Jump to content

Slow MD5


Recommended Posts

So when I think of file integrity I think of checksums and MD5. I realize there are tons of different hash methods and CRCs available but I prefer MD5. So I was exited when I heard LabVIEW 8.2 got MD5 for files natively (I think it was in the vi lib in 8.0 but nothing on the palette)

But since I've used the MD5 I've been disappointed it how long it takes to calculate an MD5. So I did some quick tests comparing the native MD5, to the OpenG MD5, against the command line version I've been using found at http://www.etree.org/md5com.html . For small files (less than 30kb) the native MD5 is relativly quick at around 50ms for one file. This is good if you are checking the integrity of a config file, but I'd rather use it as a general purpose file utility, checking the integrity of a directory of files.

Any file above 30kb and the command line version process it faster. I performed an MD5 on four 5Mb text files, and using the native MD5 it took 2,786ms, while the command line took 125ms. The OpenG wasn't a good comparison since it processed the whole file at once taking, over 30 seconds.

So I wrote an "improved" MD5 calculation VI. I think you'll be horrified when you look at the source, it just uses the command line version but it works, and alot faster than either OpenG or native. I also saved it in 7.1.

EDIT: I seem to have a problem uploading (says I didn't select a file) so I hosted it on my site for now.

http://brian-hoover.com/Code/LabVIEW/MyMD5File.zip

  • Like 1
Link to comment

I own the one that ships with LabVIEW. If y'all figure out a way to implement it with 100% G code (i.e. no command line calls) that's faster than the current shipping code, I'd certainly be open to changing it in LV 2010. This topic came up a few years back on LAVA, and at the time, mine was quite a bit faster than the OpenG one.

-D

Link to comment

I didn't expect my code would be put in the next rev. of LabVIEW for several reasons. That wasn't my intent. I just wanted to have a way of calculating the fastest MD5 possible for a directory of files. I ran it on 500MB of random files in the My Documents folder and it took 3 seconds using my version (with command line embedded) and it took 75 seconds using the native code. But I realize the limitations of using a command line. Unable to handle crashes, needs Windows, need access to a temp folder, unsure how it works with new versions of Windows, among other problems.

I don't know how to optimize the MD5 algorithm, but what sort of things are off limits for potential additions to LabVIEW? Like if I found a .dll that calculated the checksum quickly could I write a VI which just uses that .dll? I assume there are legal reasons why NI could not include random code from the internet in a commerical product.

@Ton

I saw that code in SourceForge a little while ago but it's missing two VIs

MD5 Unrecoverable U8 padding.vi

MD5 FGHI functions.vi

I'd be glad to do some testing to see how each stacks up.

Link to comment

Thanks Tom, I got the all the needed VIs and ran again. OpenG still seems to be the slowest. I've played around with the chunk size and haven't been able to improve it much. I did one 2Mb file with a 10KB chunk and it took 2.8 seconds. The command line version took .01 seconds and the native took .2 seconds. For now I'm sticking with my command line version.

BTW I reported the fact that we can't upload files.

Link to comment

QUOTE (Darren @ Jun 10 2009, 04:11 PM)

I experimented with the shipping implementation, and found that the following will help the performance:

Disable debugging in "MD5 Checksum File" and the sub-vi "MD5Checksum Core".

Inside "MD5Checksum Core", the inner-most loop contains a section of code that performs Swap Words and Swap Bytes on the current array Element. Move these two functions to the outermost loop and place them immediately after the typecast of the string to an array of U32.

I reduced the MD5 calculation on version 8.6.1f1 LabVIEW.exe from 2.79 seconds to 2.12 seconds.

QUOTE (hooovahh @ Jun 10 2009, 02:57 PM)

Any file above 30kb and the command line version process it faster. I performed an MD5 on four 5Mb text files, and using the native MD5 it took 2,786ms, while the command line took 125ms. The OpenG wasn't a good comparison since it processed the whole file at once taking, over 30 seconds.

I revisited my .NET implementation from here and found that it one of the .NET methods was broken when I loaded the VI in LabVIEW 8.6.1. I've fixed it and cleaned it up, but can't upload to the LAVA forums at the moment. (not sure why...) Maybe the .NET technique will work for you...

  • Like 1
Link to comment

I like your .NET method. In my test, for files less than 16MB the command line version is faster by a little, with both times around 100ms for the 16MB file, while the native is around 2380ms.

But as files grow to around the size I want to be process the .NET method works faster. I ran a test with 500MB of files, with a file size all between 50MB and 80MB and the command line took 4900ms and the .NET took 2320ms.

I know what you mean when you said it wouldn't open in a newer version of LabVIEW. It opens fine in 7.1, and 8.0 but any thing newer the Invoke node names are slightly different and need to be re-linked but after that it works.

So I could determine the size of the file, and use the right method for that file size, but I'm just going to stick with your method since the improvement for small files is very small between them all. Thanks.

Link to comment

QUOTE (Phillip Brooks @ Jun 11 2009, 08:30 AM)

I experimented with the shipping implementation, and found that the following will help the performance:

Thanks, Phillip. I have filed CAR# 173651 to myself for investigating your suggestions in LabVIEW 2010. If anybody else has any suggestions, post them here, as I will be reviewing this thread when looking into the CAR later this year. Again, I'm looking to stick with a 100% G, platform-independent implementation.

-D

Link to comment

QUOTE (Darren @ Jun 11 2009, 05:01 PM)

Thanks, Phillip. I have filed CAR# 173651 to myself for investigating your suggestions in LabVIEW 2010. If anybody else has any suggestions, post them here, as I will be reviewing this thread when looking into the CAR later this year. Again, I'm looking to stick with a 100% G, platform-independent implementation.

-D

I tried converting your code to read U32 values instead of a string, and I also removed the "Get File Position" and "Set File Position" functions.

By this I was able to reduce the time, by approximately 30%, for large files.

/J

Link to comment

QUOTE (Darren @ Jun 11 2009, 11:01 AM)

Thanks, Phillip. I have filed CAR# 173651 to myself for investigating your suggestions in LabVIEW 2010. If anybody else has any suggestions, post them here, as I will be reviewing this thread when looking into the CAR later this year. Again, I'm looking to stick with a 100% G, platform-independent implementation.

-D

FYI, I experimented with loading the file data in one loop and passing the data via a queue to the 'core' function running in a separate loop, thinking that the file I/O was a place for improvement. It appears that the majority of the overhead is in the 'core' vi; no gains were detected...

Link to comment
  • 1 month later...

Ok, I had a few minutes this afternoon to re-read this thread and look into any low-hanging fruit for improving the performance of the MD5 VI that ships with LabVIEW. Unless I missed something, there were three concrete suggestions for improving the performance of the core VI:

1. Disable debugging: Done.

2. Move Swap Words and Swap Bytes functions out of the loop: Done, although this appears to have a negligible effect compared to turning off debugging.

3. Process file as a U32: I haven't done this one yet. JFM, can you post your modified version of the VI so I can take a look at it? I'm not sure yet if I want to go forward with this change, as there's another VI in that LLB (MD5Checksum string.vi) that can be used to generate the MD5 of a string, independent of File I/O, that assumes the core VI takes a string input.

-D

Link to comment
  • 1 year later...
  • 10 years later...

Hey no worries.  So here is the file that was on my website.  It hasn't been up since I moved, I really should try to fix that.  Still I wouldn't mind revisiting this topic to see what else can be done.  In the past I really wanted an MD5 just to verify a file is copied properly and for that I do a faster checksum.

MyMD5File.zip

Link to comment
8 minutes ago, ShaunR said:

Hmmmm. They have removed MD5Checksum from the palette in recent LabVIEW versions?

They did remove it from the palette and I complained.  If you find the VI it has text in it saying it has been superseded with the newer file integrity stuff.  Which is half true.  There are new file integrity stuff but it doesn't support MD5, and NI has said it was due to security concerns.

Link to comment
3 minutes ago, hooovahh said:

They did remove it from the palette and I complained.  If you find the VI it has text in it saying it has been superseded with the newer file integrity stuff.  Which is half true.  There are new file integrity stuff but it doesn't support MD5, and NI has said it was due to security concerns.

That's BS. MD5 is still the 2nd fastest checksum (SHA-1 being the fastest) and a checksum has little to do with security. They are obviously confused between security and integrity.

Link to comment

Yup totally agree.  Taking a look at the source again I wonder if we could improve the speed by pipelining the process so processing and reading of file chucks happens in parallel.  Still I doubt the native G implementation could make up so much ground when the command line version is so much faster.

Link to comment
16 minutes ago, hooovahh said:

Yup totally agree.  Taking a look at the source again I wonder if we could improve the speed by pipelining the process so processing and reading of file chucks happens in parallel.  Still I doubt the native G implementation could make up so much ground when the command line version is so much faster.

Yuck. Cmd line :throwpc: Try using the EVP_Digest interface of the NIlibeay32.dll ;) You can even have progress events if you want to be fancy :wub:

I don't know why NI didn't use it :blink:

Edited by ShaunR
Link to comment

Well this code has been here for 12 years ready to be improved, I just went with the fastest easiest method to use.  If you have improvements the community would love them.

I just ran a quick test on a 1.4GB file on a SATA SSD.  NI's method was 21.4s.  A pipelined version using queues was 21.1s, the command line version was 4.2s.

I copied it to my NVME drive to compare, and the numbers were very similar indicating the slow down likely isn't file IO.

When something takes 21s a progress bar is probably needed, when it takes a couple of seconds it might be nice, but but it is less of a concern.  I likely will just make a wrapper, and if on Linux use the G code, and on Windows call the embedded binary.

EDIT: Also I tried the new file integrity functions and they all performed pretty bad.  The fastest I found was 30s on the same file.  A checksum was about half a second.

Link to comment

My quick test shows it did an MD5 in about 3 seconds, when the other command line was 4.

However if your purpose is just to ensure a file integrity, MD5 might be over kill.  I found a CRC32 posted here, which is in G so it works on all platforms LabVIEW does, doesn't require an external binary, and is also done in about 4 seconds.  If you really need MD5, and you need it to be fast, but don't care that it is Windows only, you can use one of these.  But if you just want a fast file compare and don't care about the format, I'd go with something simpler like the CRC32.

But also test it with the typical file sizes you'll be seeing.  On smaller files the difference might be a rounding error. 

Link to comment
41 minutes ago, hooovahh said:

Well this code has been here for 12 years ready to be improved, I just went with the fastest easiest method to use.  If you have improvements the community would love them.

I just ran a quick test on a 1.4GB file on a SATA SSD.  NI's method was 21.4s.  A pipelined version using queues was 21.1s, the command line version was 4.2s.

I copied it to my NVME drive to compare, and the numbers were very similar indicating the slow down likely isn't file IO.

When something takes 21s a progress bar is probably needed, when it takes a couple of seconds it might be nice, but but it is less of a concern.  I likely will just make a wrapper, and if on Linux use the G code, and on Windows call the embedded binary.

EDIT: Also I tried the new file integrity functions and they all performed pretty bad.  The fastest I found was 30s on the same file.  A checksum was about half a second.

This is fun :)

Here are my comparative benchmark results for MD5

image.png.9fab4c0cfc38abe8b001f4c1909988f9.png

the NI SHA are abysmal. 

image.png.80b10612dba742d0dcd3a5666cd52ba7.png

 

 

Link to comment

Join the conversation

You can post now and register later. If you have an account, sign in now to post with your account.

Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...

Important Information

By using this site, you agree to our Terms of Use.