MD5 vs CRC32: Which one's better for common use? - algorithm

Recently I read somewhere that although both CRC32 and MD5 are sufficiently uniform and stable, CRC32 is more efficient than MD5. MD5 seems to be a very commonly used hashing algorithm but if CRC32 is faster/more memory efficient then why not use that?

MD5 is a one-way-hash algorithm. One-way-hash algorithms are often used in cryptography as they have the property (per design) that it's hard to find the input that produced a specific hash value. Specifically it's hard to make two different inputs that give the same one-way-hash. They are often used as a way to show that an amount of data has not been altered intentionally since the hash code was produced. As the MD5 is a one-way-hash algorithm the emphasis is on security over speed. Unfortunately MD5 is now considered insecure.
CRC32 is designed to detect accidental changes to data and is commonly used in networks and storage devices. The purpose of this algorithm is not to protect against intentional changes, but rather to catch accidents like network errors and disk write errors, etc. The emphasis of this algorithm is more on speed than on security.

From Wikipedia's article on MD5 (emphasis mine):
MD5 is a widely used cryptographic hash function
Now CRC32:
CRC is an error-detecting code
So, as you can see, CRC32 is not a hashing algorithm. That means you should not use it for hashing, because it was not built for that.
And I think it doesn't make much sense to talk about common use, because similar algorithms are used for different purposes, each with significantly different requirements. There is no single algorithm that's best for common use, instead, you should choose the algorithm that's most suited for your specific use.

It depends on your goals. Here are some examples what can be done with CRC32 versus MD5:
Detecting duplicate files
If you want to check if two files are the same, CRC32 checksum is the way to go because it's faster than MD5. But be careful: CRC only reliably tells you if the binaries are different; it doesn't tell you if they're identical. If you get different hashes for two files, they cannot be the same file, so you can reject them as being duplicates very quickly.
No matter what your keys are, the CRC32 checksum will be one of 2^32 different values. Assuming random sample files, the probability of collision between the hashes of two given files is 1 / 2^32. The probability of collisions between any of N given files is (N - 1) / 2^32.
Detecting malicious software
If security is an issue, like downloading a file and checking the source's hash against yours to see if the binaries aren't corrupted, then CRC is a poor option. This is because attackers can make malware that will have the same CRC checksum. In this case, an MD5 digest is more secure -- CRC was not made for security. Two different binaries are far more likely to have the same CRC checksum than the same MD5 digest.
Securing passwords for user authentication
Synchronous (one-way) encryption is usually easier, faster, and more secure than asynchronous (two-way) encryption, so it's a common method to store passwords. Basically, the password will be combined with other data (salts) then the hash will be done on all of this combined data. Random salts greatly reduce the chances of two passwords being the same. By default, the same password will have the same hash for most algorithms, so you must add your own randomness. Of course, the salt must be saved externally.
To log a user in, you just take the information they give you when they log in. You use their username to get their salt from a database. You then combine this salt with the user's password to get a new hash. If it matches the one in in the database, then their login is successful. Since you're storing these passwords, they must be VERY secure, which means a CRC checksum is out of the question.
Cryptographic digests are more expensive to compute than CRC checksums. Also, better hashes like sha256 are more secure, but slower for hashing and take up more database space (their hashes are longer).

One big difference between CRC32 and MD5 is that it is usually easy to pick a CRC32 checksum and then come up with a message that hashes to that checksum, even if there are constraints imposed on the message, whereas MD5 is specifically designed to make this sort of thing difficult (although it is showing its age - this is now possible in some situations).
If you are in a situation where it is possible that an adversary might decide to sit down and create a load of messages with specified CRC32 hashes, to mimic other messages, or just to make a hash table perform very badly because everything hashes to the same value, then MD5 would be a better option. (Even better, IMHO, would be HMAC-MD5 with a keyed value that is unique to the module using it and unknown outside it).

CRCs are used to guard against random errors, for example in data transmission.
Cryptographic hash functions are designed to guard against intelligent adversaries forging the message, though MD5 has been broken in that respect.

Actually, CRC32 is not faster than MD5 is.
Please take a look at: https://3v4l.org/2MAUr
That php script runs several hashing algorithms and measures the time spent to calculate the hashes by each algorithm. It shows that MD5 is generally the fastest hashing algorithm around. And, it shows that even SHA1 is faster than MD5 in most of the test cases.
So, anyway, if you want to do some quick error-detection, or look for random changes... I would always advice to go with MD5, as it simply does it all.

The primary reason CRC32 (or CRC8, or CRC16) is used for any purpose whatsoever is that it can be cheaply implemented in hardware as a means of detecting "random" corruption of data. Even in software implementations, it can be useful as a means of detecting random corruption of data from hardware causes (such as noisy communications line or unreliable flash media). It is not tamper-resistant, nor is it generally suitable for testing whether two arbitrary files are likely to be the same: if each chunk of data in file is immediately followed by a CRC32 of that chunk (some data formats do that), each chunk will have the same effect on the overall file's CRC as would a chunk of all zero bytes, regardless of what data was stored in that chunk.
If one has the means to calculate a CRC32 quickly, it might be helpful in conjunction with other checksum or hash methods, if different files that had identical CRC's would be likely to differ in one of the other hashes and vice versa, but on many machines other checksum or hash methods are likely to be easier to compute relative to the amount of protection they provide.

You should use MD5 which is 128bit long.
CRC32 is only 32 bit long and its purpose is to detect errors not to hash things.
In case you need only a 32bit hash function you can choose 32 bits that are returned by MD5 the LSBs/MSBs/Whatever.

One man's common is another man's infrequent. Common varies depending on which field you are working in.
If you are doing very quick transmissions or working out hash codes for small items, then CRCs are better since they are a lot faster and the chances of getting the same 16 or 32 bit CRC for wrong data are slim.
If it is megabytes of data, for instance, a linux iso, then you could lose a few megabytes and still end up with the same CRC. Not so likely with MD5. For that reason MD5 is normally used for huge transfers. It is slower but more reliable.
So basically, if you are going to do one huge transmission and check at the end whether you have the correct result, use MD5. If you are going to transmit in small chunks, then use CRC.

I would say if you don't know what to choose, go for md5.
It's less probable to cause you a headache.

Related

Very low collision non-cryptographic hashing function

I'm writing an application that uses hashing to speed up file comparisons. Basically I pre-hash file A, and then the app runs and matches files in a folder with previously hashed files. My current criteria for looking for a hash function are as follows:
It should be fast enough that disk IO is the limiting factor. I'm currently using SHA-256 which works just fine but is way too heavy and makes my application CPU bound.
Cryptography/security doesn't matter in this case, the user is inputting both files, so if they craft a hash collision intentionally, that's on them.
Hash collisions should be avoided at almost all costs. I can compare files based on size, and their hash, but if both of those match the files are assumed to be equal. I know it's impossible guarantee this with any hash due to the compression of data, but something with the same sort of uniqueness guarantees as SHA-256 would be nice.
File sizes range from 10bytes to 2GB
A streaming algorithm would be nice, as I try to keep the memory usage of the application low, in other words I don't want to load the entire file into memory to hash it.
Hash size doesn't matter, if I got all the above with 1024bit hashes, I'm completely okay with that.
So what's a good algorithm to use here, I'm using C# but I'm sure most algorithms are available on any platform. Like I said, I'm using SHA-256, but I'm sure there's something better.
Yann Collet's xxHash may be a good choice (Home page, GitHub)
xxHash is an extremely fast non-cryptographic hash algorithm, working
at speeds close to RAM limits. It is proposed in two flavors, 32 and
64 bits.
At least 4 C# impelmentations are available (see home page).
I had excellent results with it in the past.
The Hash size is 32 or 64 bit, but XXH3 is in the making:
XXH3 features a wide internal state of 512 bits, which makes it
suitable to generate a hash of up to 256 bit. For the time being, only
64-bit and 128-bit variants are exposed, but a similar recipe can be
used for a 256-bit variant if there is any need for it one day. All
variant feature same speed, since only the finalization stage is
different.
In general, the longer the hash, the slower its calculation. 64-bit hash is good enough for most practical purposes.
You can generate longer hashes by combining two hash functions (e.g. 128-bit XXH3 and 128-bit MurmurHash3).

why does Apple FileVault use a block encryption algorithm instead of a stream encryption algorithm?

FileVault 2 uses the Advanced Encryption Standard (AES) encryption algorithm, which delivers robust protection for stored data. Until mid-2013, it only supported the use of 128-bit keys, not 256-bit keys. Although 128-bit keys are technically acceptable in many environments, organizations are rapidly moving toward 256-bit keys to thwart emerging threats.
Source: https://searchsecurity.techtarget.com/feature/Apple-FileVault-2-Full-disk-encryption-software-overview
Wouldn't a stream algorithm be faster and easier to handle? Wont' the usage of a block cipher consume more disk space?
Is there an istruction set in modern CPUs for streaming encryption algorithms as it is for block algorithms?
Thanks
A filesystem has to support all common use cases efficiently.
Now consider the case of a database file. (For example, one that uses SQLite.) It is common to know where your record is, to open up your file, seek to that place, read that record, possibly rewrite it, then close your file. With a block based algorithm that's just a question of loading the correct block, decrypting it, returning it, and then encrypting it on the way back. With a stream based algorithm you would need to read the whole database file to understand that part of the file, and would need to rewrite the whole database file again to modify a bit in the middle.
Therefore stream based algorithms would be horribly inefficient for this use case, while block based algorithms work well.
Incidentally as long as the encryption key is external to the block, a block based algorithm will have very little space overhead. Or, more precisely, will force you to round your file sizes up to the last block.

Ideal hashing method for wide distribution of values?

As part of my rhythm game that I'm working, I'm allowing users to create and upload custom songs and notecharts. I'm thinking of hashing the song and notecharts to uniquely identify them. Of course, I'd like as few collisions as possible, however, cryptographic strength isn't of much importance here as a wide uniform range. In addition, since I'd be performing the hashes rarely, computational efficiency isn't too big of an issue.
Is this as easy as selecting a tried-and-true hashing algorithm with the largest digest size? Or are there some intricacies that I should be aware of? I'm looking at either SHA-256 or 512, currently.
All cryptographic-strength algorithm should exhibit no collision at all. Of course, collisions necessarily exist (there are more possible inputs than possible outputs) but it should be impossible, using existing computing technology, to actually find one.
When the hash function has an output of n bits, it is possible to find a collision with work about 2n/2, so in practice a hash function with less than about 140 bits of output cannot be cryptographically strong. Moreover, some hash functions have weaknesses that allow attackers to find collisions faster than that; such functions are said to be "broken". A prime example is MD5.
If you are not in a security setting, and fear only random collisions (i.e. nobody will actively try to provoke a collision, they may happen only out of pure bad luck), then a broken cryptographic hash function will be fine. The usual recommendation is then MD4. Cryptographically speaking, it is as broken as it can be, but for non-cryptographic purposes it is devilishly fast, and provides 128 bits of output, which avoid random collisions.
However, chances are that you will not have any performance issue with SHA-256 or SHA-512. On a most basic PC, they already process data faster than what a hard disk can provide: if you hash a file, the file reading will be the bottleneck, not the hashing. My advice would be to use SHA-256, possibly truncating its output to 128 bits (if used in a non-security situation), and consider switching to another function only if some performance-related trouble is duly noticed and measured.
If you're using it to uniquely identify tracks, you do want a cryptographic hash: otherwise, users could deliberately create tracks that hash the same as existing tracks, and use that to overwrite them. Barring a compelling reason otherwise, SHA-1 should be perfectly satisfactory.
If cryptographic security is not of concern then you can look at this link & this. The fastest and simplest (to implement) would be Pearson hashing if you are planing to compute hash for the title/name and later do lookup. or you can have look at the superfast hash here. It is also very good for non cryptographic use.
What's wrong with something like an md5sum? Or, if you want a faster algorithm, I'd just create a hash from the file length (mod 64K to fit in two bytes) and 32-bit checksum. That'll give you a 6-byte hash which should be reasonably well distributed. It's not overly complex to implement.
Of course, as with all hashing solutions, you should monitor the collisions and change the algorithm if the cardinality gets too low. This would be true regardless of the algorithm chosen (since your users may start uploading degenerate data).
You may end up finding you're trying to solve a problem that doesn't exist (in other words, possible YAGNI).
Isn't cryptographic hashing an overkill in this case, though I understand that modern computers do this calculation pretty fast? I assume that your users will have an unique userid. When they upload, you just need to increment a number. So, you will represent them internally as userid1_song_1, userid1_song_2 etc. You can store this info in a database with that as the unique key along with user specified name.
You also didn't mention the size of these songs. If it is midi, then file size will be small. If file sizes are big (say 3MB) then sha calculations will not be instantaneous. On my core2-duo laptop, sha256sum of a 3.8 MB file takes 0.25 sec; for sha1sum it is 0.2 seconds.
If you intend to use a cryptographic hash, then sha1 should be more than adequate and you don't need sha256. No collisions --- though they exist --- have been found yet. Git, Mercurial and other distributed version control systems use sh1. Git is a content based system and uses sha1 to find out if content has been modified.

How to efficiently identify a binary file

What's the most efficient way to identify a binary file? I would like to extract some kind of signature from a binary file and use it to compare it with others.
The brute-force approach would be to use the whole file as a signature, which would take too long and too much memory. I'm looking for a smarter approach to this problem, and I'm willing to sacrifice a little accuracy (but not too much, ey) for performance.
(while Java code-examples are preferred, language-agnostic answers are encouraged)
Edit: Scanning the whole file to create a hash has the disadvantage that the bigger the file, the longer it takes. Since the hash wouldn't be unique anyway, I was wondering if there was a more efficient approach (ie: a hash from an evenly distributed sampling of bytes).
An approach I found effective for this sort of thing was to calculate two SHA-1 hashes. One for the first block in a file (I arbitrarily picked 512 bytes as a block size) and one for the whole file. I then stored the two hashes along with a file size. When I needed to identify a file I would first compare the file length. If the lengths matched then I would compare the hash of the first block and if that matched I compared the hash of the entire file. The first two tests quickly weeded out a lot of non-matching files.
That's what hashing is for. See MessageDigest.
Note that if your file is too big to be read in memory, that's OK because you can feed chunks of the file to the hash function. MD5 and SHA1 for example can take blocks of 512 bits.
Also, two files with the same hash aren't necessarily identical (it's very rare that they aren't though), but two files that are identical have necessarily the same hash.
The usual answer is to use MD5, but I'd like to suggest that there are too many collisions to use MD5 in modern applications: http://www.mscs.dal.ca/~selinger/md5collision/
SHA-1 replaced MD5 over a decade ago.
NIST recommended in 2005 that SHA-2 should be used in place of SHA-1 by the year 2010, because of work that had been done to demonstrate collisions in reduced variants of SHA-1. (Which is pretty good foresight, since it is now known that it takes 2^51 work to find collisions in what should ideally require 2^80 work to find collisions.)
So please, based on what you're trying to do, and which other programs you may need to interoperate with, select among MD5 (please no), SHA-1 (I'd understand, but we can do better), and SHA-2 (pick me! pick me!).
Are you taking into account to use header identification.
If you can design your files in such way, this would be fast and reliable.
Using one byte you can distinguish 255 file types ;)

Why do you need lots of randomness for effective encryption?

I've seen it mentioned in many places that randomness is important for generating keys for symmetric and asymmetric cryptography and when using the keys to encrypt messages.
Can someone provide an explanation of how security could be compromised if there isn't enough randomness?
Randomness means unguessable input. If the input is guessable, then the output can be easily calculated. That is bad.
For example, Debian had a long standing bug in its SSL implementation that failed to gather enough randomness when creating a key. This resulted in the software generating one of only 32k possible keys. It is thus easily possible to decrypt anything encrypted with such a key by trying all 32k possibilities by trying them out, which is very fast given today's processor speeds.
The important feature of most cryptographic operations is that they are easy to perform if you have the right information (e.g. a key) and infeasible to perform if you don't have that information.
For example, symmetric cryptography: if you have the key, encrypting and decrypting is easy. If you don't have the key (and don't know anything about its construction) then you must embark on something expensive like an exhaustive search of the key space, or a more-efficient cryptanalysis of the cipher which will nonetheless require some extremely large number of samples.
On the other hand, if you have any information on likely values of the key, your exhaustive search of the keyspace is much easier (or the number of samples you need for your cryptanalysis is much lower). For example, it is (currently) infeasible to perform 2^128 trial decryptions to discover what a 128-bit key actually is. If you know the key material came out of a time value that you know within a billion ticks, then your search just became 340282366920938463463374607431 times easier.
To decrypt a message, you need to know the right key.
The more possibly keys you have to try, the harder it is to decrypt the message.
Taking an extreme example, let's say there's no randomness at all. When I generate a key to use in encrypting my messages, I'll always end up with the exact same key. No matter where or when I run the keygen program, it'll always give me the same key.
That means anyone who have access to the program I used to generate the key, can trivially decrypt my messages. After all, they just have to ask it to generate a key too, and they get one identical to the one I used.
So we need some randomness to make it unpredictable which key you end up using. As David Schmitt mentions, Debian had a bug which made it generate only a small number of unique keys, which means that to decrypt a message encrypted by the default OpenSSL implementation on Debian, I just have to try this smaller number of possible keys. I can ignore the vast number of other valid keys, because Debian's SSL implementation will never generate those.
On the other hand, if there was enough randomness in the key generation, it's impossible to guess anything about the key. You have to try every possible bit pattern. (and for a 128-bit key, that's a lot of combinations.)
It has to do with some of the basic reasons for cryptography:
Make sure a message isn't altered in transit (Immutable)
Make sure a message isn't read in transit (Secure)
Make sure the message is from who it says it's from (Authentic)
Make sure the message isn't the same as one previously sent (No Replay)
etc
There's a few things you need to include, then, to make sure that the above is true. One of the important things is a random value.
For instance, if I encrypt "Too many secrets" with a key, it might come out with "dWua3hTOeVzO2d9w"
There are two problems with this - an attacker might be able to break the encryption more easily since I'm using a very limited set of characters. Further, if I send the same message again, it's going to come out exactly the same. Lastly, and attacker could record it, and send the message again and the recipient wouldn't know that I didn't send it, even if the attacker didn't break it.
If I add some random garbage to the string each time I encrypt it, then not only does it make it harder to crack, but the encrypted message is different each time.
The other features of cryptography in the bullets above are fixed using means other than randomness (seed values, two way authentication, etc) but the randomness takes care of a few problems, and helps out on other problems.
A bad source of randomness limits the character set again, so it's easier to break, and if it's easy to guess, or otherwise limited, then the attacker has fewer paths to try when doing a brute force attack.
-Adam
A common pattern in cryptography is the following (sending text from alice to bob):
Take plaintext p
Generate random k
Encrypt p with k using symmetric encryption, producing crypttext c
Encrypt k with bob's private key, using asymmetric encryption, producing x
Send c+x to bob
Bob reverses the processes, decrypting x using his private key to obtain k
The reason for this pattern is that symmetric encryption is much faster than asymmetric encryption. Of course, it depends on a good random number generator to produce k, otherwise the bad guys can just guess it.
Here's a "card game" analogy: Suppose we play several rounds of a game with the same deck of cards. The shuffling of the deck between rounds is the primary source of randomness. If we didn't shuffle properly, you could beat the game by predicting cards.
When you use a poor source of randomness to generate an encryption key, you significantly reduce the entropy (or uncertainty) of the key value. This could compromise the encryption because it makes a brute-force search over the key space much easier.
Work out this problem from Project Euler, and it will really drive home what "lots of randomness" will do for you. When I saw this question, that was the first thing that popped into my mind.
Using the method he talks about there, you can easily see what "more randomness" would gain you.
A pretty good paper that outlines why not being careful with randomness can lead to insecurity:
http://www.cs.berkeley.edu/~daw/papers/ddj-netscape.html
This describes how back in 1995 the Netscape browser's key SSL implementation was vulnerable to guessing the SSL keys because of a problem seeding the PRNG.

Resources