Very low collision non-cryptographic hashing function - algorithm

I'm writing an application that uses hashing to speed up file comparisons. Basically I pre-hash file A, and then the app runs and matches files in a folder with previously hashed files. My current criteria for looking for a hash function are as follows:
It should be fast enough that disk IO is the limiting factor. I'm currently using SHA-256 which works just fine but is way too heavy and makes my application CPU bound.
Cryptography/security doesn't matter in this case, the user is inputting both files, so if they craft a hash collision intentionally, that's on them.
Hash collisions should be avoided at almost all costs. I can compare files based on size, and their hash, but if both of those match the files are assumed to be equal. I know it's impossible guarantee this with any hash due to the compression of data, but something with the same sort of uniqueness guarantees as SHA-256 would be nice.
File sizes range from 10bytes to 2GB
A streaming algorithm would be nice, as I try to keep the memory usage of the application low, in other words I don't want to load the entire file into memory to hash it.
Hash size doesn't matter, if I got all the above with 1024bit hashes, I'm completely okay with that.
So what's a good algorithm to use here, I'm using C# but I'm sure most algorithms are available on any platform. Like I said, I'm using SHA-256, but I'm sure there's something better.

Yann Collet's xxHash may be a good choice (Home page, GitHub)
xxHash is an extremely fast non-cryptographic hash algorithm, working
at speeds close to RAM limits. It is proposed in two flavors, 32 and
64 bits.
At least 4 C# impelmentations are available (see home page).
I had excellent results with it in the past.
The Hash size is 32 or 64 bit, but XXH3 is in the making:
XXH3 features a wide internal state of 512 bits, which makes it
suitable to generate a hash of up to 256 bit. For the time being, only
64-bit and 128-bit variants are exposed, but a similar recipe can be
used for a 256-bit variant if there is any need for it one day. All
variant feature same speed, since only the finalization stage is
different.
In general, the longer the hash, the slower its calculation. 64-bit hash is good enough for most practical purposes.
You can generate longer hashes by combining two hash functions (e.g. 128-bit XXH3 and 128-bit MurmurHash3).

Related

Fast hash function with collision possibility near SHA-1

I'm using SHA-1 to detect duplicates in a program handling files. It is not required to be cryptographic strong and may be reversible. I found this list of fast hash functions https://code.google.com/p/xxhash/ (list has been moved to https://github.com/Cyan4973/xxHash)
What do I choose if I want a faster function and collision on random data near to SHA-1?
Maybe a 128 bit hash is good enough for file deduplication? (vs 160 bit sha-1)
In my program the hash is calculated on chuncks from 0 - 512 KB.
Maybe this will help you:
https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
collisions rare: FNV-1, FNV-1a, DJB2, DJB2a, SDBM & MurmurHash
I don't know about xxHash but it looks also promising.
MurmurHash is very fast and version 3 supports 128bit length, I would choose this one. (Implemented in Java and Scala.)
Since the only relevant property of hash algorithms in your case is the collision probability, you should estimate it and choose the fastest algorithm which fulfills your requirements.
If we suppose your algorithm has absolute uniformity, the probability of a hash collision among n files using hashes with d possible values will be:
For example, if you need a collision probability lower than one in a million among one million of files, you will need to have more than 5*10^17 distinct hash values, which means your hashes need to have at least 59 bits. Let's round to 64 to account for possibly bad uniformity.
So I'd say any decent 64-bit hash should be sufficient for you. Longer hashes will further reduce collision probability, at a price of heavier computation and increased hash storage volume. Shorter caches like CRC32 will require you to write some explicit collision handling code.
Google developed and uses (I think) FarmHash for performance-critical hashing. From the project page:
FarmHash is a successor to CityHash, and includes many of the same tricks and techniques, several of them taken from Austin Appleby’s MurmurHash.
...
On CPUs with all the necessary machine instructions, about six different hash functions can contribute to FarmHash's lineup. In some cases we've made significant performance gains over CityHash by using newer instructions that are now commonly available. However, we've also squeezed out some more speed in other ways, so the vast majority of programs using CityHash should gain at least a bit when switching to FarmHash.
(CityHash was already a performance-optimized hash function family by Google.)
It was released a year ago, at which point it was almost certainly the state of the art, at least among the published algorithms. (Or else Google would have used something better.) There's a good chance it's still the best option.
The facts:
Good hash functions, specially the cryptographic ones (like SHA-1),
require considerable CPU time because they have to honor a number of
properties that wont be very useful for you in this case;
Any hash function will give you only one certainty: if the hash values of two files are different, the files are surely different. If, however, their hash values are equal, chances are that the files are also equal, but the only way to tell for sure if this "equality" is not just a hash collision, is to fall back to a binary comparison of the two files.
The conclusion:
In your case I would try a much faster algorithm like CRC32, that has pretty much all the properties you need, and would be capable of handling more than 99.9% of the cases and only resorting to a slower comparison method (like binary comparison) to rule out the false positives. Being a lot faster in the great majority of comparisons would probably compensate for not having an "awesome" uniformity (possibly generating a few more collisions).
128 bits is indeed good enough to detect different files or chunks. The risk of collision is infinitesimal, at least as long as no intentional collision is being attempted.
64 bits can also prove good enough if the number of files or chunks you want to track remain "small enough" (i.e. no more than a few millions ones).
Once settled the size of the hash, you need a hash with some very good distribution properties, such as the ones listed with Q.Score=10 in your link.
It kind of depends on how many hashes you are going to compute over in an iteration.
Eg, 64bit hash reaches a collision probability of 1 in 1000000 with 6 million hashes computed.
Refer to : Hash collision probabilities
Check out MurmurHash2_160. It's a modification of MurmurHash2 which produces 160-bit output.
It computes 5 unique results of MurmurHash2 in parallel and mixes them thoroughly. The collision probability is equivalent to SHA-1 based on the digest size.
It's still fast, but MurmurHash3_128, SpookyHash128 and MetroHash128 are probably faster, albeit with a higher (but still very unlikely) collision probability. There's also CityHash256 which produces a 256-bit output which should be faster than SHA-1 as well.

MD5 vs CRC32: Which one's better for common use?

Recently I read somewhere that although both CRC32 and MD5 are sufficiently uniform and stable, CRC32 is more efficient than MD5. MD5 seems to be a very commonly used hashing algorithm but if CRC32 is faster/more memory efficient then why not use that?
MD5 is a one-way-hash algorithm. One-way-hash algorithms are often used in cryptography as they have the property (per design) that it's hard to find the input that produced a specific hash value. Specifically it's hard to make two different inputs that give the same one-way-hash. They are often used as a way to show that an amount of data has not been altered intentionally since the hash code was produced. As the MD5 is a one-way-hash algorithm the emphasis is on security over speed. Unfortunately MD5 is now considered insecure.
CRC32 is designed to detect accidental changes to data and is commonly used in networks and storage devices. The purpose of this algorithm is not to protect against intentional changes, but rather to catch accidents like network errors and disk write errors, etc. The emphasis of this algorithm is more on speed than on security.
From Wikipedia's article on MD5 (emphasis mine):
MD5 is a widely used cryptographic hash function
Now CRC32:
CRC is an error-detecting code
So, as you can see, CRC32 is not a hashing algorithm. That means you should not use it for hashing, because it was not built for that.
And I think it doesn't make much sense to talk about common use, because similar algorithms are used for different purposes, each with significantly different requirements. There is no single algorithm that's best for common use, instead, you should choose the algorithm that's most suited for your specific use.
It depends on your goals. Here are some examples what can be done with CRC32 versus MD5:
Detecting duplicate files
If you want to check if two files are the same, CRC32 checksum is the way to go because it's faster than MD5. But be careful: CRC only reliably tells you if the binaries are different; it doesn't tell you if they're identical. If you get different hashes for two files, they cannot be the same file, so you can reject them as being duplicates very quickly.
No matter what your keys are, the CRC32 checksum will be one of 2^32 different values. Assuming random sample files, the probability of collision between the hashes of two given files is 1 / 2^32. The probability of collisions between any of N given files is (N - 1) / 2^32.
Detecting malicious software
If security is an issue, like downloading a file and checking the source's hash against yours to see if the binaries aren't corrupted, then CRC is a poor option. This is because attackers can make malware that will have the same CRC checksum. In this case, an MD5 digest is more secure -- CRC was not made for security. Two different binaries are far more likely to have the same CRC checksum than the same MD5 digest.
Securing passwords for user authentication
Synchronous (one-way) encryption is usually easier, faster, and more secure than asynchronous (two-way) encryption, so it's a common method to store passwords. Basically, the password will be combined with other data (salts) then the hash will be done on all of this combined data. Random salts greatly reduce the chances of two passwords being the same. By default, the same password will have the same hash for most algorithms, so you must add your own randomness. Of course, the salt must be saved externally.
To log a user in, you just take the information they give you when they log in. You use their username to get their salt from a database. You then combine this salt with the user's password to get a new hash. If it matches the one in in the database, then their login is successful. Since you're storing these passwords, they must be VERY secure, which means a CRC checksum is out of the question.
Cryptographic digests are more expensive to compute than CRC checksums. Also, better hashes like sha256 are more secure, but slower for hashing and take up more database space (their hashes are longer).
One big difference between CRC32 and MD5 is that it is usually easy to pick a CRC32 checksum and then come up with a message that hashes to that checksum, even if there are constraints imposed on the message, whereas MD5 is specifically designed to make this sort of thing difficult (although it is showing its age - this is now possible in some situations).
If you are in a situation where it is possible that an adversary might decide to sit down and create a load of messages with specified CRC32 hashes, to mimic other messages, or just to make a hash table perform very badly because everything hashes to the same value, then MD5 would be a better option. (Even better, IMHO, would be HMAC-MD5 with a keyed value that is unique to the module using it and unknown outside it).
CRCs are used to guard against random errors, for example in data transmission.
Cryptographic hash functions are designed to guard against intelligent adversaries forging the message, though MD5 has been broken in that respect.
Actually, CRC32 is not faster than MD5 is.
Please take a look at: https://3v4l.org/2MAUr
That php script runs several hashing algorithms and measures the time spent to calculate the hashes by each algorithm. It shows that MD5 is generally the fastest hashing algorithm around. And, it shows that even SHA1 is faster than MD5 in most of the test cases.
So, anyway, if you want to do some quick error-detection, or look for random changes... I would always advice to go with MD5, as it simply does it all.
The primary reason CRC32 (or CRC8, or CRC16) is used for any purpose whatsoever is that it can be cheaply implemented in hardware as a means of detecting "random" corruption of data. Even in software implementations, it can be useful as a means of detecting random corruption of data from hardware causes (such as noisy communications line or unreliable flash media). It is not tamper-resistant, nor is it generally suitable for testing whether two arbitrary files are likely to be the same: if each chunk of data in file is immediately followed by a CRC32 of that chunk (some data formats do that), each chunk will have the same effect on the overall file's CRC as would a chunk of all zero bytes, regardless of what data was stored in that chunk.
If one has the means to calculate a CRC32 quickly, it might be helpful in conjunction with other checksum or hash methods, if different files that had identical CRC's would be likely to differ in one of the other hashes and vice versa, but on many machines other checksum or hash methods are likely to be easier to compute relative to the amount of protection they provide.
You should use MD5 which is 128bit long.
CRC32 is only 32 bit long and its purpose is to detect errors not to hash things.
In case you need only a 32bit hash function you can choose 32 bits that are returned by MD5 the LSBs/MSBs/Whatever.
One man's common is another man's infrequent. Common varies depending on which field you are working in.
If you are doing very quick transmissions or working out hash codes for small items, then CRCs are better since they are a lot faster and the chances of getting the same 16 or 32 bit CRC for wrong data are slim.
If it is megabytes of data, for instance, a linux iso, then you could lose a few megabytes and still end up with the same CRC. Not so likely with MD5. For that reason MD5 is normally used for huge transfers. It is slower but more reliable.
So basically, if you are going to do one huge transmission and check at the end whether you have the correct result, use MD5. If you are going to transmit in small chunks, then use CRC.
I would say if you don't know what to choose, go for md5.
It's less probable to cause you a headache.

characteristics of various hash algorithms?

MD5, MD6?, all the SHA-somethings, CRC-somethings. I've used them before and seen them used in various places, but I have no idea why you would use one over another.
On a very high level, what is the difference between all these 3/4 letter acronyms In terms of performance, collision probability and general hard-to-crackness? Does any of those depend on what kind or what amount of data I am hashing?
What trade-offs am I making when i choose one over another? I've read that the CRC is not suitable to use for security, but what about for general hash-table collision avoidance?
CRC-whatever is used primarily (should be exclusively) for protection against accidental changes in data. They do quite a good job of detecting noise and such, but are not intended for cryptographic purposes -- finding a second preimage (a second input that produces the same hash) is (by cryptographic standards) trivial. [Edit: As noted by #Jon, unlike the other hashes mentioned here, CRC is not and never was intended for cryptographic use.]
MD-5. Originally intended for cryptographic use, but fairly old and now considered fairly weak. Although no second preimage attack is known, a collision attack is known (i.e., a way to produce two selected inputs that produce the same result, but not a second input to produce the same result as one that's specified). About the only time to use this any more is as a more elaborate version of a CRC.
SHA-x
Once upon a time, there was simply "SHA". Very early in its history, a defect was found, and a slight modification was made to produce SHA-1. SHA was in use for a short enough time that it's rarely of practical interest.
SHA-1 is generally more secure than MD-5, but still in the same general range -- a collision attack is known, though it's a lot1 more expensive than for MD-5. No second preimage attack is known, but the collision attack is enough to say "stay away".
SHA-256, SHA-384, SHA-512: These are sort of based on SHA-1, but are somewhat more complex internally. At least as far as I'm aware, neither a second-preimage attack nor a collision attack is known on any of these at the present time.
SHA-3: US National Institute of Standards and Technology (NIST) is currently holding a competition to standardize a replacement for the current SHA-2 series hash algorithm, which will apparent be called SHA-3. As I write this (September 2011) the competition is currently in its third round, with five candidates (Blake, Grøstl, JH, Kaccek and Skein2) left in the running. Round 3 is scheduled to be over in January 2012, at which time public comments on the algorithms will no longer be (at least officially) accepted. In March 2012, a (third) SHA-3 conference will be held (in Washington DC). At some unspecified date later in 2012, the final selection will be announced.
1 For anybody who cares about how much more expensive it is to attack SHA-1 than MD-5, I'll try to give some concrete numbers. For MD-5, my ~5 year-old machine can produce a collision in about 40-45 minutes. For SHA-1, I only have an estimate, but my estimate is that a cluster to produce collisions at a rate of one per week would cost well over a million US dollars (and probably closer to $10 million). Even given an existing machine, the cost of operating the machine long enough to find a collision is substantial.
2 Since it's almost inevitable that somebody will at least wonder, I'll point out that the entry Bruce Schneier worked on is Skein.
Here's the really short summary:
MD4, MD5, SHA-1 and SHA-2 all belong to a category of general purpose secure hashing algorithms. They're generally interchangeable, in that they all produce a hashcode from a plaintext such that it's designed to be computationally infeasible to determine a plaintext that generates a hash (a 'preimage'), or to find two texts that hash to the same thing (a 'collision'). All of them are broken to various degrees, or at least believed to be vulnerable.
MD6 was a candidate for NIST's SHA-3 competition, but it was withdrawn. It has the same general characteristics of the above hash functions, but like many of the SHA-3 candidates, adds extra features - in this case a merkle-tree like structure for improving parallelization of hashes. It goes without saying that it, along with the remaining SHA-3 candidates, are not yet well tested.
A CRC is in fact not a hash algorithm at all. Its name stands for Cyclic Redundancy Check, and it's a checksum rather than a hash. Different CRCs are designed to resist different levels of transmission errors, but they all have in common a guarantee that they will detect a certain number of bit errors, something hash algorithms do not share. They're not as well distributed as a hash algorithm, so shouldn't be used as one.
There are a range of general purpose hash algorithms suitable for use in hashtables etcetera, such as FNV. These tend to be a lot faster than cryptographic hashes, but aren't designed to resist attacks by an adversary. Unlike secure hashes, some of them show quite poor distribution characteristics, and are only suitable for hashing certain types of data.
To complete the other answers: performance varies among hash functions. Hash functions are built on elementary operations, which are more or less efficient, depending on the architecture. For instance, the SHA-3 candidate Skein uses additions on 64-bit integers and is very fast on platforms which offer 64-bit operations, but on 32-bit-only systems (including all ARM processors), Skein is much slower.
SHA-256 is usually said to be "slow" but will still hash data at a rate of about 150 megabytes per second on a basic PC (a 2.4 GHz Core2), which is more than enough for most applications. It is rare that hash function performance is really important on a PC. Things can be different on embedded systems (from smartcards to smartphones) where you could get more data to process than what the CPU can handle. MD5 will be typically 3 to 6 times faster than SHA-256. SHA-256 is still the recommended default choice, since its security is still intact; consider using something else only if you get a real, duly constated and measured performance issue.
On small 32-bit architectures (MIPS, ARM...), all remaining SHA-3 candidates are slower than SHA-256, so getting something faster and yet secure could be challenging.

Ideal hashing method for wide distribution of values?

As part of my rhythm game that I'm working, I'm allowing users to create and upload custom songs and notecharts. I'm thinking of hashing the song and notecharts to uniquely identify them. Of course, I'd like as few collisions as possible, however, cryptographic strength isn't of much importance here as a wide uniform range. In addition, since I'd be performing the hashes rarely, computational efficiency isn't too big of an issue.
Is this as easy as selecting a tried-and-true hashing algorithm with the largest digest size? Or are there some intricacies that I should be aware of? I'm looking at either SHA-256 or 512, currently.
All cryptographic-strength algorithm should exhibit no collision at all. Of course, collisions necessarily exist (there are more possible inputs than possible outputs) but it should be impossible, using existing computing technology, to actually find one.
When the hash function has an output of n bits, it is possible to find a collision with work about 2n/2, so in practice a hash function with less than about 140 bits of output cannot be cryptographically strong. Moreover, some hash functions have weaknesses that allow attackers to find collisions faster than that; such functions are said to be "broken". A prime example is MD5.
If you are not in a security setting, and fear only random collisions (i.e. nobody will actively try to provoke a collision, they may happen only out of pure bad luck), then a broken cryptographic hash function will be fine. The usual recommendation is then MD4. Cryptographically speaking, it is as broken as it can be, but for non-cryptographic purposes it is devilishly fast, and provides 128 bits of output, which avoid random collisions.
However, chances are that you will not have any performance issue with SHA-256 or SHA-512. On a most basic PC, they already process data faster than what a hard disk can provide: if you hash a file, the file reading will be the bottleneck, not the hashing. My advice would be to use SHA-256, possibly truncating its output to 128 bits (if used in a non-security situation), and consider switching to another function only if some performance-related trouble is duly noticed and measured.
If you're using it to uniquely identify tracks, you do want a cryptographic hash: otherwise, users could deliberately create tracks that hash the same as existing tracks, and use that to overwrite them. Barring a compelling reason otherwise, SHA-1 should be perfectly satisfactory.
If cryptographic security is not of concern then you can look at this link & this. The fastest and simplest (to implement) would be Pearson hashing if you are planing to compute hash for the title/name and later do lookup. or you can have look at the superfast hash here. It is also very good for non cryptographic use.
What's wrong with something like an md5sum? Or, if you want a faster algorithm, I'd just create a hash from the file length (mod 64K to fit in two bytes) and 32-bit checksum. That'll give you a 6-byte hash which should be reasonably well distributed. It's not overly complex to implement.
Of course, as with all hashing solutions, you should monitor the collisions and change the algorithm if the cardinality gets too low. This would be true regardless of the algorithm chosen (since your users may start uploading degenerate data).
You may end up finding you're trying to solve a problem that doesn't exist (in other words, possible YAGNI).
Isn't cryptographic hashing an overkill in this case, though I understand that modern computers do this calculation pretty fast? I assume that your users will have an unique userid. When they upload, you just need to increment a number. So, you will represent them internally as userid1_song_1, userid1_song_2 etc. You can store this info in a database with that as the unique key along with user specified name.
You also didn't mention the size of these songs. If it is midi, then file size will be small. If file sizes are big (say 3MB) then sha calculations will not be instantaneous. On my core2-duo laptop, sha256sum of a 3.8 MB file takes 0.25 sec; for sha1sum it is 0.2 seconds.
If you intend to use a cryptographic hash, then sha1 should be more than adequate and you don't need sha256. No collisions --- though they exist --- have been found yet. Git, Mercurial and other distributed version control systems use sh1. Git is a content based system and uses sha1 to find out if content has been modified.

Is it worthwhile to use a bit vector/array rather than a simple array of bools?

When I want an array of flags it has typically pained me to use an entire byte (or word) to store each one, as would be the result if I made an array of bools or some other numeric type that could be set to 0 or 1. But now I wonder whether using a structure that is more space-efficient is worth it given the (albeit hopefully very slight) additional overhead of shifting and bit testing.
In my company we use Rogue Wave tools (though hopefully not for much longer) and it's their RWBitVec that I've used for this purpose up until now.
It's mostly about saving memory. If your array of bools is large enough that a 8x improvement on storage space is meaningful, then by all means, use a bitarray.
Note that the memory access is pretty expensive compared to the shift/and, so the bitarray approach is slightly faster than the array-of-chars. Basically it comes down to memory versus programmer time. Remember that premature optimization is a waste of time. I'd use whichever approach is the easiest to develop, and then refactor only after it shows that it's a primary performance bottleneck.
Don't use vector<bool>, it's not really a Container:
http://www.informit.com/guides/content.aspx?g=cplusplus&seqNum=98
Use std::bitset (for fixed size bitsets) and boost::dynamic_bitset (for resizeable ones) where appropriate. They aren't Containers either, but they don't look as if they ought to be, so are less likely to cause confusion.
Whether the trade-off is worth it depends, obviously, on how big the arrays are in your program. I think you're right that the overhead of bit access is usually negligible, but if the memory overhead is negligible too then you've nothing to go on there either.
bitsets have the advantage that they do exactly what they say on the tin - none of this "declare an array of chars/ints, but the only legal values are 0 and 1" nonsense. Your code will read about the same as if you'd used an array.
I wrote some code once to unpack a bitmap image line into separate bytes per pixel, then pack it back again after processing. For the code I was benchmarking, it was actually faster to do it that way than to work at the bit level.
I've used a bit array for indexing a HUGE tree. The algorithm was:
Check bitarray if entry exists
if entry doesn't exists
return null
else do binary search in tree
return value
The advantage is that the Tree has huge enough that searching for a non existent entry would cause several cache misses before completing. Thus the algorithm was taking longer or not depending on the existence of the value.
However adding that initial bit array search meant I'd reduce cache misses, and would avoid searching the tree at all if the answer wasn't there. By adding this extra step the algorithm became much more robust (actual performance time on a Computer, became nearly linear although the Big-O would say differently), and overall performance increased by an order of magnitude.
Like they say sometimes taking hardware into consideration is more important than the "ideal" mathematical algorithm.
Modern computers have barrel shifters so that a shift of any number of bits up to 31 takes a few cycles (less than many other instructions). Compilers take advantage of this and bit operations are not only space efficient but in most cases time efficient.
But it really depends on how you're using and testing the bits - there are some inefficient methods that would make using a whole integer faster.
-Adam
Is it worth it? Only if you know that you have a problem with memory usage.
But unless you're either:
Working on an embedded processor with very limited resources, or
Storing an astronomical number of bools
then the answer is no. You'll have to work somewhat harder to achieve the same level of readability in your source by using a bitmap than you will using bools, and unless you're operating under either of the previous two conditions you'll likely find that it doesn't make any noticeable difference to your memory footprint.

Resources