Fast hash function with collision possibility near SHA-1 - performance

I'm using SHA-1 to detect duplicates in a program handling files. It is not required to be cryptographic strong and may be reversible. I found this list of fast hash functions https://code.google.com/p/xxhash/ (list has been moved to https://github.com/Cyan4973/xxHash)
What do I choose if I want a faster function and collision on random data near to SHA-1?
Maybe a 128 bit hash is good enough for file deduplication? (vs 160 bit sha-1)
In my program the hash is calculated on chuncks from 0 - 512 KB.

Maybe this will help you:
https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
collisions rare: FNV-1, FNV-1a, DJB2, DJB2a, SDBM & MurmurHash
I don't know about xxHash but it looks also promising.
MurmurHash is very fast and version 3 supports 128bit length, I would choose this one. (Implemented in Java and Scala.)

Since the only relevant property of hash algorithms in your case is the collision probability, you should estimate it and choose the fastest algorithm which fulfills your requirements.
If we suppose your algorithm has absolute uniformity, the probability of a hash collision among n files using hashes with d possible values will be:
For example, if you need a collision probability lower than one in a million among one million of files, you will need to have more than 5*10^17 distinct hash values, which means your hashes need to have at least 59 bits. Let's round to 64 to account for possibly bad uniformity.
So I'd say any decent 64-bit hash should be sufficient for you. Longer hashes will further reduce collision probability, at a price of heavier computation and increased hash storage volume. Shorter caches like CRC32 will require you to write some explicit collision handling code.

Google developed and uses (I think) FarmHash for performance-critical hashing. From the project page:
FarmHash is a successor to CityHash, and includes many of the same tricks and techniques, several of them taken from Austin Appleby’s MurmurHash.
...
On CPUs with all the necessary machine instructions, about six different hash functions can contribute to FarmHash's lineup. In some cases we've made significant performance gains over CityHash by using newer instructions that are now commonly available. However, we've also squeezed out some more speed in other ways, so the vast majority of programs using CityHash should gain at least a bit when switching to FarmHash.
(CityHash was already a performance-optimized hash function family by Google.)
It was released a year ago, at which point it was almost certainly the state of the art, at least among the published algorithms. (Or else Google would have used something better.) There's a good chance it's still the best option.

The facts:
Good hash functions, specially the cryptographic ones (like SHA-1),
require considerable CPU time because they have to honor a number of
properties that wont be very useful for you in this case;
Any hash function will give you only one certainty: if the hash values of two files are different, the files are surely different. If, however, their hash values are equal, chances are that the files are also equal, but the only way to tell for sure if this "equality" is not just a hash collision, is to fall back to a binary comparison of the two files.
The conclusion:
In your case I would try a much faster algorithm like CRC32, that has pretty much all the properties you need, and would be capable of handling more than 99.9% of the cases and only resorting to a slower comparison method (like binary comparison) to rule out the false positives. Being a lot faster in the great majority of comparisons would probably compensate for not having an "awesome" uniformity (possibly generating a few more collisions).

128 bits is indeed good enough to detect different files or chunks. The risk of collision is infinitesimal, at least as long as no intentional collision is being attempted.
64 bits can also prove good enough if the number of files or chunks you want to track remain "small enough" (i.e. no more than a few millions ones).
Once settled the size of the hash, you need a hash with some very good distribution properties, such as the ones listed with Q.Score=10 in your link.

It kind of depends on how many hashes you are going to compute over in an iteration.
Eg, 64bit hash reaches a collision probability of 1 in 1000000 with 6 million hashes computed.
Refer to : Hash collision probabilities

Check out MurmurHash2_160. It's a modification of MurmurHash2 which produces 160-bit output.
It computes 5 unique results of MurmurHash2 in parallel and mixes them thoroughly. The collision probability is equivalent to SHA-1 based on the digest size.
It's still fast, but MurmurHash3_128, SpookyHash128 and MetroHash128 are probably faster, albeit with a higher (but still very unlikely) collision probability. There's also CityHash256 which produces a 256-bit output which should be faster than SHA-1 as well.

Related

Very low collision non-cryptographic hashing function

I'm writing an application that uses hashing to speed up file comparisons. Basically I pre-hash file A, and then the app runs and matches files in a folder with previously hashed files. My current criteria for looking for a hash function are as follows:
It should be fast enough that disk IO is the limiting factor. I'm currently using SHA-256 which works just fine but is way too heavy and makes my application CPU bound.
Cryptography/security doesn't matter in this case, the user is inputting both files, so if they craft a hash collision intentionally, that's on them.
Hash collisions should be avoided at almost all costs. I can compare files based on size, and their hash, but if both of those match the files are assumed to be equal. I know it's impossible guarantee this with any hash due to the compression of data, but something with the same sort of uniqueness guarantees as SHA-256 would be nice.
File sizes range from 10bytes to 2GB
A streaming algorithm would be nice, as I try to keep the memory usage of the application low, in other words I don't want to load the entire file into memory to hash it.
Hash size doesn't matter, if I got all the above with 1024bit hashes, I'm completely okay with that.
So what's a good algorithm to use here, I'm using C# but I'm sure most algorithms are available on any platform. Like I said, I'm using SHA-256, but I'm sure there's something better.
Yann Collet's xxHash may be a good choice (Home page, GitHub)
xxHash is an extremely fast non-cryptographic hash algorithm, working
at speeds close to RAM limits. It is proposed in two flavors, 32 and
64 bits.
At least 4 C# impelmentations are available (see home page).
I had excellent results with it in the past.
The Hash size is 32 or 64 bit, but XXH3 is in the making:
XXH3 features a wide internal state of 512 bits, which makes it
suitable to generate a hash of up to 256 bit. For the time being, only
64-bit and 128-bit variants are exposed, but a similar recipe can be
used for a 256-bit variant if there is any need for it one day. All
variant feature same speed, since only the finalization stage is
different.
In general, the longer the hash, the slower its calculation. 64-bit hash is good enough for most practical purposes.
You can generate longer hashes by combining two hash functions (e.g. 128-bit XXH3 and 128-bit MurmurHash3).

characteristics of various hash algorithms?

MD5, MD6?, all the SHA-somethings, CRC-somethings. I've used them before and seen them used in various places, but I have no idea why you would use one over another.
On a very high level, what is the difference between all these 3/4 letter acronyms In terms of performance, collision probability and general hard-to-crackness? Does any of those depend on what kind or what amount of data I am hashing?
What trade-offs am I making when i choose one over another? I've read that the CRC is not suitable to use for security, but what about for general hash-table collision avoidance?
CRC-whatever is used primarily (should be exclusively) for protection against accidental changes in data. They do quite a good job of detecting noise and such, but are not intended for cryptographic purposes -- finding a second preimage (a second input that produces the same hash) is (by cryptographic standards) trivial. [Edit: As noted by #Jon, unlike the other hashes mentioned here, CRC is not and never was intended for cryptographic use.]
MD-5. Originally intended for cryptographic use, but fairly old and now considered fairly weak. Although no second preimage attack is known, a collision attack is known (i.e., a way to produce two selected inputs that produce the same result, but not a second input to produce the same result as one that's specified). About the only time to use this any more is as a more elaborate version of a CRC.
SHA-x
Once upon a time, there was simply "SHA". Very early in its history, a defect was found, and a slight modification was made to produce SHA-1. SHA was in use for a short enough time that it's rarely of practical interest.
SHA-1 is generally more secure than MD-5, but still in the same general range -- a collision attack is known, though it's a lot1 more expensive than for MD-5. No second preimage attack is known, but the collision attack is enough to say "stay away".
SHA-256, SHA-384, SHA-512: These are sort of based on SHA-1, but are somewhat more complex internally. At least as far as I'm aware, neither a second-preimage attack nor a collision attack is known on any of these at the present time.
SHA-3: US National Institute of Standards and Technology (NIST) is currently holding a competition to standardize a replacement for the current SHA-2 series hash algorithm, which will apparent be called SHA-3. As I write this (September 2011) the competition is currently in its third round, with five candidates (Blake, Grøstl, JH, Kaccek and Skein2) left in the running. Round 3 is scheduled to be over in January 2012, at which time public comments on the algorithms will no longer be (at least officially) accepted. In March 2012, a (third) SHA-3 conference will be held (in Washington DC). At some unspecified date later in 2012, the final selection will be announced.
1 For anybody who cares about how much more expensive it is to attack SHA-1 than MD-5, I'll try to give some concrete numbers. For MD-5, my ~5 year-old machine can produce a collision in about 40-45 minutes. For SHA-1, I only have an estimate, but my estimate is that a cluster to produce collisions at a rate of one per week would cost well over a million US dollars (and probably closer to $10 million). Even given an existing machine, the cost of operating the machine long enough to find a collision is substantial.
2 Since it's almost inevitable that somebody will at least wonder, I'll point out that the entry Bruce Schneier worked on is Skein.
Here's the really short summary:
MD4, MD5, SHA-1 and SHA-2 all belong to a category of general purpose secure hashing algorithms. They're generally interchangeable, in that they all produce a hashcode from a plaintext such that it's designed to be computationally infeasible to determine a plaintext that generates a hash (a 'preimage'), or to find two texts that hash to the same thing (a 'collision'). All of them are broken to various degrees, or at least believed to be vulnerable.
MD6 was a candidate for NIST's SHA-3 competition, but it was withdrawn. It has the same general characteristics of the above hash functions, but like many of the SHA-3 candidates, adds extra features - in this case a merkle-tree like structure for improving parallelization of hashes. It goes without saying that it, along with the remaining SHA-3 candidates, are not yet well tested.
A CRC is in fact not a hash algorithm at all. Its name stands for Cyclic Redundancy Check, and it's a checksum rather than a hash. Different CRCs are designed to resist different levels of transmission errors, but they all have in common a guarantee that they will detect a certain number of bit errors, something hash algorithms do not share. They're not as well distributed as a hash algorithm, so shouldn't be used as one.
There are a range of general purpose hash algorithms suitable for use in hashtables etcetera, such as FNV. These tend to be a lot faster than cryptographic hashes, but aren't designed to resist attacks by an adversary. Unlike secure hashes, some of them show quite poor distribution characteristics, and are only suitable for hashing certain types of data.
To complete the other answers: performance varies among hash functions. Hash functions are built on elementary operations, which are more or less efficient, depending on the architecture. For instance, the SHA-3 candidate Skein uses additions on 64-bit integers and is very fast on platforms which offer 64-bit operations, but on 32-bit-only systems (including all ARM processors), Skein is much slower.
SHA-256 is usually said to be "slow" but will still hash data at a rate of about 150 megabytes per second on a basic PC (a 2.4 GHz Core2), which is more than enough for most applications. It is rare that hash function performance is really important on a PC. Things can be different on embedded systems (from smartcards to smartphones) where you could get more data to process than what the CPU can handle. MD5 will be typically 3 to 6 times faster than SHA-256. SHA-256 is still the recommended default choice, since its security is still intact; consider using something else only if you get a real, duly constated and measured performance issue.
On small 32-bit architectures (MIPS, ARM...), all remaining SHA-3 candidates are slower than SHA-256, so getting something faster and yet secure could be challenging.

when to resize a hash table?

In various hash table implementations, I have seen "magic numbers" for when a mutable hash table should resize (grow). Usually this number is somewhere between 65% to 80% of the values added per allocated slots. I am assuming the trade off is that a higher number will give the potential for more collisions and a lower number less at the expense of using more memory.
My question is how is this number arrived at?
Is it arbitrary? based on testing? based on some other logic?
At a guess, most people at least start from the numbers in a book (e.g., Knuth, Volume 3), which were produced by testing. Depending on the situation, some may carry out testing afterwards, and make adjustments accordingly -- but from what I've seen, these are probably in the minority.
As I outlined in a previous answer, the "right" number also depends heavily on how you resolve collisions. For better or worse, this fact seems to be widely ignored -- people frequently don't pick numbers that are particularly appropriate for the collision resolution they use.
OTOH, the other point I found in my testing is that it only rarely makes a whole lot of difference. You can pick numbers across a fairly broad range and get pretty similar overall speed. The main thing is to be careful to avoid pushing the number too high, especially if you're using something like linear probing for collision resolution.
I think you don't want to consider "how full" the table is (how many "buckets" out of total buckets have values) but rather the number of collisions it might take to find a spot for a new item.
I read some compiler book years ago (can't remember title or authors) that suggested just using linked lists until you have more than 10 to 12 items. That would seem to support more than 10 collisions means time to re-size.
The Design and Implementation of Dynamic. Hashing for Sets and Tables in Icon suggests that an average hash chain length of 5 (in that algorithm, the average number of collisions) is enough to trigger a rehash. Seems supported by testing, but I'm not sure I'm reading the paper correctly.
It looks like the resize condition is mainly the result of testing.
That depends on the keys. If you know that your hash function is perfect for all possible keys (for example, using gperf), then you know that you'll have only few collisions, so the number is higher.
But most of the time, you don't know much about the keys except that they are text. In this case, you have to guess since you don't even have test data to figure out in advance how your hash function is behaving.
So you hope for the best. If you hash function is very bad for the keys, then you will have a lot of collisions and the point of growth will never be reached. In this case, the chosen figure is irrelevant.
If your hash function is adequate, then it should create only a few collisions (less than 50%), so a number between 65% and 80% seems reasonable.
That said: Unless your hash table must be perfect (= huge size or lots of accesses), don't bother. If you have, say, ten elements, considering these issues is a waste of time.
As far as I'm aware the number is a heuristic based on empirical testing.
With a reasonably good distribution of hash values it seems that the magic load factor is -- as you say -- usually around 70%. A smaller load factor means that you're wasting space for no real benefit; a higher load factor means that you'll use less space but spend more time dealing with hash collisions.
(Of course, if you know that your hash values are perfectly distributed then your load factor can be 100% and you'll still have no wasted space and no hash collisions.)
Collisions depend highly on data and used hash function.
Most of numbers based on heuristics or on assumption about normal distribution of hash values. (AFAIK values about 70% are typical for extendible hash tables, but one can always construct such data stream, that you get much more/less collisions)

Ideal hashing method for wide distribution of values?

As part of my rhythm game that I'm working, I'm allowing users to create and upload custom songs and notecharts. I'm thinking of hashing the song and notecharts to uniquely identify them. Of course, I'd like as few collisions as possible, however, cryptographic strength isn't of much importance here as a wide uniform range. In addition, since I'd be performing the hashes rarely, computational efficiency isn't too big of an issue.
Is this as easy as selecting a tried-and-true hashing algorithm with the largest digest size? Or are there some intricacies that I should be aware of? I'm looking at either SHA-256 or 512, currently.
All cryptographic-strength algorithm should exhibit no collision at all. Of course, collisions necessarily exist (there are more possible inputs than possible outputs) but it should be impossible, using existing computing technology, to actually find one.
When the hash function has an output of n bits, it is possible to find a collision with work about 2n/2, so in practice a hash function with less than about 140 bits of output cannot be cryptographically strong. Moreover, some hash functions have weaknesses that allow attackers to find collisions faster than that; such functions are said to be "broken". A prime example is MD5.
If you are not in a security setting, and fear only random collisions (i.e. nobody will actively try to provoke a collision, they may happen only out of pure bad luck), then a broken cryptographic hash function will be fine. The usual recommendation is then MD4. Cryptographically speaking, it is as broken as it can be, but for non-cryptographic purposes it is devilishly fast, and provides 128 bits of output, which avoid random collisions.
However, chances are that you will not have any performance issue with SHA-256 or SHA-512. On a most basic PC, they already process data faster than what a hard disk can provide: if you hash a file, the file reading will be the bottleneck, not the hashing. My advice would be to use SHA-256, possibly truncating its output to 128 bits (if used in a non-security situation), and consider switching to another function only if some performance-related trouble is duly noticed and measured.
If you're using it to uniquely identify tracks, you do want a cryptographic hash: otherwise, users could deliberately create tracks that hash the same as existing tracks, and use that to overwrite them. Barring a compelling reason otherwise, SHA-1 should be perfectly satisfactory.
If cryptographic security is not of concern then you can look at this link & this. The fastest and simplest (to implement) would be Pearson hashing if you are planing to compute hash for the title/name and later do lookup. or you can have look at the superfast hash here. It is also very good for non cryptographic use.
What's wrong with something like an md5sum? Or, if you want a faster algorithm, I'd just create a hash from the file length (mod 64K to fit in two bytes) and 32-bit checksum. That'll give you a 6-byte hash which should be reasonably well distributed. It's not overly complex to implement.
Of course, as with all hashing solutions, you should monitor the collisions and change the algorithm if the cardinality gets too low. This would be true regardless of the algorithm chosen (since your users may start uploading degenerate data).
You may end up finding you're trying to solve a problem that doesn't exist (in other words, possible YAGNI).
Isn't cryptographic hashing an overkill in this case, though I understand that modern computers do this calculation pretty fast? I assume that your users will have an unique userid. When they upload, you just need to increment a number. So, you will represent them internally as userid1_song_1, userid1_song_2 etc. You can store this info in a database with that as the unique key along with user specified name.
You also didn't mention the size of these songs. If it is midi, then file size will be small. If file sizes are big (say 3MB) then sha calculations will not be instantaneous. On my core2-duo laptop, sha256sum of a 3.8 MB file takes 0.25 sec; for sha1sum it is 0.2 seconds.
If you intend to use a cryptographic hash, then sha1 should be more than adequate and you don't need sha256. No collisions --- though they exist --- have been found yet. Git, Mercurial and other distributed version control systems use sh1. Git is a content based system and uses sha1 to find out if content has been modified.

calculating which strings will have the same hash

with SHA-1 is it possible to figure out which finite strings will render equal hashes?
What you are looking for is the solution to the Collision Problem (See also collision attack). A well-designed and powerful cryptographic hash function is designed with the intent of as much obfuscating mathematics as possible to make this problem as hard as possible.
In fact, one of the measures of a good hash function is the difficulty of finding collisions. (Among the other measures, the difficulty of reversing the hash function)
It should be noted that, in hashes where the input is any length of string and the output is a fixed-length string, the Pigeonhole Principle ensures that there is at least one collision for any given string. However, finding this string is not as easy, as it would require basically blind guess-and-check over a basically infinite collection of strings.
It might be useful to read into the the ideal hash functions. Hash functions are designed to be functions where
Small changes in the input cause radical, chaotic changes in the output
Collisions are reduced to a minimum
It is difficult or, ideally, impossible to reverse
There are no hashed values that are impossible to obtain with any inputs (this one matters significantly less for cryptographic purposes)
The theoretical "perfect" hash algorithm would be a "random oracle" -- that is, for every input, it outputs a perfectly random output, on the condition that for the same input, the output will be identical (this condition is fulfilled with magic, by the hand of Zeus and pixie fairies, or in a way that no human could ever possibly understand or figure out)
Unfortunately, this is pretty much impossible, and ultimately, all hashes are judged as "strong" based on how much of these qualities they possess, and to what degree.
A hash like SHA1 or MD5 is going to be pretty strong, and more or less computationally impossible to find collisions for (within a reasonable time frame). Ultimately, you don't need to find a hash that is impossible to find collisions for. You only practically need one where the difficulty of it is large enough that it'd be too expensive to compute (ie, on the order of a billion or a trillion years to find a collision)
Due to all hashes being imperfect, one could analyze the internal workings of it and see mathematical patterns and heuristics and try to find collisions along that pattern. This is similar to a hash function being %7...Hashing the number 13 would be 13%7 = 6, 89%7 = 5. If you saw a hash of 3, you could use your mathematical understanding of the modulus function to easily find a collision (ie, 10)1. Fortunately for us, stronger hash functions have much, much, much harder to understand mathematical basis. (Ideally, so hard that no human would ever understand it!)
Some figures:
Finding a collision for a single given SHA-0 hash takes about 13 full days of running computations on the top supercomputers in the world, using the patterns inherent in the math.
According to a helpful commenter, MD5 collisions can be generated "quickly" enough to be less than ideal for sensitive purposes.
No feasible or practical/usable collision finding method for SHA-1 has been found or proven so far, although, as pointed out in the comments, there are some weaknesses that have been discovered.
Here is a similar SO question, which has answers much wiser than mine.
1note that, while this hash function is weak for collisions, it is strong it that it is perfectly impossible to go backwards and find a given key, if your hash is, say, 4. There are an infinite amount (ie, 4, 11, 18, 25...)
The answer is pretty clearly yes, since at the very least you could run through every possible string of the given length, compute the hashes of all of them, and then see which are the same. The more interesting question is how to do it quickly.
Further reading: http://en.wikipedia.org/wiki/Collision_attack
It depends on the hash function. With a simple hash function, it may be possible. For example, if the hash function simply sums the ASCII byte values of a string, then one could enumerate all strings of a given length that produce a given hash value. If the hash function is more complex and "cryptographically strong" (e.g., MD5 or SHA1), then it is theoretically not possible.
Most hashes are of cryptographic or near-cryptographic strength, so the hash depends on the input in non-obvious ways. The way this is done professionally is with rainbow tables, which are precomputed tables of inputs and their hashes. So brute force checking is basically the only way.

Resources