Such a thing as a constant quality (variable bit) digest hashing algorithm? - algorithm

Problem space: We have a ton of data to digest that can range 6 orders of magnitude in size. Looking for a way to be more efficient, and thus use less disk space to store all of these digests.
So I was thinking about lossy audio encoding, such as MP3. There are two basic approaches - constant bitrate and constant quality (aka variable bitrate). Since my primary interest is quality, I usually go for VBR. Thus, to achieve the same level of quality, a pure sin tone would require significantly lower bitrate than a something like a complex classical piece.
Using the same idea, two very small data chunks should require significantly less total digest bits than two very large data chunks to ensure roughly the same statistical improbability (what I am calling quality in this context) of their digests colliding. This is an assumption that seems intuitively correct to me, but then again, I am not a crypto mathematician. Also note that this is all about identification, not security. It's okay if a small data chunk has a small digest, and thus computationally feasible to reproduce.
I tried searching around the inter-tubes for anything like this. The closest thing I found was a posting somewhere that talked about using a fixed size digest hash, like SHA256, as a initialization vector for AES/CTR acting as a psuedo-random generator. Then taking the first x number of bit off that.
That seems like a totally do-able thing. The only problem with this approach is that I have no idea how to calculate the appropriate value of x as a function of the data chunk size. I think my target quality would be statistical improbability of SHA256 collision between two 1GB data chunks. Does anyone have thoughts on this calculation?
Are there any existing digest hashing algorithms that already do this? Or are there any other approaches that will yield this same result?
Update: Looks like there is the SHA3 Keccak "sponge" that can output an arbitrary number of bits. But I still need to know how many bits I need as a function of input size for a constant quality. It sounded like this algorithm produces an infinite stream of bits, and you just truncate at however many you want. However testing in Ruby, I would have expected the first half of a SHA3-512 to be exactly equal to a SHA3-256, but it was not...

Your logic from the comment is fairly sound. Quality hash functions will not generate a duplicate/previously generated output until the input length is nearly (or has exceeded) the hash digest length.
But, the key factor in collision risk is the size of the input set to the size of the hash digest. When using a quality hash function, the chance of a collision for two 1 TB files not significantly different than the chance of collision for two 1KB files, or even one 1TB and one 1KB file. This is because hash function strive for uniformity; good functions achieve it to a high degree.
Due to the birthday problem, the collision risk for a hash function is is less than the bitwidth of its output. That wiki article for the pigeonhole principle, which is the basis for the birthday problem, says:
The [pigeonhole] principle can be used to prove that any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger. Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L, and do so without collisions (because the compression is lossless), which possibility the pigeonhole principle excludes.
So going to a 'VBR' hash digest is not guaranteed to save you space. The birthday problem provides the math for calculating the chance that two random things will share the same property (a hash code is a property, in a broad sense), but this article gives a better summary, including the following table.
Source: preshing.com
The top row of the table says that in order to have a 50% chance of a collision with a 32-bit hash function, you only need to hash 77k items. For a 64-bit hash function, that number rises to 5.04 billion for the same 50% collision risk. For a 160-bit hash function, you need 1.42 * 1024 inputs before there is a 50% chance that a new input will have the same hash as a previous input.
Note that 1.42 * 1024 160 bit numbers would themselves take up an unreasonably large amount of space; millions of Terabytes, if I'm doing the math right. And that's without counting for the 1024 item values they represent.
The bottom end of that table should convince you that a 160-bit hash function has a sufficiently low risk of collisions. In particular, you would have to have 1021 hash inputs before there is even a 1 in a million chance of a hash collision. That's why your searching turned up so little: it's not worth dealing with the complexity.
No matter what hash strategy you decide upon however, there is a non-zero risk of collision. Any type of ID system that relies on a hash needs to have a fallback comparison. An easy additional check for files is to compare their sizes (works well for any variable length data where the length is known, such as strings). Wikipedia covers several different collision mitigation and detection strategies for hash tables, most of which can be extended to a filesystem with a little imagination. If you require perfect fidelity, then after you've run out of fast checks, you need to fallback to the most basic comparator: the expensive bit-for-bit check of the two inputs.

If I understand the question correctly, you have a number of data items of different lengths, and for each item you are computing a hash (i.e. a digest) so the items can be identified.
Suppose you have already hashed N items (without collisions), and you are using a 64bit hash code.
The next item you hash will take one of 2^64 values and so you will have a N / 2^64 probability of a hash collision when you add the next item.
Note that this probability does NOT depend on the original size of the data item. It does depend on the total number of items you have to hash, so you should choose the number of bits according to the probability you are willing to tolerate of a hash collision.
However, if you have partitioned your data set in some way such that there are different numbers of items in each partition, then you may be able to save a small amount of space by using variable sized hashes.
For example, suppose you use 1TB disk drives to store items, and all items >1GB are on one drive, while items <1KB are on another, and a third is used for intermediate sizes. There will be at most 1000 items on the first drive so you could use a smaller hash, while there could be a billion items on the drive with small files so a larger hash would be appropriate for the same collision probability.
In this case the hash size does depend on file size, but only in an indirect way based on the size of the partitions.

Related

Do hash functions/checksums without the diffusion property exist?

From Wikipedia:
Diffusion means that if we change a single bit of the plaintext, then (statistically) half of the bits in the ciphertext should change, and similarly, if we change one bit of the ciphertext, then approximately one half of the plaintext bits should change.[2] Since a bit can have only two states, when they are all re-evaluated and changed from one seemingly random position to another, half of the bits will have changed state.
For example, change one bit of a file and the MD5 checksum of the file becomes completely different.
Is there any hash function/checksum that does not have the diffusion property? Ideally, if 20% of the plaintext changes then 20% of the ciphertext should change and if 80% of the plaintext changes then 80% of the ciphertext should change. This way, % change in the plaintext can be tracked via the ciphertext.
I think the closest thing to what you appear to be looking for is "locality sensitive hashing", which attempts to map similar inputs to similar outputs: https://en.wikipedia.org/wiki/Locality-sensitive_hashing
The amount of change in the hash is not really proportional to the amount of change in the input, but maybe it'll do for what you want.
Depending on what the purpose of your hash-function is, you might not need the diffusion property.
E.g for large data blobs you might want to avoid reading all bytes to generate a hash-value (because of performance). So using only the first 1000 Bytes to calculate a hash-value might be good enough for usage in a HashSet, if the 1000 bytes differ enough through your data sets.
Related to your % question:
You can create such a function, but that is quite an unusual extra feature of the hash-function.
E.g. You can split your data in 100 equal sized chunks calculate the md5 for each of them and than take the first n bytes of each md5-value and append them to create your hash-value. this way you can determine the % value out of the of the fact how many n-bytes blocks changed in your hash-value (you can even see at which position your data changed)
Yes, these functions do exist and are called very bad hash functions.
People come up with such nonsense from time to time.
The diffusion property is an essential property of a good hash function.
E.g. you could return the address of the string which does look like a real hash function looking from outside, but when you change the string in place without reallocation, you see no diffusion, no change.
Or language implementors cut off long strings during faster hash processing. These hashes will collide when you change characters at the end of the string. If you are happy with bad hashes and many collisions it's still a valid hash function. This is done in practice. Your miles may vary.

Fastest algorithm to detect duplicate files

In the process of finding duplicates in my 2 terabytes of HDD stored images I was astonished about the long run times of the tools fslint and fslint-gui.
So I analyzed the internals of the core tool findup which is implemented as very well written and documented shell script using an ultra-long pipe. Essentially its based on find and hashing (md5 and SHA1).
The author states that it was faster than any other alternative which I couldn't believe. So I found Detecting duplicate files where the topic quite fast slided towards hashing and comparing hashes which is not the best and fastest way in my opinion.
So the usual algorithm seems to work like this:
generate a sorted list of all files (path, Size, id)
group files with the exact same size
calculate the hash of all the files with a same size and compare the hashes
same has means identical files - a duplicate is found
Sometimes the speed gets increased by first using a faster hash algorithm (like md5) with more collision probability and second if the hash is the same use a second slower but less collision-a-like algorithm to prove the duplicates. Another improvement is to first only hash a small chunk to sort out totally different files.
So I've got the opinion that this scheme is broken in two different dimensions:
duplicate candidates get read from the slow HDD again (first chunk) and again (full md5) and again (sha1)
by using a hash instead just comparing the files byte by byte we introduce a (low) probability of a false negative
a hash calculation is a lot slower than just byte-by-byte compare
I found one (Windows) app which states to be fast by not using this common hashing scheme.
Am I totally wrong with my ideas and opinion?
[Update]
There seems to be some opinion that hashing might be faster than comparing. But that seems to be a misconception out of the general use of "hash tables speed up things". But to generate a hash of a file the first time the files needs to be read fully byte by byte. So there a byte-by-byte-compare on the one hand, which only compares so many bytes of every duplicate-candidate function till the first differing position. And there is the hash function which generates an ID out of so and so many bytes - lets say the first 10k bytes of a terabyte or the full terabyte if the first 10k are the same. So under the assumption that I don't usually have a ready calculated and automatically updated table of all files hashes I need to calculate the hash and read every byte of duplicates candidates. A byte-by-byte compare doesn't need to do this.
[Update 2]
I've got a first answer which again goes into the direction: "Hashes are generally a good idea" and out of that (not so wrong) thinking trying to rationalize the use of hashes with (IMHO) wrong arguments. "Hashes are better or faster because you can reuse them later" was not the question.
"Assuming that many (say n) files have the same size, to find which are duplicates, you would need to make n * (n-1) / 2 comparisons to test them pair-wise all against each other. Using strong hashes, you would only need to hash each of them once, giving you n hashes in total." is skewed in favor of hashes and wrong (IMHO) too. Why can't I just read a block from each same-size file and compare it in memory? If I have to compare 100 files I open 100 file handles and read a block from each in parallel and then do the comparison in memory. This seams to be a lot faster then to update one or more complicated slow hash algorithms with these 100 files.
[Update 3]
Given the very big bias in favor of "one should always use hash functions because they are very good!" I read through some SO questions on hash quality e.g. this:
Which hashing algorithm is best for uniqueness and speed? It seams that common hash functions more often produce collisions then we think thanks to bad design and the birthday paradoxon. The test set contained: "A list of 216,553 English words (in lowercase),
the numbers "1" to "216553" (think ZIP codes, and how a poor hash took down msn.com) and 216,553 "random" (i.e. type 4 uuid) GUIDs". These tiny data sets produced from arround 100 to nearly 20k collisions. So testing millions of files on (in)equality only based on hashes might be not a good idea at all.
I guess I need to modify 1 and replace the md5/sha1 part of the pipe with "cmp" and just measure times. I keep you updated.
[Update 4]
Thanks for alle the feedback. Slowly we are converting. Background is what I observed when fslints findup had running on my machine md5suming hundreds of images. That took quite a while and HDD was spinning like hell. So I was wondering "what the heck is this crazy tool thinking in destroying my HDD and taking huge amounts of time when just comparing byte-by-byte" is 1) less expensive per byte then any hash or checksum algorithm and 2) with a byte-by-byte compare I can return early on the first difference so I save tons of time not wasting HDD bandwidth and time by reading full files and calculating hashs over full files. I still think thats true - but: I guess I didn't catch the point that a 1:1 comparison (if (file_a[i] != file_b[i]) return 1;) might be cheaper than is hashing per byte. But complexity wise hashing with O(n) may win when more and files need to be compared against each other. I have set this problem on my list and plan to either replace the md5 part of findup's fslint with cmp or enhance pythons filecmp.py compare lib which only compares 2 files at once with a multiple files option and maybe a md5hash version.
So thank you all for the moment.
And generally the situation is like you guys say: the best way (TM) totally depends on the circumstances: HDD vs SSD, likelyhood of same length files, duplicate files, typical files size, performance of CPU vs. Memory vs. Disk, Single vs. Multicore and so on. And I learned that I should considder more often using hashes - but I'm an embedded developer with most of the time very very limited resources ;-)
Thanks for all your effort!
Marcel
The fastest de-duplication algorithm will depend on several factors:
how frequent is it to find near-duplicates? If it is extremely frequent to find hundreds of files with the exact same contents and a one-byte difference, this will make strong hashing much more attractive. If it is extremely rare to find more than a pair of files that are of the same size but have different contents, hashing may be unnecessary.
how fast is it to read from disk, and how large are the files? If reading from the disk is very slow or the files are very small, then one-pass hashes, however cryptographically strong, will be faster than making small passes with a weak hash and then a stronger pass only if the weak hash matches.
how many times are you going to run the tool? If you are going to run it many times (for example to keep things de-duplicated on an on-going basis), then building an index with the path, size & strong_hash of each and every file may be worth it, because you would not need to rebuild it on subsequent runs of the tool.
do you want to detect duplicate folders? If you want to do so, you can build a Merkle tree (essentially a recursive hash of the folder's contents + its metadata); and add those hashes to the index too.
what do you do with file permissions, modification date, ACLs and other file metadata that excludes the actual contents? This is not related directly to algorithm speed, but it adds extra complications when choosing how to deal with duplicates.
Therefore, there is no single way to answer the original question. Fastest when?
Assuming that two files have the same size, there is, in general, no fastest way to detect whether they are duplicates or not than comparing them byte-by-byte (even though technically you would compare them block-by-block, as the file-system is more efficient when reading blocks than individual bytes).
Assuming that many (say n) files have the same size, to find which are duplicates, you would need to make n * (n-1) / 2 comparisons to test them pair-wise all against each other. Using strong hashes, you would only need to hash each of them once, giving you n hashes in total. Even if it takes k times as much to hash than to compare byte-by-byte, hashing is better when k > (n-1)/2. Hashes may yield false-positives (although strong hashes will only do so with astronomically low probabilities), but testing those byte-by-byte will only increment k by at most 1. With k=3, you will be ahead as soon as n>=7; with a more conservative k=2, you reach break-even with n=3. In practice, I would expect k to be very near to 1: it will probably be more expensive to read from disk than to hash whatever you have read.
The probability that several files will have the same sizes increases with the square of the number of files (look up birthday paradox). Therefore, hashing can be expected to be a very good idea in the general case. It is also a dramatic speedup in case you ever run the tool again, because it can reuse an existing index instead of building it anew. So comparing 1 new file to 1M existing, different, indexed files of the same size can be expected to take 1 hash + 1 lookup in the index, vs. 1M comparisons in the no-hashing, no-index scenario: an estimated 1M times faster!
Note that you can repeat the same argument with a multilevel hash: if you use a very fast hash with, say, the 1st, central and last 1k bytes, it will be much faster to hash than to compare the files (k < 1 above) - but you will expect collisions, and make a second pass with a strong hash and/or a byte-by-byte comparison when found. This is a trade-off: you are betting that there will be differences that will save you the time of a full hash or full compare. I think it is worth it in general, but the "best" answer depends on the specifics of the machine and the workload.
[Update]
The OP seems to be under the impression that
Hashes are slow to calculate
Fast hashes produce collisions
Use of hashing always requires reading the full file contents, and therefore is overkill for files that differ in their 1st bytes.
I have added this segment to counter these arguments:
A strong hash (sha1) takes about 5 cycles per byte to compute, or around 15ns per byte on a modern CPU. Disk latencies for a spinning hdd or an ssd are on the order of 75k ns and 5M ns, respectively. You can hash 1k of data in the time that it takes you to start reading it from an SSD. A faster, non-cryptographic hash, meowhash, can hash at 1 byte per cycle. Main memory latencies are at around 120 ns - there's easily 400 cycles to be had in the time it takes to fulfill a single access-noncached-memory request.
In 2018, the only known collision in SHA-1 comes from the shattered project, which took huge resources to compute. Other strong hashing algorithms are not much slower, and stronger (SHA-3).
You can always hash parts of a file instead of all of it; and store partial hashes until you run into collisions, which is when you would calculate increasingly larger hashes until, in the case of a true duplicate, you would have hashed the whole thing. This gives you much faster index-building.
My points are not that hashing is the end-all, be-all. It is that, for this application, it is very useful, and not a real bottleneck: the true bottleneck is in actually traversing and reading parts of the file-system, which is much, much slower than any hashing or comparing going on with its contents.
The most important thing you're missing is that comparing two or more large files byte-for-byte while reading them from a real spinning disk can cause a lot of seeking, making it vastly slower than hashing each individually and comparing the hashes.
This is, of course, only true if the files actually are equal or close to it, because otherwise a comparison could terminate early. What you call the "usual algorithm" assumes that files of equal size are likely to match. That is often true for large files generally.
But...
When all the files of the same size are small enough to fit in memory, then it can indeed be a lot faster to read them all and compare them without a cryptographic hash. (an efficient comparison will involve a much simpler hash, though).
Similarly when the number of files of a particular length is small enough, and you have enough memory to compare them in chunks that are big enough, then again it can be faster to compare them directly, because the seek penalty will be small compared to the cost of hashing.
When your disk does not actually contain a lot of duplicates (because you regularly clean them up, say), but it does have a lot of files of the same size (which is a lot more likely for certain media types), then again it can indeed be a lot faster to read them in big chunks and compare the chunks without hashing, because the comparisons will mostly terminate early.
Also when you are using an SSD instead of spinning platters, then again it is generally faster to read + compare all the files of the same size together (as long as you read appropriately-sized blocks), because there is no penalty for seeking.
So there are actually a fair number of situations in which you are correct that the "usual" algorithm is not as fast as it could be. A modern de-duping tool should probably detect these situations and switch strategies.
Byte-by-byte comparison may be faster if all file groups of the same size fit in physical memory OR if you have a very fast SSD. It also may still be slower depending on the number and nature of the files, hashing functions used, cache locality and implementation details.
The hashing approach is a single, very simple algorithm that works on all cases (modulo the extremely rare collision case). It scales down gracefully to systems with small amounts of available physical memory. It may be slightly less than optimal in some specific cases, but should always be in the ballpark of optimal.
A few specifics to consider:
1) Did you measure and discover that the comparison within file groups was the expensive part of the operation? For a 2TB HDD walking the entire file system can take a long time on its own. How many hashing operations were actually performed? How big were the file groups, etc?
2) As noted elsewhere, fast hashing doesn't necessarily have to look at the whole file. Hashing some small portions of the file is going to work very well in the case where you have sets of larger files of the same size that aren't expected to be duplicates. It will actually slow things down in the case of a high percentage of duplicates, so it's a heuristic that should be toggled based on knowledge of the files.
3) Using a 128 bit hash is probably sufficient for determining identity. You could hash a million random objects a second for the rest of your life and have better odds of winning the lottery than seeing a collision. It's not perfect, but pragmatically you're far more likely to lose data in your lifetime to a disk failure than a hash collision in the tool.
4) For a HDD in particular (a magnetic disk), sequential access is much faster than random access. This means a sequential operation like hashing n files is going to be much faster than comparing those files block by block (which happens when they don't fit entirely into physical memory).

4 bytes hash algorithm to compare small text (normally less than 2 kb)

I am developing a piece of software that need to check duplicate small text (normally less than 2 kb) using pre-calculated signature (4bytes). Currently , I've implemented CRC32 (4byte) to achieve this purpose but I suspect that CRC32 would generated a lot of duplicate values. I know it is impossible to make it really unique but at least I want to minimize this probability.
-- UPDATE 1 --
NOTE: I can not increase the size of hash bytes. It costs me a lot of storage. I am talking about entries size more than 1,000,000. for example 1,000,000 * 4 byte = 4,000,000 bytes. I cannot use MD5 because it takes 16 bytes!
-- UPDATE 2 --
I did not want to open the whole problem but now I have to do it.
My project is a dictionary engine that can search a lot of independent databases to find the users' asked phrase. All results must be prepared instantly (auto-complete feature). All text data is compressed, so I cannot decompress them to check the duplicated results. I have to store hash values from compressed text in my index. So hash bytes increase index size and disk I/O to read, decompress and decode index blocks (index blocks is also compressed). The hash values are generally un-compressible. The design of this software forced me to compress everything to meet the user's needs (using in embedded systems). Now, I want to remove duplicate text from search result using hash values to avoid (un)compressed text comparison (which is unreasonable in my case because of disk I/O).
It seems that we can design a custom checksum that meets the conditions. For example, I store text length in 2 bytes and generate 2-bytes checksum to check duplicate possibility ?!
I appreciate any suggestion in advance.
-- UPDATE 3 --
After a lot of investigations and using the information that are provided by answers, thanks to all of you, I found that CRC32 is good enough in my case. I ran some statistical benchmarks on my generated CRCs, after checking the duplicate values, the result was satisfying.
thanks to all of you.
I will up-vote all answers.
Without further knowledge about small text, the best you can hope for is each hash value equally probable, and most of 2³² 4-octet-values used. Even then, you are more likely than not to have a collision with just about 77000 texts, let alone a million. With a few exceptions (Adler32 coming to mind), well-known hash functions differ very little in collision probability. (They differ in difficulty to produce collisions/given values on purpose, and in computation/circuit cost.)
→Chose a compromise between collision probability and storage requirements.
For easily computed checksums, have a look at Fletcher's - Adler32 is very similar, but has a an increased collision probability with short inputs.
In case you get into hash collision you have to check if text is equal. The best way would be to count how many time it happens to have collision make some statistics and if it looks bad optimizing it. I got this idea that you could build 2 different hash values crc32 and md5 (or Luhn or whatever you like) and check for equality only if both hashes have same values.
I did something very similar in one of my projects. In my project i used something called a BLOOM FILTER , watch about the entire thing here and how to implement it , Bloom filter reduces the chances of HASH COLLITIONS massively thanks to its use of several hashing algorithms (however its possible to simulate multiple hash functions using just one hashing function but that what we are here for.) .. Try this out !! it worked for me and will work for u as well
An actual working implementation of a bloom filter

Why are hashing algorithms safe to use?

Hashing algorithms today are widely used to check for integrity of data, but why are they safe to use? A 256-bit hashing algorithm generates 256 bits representation of given data. However, a 256-bit hash only has 2512 variations. But 1 KB of data has 28192 different variations. It's mathematically impossible for every piece of data in the world to have different hash values. So why are hashing algorithms safe?
The reasons why hashing algorithms are considered safe are due to the following:
They are irreversible. You can't get to the input data by reverse-engineering the output hash value.
A small change in the input will produce a vastly different hash value. i.e. "hello" vs "hellp" will generate completely different values.
The assumption being made with data integrity is that a majority of your input is going to be the same between a good copy of input data and a bad (malicious) copy of input data. The small change in data will make the hash value completely different. Therefore, if I try to inject any malicious code or data, that small change will completely throw-off the value of the hash. When comparison is done with a known hash value, it'll be easily determinable if data has been modified or corrupted.
You are correct in that there is risk of collisions between an infinite number of datasets, but when you compare two datasets that are very similar, it is reasonable to assume that the hash values of those two almost-equivalent datasets with be completely different.
Not all hashes are safe. There are good hashes (for some value of "good") where it's sufficiently non-trivial to intentionally create collisions (I think FNV-1a may fall in this category). However, a cryptographic hash, used correctly, would be computationally expensive to generate a collision for.
"Good" hashes generally have the property that small changes in the input cause large changes in the output (rule of thumb is that a single-bit flip in the input cause roughly b bit flips in the output, for a 2b hash). There are some special-purpose hashes where "close inputs generate close hashes" is actually a feature, but you probably would not use those for error detecting, but they may be useful for other applications.
A specific use for FNV-1a is to hash large blocks of data, then compare the computed hash to that of other blocks. Only blocks that have a matching hash need to be fully compared to see if they're identical, meaning that a large number of blocks can simply be ignored, speeding up the comparison by orders of magnitude (you can compare one 2 MB to another in approximately the same time as you can compare its 64-bit hash to that of the hash of 256Ki blocks; although you will probbaly have a few blocks that have colliding hashes).
Note that "just a hash" may not be enough to provide security, you may also need to apply some sort of signing mechanism to ensure that you don't have the problem of someone modifying the hashed-over text as well as the hash.
Simply for ensuring storage integrity (basically "protect against accidental modification" as a threat model), a cryptographic hash without signature, plus the original size, should be good enough. You would need a really really unlikely sequence of random events mutating a fixed-length bit string to another fixed-length bit string of the same length, giving the same hash. Of course, this does not give you any sort of error correction ability, just error detection.

Perfect Hash Building

Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions and are also not revertible. So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Only reason I am able to think is they require say large key say 32bit.But still avoiding collision so the look up will definitely be O(1).
Because they are very slow, for two reasons:
They aim to be crytographically secure, not only collision-resistant in general
They produce a much larger hash value than what you actually need in a hash table
Because they handle unstructured data (octet / byte streams) but the objects you need to hash are often structured and would require linearization first
Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions...
Wrong because:
Two inputs cam still happen to have the same hash value. Say the hash value is 32 bit, a great general-purpose hash routine (i.e. one that doesn't utilise insights into the set of actual keys) still has at least 1/2^32 chance of returning the same hash value for any 2 keys, then 2/2^32 chance of colliding with one of those as a third key is hashed, 3/2^32 for the fourth etc..
Having distinct hash values is a very different thing from having the hash values map to distinct hash buckets in a hash table. Hash values are generally modded into the table size to select a bucket, so at best - and again for general-purpose hashing - the chance of a collision when adding an element to a hash table is #preexisting-elements / table-size.
So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Because speed is often the programmer's goal when choosing to use a hash table over say a binary tree. If the hash values are mathematically complicated to calculate, they may take a lot longer than using a slightly more (but still not particularly) collision prone but faster-to-calculate hash function. That said, there are times when more effort on the hashing can pay off - for example, when the hash table exists on magnetic disk and the I/O costs of seeking & reading records dwarfs hash calculation effort.
antti makes an interesting point about data too... general purpose hashing routines often work on blocks of binary data with a specific starting address and a number of bytes (they may even require that number of bytes to be a multiple of 2 or 4). In many applications, data that needs to be hashed will be intermingled with data that must not be included in the hash - such as cached values, file handles, pointers/references to other data or virtual dispatch tables etc.. A common solution is to hash the desired fields separately and combine the hash keys - perhaps using exclusive-or. As there can be bit fields that should be hashed in the same byte of memory as other data that should not be hashed, you sometimes need custom code to extract those values. Still, even if some copying and padding was required beforehand, each individual field could eventually be hashed using md5, SHA-1 or whatever and those hash values could be similarly combined, so this complication doesn't really categorically rule out the approach you're interested in.
Only reason I am able to think is they require say large key say 32bit.
All other things being equal, the larger the key the better, though if the hash function is mathematically ideal then any N of its bits - where 2^N >= # hash buckets - will produce minimal collisions.
But still avoiding collision so the look up will definitely be O(1).
Again, wrong as mentioned above.
(BTW... I stress general-purpose in a couple places above. That's just because there are trivial cases where you might have some insight into the keys you'll need to hash that allows you to position them perfectly within the available hash buckets. For example, if you knew the keys were the numbers 1000, 2000, 3000 etc. up to 100000 and that you had at least 100 hash buckets, you could trivially define your hash function as x/1000 and know you'd have perfect hashing sans collisions. This situation of knowing that all your keys map to distinct hash table buckets is known as "perfect hashing" - as per your question title - a good general-purpose hash like md5 is not a perfect hash, and indeed it makes no sense to talk about perfect hashing without knowing the complete set of possible keys).

Resources