Why are hashing algorithms safe to use? - algorithm

Hashing algorithms today are widely used to check for integrity of data, but why are they safe to use? A 256-bit hashing algorithm generates 256 bits representation of given data. However, a 256-bit hash only has 2512 variations. But 1 KB of data has 28192 different variations. It's mathematically impossible for every piece of data in the world to have different hash values. So why are hashing algorithms safe?

The reasons why hashing algorithms are considered safe are due to the following:
They are irreversible. You can't get to the input data by reverse-engineering the output hash value.
A small change in the input will produce a vastly different hash value. i.e. "hello" vs "hellp" will generate completely different values.
The assumption being made with data integrity is that a majority of your input is going to be the same between a good copy of input data and a bad (malicious) copy of input data. The small change in data will make the hash value completely different. Therefore, if I try to inject any malicious code or data, that small change will completely throw-off the value of the hash. When comparison is done with a known hash value, it'll be easily determinable if data has been modified or corrupted.
You are correct in that there is risk of collisions between an infinite number of datasets, but when you compare two datasets that are very similar, it is reasonable to assume that the hash values of those two almost-equivalent datasets with be completely different.

Not all hashes are safe. There are good hashes (for some value of "good") where it's sufficiently non-trivial to intentionally create collisions (I think FNV-1a may fall in this category). However, a cryptographic hash, used correctly, would be computationally expensive to generate a collision for.
"Good" hashes generally have the property that small changes in the input cause large changes in the output (rule of thumb is that a single-bit flip in the input cause roughly b bit flips in the output, for a 2b hash). There are some special-purpose hashes where "close inputs generate close hashes" is actually a feature, but you probably would not use those for error detecting, but they may be useful for other applications.
A specific use for FNV-1a is to hash large blocks of data, then compare the computed hash to that of other blocks. Only blocks that have a matching hash need to be fully compared to see if they're identical, meaning that a large number of blocks can simply be ignored, speeding up the comparison by orders of magnitude (you can compare one 2 MB to another in approximately the same time as you can compare its 64-bit hash to that of the hash of 256Ki blocks; although you will probbaly have a few blocks that have colliding hashes).
Note that "just a hash" may not be enough to provide security, you may also need to apply some sort of signing mechanism to ensure that you don't have the problem of someone modifying the hashed-over text as well as the hash.
Simply for ensuring storage integrity (basically "protect against accidental modification" as a threat model), a cryptographic hash without signature, plus the original size, should be good enough. You would need a really really unlikely sequence of random events mutating a fixed-length bit string to another fixed-length bit string of the same length, giving the same hash. Of course, this does not give you any sort of error correction ability, just error detection.

Related

Do hash functions/checksums without the diffusion property exist?

From Wikipedia:
Diffusion means that if we change a single bit of the plaintext, then (statistically) half of the bits in the ciphertext should change, and similarly, if we change one bit of the ciphertext, then approximately one half of the plaintext bits should change.[2] Since a bit can have only two states, when they are all re-evaluated and changed from one seemingly random position to another, half of the bits will have changed state.
For example, change one bit of a file and the MD5 checksum of the file becomes completely different.
Is there any hash function/checksum that does not have the diffusion property? Ideally, if 20% of the plaintext changes then 20% of the ciphertext should change and if 80% of the plaintext changes then 80% of the ciphertext should change. This way, % change in the plaintext can be tracked via the ciphertext.
I think the closest thing to what you appear to be looking for is "locality sensitive hashing", which attempts to map similar inputs to similar outputs: https://en.wikipedia.org/wiki/Locality-sensitive_hashing
The amount of change in the hash is not really proportional to the amount of change in the input, but maybe it'll do for what you want.
Depending on what the purpose of your hash-function is, you might not need the diffusion property.
E.g for large data blobs you might want to avoid reading all bytes to generate a hash-value (because of performance). So using only the first 1000 Bytes to calculate a hash-value might be good enough for usage in a HashSet, if the 1000 bytes differ enough through your data sets.
Related to your % question:
You can create such a function, but that is quite an unusual extra feature of the hash-function.
E.g. You can split your data in 100 equal sized chunks calculate the md5 for each of them and than take the first n bytes of each md5-value and append them to create your hash-value. this way you can determine the % value out of the of the fact how many n-bytes blocks changed in your hash-value (you can even see at which position your data changed)
Yes, these functions do exist and are called very bad hash functions.
People come up with such nonsense from time to time.
The diffusion property is an essential property of a good hash function.
E.g. you could return the address of the string which does look like a real hash function looking from outside, but when you change the string in place without reallocation, you see no diffusion, no change.
Or language implementors cut off long strings during faster hash processing. These hashes will collide when you change characters at the end of the string. If you are happy with bad hashes and many collisions it's still a valid hash function. This is done in practice. Your miles may vary.

Such a thing as a constant quality (variable bit) digest hashing algorithm?

Problem space: We have a ton of data to digest that can range 6 orders of magnitude in size. Looking for a way to be more efficient, and thus use less disk space to store all of these digests.
So I was thinking about lossy audio encoding, such as MP3. There are two basic approaches - constant bitrate and constant quality (aka variable bitrate). Since my primary interest is quality, I usually go for VBR. Thus, to achieve the same level of quality, a pure sin tone would require significantly lower bitrate than a something like a complex classical piece.
Using the same idea, two very small data chunks should require significantly less total digest bits than two very large data chunks to ensure roughly the same statistical improbability (what I am calling quality in this context) of their digests colliding. This is an assumption that seems intuitively correct to me, but then again, I am not a crypto mathematician. Also note that this is all about identification, not security. It's okay if a small data chunk has a small digest, and thus computationally feasible to reproduce.
I tried searching around the inter-tubes for anything like this. The closest thing I found was a posting somewhere that talked about using a fixed size digest hash, like SHA256, as a initialization vector for AES/CTR acting as a psuedo-random generator. Then taking the first x number of bit off that.
That seems like a totally do-able thing. The only problem with this approach is that I have no idea how to calculate the appropriate value of x as a function of the data chunk size. I think my target quality would be statistical improbability of SHA256 collision between two 1GB data chunks. Does anyone have thoughts on this calculation?
Are there any existing digest hashing algorithms that already do this? Or are there any other approaches that will yield this same result?
Update: Looks like there is the SHA3 Keccak "sponge" that can output an arbitrary number of bits. But I still need to know how many bits I need as a function of input size for a constant quality. It sounded like this algorithm produces an infinite stream of bits, and you just truncate at however many you want. However testing in Ruby, I would have expected the first half of a SHA3-512 to be exactly equal to a SHA3-256, but it was not...
Your logic from the comment is fairly sound. Quality hash functions will not generate a duplicate/previously generated output until the input length is nearly (or has exceeded) the hash digest length.
But, the key factor in collision risk is the size of the input set to the size of the hash digest. When using a quality hash function, the chance of a collision for two 1 TB files not significantly different than the chance of collision for two 1KB files, or even one 1TB and one 1KB file. This is because hash function strive for uniformity; good functions achieve it to a high degree.
Due to the birthday problem, the collision risk for a hash function is is less than the bitwidth of its output. That wiki article for the pigeonhole principle, which is the basis for the birthday problem, says:
The [pigeonhole] principle can be used to prove that any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger. Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L, and do so without collisions (because the compression is lossless), which possibility the pigeonhole principle excludes.
So going to a 'VBR' hash digest is not guaranteed to save you space. The birthday problem provides the math for calculating the chance that two random things will share the same property (a hash code is a property, in a broad sense), but this article gives a better summary, including the following table.
Source: preshing.com
The top row of the table says that in order to have a 50% chance of a collision with a 32-bit hash function, you only need to hash 77k items. For a 64-bit hash function, that number rises to 5.04 billion for the same 50% collision risk. For a 160-bit hash function, you need 1.42 * 1024 inputs before there is a 50% chance that a new input will have the same hash as a previous input.
Note that 1.42 * 1024 160 bit numbers would themselves take up an unreasonably large amount of space; millions of Terabytes, if I'm doing the math right. And that's without counting for the 1024 item values they represent.
The bottom end of that table should convince you that a 160-bit hash function has a sufficiently low risk of collisions. In particular, you would have to have 1021 hash inputs before there is even a 1 in a million chance of a hash collision. That's why your searching turned up so little: it's not worth dealing with the complexity.
No matter what hash strategy you decide upon however, there is a non-zero risk of collision. Any type of ID system that relies on a hash needs to have a fallback comparison. An easy additional check for files is to compare their sizes (works well for any variable length data where the length is known, such as strings). Wikipedia covers several different collision mitigation and detection strategies for hash tables, most of which can be extended to a filesystem with a little imagination. If you require perfect fidelity, then after you've run out of fast checks, you need to fallback to the most basic comparator: the expensive bit-for-bit check of the two inputs.
If I understand the question correctly, you have a number of data items of different lengths, and for each item you are computing a hash (i.e. a digest) so the items can be identified.
Suppose you have already hashed N items (without collisions), and you are using a 64bit hash code.
The next item you hash will take one of 2^64 values and so you will have a N / 2^64 probability of a hash collision when you add the next item.
Note that this probability does NOT depend on the original size of the data item. It does depend on the total number of items you have to hash, so you should choose the number of bits according to the probability you are willing to tolerate of a hash collision.
However, if you have partitioned your data set in some way such that there are different numbers of items in each partition, then you may be able to save a small amount of space by using variable sized hashes.
For example, suppose you use 1TB disk drives to store items, and all items >1GB are on one drive, while items <1KB are on another, and a third is used for intermediate sizes. There will be at most 1000 items on the first drive so you could use a smaller hash, while there could be a billion items on the drive with small files so a larger hash would be appropriate for the same collision probability.
In this case the hash size does depend on file size, but only in an indirect way based on the size of the partitions.

Perfect Hash Building

Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions and are also not revertible. So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Only reason I am able to think is they require say large key say 32bit.But still avoiding collision so the look up will definitely be O(1).
Because they are very slow, for two reasons:
They aim to be crytographically secure, not only collision-resistant in general
They produce a much larger hash value than what you actually need in a hash table
Because they handle unstructured data (octet / byte streams) but the objects you need to hash are often structured and would require linearization first
Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions...
Wrong because:
Two inputs cam still happen to have the same hash value. Say the hash value is 32 bit, a great general-purpose hash routine (i.e. one that doesn't utilise insights into the set of actual keys) still has at least 1/2^32 chance of returning the same hash value for any 2 keys, then 2/2^32 chance of colliding with one of those as a third key is hashed, 3/2^32 for the fourth etc..
Having distinct hash values is a very different thing from having the hash values map to distinct hash buckets in a hash table. Hash values are generally modded into the table size to select a bucket, so at best - and again for general-purpose hashing - the chance of a collision when adding an element to a hash table is #preexisting-elements / table-size.
So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Because speed is often the programmer's goal when choosing to use a hash table over say a binary tree. If the hash values are mathematically complicated to calculate, they may take a lot longer than using a slightly more (but still not particularly) collision prone but faster-to-calculate hash function. That said, there are times when more effort on the hashing can pay off - for example, when the hash table exists on magnetic disk and the I/O costs of seeking & reading records dwarfs hash calculation effort.
antti makes an interesting point about data too... general purpose hashing routines often work on blocks of binary data with a specific starting address and a number of bytes (they may even require that number of bytes to be a multiple of 2 or 4). In many applications, data that needs to be hashed will be intermingled with data that must not be included in the hash - such as cached values, file handles, pointers/references to other data or virtual dispatch tables etc.. A common solution is to hash the desired fields separately and combine the hash keys - perhaps using exclusive-or. As there can be bit fields that should be hashed in the same byte of memory as other data that should not be hashed, you sometimes need custom code to extract those values. Still, even if some copying and padding was required beforehand, each individual field could eventually be hashed using md5, SHA-1 or whatever and those hash values could be similarly combined, so this complication doesn't really categorically rule out the approach you're interested in.
Only reason I am able to think is they require say large key say 32bit.
All other things being equal, the larger the key the better, though if the hash function is mathematically ideal then any N of its bits - where 2^N >= # hash buckets - will produce minimal collisions.
But still avoiding collision so the look up will definitely be O(1).
Again, wrong as mentioned above.
(BTW... I stress general-purpose in a couple places above. That's just because there are trivial cases where you might have some insight into the keys you'll need to hash that allows you to position them perfectly within the available hash buckets. For example, if you knew the keys were the numbers 1000, 2000, 3000 etc. up to 100000 and that you had at least 100 hash buckets, you could trivially define your hash function as x/1000 and know you'd have perfect hashing sans collisions. This situation of knowing that all your keys map to distinct hash table buckets is known as "perfect hashing" - as per your question title - a good general-purpose hash like md5 is not a perfect hash, and indeed it makes no sense to talk about perfect hashing without knowing the complete set of possible keys).

Do cryptographic hashes provide really unique results?

I was wondering whether md5, sha1 and anothers return unique values.
For example, sha1() for test returns a94a8fe5ccb19ba61c4c0873d391e987982fbbd3, which is 40 characters long. So, sha1 for strings larger than 40 chars must be the same (of course it's scrambled, because the given input may contain whitespaces and special chars etc.).
Due to this, when we are storing users' passwords, they can enter either their original password or some super-long one, which nobody knows.
Is this right, or do these hash algorithms provide really unique results - I'm quite sure it's hardly possible.
(Note: You're asking about hashing functions, not encryption).
It's impossible for them to be unique, by definition. They take a large input and reduce its size. It obviously follows, then, that they can't represent all the information they have compressed. So no, they don't provide "truly unique" results.
What they do provide, however, is "collision resistant" results. I.e. they try and show that two slightly different datas produce a significantly different hash.
Hashing algorithms (which is what you are referring to) do not provide unique results. What you are referring to is called the Pigeonhole Principle. The number of inputs exceeds the number of outputs, so multiple inputs must be mapped to the same output. This is why the longer the output hash the better, because there are less number of inputs mapped to an output.
Encrypting something must provide a unique results, because you can encrypt a message and decrypt it and get the same message.
SHA1 is not encryption algorithm, but a cryptographic hash function.
You are right - since it maps arbitrary long input to a fixed size hash there can be collisions. But the idea of a cryptographic hash function is to make it impossible to create such collisions "on demand". That's why we call them one-way hash functions, too.
Quote (source):
The ideal cryptographic hash function has four main or significant properties:
* it is easy to compute the hash value for any given message,
* it is infeasible to find a message that has a given hash,
* it is infeasible to modify a message without changing its hash,
* it is infeasible to find two different messages with the same hash.
Hashing algorithms never guarantee a different result for a different input. That's why hashing is always used as a one-way "encryption".
But you have to be realistic, a 160-bit hashing algorithm can have 2^160 possible combinations, which is... a lot! (1 with 48 zeroes)
These are not encryption functions, but hashing ones.
Hashing, by definition, can have two different strings collide (map to the same value) for the very reasons you mention. But that is usually not relevant because:
Cryptographic hashes (such as SHA1) try hard to make the collision probability for similar strings (very, very) low
You cannot deduce the original string from the hash.
These two mean that you cannot take a hash and easily generate one of the strings that map to it.

When do hashes collide?

I understand that according to pigeonhole principle, if number of items is greater than number of containers, then at least one container will have more than one item. Does it matter which container will it be? How does this apply to MD5, SHA1, SHA2 hashes?
No it doesn't matter which container it is, and in fact this is not that important to cryptographic hashes; much more important is the birthday paradox, which says that you only need to hash sqrt(numberNeededByPigeonHolePrincipal) values, on average, before finding a collision.
Thus, the hash needs to be large enough that the square-root of the search space is too large to brute-force. The square-root-of-search-space for SHA1 is 280, and as of March 2012, no two values have ever been found with the same SHA1-hash (though I predict that will happen within the next year or two..); same with SHA2, a family of hashes which all have an even larger search-space. MD5 has been broken for a while though.
If you have more items to hash than you have slots, then you'll have hash collisions. But if you have a poor hashing algorithm, then you'll see collisions even when the items / slots ratio is very small. A good hashing algorithm (including most of the ones you'll see in the wild) will attempt to spread the resulting hashes over the entire output space as evenly as possible, and thus minimize collisions.
Note that a hash collision is not the end of the world. When used in a hash table, for instance, it just means that more than one item is stored in a slot, and the table code will have to traverse a little bit more to find or add the target item, increasing lookup time slightly.
You'll see people refer to MD5 as a "broken" hashing algorithm, when in reality, it's just a poor one to use as a cryptographic hash. It'll be better than one you build yourself.
The point of a hash function is to randomly distribute items into containers. For any good hash function, it doesn't/shouldn't "matter" which container is which as they must be indistinguishable.
This does not apply to "perfect hash" implementations which attempt to do better than random distribution — unlike the algorithms you mentioned.
As Michael mentioned, collisions happen LONG before there are as many items as slots. You must have graceful collision handling (or a perfect hash) if you want to handle the birthday paradox.
I think which application you're using the hash function for is an important distinction. Frequent collision in hashing containers, for example, can degrade performance. Frequent collision in cryptography will have far more devastating consequences (see: cryptographic hash function on Wikipedia).
Collision happens relatively easily even with "decent" hashing algorithm. For example, in Java,
String s = new String(new char[size]);
always hashes to 0. That is, all strings containing only \0 hash to 0 in Java.
As for "does it matter which container will it be?", again it depends on the application. You can design hash functions that would hash "similar" objects to nearby values. This is useful when you want to search for similar objects, for example. Just hash them all and see where they fall. In this case, collisions or near-collisions are desirable, because it groups objects that are similar.
In other applications, you want even the slightest change in the object to result in an entirely different hash value. This is the case in cryptography, for example, where you want to be as certain as possible that something has not been modified. It is far more difficult to find different objects that hash to the same value in this case.
Depending on your application, cryptographic hashes like MDA, SHA1/2 etc. may not be the ideal choice, precisely because they appear as if entirely random, thus giving you collisions as prediced by the birthday paradox. Traditionally, one reason for using simple hashes based on the remainder operation is that keys were expected to be serial numbers or similar, so that a remainder operation would sustain fewer collisions than expected at random. E.g. if the keys are the integers are 1..1000 you might have no collisions at all in a container of size 1009 if your hash function is the key mod 1009. People would sometimes hand-tune systems by carefully picking container size and hash function to achieve an even split.
Of course, if you have to worry about people maliciously choosing keys that will cause you difficulty, or an upstream system sending you very biassed keys (because e.g. it has its own hash table and decides to process all keys that hash to X at once). you may wish to use a hash based on a keyed cryptographic hash function to defend against this.

Resources