Do hash functions/checksums without the diffusion property exist? - algorithm

From Wikipedia:
Diffusion means that if we change a single bit of the plaintext, then (statistically) half of the bits in the ciphertext should change, and similarly, if we change one bit of the ciphertext, then approximately one half of the plaintext bits should change.[2] Since a bit can have only two states, when they are all re-evaluated and changed from one seemingly random position to another, half of the bits will have changed state.
For example, change one bit of a file and the MD5 checksum of the file becomes completely different.
Is there any hash function/checksum that does not have the diffusion property? Ideally, if 20% of the plaintext changes then 20% of the ciphertext should change and if 80% of the plaintext changes then 80% of the ciphertext should change. This way, % change in the plaintext can be tracked via the ciphertext.

I think the closest thing to what you appear to be looking for is "locality sensitive hashing", which attempts to map similar inputs to similar outputs: https://en.wikipedia.org/wiki/Locality-sensitive_hashing
The amount of change in the hash is not really proportional to the amount of change in the input, but maybe it'll do for what you want.

Depending on what the purpose of your hash-function is, you might not need the diffusion property.
E.g for large data blobs you might want to avoid reading all bytes to generate a hash-value (because of performance). So using only the first 1000 Bytes to calculate a hash-value might be good enough for usage in a HashSet, if the 1000 bytes differ enough through your data sets.
Related to your % question:
You can create such a function, but that is quite an unusual extra feature of the hash-function.
E.g. You can split your data in 100 equal sized chunks calculate the md5 for each of them and than take the first n bytes of each md5-value and append them to create your hash-value. this way you can determine the % value out of the of the fact how many n-bytes blocks changed in your hash-value (you can even see at which position your data changed)

Yes, these functions do exist and are called very bad hash functions.
People come up with such nonsense from time to time.
The diffusion property is an essential property of a good hash function.
E.g. you could return the address of the string which does look like a real hash function looking from outside, but when you change the string in place without reallocation, you see no diffusion, no change.
Or language implementors cut off long strings during faster hash processing. These hashes will collide when you change characters at the end of the string. If you are happy with bad hashes and many collisions it's still a valid hash function. This is done in practice. Your miles may vary.

Related

Why are hashing algorithms safe to use?

Hashing algorithms today are widely used to check for integrity of data, but why are they safe to use? A 256-bit hashing algorithm generates 256 bits representation of given data. However, a 256-bit hash only has 2512 variations. But 1 KB of data has 28192 different variations. It's mathematically impossible for every piece of data in the world to have different hash values. So why are hashing algorithms safe?
The reasons why hashing algorithms are considered safe are due to the following:
They are irreversible. You can't get to the input data by reverse-engineering the output hash value.
A small change in the input will produce a vastly different hash value. i.e. "hello" vs "hellp" will generate completely different values.
The assumption being made with data integrity is that a majority of your input is going to be the same between a good copy of input data and a bad (malicious) copy of input data. The small change in data will make the hash value completely different. Therefore, if I try to inject any malicious code or data, that small change will completely throw-off the value of the hash. When comparison is done with a known hash value, it'll be easily determinable if data has been modified or corrupted.
You are correct in that there is risk of collisions between an infinite number of datasets, but when you compare two datasets that are very similar, it is reasonable to assume that the hash values of those two almost-equivalent datasets with be completely different.
Not all hashes are safe. There are good hashes (for some value of "good") where it's sufficiently non-trivial to intentionally create collisions (I think FNV-1a may fall in this category). However, a cryptographic hash, used correctly, would be computationally expensive to generate a collision for.
"Good" hashes generally have the property that small changes in the input cause large changes in the output (rule of thumb is that a single-bit flip in the input cause roughly b bit flips in the output, for a 2b hash). There are some special-purpose hashes where "close inputs generate close hashes" is actually a feature, but you probably would not use those for error detecting, but they may be useful for other applications.
A specific use for FNV-1a is to hash large blocks of data, then compare the computed hash to that of other blocks. Only blocks that have a matching hash need to be fully compared to see if they're identical, meaning that a large number of blocks can simply be ignored, speeding up the comparison by orders of magnitude (you can compare one 2 MB to another in approximately the same time as you can compare its 64-bit hash to that of the hash of 256Ki blocks; although you will probbaly have a few blocks that have colliding hashes).
Note that "just a hash" may not be enough to provide security, you may also need to apply some sort of signing mechanism to ensure that you don't have the problem of someone modifying the hashed-over text as well as the hash.
Simply for ensuring storage integrity (basically "protect against accidental modification" as a threat model), a cryptographic hash without signature, plus the original size, should be good enough. You would need a really really unlikely sequence of random events mutating a fixed-length bit string to another fixed-length bit string of the same length, giving the same hash. Of course, this does not give you any sort of error correction ability, just error detection.

Such a thing as a constant quality (variable bit) digest hashing algorithm?

Problem space: We have a ton of data to digest that can range 6 orders of magnitude in size. Looking for a way to be more efficient, and thus use less disk space to store all of these digests.
So I was thinking about lossy audio encoding, such as MP3. There are two basic approaches - constant bitrate and constant quality (aka variable bitrate). Since my primary interest is quality, I usually go for VBR. Thus, to achieve the same level of quality, a pure sin tone would require significantly lower bitrate than a something like a complex classical piece.
Using the same idea, two very small data chunks should require significantly less total digest bits than two very large data chunks to ensure roughly the same statistical improbability (what I am calling quality in this context) of their digests colliding. This is an assumption that seems intuitively correct to me, but then again, I am not a crypto mathematician. Also note that this is all about identification, not security. It's okay if a small data chunk has a small digest, and thus computationally feasible to reproduce.
I tried searching around the inter-tubes for anything like this. The closest thing I found was a posting somewhere that talked about using a fixed size digest hash, like SHA256, as a initialization vector for AES/CTR acting as a psuedo-random generator. Then taking the first x number of bit off that.
That seems like a totally do-able thing. The only problem with this approach is that I have no idea how to calculate the appropriate value of x as a function of the data chunk size. I think my target quality would be statistical improbability of SHA256 collision between two 1GB data chunks. Does anyone have thoughts on this calculation?
Are there any existing digest hashing algorithms that already do this? Or are there any other approaches that will yield this same result?
Update: Looks like there is the SHA3 Keccak "sponge" that can output an arbitrary number of bits. But I still need to know how many bits I need as a function of input size for a constant quality. It sounded like this algorithm produces an infinite stream of bits, and you just truncate at however many you want. However testing in Ruby, I would have expected the first half of a SHA3-512 to be exactly equal to a SHA3-256, but it was not...
Your logic from the comment is fairly sound. Quality hash functions will not generate a duplicate/previously generated output until the input length is nearly (or has exceeded) the hash digest length.
But, the key factor in collision risk is the size of the input set to the size of the hash digest. When using a quality hash function, the chance of a collision for two 1 TB files not significantly different than the chance of collision for two 1KB files, or even one 1TB and one 1KB file. This is because hash function strive for uniformity; good functions achieve it to a high degree.
Due to the birthday problem, the collision risk for a hash function is is less than the bitwidth of its output. That wiki article for the pigeonhole principle, which is the basis for the birthday problem, says:
The [pigeonhole] principle can be used to prove that any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger. Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L, and do so without collisions (because the compression is lossless), which possibility the pigeonhole principle excludes.
So going to a 'VBR' hash digest is not guaranteed to save you space. The birthday problem provides the math for calculating the chance that two random things will share the same property (a hash code is a property, in a broad sense), but this article gives a better summary, including the following table.
Source: preshing.com
The top row of the table says that in order to have a 50% chance of a collision with a 32-bit hash function, you only need to hash 77k items. For a 64-bit hash function, that number rises to 5.04 billion for the same 50% collision risk. For a 160-bit hash function, you need 1.42 * 1024 inputs before there is a 50% chance that a new input will have the same hash as a previous input.
Note that 1.42 * 1024 160 bit numbers would themselves take up an unreasonably large amount of space; millions of Terabytes, if I'm doing the math right. And that's without counting for the 1024 item values they represent.
The bottom end of that table should convince you that a 160-bit hash function has a sufficiently low risk of collisions. In particular, you would have to have 1021 hash inputs before there is even a 1 in a million chance of a hash collision. That's why your searching turned up so little: it's not worth dealing with the complexity.
No matter what hash strategy you decide upon however, there is a non-zero risk of collision. Any type of ID system that relies on a hash needs to have a fallback comparison. An easy additional check for files is to compare their sizes (works well for any variable length data where the length is known, such as strings). Wikipedia covers several different collision mitigation and detection strategies for hash tables, most of which can be extended to a filesystem with a little imagination. If you require perfect fidelity, then after you've run out of fast checks, you need to fallback to the most basic comparator: the expensive bit-for-bit check of the two inputs.
If I understand the question correctly, you have a number of data items of different lengths, and for each item you are computing a hash (i.e. a digest) so the items can be identified.
Suppose you have already hashed N items (without collisions), and you are using a 64bit hash code.
The next item you hash will take one of 2^64 values and so you will have a N / 2^64 probability of a hash collision when you add the next item.
Note that this probability does NOT depend on the original size of the data item. It does depend on the total number of items you have to hash, so you should choose the number of bits according to the probability you are willing to tolerate of a hash collision.
However, if you have partitioned your data set in some way such that there are different numbers of items in each partition, then you may be able to save a small amount of space by using variable sized hashes.
For example, suppose you use 1TB disk drives to store items, and all items >1GB are on one drive, while items <1KB are on another, and a third is used for intermediate sizes. There will be at most 1000 items on the first drive so you could use a smaller hash, while there could be a billion items on the drive with small files so a larger hash would be appropriate for the same collision probability.
In this case the hash size does depend on file size, but only in an indirect way based on the size of the partitions.

Is it possible to retrieve the original message from a SHA-1 encrypted message?

Is it possible to retrieve the original message from a SHA-1 encrypted message? If I have an SHA -1 encrypted message, what all paratmeters do i need to get the original message from it?
I answered a similar question already: Python SHA1 DECODE function
In short, no it is not possible. The whole point of hashing is to take some long string and turn it into a small one. Hashing is destructive and you lose data, so it is irreversible.
Also, to make things more fun, infinitely many strings have the same hash1. It is impossible to generate a unique string with a given hash unless you know more information about the input.
1: There are tons of hash functions and some may have "special" hashes that are only generated when you give a specific input to the function. Aside from those rare cases (if they even exist), every other output hash has infinitely many input strings.
http://en.wikipedia.org/wiki/Cryptographic_hash_function
it is infeasible to generate a message that has a given hash
The SHA-1 hash generate a 160-bit output from an arbitrarily sized input. As there is more possible inputs than the 2^160 possible output, there is bound to be collision, ie. different input having the same output.
This mean that it may be possible (via brute-force or by exploiting a weakness in the algorithm — none are known at the moment I think) to find a message corresponding to a given hash, but it may not be the original message.
Even if you fix the size of the input, if it is larger than 160 bits, there will be collision, and no way to invert the hash function.
Hashing is not encryption. Encryption is like shuffling the pieces of a jigsaw puzzle. Hashing is more like putting the pieces in a blender, there's no reasonable way to restore the original picture after that.
If you know the length of the original message (in multiples of 512 bits), you'll only need to test the 2^512 inputs of that size. Apply a SHA1 operation to each, and compare the result. This assumes no salting, and rather significant computational resources.

Is Terra Compression possible? If so, please explain and provide samples

Long Ascii String Text may or may not be crushed and compressed into hash kind of ascii "checksum" by using sophisticated mathematical formula/algorithm. Just like air which can be compressed.
To compress megabytes of ascii text into a 128 or so bytes, by shuffling, then mixing new "patterns" of single "bytes" turn by turn from the first to the last. When we are decompressing it, the last character is extracted first, then we just go on decompression using the formula and the sequential keys from the last to the first. The sequential keys and the last and the first bytes must be exactly known, including the fully updated final compiled string, and the total number of bytes which were compressed.
This is the terra compression I was thinking about. Is this possible? Can you explain examples. I am working on this theory and it is my own thought.
In general? Absolutely not.
For some specific cases? Yup. A megabyte of ASCII text consisting only of spaces is likely to compress extremely well. Real text will generally compress pretty well... but not in the order of several megabytes into 128 bytes.
Think about just how many strings - even just strings of valid English words - can fit into several megabytes. Far more than 256^128. They can't all compress down to 128 bytes, by the pigeon-hole principle...
If you have n possible input strings and m possible compressed strings and m is less than n then two strings must map to the same compressed string. This is called the pigeonhole principle and is the fundemental reason why there is a limit on how much you can compress data.
What you are describing is more like a hash function. Many hash functions are designed so that given a hash of a string it is extremely unlikely that you can find another string that gives the same hash. But there is no way that given a hash you can discover the original string. Even if you are able to reverse the hashing operation to produce a valid input that gives that hash, there are infinitely many other inputs that would give the same hash. You wouldn't know which of them is the "correct" one.
Information theory is the scientific field which addresses questions of this kind. It also provides you the possibility to calculate the minimum amount of bits needed to store a compressed message (with lossless compression). This lower bound is known as the Entropy of the message.
Calculation of the Entropy of a piece of text is possible using a Markov model. Such a model uses information how likely a certain sequence of characters of the alphabet is.
The air analogy is very wrong.
When you compress air you make the molecules come closer to each other, each molecule is given less space.
When you compress data you can not make the bit smaller (unless you put your harddrive in a hydraulic press). The closest you can get of actually making bits smaller is increasing the bandwidth of a network, but that is not compression.
Compression is about finding a reversible formula for calculating data. The "rules" about data compression are like
The algorithm (including any standard start dictionaries) is shared before hand and not included in the compressed data.
All startup parameters must be included in the compressed data, including:
Choice of algorithmic variant
Choice of dictionaries
All compressed data
The algorithm must be able to compress/decompress all possible messages in your domain (like plain text, digits or binary data).
To get a feeling of how compression works you may study some examples, like Run length encoding and Lempel Ziv Welch.
You may be thinking of fractal compression which effectively works by storing a formula and start values. The formula is iterated a certain number of times and the result is an approximation of the original input.
This allows for high compression but is lossy (output is close to input but not exactly the same) and compression can be very slow. Even so, ratios of 170:1 are about the highest achieved at the moment.
This is a bit off topic, but I'm reminded of the Broloid compression joke thread that appeared on USENET ... back in the days when USENET was still interesting.
Seriously, anyone who claims to have a magical compression algorithm that reduces any text megabyte file to a few hundred bytes is either:
a scammer or click-baiter,
someone who doesn't understand basic information theory, or
both.
You can compress test to a certain degree because it doesn't use all the available bits (i.e. a-z and A-Z make up 52 out of 256 values). Repeating patterns allow some intelligent storage (zip).
There is no way to store arbitrary large chunks of text in any fixed length number of bytes.
You can compress air, but you won't remove it's molecules! It's mass keeps the same.

How to test a hash function?

Is there a way to test the quality of a hash function? I want to have a good spread when used in the hash table, and it would be great if this is verifyable in a unit test.
EDIT: For clarification, my problem was that I have used long values in Java in such a way that the first 32 bit encoded an ID and the second 32 bit encoded another ID. Unfortunately Java's hash of long values just XORs the first 32 bit with the second 32 bits, which in my case led to very poor performance when used in a HashMap. So I need a different hash, and would like to have a Unit Test so that this problem cannot creep in any more.
You have to test your hash function using data drawn from the same (or similar) distribution that you expect it to work on. When looking at hash functions on 64-bit longs, the default Java hash function is excellent if the input values are drawn uniformly from all possible long values.
However, you've mentioned that your application uses the long to store essentially two independent 32-bit values. Try to generate a sample of values similar to the ones you expect to actually use, and then test with that.
For the test itself, take your sample input values, hash each one and put the results into a set. Count the size of the resulting set and compare it to the size of the input set, and this will tell you the number of collisions your hash function is generating.
For your particular application, instead of simply XORing them together, try combining the 32-bit values in ways a typical good hash function would combine two indepenet ints. I.e. multiply by a prime, and add.
First I think you have to define what you mean by a good spread to yourself. Do you mean a good spread for all possible input, or just a good spread for likely input?
For example, if you're hashing strings that represent proper full (first+last) names, you're not going to likely care about how things with the numerical ASCII characters hash.
As for testing, your best bet is to probably get a huge or random input set of data you expect, and push it through the hash function and see how the spread ends up. There's not likely going to be a magic program that can say "Yes, this is a good hash function for your use case.". However, if you can programatically generate the input data, you should easily be able to create a unit test that generates a significant amount of it and then verify that the spread is within your definition of good.
Edit: In your case with a 64 bit long, is there even really a reason to use a hash map? Why not just use a balanced tree directly, and use the long as the key directly rather than rehashing it? You pay a little penalty in overall node size (2x the size for the key value), but may end up saving it in performance.
If your using a chaining hash table, what you really care about is the number of collisions. This would be trivial to implement as a simple counter on your hash table. Every time an item is inserted and the table has to chain, increment a chain counter. A better hashing algorithm will result in a lower number of collisions. A good general purpose table hashing function to check out is: djb2
Based on your clarification:
I have used long values in Java in such a way that the first 32 bit encoded an ID and the second 32 bit encoded another ID. Unfortunately Java's hash of long values just XORs the first 32 bit with the second 32 bits, which in my case led to very poor performance when used in a HashMap.
it appears you have some unhappy "resonances" between the way you assign the two ID values and the sizes of your HashMap instances.
Are you explicitly sizing your maps, or using the defaults? A QAD check seems to indicate that a HashMap<Long,String> starts with a 16-bucket structure and doubles on overflow. That would mean that only the low-order bits of the ID values are actually participating in the hash bucket selection. You could try using one of the constructors that takes an initial-size parameter and create your maps with a prime initial size.
Alternately, Dave L's suggestion of defining your own hashing of long keys would allow you to avoid the low-bit-dependency problem.
Another way to look at this is that you're using a primitive type (long) as a way to avoid defining a real class. I'd suggest looking at the benefits you could achieve by defining the business classes and then implementing hash-coding, equality, and other methods as appropriate on your own classes to manage this issue.

Resources