why is hash output fixed in length? - algorithm

Hash functions always produce a fixed length output regardless of the input (i.e. MD5 >> 128 bits, SHA-256 >> 256 bits), but why?
I know that it is how the designer designed them to be, but why they designed the output to have the same length?
So that it can be stored in a consistent fashion? easier to be compared? less complicated?

Because that is what the definition of a hash is. Refer to wikipedia
A hash function is any function that can be used to map digital data
of arbitrary size to digital data of fixed size.
If your question relates to why it is useful for a hash to be a fixed size there are multiple reasons (non-exhaustive list):
Hashes typically encode a larger (often arbitrary size) input into a smaller size, generally in a lossy way, i.e. unlike compression functions, you cannot reconstruct the input from the hash value by "reversing" the process.
Having a fixed size output is convenient, especially for hashes designed to be used as a lookup key.
You can predictably (pre)allocate storage for hash values and index them in a contiguous memory segment such as an array.
For hashes of "native word sizes", e.g. 16, 32 and 64 bit integer values, you can do very fast equality and ordering comparisons.
Any algorithm working with hash values can use a single set of fixed size operations for generating and handling them.
You can predictably combine hashes produced with different hash functions in e.g. a bloom filter.
You don't need to waste any space to encode how big the hash value is.
There do exist special hash functions, that are capable of producing an output hash of a specified fixed length, such as so-called sponge functions.

As you can see it is the standard.
Also what you want is specified in standard :
Some application may require a hash function with a message digest
length different than those provided by the hash functions in this
Standard. In such cases, a truncated message digest may be used,
whereby a hash function with a larger message digest length is applied
to the data to be hashed, and the resulting message digest is
truncated by selecting an appropriate number of the leftmost bits.

Often it's because you want to use the hash value, or some part of it, to quickly store and look up values in a fixed-size array. (This is how a non-resizable hashtable works, for example.)
And why use a fixed-size array instead of some other, growable data structure (like a linked list or binary tree)? Because accessing them tends to be both theoretically and practically fast: provided that the hash function is good and the fraction of occupied table entries isn't too high, you get O(1) lookups (vs. O(log n) lookups for tree-based data structures or O(n) for lists) on average. And these accesses are fast in practice: after calculating the hash, which usually takes linear time in the size of the key with a low hidden constant, there's often just a bit shift, a bit mask and one or two indirect memory accesses into a contiguous block of memory that (a) makes good use of cache and (b) pipelines well on modern CPUs because few pointer indirections are needed.

Related

Is there a two-way hashing algorithm in PHP?

Disclaimer: I understand that a hash is not supposed to be reversible.
I've seen many people ask if there is a way to "unhash" text that is already hashed. However, I am not seeing a straight answer. Most answers state that MD5 and SHA-1 are one-way hashing algorthims, and therefore irreversible. That's great and all, but it begs the question are all hashing algorithms one-way and irreversible?
A hash function is any function that can be used to map data of arbitrary size to data of fixed size. (source: Wikipedia)
Because the range of the input values is infinite and the number of possible distinct output values is finite, the function produces the same output for an infinite number of input values. This means a hash is a losing-information function.
Assuming one could "reverse" the hashing, they would get an infinite set of possible original values. It is still impossible to tell what was the value used to generate the hash.
In mathematical terms, a hash function is not injective and this property automatically makes it not invertible.
All of the above apply to any hash function, no matter what language or library provides it.
Not really. The one absolutely non-negotiable property of a hash function is it converts data of an arbitrary length to values of a fixed length. This means each possible result of your hashing function has infinitely many possible inputs that could produce it, making reversing the hash function to a single value impossible.
If you can place constraints on the length of your data input, then technically you could define a reversible hash function but I don't particularly see a use for it.
... are all hashing algorithms one-way and irreversible?
There are some real-world hash functions that can be reversed, such as the not-uncommon implementation of nominally hashing an 8, 16, 32 or 64-bit number by returning the input unchanged. Many C++ Standard Libraries, python and other languages do exactly that, as it's often good enough for use by hash tables keyed on the numbers - the extra potential for collisions must be weighed up against the time that would have been needed to generate a stronger hash, and indeed even the potential CPU-cache benefits of nearby keys hashing to nearby buckets.
That said, your question starts...
I've seen many people ask if there is a way to "unhash" text that is already hashed.
For very short amounts of text, such 8-character passwords, brute force attacks using dictionaries and mutation rules (e.g. "try a dictionary word followed by each character from space (ASCII 32) through tilda (127)", "try all combinations of replacing letters with similar-looking or -sounding numbers"...) can sometimes find the password likely used (though there's a small chance it's another password with the same hash value).
If the input wasn't based on a dictionary word or something else guessable, it's far less likely to be crackable.
For longer amounts of text, it's increasingly impractical to find any input with matching hash value, and massively less likely that any such input would actually be the one originally used to generate the hash (with longer inputs, more of them will - on average - map to any given hash value). Once the text input is dozens of times longer than the hash value, it's totally impractical (unless perhaps quantum computing develops significantly). (Note that Microsoft's C++ compiler's std::hash<std::string> only combines 10 characters evenly spaced along any string to form the hash value, so longer strings don't increase the quality of the hash, but on the other hand the hash only provides any insight at all into the max 10 characters chosen to form it).
Most answers state that MD5 and SHA-1 are one-way hashing algorthims, and therefore irreversible.
Hashes suitable for cryptographic use (as distinct from hash table use) - should inherently take a relatively long time to calculate (some goodly fraction of a second on likely hardware), so that the brute-force dictionary attacks mentioned above are prohibitively compute-intensive even for short textual strings. This helps make them practically irreversible. Even reasonable checksum-strength hash functions will be hard to reverse after there are more bytes of input than there are bytes in the hash value, rapidly becoming practically irreversible as the input gets larger and larger.

Looking for an one-way function with small input and long output

I'm looking for an algorithm, which is a one-way function, like Hash function. And the algorithm accept a small input(serveral bits, less than 512 bits), and map it to a long output(1K Byte or more). Do you know an algorithm or a function like this?
From the Shannon theorem you don't gain any security by having a cyphertext of a size bigger than your plain text, unless the key (or the procedure to create the cyphertext) is different for any input. Even in this case, you will need to assign only one key (or mechanism) for each input x otherwise you violate the definition of a function. So if you apply an encryption mechanism f: X (set of inputs) -> Y (set of outputs), then |Y| <= |X|.
All this to say that if your input is less than 512 bits, you gain nothing by producing a 1KB output. Now, I recommend you to use one of the functions listed on the one-way function wiki page
Keccak has variable length output, (although not evaluated for in SHA-3), it's "security claim is disentangled from the output length. There is a minimum output length..." and Skein hash function has a variable output of up to 16 exabytes
Whatever your reasons are, you can calculate hashes of the same small data using different algorithms, then concatenate those hashes. If the output is not large enough, calculate hashes of hashes and append them.
As pointed in other answers, this doesn't have much sense from security perspective.

Algorithm to find most efficient base to store large integer

Very large integers are often stored as variable-length arrays of digits in memory, as opposed to a straightforward binary representation as is the case with most primitive 'int' or 'long' types, as in Java or C. With this in mind, I would be interested to know algorithm(s) that can compute:
At what count an integer must reach before it becomes more efficient to store it as a BigInteger (or equivalent arbitrary-precision arithmetic construct) with a given radix for the integer's digits;
Which radix would be most efficient to store the digits of this large integer.
I have mentioned 'efficiency'; by this, I mean I am mainly concerned with the amount of space such a BigInteger would consume, though I would also be interested to hear any comments on processing speed or time complexity.
An integer should consume the least space if stored in a raw binary format (unless maybe it is a small integer and data type is way too wide for it - to store 1 in 128 bit long long). Storing differently does not save any memory and is used to make the work with such integers easier.
If byte by byte, this translates into 256'ecimal radix - 256 possible values, as much as the byte can hold.
BigInt is never more efficient than one of the integer types directly supported by hardware. If you can use what's supported directly, use it.
What's supported by hardware most efficiently, likely a power of 2 or, often equivalently, binary.

How does the md5 hashing algorithm compress data to a fixed length?

I know that MD5 produces a 128-bit digest. My question is, how does it produce this fixed length output from a message of 128bits+?
EDIT:
I have now a greater understanding of hashing functions now. After reading this article I have realized that hash functions are one-way, meaning that you can't convert the hash back to plaintext. I was under the misimpression that you could due to all the online services converting them back to strings, but I have realised that thats just rainbow tables (collections of string's mapped to pre-computed hashes).
When you generate an MD5 hash, you're not compressing the input data. Compression implies that you'll be able to uncompress it back to it's original state. MD5, on the other hand, is a one-way process. This is why it's used for password storage; you ideally have to know the original input string to be able to generate the same MD5 result again.
This page provides a nice graphic-equipped explanation of MD5 and similar hash functions, and how they're used: An Illustrated Guide to Cryptographic Hashes
Consider something like starting with a 128-bit value, and taking input 128 bits at a time, and XORing each of those input blocks with the existing value.
MD5 is considerably more complex than that, but the general idea is the same: input is processed 128 bits at a time. Each input block can change the value of the result, but has no effect on the length.
It has noting (or, better, few) to do with compression. There is an algorithm which produces for every initial state and byte a new state. This state is more or less unique to this combination of inputs.
In short, it will split into many parts and do operation.
If you are wonder about the collsion, consider your message is only Readable.
The bit space is much bigger than readable char space.

Are fragments of hashes collision-resistent?

If you only use the first 4 bytes of an MD5 hash, would that mean theoretically only 1 in 255^4 chance of collision? That is, are hashes designed such that you only have to use a small portion of the returned hash (say the hash is of a file of some size)?
Remember that, even without considering a smart attacker deliberately trying to cause collisions, you need to start worrying about accidental collisions once the number of objects you're hashing get comparable to the square root of the hash space... just a few tens of thousands of objects for a 32-bit hash key. This comes from the so-called birthday paradox.
It is 256, not 255.
Assuming that MD5 is a secure hash function (it turns out it is not secure, but, for the sake of the discussion, let's suppose that it is secure), then it should behave like a random oracle, a mythical object which outputs uniformly random values, under the sole constraint that it "remembers" its previous outputs and returns the same value again, given the same input.
Truncating the output of a random oracle yields another random oracle. Thus, if you keep 32 bits, then the probability of a collision with two distinct input messages is 1 in 2^32 (i.e. 1 in 256^4).
Now there is a thing known as the birthday paradox which says that, with about 2^16 distinct inputs, there are good chances that two of the 2^16 corresponding outputs collide.
MD5 has been shown to be insecure for some purposes -- in particular anything which is related to collisions. The current default recommendation is SHA-2 (a family of four functions, with output sizes 224, 256, 384 and 512 bits, respectively). A new (american) standard is currently being defined, through an open competition, under the code name SHA-3. This is a long process; the new function shall be chosen by mid-2012. Some of the remaining candidates (currently 14, out of an initial 51) are substantially faster than SHA-2, some approaching MD5 in performance, while being considerably more secure. But this is a bit new, so right now you shall use SHA-2 by default.
Assume we have a pre-determined message1. hash1 = md5(message1)
Now choose a message2 randomly, and set hash2 = md5(message2).
In theory there is a 1/255^4 chance that the first four characters of hash2 match the first four of pre-determined hash1.
It is also supposed to be very hard for an attacker that knows message1 to come up with a different message2 that has the same hash. This is called second pre-image resistance. However, even with the full MD5, there are better than theoretical pre-image attacks.
MD5 is completely broken for collisions. This means it is quite feasible for an attacker (in a few hours) to come up with two messages with the same hash (let alone the same first four bytes). The attacker gets to choose both messages, but this can still cause major damage. See for instance the poisoned message example.
If you're generating unique identifiers, you might want to use a UUID instead. These are designed to minimize the change of collisions so that in practice they should never occur.
If you're worried about filenames being too long, which is a peculiar thing to be concerned about when most operating systems support names as long as 255 characters, you can always split the filename into a path and filename component. This has the advantage of splitting up the files into different directories:
fdadda221fd71619e6c0139730b012577dd4de90
fdadda221fd71619e6c/0139730b012577dd4de90
fdad/da22/1fd7/1619/e6c0/1397/30b0/1257/7dd4/de90
Depends on the purpose of the hash.
Hash functions for use in hash tables tend to have more "randomness" in the lower bits (which are used to find the array index) than in the higher bits. Checksum and cryptographic hash functions are more evenly distributed.

Resources