Is there a two-way hashing algorithm in PHP? - algorithm

Disclaimer: I understand that a hash is not supposed to be reversible.
I've seen many people ask if there is a way to "unhash" text that is already hashed. However, I am not seeing a straight answer. Most answers state that MD5 and SHA-1 are one-way hashing algorthims, and therefore irreversible. That's great and all, but it begs the question are all hashing algorithms one-way and irreversible?

A hash function is any function that can be used to map data of arbitrary size to data of fixed size. (source: Wikipedia)
Because the range of the input values is infinite and the number of possible distinct output values is finite, the function produces the same output for an infinite number of input values. This means a hash is a losing-information function.
Assuming one could "reverse" the hashing, they would get an infinite set of possible original values. It is still impossible to tell what was the value used to generate the hash.
In mathematical terms, a hash function is not injective and this property automatically makes it not invertible.
All of the above apply to any hash function, no matter what language or library provides it.

Not really. The one absolutely non-negotiable property of a hash function is it converts data of an arbitrary length to values of a fixed length. This means each possible result of your hashing function has infinitely many possible inputs that could produce it, making reversing the hash function to a single value impossible.
If you can place constraints on the length of your data input, then technically you could define a reversible hash function but I don't particularly see a use for it.

... are all hashing algorithms one-way and irreversible?
There are some real-world hash functions that can be reversed, such as the not-uncommon implementation of nominally hashing an 8, 16, 32 or 64-bit number by returning the input unchanged. Many C++ Standard Libraries, python and other languages do exactly that, as it's often good enough for use by hash tables keyed on the numbers - the extra potential for collisions must be weighed up against the time that would have been needed to generate a stronger hash, and indeed even the potential CPU-cache benefits of nearby keys hashing to nearby buckets.
That said, your question starts...
I've seen many people ask if there is a way to "unhash" text that is already hashed.
For very short amounts of text, such 8-character passwords, brute force attacks using dictionaries and mutation rules (e.g. "try a dictionary word followed by each character from space (ASCII 32) through tilda (127)", "try all combinations of replacing letters with similar-looking or -sounding numbers"...) can sometimes find the password likely used (though there's a small chance it's another password with the same hash value).
If the input wasn't based on a dictionary word or something else guessable, it's far less likely to be crackable.
For longer amounts of text, it's increasingly impractical to find any input with matching hash value, and massively less likely that any such input would actually be the one originally used to generate the hash (with longer inputs, more of them will - on average - map to any given hash value). Once the text input is dozens of times longer than the hash value, it's totally impractical (unless perhaps quantum computing develops significantly). (Note that Microsoft's C++ compiler's std::hash<std::string> only combines 10 characters evenly spaced along any string to form the hash value, so longer strings don't increase the quality of the hash, but on the other hand the hash only provides any insight at all into the max 10 characters chosen to form it).
Most answers state that MD5 and SHA-1 are one-way hashing algorthims, and therefore irreversible.
Hashes suitable for cryptographic use (as distinct from hash table use) - should inherently take a relatively long time to calculate (some goodly fraction of a second on likely hardware), so that the brute-force dictionary attacks mentioned above are prohibitively compute-intensive even for short textual strings. This helps make them practically irreversible. Even reasonable checksum-strength hash functions will be hard to reverse after there are more bytes of input than there are bytes in the hash value, rapidly becoming practically irreversible as the input gets larger and larger.

Related

In Hashing, can't we find AT LEAST one original text hashing to the given hash value

I have a basic question about hashing. It is said that hashing is one way. I have a doubt that if we simply reverse the steps in program/algorithm/logic then can't we find at least one input which hashes to the given output hash value?.
I found 2 related posts, but I am still not completely clear:
How is one way hashing possible?
How do one-way hash functions work? (Edited)
I have the same question as the comment to the accepted answer in the first post:
"Well, but if I want to bypass a password check it suffices to find one string that hashes to the same value as the original password". Does this comment hold water?.
What you're thinking of is called "hash collisions".
And you're right to think, that if one could find an efficient method to determined inputs for a given hash functions that produce a desired output, this would break a lot of systems (https://en.wikipedia.org/wiki/Preimage_attack)
That's there the bones and meat of cryptographically secure hash functions come in. Those are built in a way, that it is very, very difficult to find a preimage that produces a desired hash.
Over time mathamaticians and cryptologists are chipping away on those hashes and quite a number of hash functions that were used for securing thing have been broken (MD4, MD5, SHA-1).
Also it's important to differentiate between hashes that are intended to check the integrity of messages, and hashes that are intended to protect secrets.
For integrety checking you want fast hashes, so that you can put a lot of data through them with minimal effort. MD5, SHA-1, SHA-2 are such hashes.
For secret keeping you want SLOW -er than molasses hashes, so that one can't easily brute force through dictionaries of other predicable patterns of a secret. SCrypt, BCrypt, Argon and many-round PBKDF schemes are such hashes.
The operations in a cryptographic hash function are so complex and there are so many of them that reversing the function (compute at least one valid input for a given output) is incredibly infeasible. It doesn't matter if you do that reversing by hand or with the help of some sort of algorithmic solver. This is called (first) preimage resistance and this is what cryptographers are attacking when a new hash function is proposed. If the hash function stood the test of time, it is considered secure.
On the other hand it is much easier to just generate a bunch of candidate passwords and run the known hash function over them to check for equality with the given output. Humans are pretty bad at generating good passwords or passphrases. Have a look at this talk.
In Hashing, can't we find AT LEAST one original text hashing to the given hash value
In that context, "finding" as in brute forcing the input space is easier than attacking the hash function itself.
There's a very simple way of giving a hash function that is not reversible:
int GetHashCode(byte[] myData)
{
return 1;
}
This is a perfectly valid hash function, as it maps the contents of an arbitrary data set to a much smaller domain (int in this case). It satisfies the condition that the same input data gives the same output data.
It is obvious that this function is not reversible.
(Of course, this hash function is not suitable for securing anything, but that's only one application of hash functions)

Using multiple hash outputs in iterations?

Is there a known or perceived weakness to using the output of other hash algorithms as input for the next hash iteration?
Of course double hashing is not recommended, but this is not the same as double hashing.
Example:
I take a "secret" input and I hash it with SHA256, SHA384, and RIPEMD160 separately. I then combine the output of each into a single long string to use as input for a SHA512 hash. I then repeat this process repeatedly for a number of times.
In my mind, doing this significantly expands the length of the input into the SHA512 and essentially makes brute for even more infeasible.
Additionally, I considered using a 4th hash function merely to generate a value which could then be used to vary the length of the combined input string, by possibly discarding a few bytes in an unpredictable manner, so that the input is not a constant size. I'm not entirely sure that would be of any benefit.
Thoughts?
An answer to this question depends heavily on the attack scenario.
Of course double hashing is not recommended, but this is not the same as double hashing.
I would say: No! If you are storing passwords using a hash function, the attack on the store will be harder, if you use multiple rounds (feeding the output of round n as input for round n+1). Bitcoin as another example uses 2 passes (see here and here). For additional info see Why hashing twice?
by possibly discarding a few bytes in an unpredictable manner, so that the input is not a constant size. I'm not entirely sure that would be of any benefit.
That counteracts the way hash functions are designed. You want the function to produce the same output using the same input. Lifting this relationship basically destroys all use from the function. You could use a random number generator instead. See also: Does the MD5 algorithm always generate the same output for the same string? or Is sha-1 hash always the same?
In my mind, doing [...] essentially makes brute for even more infeasible.
The quoted statement is correct, but the reasoning is flawed. It makes brute force harder, because an attacker has to compute 4 functions instead of one. And she cannot use rainbow tables, because they aren't generated for your setup.
Wild guess: If you are using the mentioned setup to store and verify passwords, don't do it. Use PBKDF2 or bcrypt for that. See Password Storage Cheat Sheet

Are fragments of hashes collision-resistent?

If you only use the first 4 bytes of an MD5 hash, would that mean theoretically only 1 in 255^4 chance of collision? That is, are hashes designed such that you only have to use a small portion of the returned hash (say the hash is of a file of some size)?
Remember that, even without considering a smart attacker deliberately trying to cause collisions, you need to start worrying about accidental collisions once the number of objects you're hashing get comparable to the square root of the hash space... just a few tens of thousands of objects for a 32-bit hash key. This comes from the so-called birthday paradox.
It is 256, not 255.
Assuming that MD5 is a secure hash function (it turns out it is not secure, but, for the sake of the discussion, let's suppose that it is secure), then it should behave like a random oracle, a mythical object which outputs uniformly random values, under the sole constraint that it "remembers" its previous outputs and returns the same value again, given the same input.
Truncating the output of a random oracle yields another random oracle. Thus, if you keep 32 bits, then the probability of a collision with two distinct input messages is 1 in 2^32 (i.e. 1 in 256^4).
Now there is a thing known as the birthday paradox which says that, with about 2^16 distinct inputs, there are good chances that two of the 2^16 corresponding outputs collide.
MD5 has been shown to be insecure for some purposes -- in particular anything which is related to collisions. The current default recommendation is SHA-2 (a family of four functions, with output sizes 224, 256, 384 and 512 bits, respectively). A new (american) standard is currently being defined, through an open competition, under the code name SHA-3. This is a long process; the new function shall be chosen by mid-2012. Some of the remaining candidates (currently 14, out of an initial 51) are substantially faster than SHA-2, some approaching MD5 in performance, while being considerably more secure. But this is a bit new, so right now you shall use SHA-2 by default.
Assume we have a pre-determined message1. hash1 = md5(message1)
Now choose a message2 randomly, and set hash2 = md5(message2).
In theory there is a 1/255^4 chance that the first four characters of hash2 match the first four of pre-determined hash1.
It is also supposed to be very hard for an attacker that knows message1 to come up with a different message2 that has the same hash. This is called second pre-image resistance. However, even with the full MD5, there are better than theoretical pre-image attacks.
MD5 is completely broken for collisions. This means it is quite feasible for an attacker (in a few hours) to come up with two messages with the same hash (let alone the same first four bytes). The attacker gets to choose both messages, but this can still cause major damage. See for instance the poisoned message example.
If you're generating unique identifiers, you might want to use a UUID instead. These are designed to minimize the change of collisions so that in practice they should never occur.
If you're worried about filenames being too long, which is a peculiar thing to be concerned about when most operating systems support names as long as 255 characters, you can always split the filename into a path and filename component. This has the advantage of splitting up the files into different directories:
fdadda221fd71619e6c0139730b012577dd4de90
fdadda221fd71619e6c/0139730b012577dd4de90
fdad/da22/1fd7/1619/e6c0/1397/30b0/1257/7dd4/de90
Depends on the purpose of the hash.
Hash functions for use in hash tables tend to have more "randomness" in the lower bits (which are used to find the array index) than in the higher bits. Checksum and cryptographic hash functions are more evenly distributed.

Guessing the hash function?

I'd like to know which algorithm is employed. I strongly assume it's something simple and hopefully common. There's no lag in generating the results, for instance.
Input: any string
Output: 5 hex characters (0-F)
I have access to as many keys and results as I wish, but I don't know how exactly I could harness this to attack the function. Is there any method? If I knew any functions that converted to 5-chars to start with then I might be able to brute force for a salt or something.
I know for example that:
a=06a07
b=bfbb5
c=63447
(in case you have something in mind)
In normal use it converts random 32-char strings into 5-char strings.
The only way to derive a hash function from data is through brute force, perhaps combined with some cleverness. There are an infinite number of hash functions, and the good ones perform what is essentially one-way encryption, so it's a question of trial and error.
It's practically irrelevant that your function converts 32-character strings into 5-character hashes; the output is probably truncated. For fun, here are some perfectly legitimate examples, the last 3 of which are cryptographically terrible:
Use the MD5 hashing algorithm, which generates a 16-character hash, and use the 10th through the 14th characters.
Use the SHA-1 algorithm and take the last 5 characters.
If the input string is alphabetic, use the simple substitution A=1, B=2, C=3, ... and take the first 5 digits.
Find each character on your keyboard, measure its distance from the left edge in millimeters, and use every other digit, in reverse order, starting with the last one.
Create a stackoverflow user whose name is the 32-bit string, divide 113 by the corresponding user ID number, and take the first 5 digits after the decimal. (But don't tell 'em I told you to do it!)
Depending on what you need this for, if you have access to as many keys and results as you wish, you might want to try a rainbow table approach. 5 hex chars is only 1mln combinations. You should be able to brute-force generate a map of strings that match all of the resulting hashes in no time. Then you don't need to know the original string, just an equivalent string that generates the same hash, or brute-force entry by iterating over the 1mln input strings.
Following on from a comment I just made to Pontus Gagge, suppose the hash algorithm is as follows:
Append some long, constant string to the input
Compute the SHA-256 hash of the result
Output the last 5 chars of the hash.
Then I'm pretty sure there's no computationally feasible way from your chosen-plaintext attack to figure out what the hashing function is. To even prove that SHA-256 is in use (assuming it's a good hash function, which as far as we currently know it is), I think you'd need to know the long string, which is only stored inside the "black box".
That said, if I knew any published 20-bit hash functions, then I'd be checking those first. But I don't know any: all the usual non-crypto string hashing functions are 32 bit, because that's the expected size of an integer type. You should perhaps compare your results to those of CRC, PJW, and BUZ hash on the same strings, as well as some variants of DJB hash with different primes, and any string hash functions built in to well-known programming languages, like java.lang.String.hashCode. It could be that the 5 output chars are selected from the 8 hex chars generated by one of those.
Beyond that (and any other well-known string hashes you can find), I'm out of ideas. To cryptanalyse a black box hash, you start by looking for correlations between the bits of the input and the bits of the output. This gives you clues what functions might be involved in the hash. But that's a huge subject and not one I'm familiar with.
This sounds mildly illicit.
Not to rain on your parade or anything, but if the implementors have done their work right, you wouldn't notice lags beyond a few tens of milliseconds on modern CPU's even with strong cryptographic hashes, and knowing the algorithm won't help you if they have used salt correctly. If you don't have access to the code or binaries, your only hope is a trivial mistake, whether caused by technical limitations or carelesseness.
There is an uncountable infinity of potential (hash) functions for any given set of inputs and outputs, and if you have no clue better than an upper bound on their computational complexity (from the lag you detect), you have a very long search ahead of you...

Hashtables/Dictionaries that use floats/doubles

I read somewhere about other data structures similar to hashtables, dictionaries but instead of using ints, they were using floats/doubles, etc.
Anyone knows what they are?
If you mean using floats/doubles as keys in your hash, that's easy. For example, in .NET, it's just using Dictionary<double,MyValueType>.
If you're talking about having the hash be based off a double instead of an int....
Technically, you can have any element as your internal hash. Normally, this is done using an int or long, since these are fast, and the hashing algorithm is easy to compute.
However, the hash is really just a BitArray at heart, so anything would work. There really isn't much advantage to making this something other than an int or long, other than potentially allowing a larger set of hash values (ie: if you go to an 8 byte or larger type for your hash).
You mean as keys? That strikes me as tricky.
If you're using them as arbitrary keys, they're no better than integers.
If you expect to calculate a floating-point value and use it to look something up in a hash table, you're living very dangerously. Floating point numbers do not have infinite precision, and calculating the same thing in two slightly different ways can result in very tiny differences in the result. Hash keys rely on getting the exact same thing every time, so you'd have to be careful to round, and round in exactly the same way at all times. This is trickier than it sounds, by the way.
So, what would you do with floating-point hashes?
A hash algorithm is, in general terms, just a function that produces a smaller output from a larger input. Good hash functions have interesting properties like a large change in output for a small change in the input, and an assurance that they produce every possible output value for some input.
It's not hard to write a simple polynomial type hash function that outputs a floating-point value, rather than an integer value, but it's difficult to ensure that the resulting hash function has the desired properties without getting into the details of the particular floating-point representation used.
At least part of the reason that hash functions are nearly always implemented in integer arithmetic is because proving various properties about an integer calculation is easier than doing the same for a floating point calculation.
It's fairly easy to prove that some (sum of prime factors) modulo (another prime) must, necessarily, produce every possible output for some input. Doing the same for a calculation with a bunch of floating-point fractions would be a drag.
Add to that the relative difficulty of storing and transmitting floating-point values without corruption, and it's just not worth it.
Your question history shows that you use .Net, so I'll answer in that context.
If you want a Dictionary that is type aware, such that you can specify it should use floats or doubles for the keys or values, use System.Collections.Generic.Dictionary<T, U> http://msdn.microsoft.com/en-us/library/xfhwa508.aspx
If you want a Dictionary that is type blind, such that you can use floats AND doubles for keys and values, use System.Collections.HashTable http://msdn.microsoft.com/en-us/library/system.collections.hashtable.aspx

Resources