How does the md5 hashing algorithm compress data to a fixed length?

How does the md5 hashing algorithm compress data to a fixed length? - algorithm

I know that MD5 produces a 128-bit digest. My question is, how does it produce this fixed length output from a message of 128bits+?
EDIT:
I have now a greater understanding of hashing functions now. After reading this article I have realized that hash functions are one-way, meaning that you can't convert the hash back to plaintext. I was under the misimpression that you could due to all the online services converting them back to strings, but I have realised that thats just rainbow tables (collections of string's mapped to pre-computed hashes).

When you generate an MD5 hash, you're not compressing the input data. Compression implies that you'll be able to uncompress it back to it's original state. MD5, on the other hand, is a one-way process. This is why it's used for password storage; you ideally have to know the original input string to be able to generate the same MD5 result again.
This page provides a nice graphic-equipped explanation of MD5 and similar hash functions, and how they're used: An Illustrated Guide to Cryptographic Hashes

Consider something like starting with a 128-bit value, and taking input 128 bits at a time, and XORing each of those input blocks with the existing value.
MD5 is considerably more complex than that, but the general idea is the same: input is processed 128 bits at a time. Each input block can change the value of the result, but has no effect on the length.

It has noting (or, better, few) to do with compression. There is an algorithm which produces for every initial state and byte a new state. This state is more or less unique to this combination of inputs.

In short, it will split into many parts and do operation.
If you are wonder about the collsion, consider your message is only Readable.
The bit space is much bigger than readable char space.

Related

Using multiple hash outputs in iterations?

Is there a known or perceived weakness to using the output of other hash algorithms as input for the next hash iteration?
Of course double hashing is not recommended, but this is not the same as double hashing.
Example:
I take a "secret" input and I hash it with SHA256, SHA384, and RIPEMD160 separately. I then combine the output of each into a single long string to use as input for a SHA512 hash. I then repeat this process repeatedly for a number of times.
In my mind, doing this significantly expands the length of the input into the SHA512 and essentially makes brute for even more infeasible.
Additionally, I considered using a 4th hash function merely to generate a value which could then be used to vary the length of the combined input string, by possibly discarding a few bytes in an unpredictable manner, so that the input is not a constant size. I'm not entirely sure that would be of any benefit.
Thoughts?

An answer to this question depends heavily on the attack scenario.
Of course double hashing is not recommended, but this is not the same as double hashing.
I would say: No! If you are storing passwords using a hash function, the attack on the store will be harder, if you use multiple rounds (feeding the output of round n as input for round n+1). Bitcoin as another example uses 2 passes (see here and here). For additional info see Why hashing twice?
by possibly discarding a few bytes in an unpredictable manner, so that the input is not a constant size. I'm not entirely sure that would be of any benefit.
That counteracts the way hash functions are designed. You want the function to produce the same output using the same input. Lifting this relationship basically destroys all use from the function. You could use a random number generator instead. See also: Does the MD5 algorithm always generate the same output for the same string? or Is sha-1 hash always the same?
In my mind, doing [...] essentially makes brute for even more infeasible.
The quoted statement is correct, but the reasoning is flawed. It makes brute force harder, because an attacker has to compute 4 functions instead of one. And she cannot use rainbow tables, because they aren't generated for your setup.
Wild guess: If you are using the mentioned setup to store and verify passwords, don't do it. Use PBKDF2 or bcrypt for that. See Password Storage Cheat Sheet

Is there a two-way hashing algorithm in PHP?

Disclaimer: I understand that a hash is not supposed to be reversible.
I've seen many people ask if there is a way to "unhash" text that is already hashed. However, I am not seeing a straight answer. Most answers state that MD5 and SHA-1 are one-way hashing algorthims, and therefore irreversible. That's great and all, but it begs the question are all hashing algorithms one-way and irreversible?

A hash function is any function that can be used to map data of arbitrary size to data of fixed size. (source: Wikipedia)
Because the range of the input values is infinite and the number of possible distinct output values is finite, the function produces the same output for an infinite number of input values. This means a hash is a losing-information function.
Assuming one could "reverse" the hashing, they would get an infinite set of possible original values. It is still impossible to tell what was the value used to generate the hash.
In mathematical terms, a hash function is not injective and this property automatically makes it not invertible.
All of the above apply to any hash function, no matter what language or library provides it.

Not really. The one absolutely non-negotiable property of a hash function is it converts data of an arbitrary length to values of a fixed length. This means each possible result of your hashing function has infinitely many possible inputs that could produce it, making reversing the hash function to a single value impossible.
If you can place constraints on the length of your data input, then technically you could define a reversible hash function but I don't particularly see a use for it.

... are all hashing algorithms one-way and irreversible?
There are some real-world hash functions that can be reversed, such as the not-uncommon implementation of nominally hashing an 8, 16, 32 or 64-bit number by returning the input unchanged. Many C++ Standard Libraries, python and other languages do exactly that, as it's often good enough for use by hash tables keyed on the numbers - the extra potential for collisions must be weighed up against the time that would have been needed to generate a stronger hash, and indeed even the potential CPU-cache benefits of nearby keys hashing to nearby buckets.
That said, your question starts...
I've seen many people ask if there is a way to "unhash" text that is already hashed.
For very short amounts of text, such 8-character passwords, brute force attacks using dictionaries and mutation rules (e.g. "try a dictionary word followed by each character from space (ASCII 32) through tilda (127)", "try all combinations of replacing letters with similar-looking or -sounding numbers"...) can sometimes find the password likely used (though there's a small chance it's another password with the same hash value).
If the input wasn't based on a dictionary word or something else guessable, it's far less likely to be crackable.
For longer amounts of text, it's increasingly impractical to find any input with matching hash value, and massively less likely that any such input would actually be the one originally used to generate the hash (with longer inputs, more of them will - on average - map to any given hash value). Once the text input is dozens of times longer than the hash value, it's totally impractical (unless perhaps quantum computing develops significantly). (Note that Microsoft's C++ compiler's std::hash<std::string> only combines 10 characters evenly spaced along any string to form the hash value, so longer strings don't increase the quality of the hash, but on the other hand the hash only provides any insight at all into the max 10 characters chosen to form it).
Most answers state that MD5 and SHA-1 are one-way hashing algorthims, and therefore irreversible.
Hashes suitable for cryptographic use (as distinct from hash table use) - should inherently take a relatively long time to calculate (some goodly fraction of a second on likely hardware), so that the brute-force dictionary attacks mentioned above are prohibitively compute-intensive even for short textual strings. This helps make them practically irreversible. Even reasonable checksum-strength hash functions will be hard to reverse after there are more bytes of input than there are bytes in the hash value, rapidly becoming practically irreversible as the input gets larger and larger.

Chicken/Egg problem: Hash of file (including hash) inside file! Possible?

Thing is I have a file that has room for metadata. I want to store a hash for integrity verification in it. Problem is, once I store the hash, the file and the hash along with it changes.
I perfectly understand that this is by definition impossible with one way cryptographic hash methods like md5/sha.
I am also aware of the possibility of containers that store verification data separated from the content as zip & co do.
I am also aware of the possibility to calculate the hash separately and send it along with the file or to append it at the end or somewhere where the client, when calculating the hash, ignores it.
This is not what I want.
I want to know whether there is an algorithm where its possible to get the resulting hash from data where the very result of the hash itself is included.
It doesn't need to be cryptographic or fullfill a lot of criterias. It can also be based on some heuristics that after a realistic amount of time deliver the desired result.
I am really not so into mathematics, but couldn't there be some really advanced exponential modulo polynom cyclic back-reference devision stuff that makes this possible?
And if not, whats (if there is) the proof against it?
The reason why i need tis is because i want (ultimately) to store a hash along with MP4 files. Its complicated, but other solutions are not easy to implement as the file walks through a badly desigend production pipeline...

It's possible to do this with a CRC, in a way. What I've done in the past is to set aside 4 bytes in a file as a placeholder for a CRC32, filling them with zeros. Then I calculate the CRC of the file.
It is then possible to fill the placeholder bytes to make the CRC of the file equal to an arbitrary fixed constant, by computing numbers in the Galois field of the CRC polynomial.
(Further details possible but not right at this moment. You basically need to compute (CRC_desired - CRC_initial) * 2-8*byte_offset in the Galois field, where byte_offset is the number of bytes between the placeholder bytes and the end of the file.)
Note: as per #KeithS's comments this solution is not to prevent against intentional tampering. We used it on one project as a means to tie metadata within an embedded system to the executable used to program it -- the embedded system itself does not have direct knowledge of the file(s) used to program it, and therefore cannot calculate a CRC or hash itself -- to detect inadvertent mismatch between an embedded system and the file used to program it. (In later systems I've just used UUIDs.)

Of course this is possible, in a multitude of ways. However, it cannot prevent intentional tampering.
For example, let
hash(X) = sum of all 32-bit (non-overlapping) blocks of X modulo 65521.
Let
Z = X followed by the 32-bit unsigned integer (hash(X) * 65521)
Then
hash(Z) == hash(X) == last 32-bits of Z
The idea here is just that any 32-bit integer congruent to 0 modulo 65521 will have no effect on the hash of X. Then, since 65521 < 2^16, hash has a range less then 2^16, and there are at least 2^16 values less than 2^32 congruent to 0 modulo 65521. And so we can encode the hash into a 32 bit integer that will not affect the hash. You could actually use any number less than 2^16, 65521 just happens to be the largest such prime number.

I remember an old DOS program that was able to embed in a text file the CRC value of that file. However, this is possible only with simple hash functions.
Altough in theory you could create such file for any kind of hash function (given enough time or the right algorithm), the attacker would be able to use exactly the same approach. Even more, he would have a chose: to use exactly your approach to obtain such file, or just to get rid of the check.
It means that now you have two problems instead of one, and both should be implemented with the same complexity. It's up to you to decide if it worth it.
EDIT: you could consider hashing some intermediary results (like RAW decoded output, or something specific to your codec). In this way the decoder would have it anyway, but for another program it would be more difficult to compute.

No, not possible. You either you a separate file for hashs ala md5sum, or the embedded hash is only for the "data" portion of the file.

the way the nix package manager does this is by when calculating the hash you pretend the contents of the hash in the file are some fixed value like 20 x's and not the hash of the file then you write the hash over those 20 x's and when you check the hash you read that and ignore again it pretending the hash was just the fixed value of 20 x's when hashing
they do this because the paths at which a package is installed depend on the hash of the whole package so as the hash is of fixed length they set it as some fixed value and then replace it with the real hash and when verifying they ignore the value they placed and pretend it's that fixed value
but if you don't use such a method is it impossible

It depends on your definition of "hash". As you state, obviously with any pseudo-random hash this would be impossible (in a reasonable amount of time).
Equally obvious, there are of course trivial "hashes" where you can do this. Data with an odd number of bits set to 1 hash to 00 and an even number of 1s hash to 11, for example. The hash doesn't modify the odd/evenness of the 1 bits, so files hash the same when their hash is included.

Guessing the hash function?

I'd like to know which algorithm is employed. I strongly assume it's something simple and hopefully common. There's no lag in generating the results, for instance.
Input: any string
Output: 5 hex characters (0-F)
I have access to as many keys and results as I wish, but I don't know how exactly I could harness this to attack the function. Is there any method? If I knew any functions that converted to 5-chars to start with then I might be able to brute force for a salt or something.
I know for example that:
a=06a07
b=bfbb5
c=63447
(in case you have something in mind)
In normal use it converts random 32-char strings into 5-char strings.

The only way to derive a hash function from data is through brute force, perhaps combined with some cleverness. There are an infinite number of hash functions, and the good ones perform what is essentially one-way encryption, so it's a question of trial and error.
It's practically irrelevant that your function converts 32-character strings into 5-character hashes; the output is probably truncated. For fun, here are some perfectly legitimate examples, the last 3 of which are cryptographically terrible:
Use the MD5 hashing algorithm, which generates a 16-character hash, and use the 10th through the 14th characters.
Use the SHA-1 algorithm and take the last 5 characters.
If the input string is alphabetic, use the simple substitution A=1, B=2, C=3, ... and take the first 5 digits.
Find each character on your keyboard, measure its distance from the left edge in millimeters, and use every other digit, in reverse order, starting with the last one.
Create a stackoverflow user whose name is the 32-bit string, divide 113 by the corresponding user ID number, and take the first 5 digits after the decimal. (But don't tell 'em I told you to do it!)

Depending on what you need this for, if you have access to as many keys and results as you wish, you might want to try a rainbow table approach. 5 hex chars is only 1mln combinations. You should be able to brute-force generate a map of strings that match all of the resulting hashes in no time. Then you don't need to know the original string, just an equivalent string that generates the same hash, or brute-force entry by iterating over the 1mln input strings.

Following on from a comment I just made to Pontus Gagge, suppose the hash algorithm is as follows:
Append some long, constant string to the input
Compute the SHA-256 hash of the result
Output the last 5 chars of the hash.
Then I'm pretty sure there's no computationally feasible way from your chosen-plaintext attack to figure out what the hashing function is. To even prove that SHA-256 is in use (assuming it's a good hash function, which as far as we currently know it is), I think you'd need to know the long string, which is only stored inside the "black box".
That said, if I knew any published 20-bit hash functions, then I'd be checking those first. But I don't know any: all the usual non-crypto string hashing functions are 32 bit, because that's the expected size of an integer type. You should perhaps compare your results to those of CRC, PJW, and BUZ hash on the same strings, as well as some variants of DJB hash with different primes, and any string hash functions built in to well-known programming languages, like java.lang.String.hashCode. It could be that the 5 output chars are selected from the 8 hex chars generated by one of those.
Beyond that (and any other well-known string hashes you can find), I'm out of ideas. To cryptanalyse a black box hash, you start by looking for correlations between the bits of the input and the bits of the output. This gives you clues what functions might be involved in the hash. But that's a huge subject and not one I'm familiar with.

This sounds mildly illicit.
Not to rain on your parade or anything, but if the implementors have done their work right, you wouldn't notice lags beyond a few tens of milliseconds on modern CPU's even with strong cryptographic hashes, and knowing the algorithm won't help you if they have used salt correctly. If you don't have access to the code or binaries, your only hope is a trivial mistake, whether caused by technical limitations or carelesseness.
There is an uncountable infinity of potential (hash) functions for any given set of inputs and outputs, and if you have no clue better than an upper bound on their computational complexity (from the lag you detect), you have a very long search ahead of you...

Encryption algorithm that output byte by byte based on password and offset

Is there a well-known (to be considered) algorithm that can encrypt/decrypt any arbitrary byte inside the file based on the password entered and the offset inside the file.
(Databyte, Offset, Password) => EncryptedByte
(EncryptedByte, Offset, Password) => DataByte
And is there some fundamental weakness in this approach or it's still theoretically possible to build it strong enough
Update:
More datails: Any cryptographic algorithm has input and output. For many existing ones the input operates on large blocks. I want to operate on only one byte, but the system based on this can only can remap bytes and weak by default, but if we take the position in the file of this byte, we for example can take the bits of this position value to interpret them as some operation on some step (0: xor, 1: shitf) and create the encrypted byte with this. But it's too simple, I'm looking for something stronger.

Maybe it's not very efficient but how about this:
for encryption use:
encryptedDataByte = Encrypt(offset,key) ^ dataByte
for decryption use:
dataByte = Encrypt(offset,key) ^ encryptedDataByte
Where Encrypt(offset,key) might be e.g. 3DES or AES (with padding the offset, if needed, and throwing away all but one result bytes)

If you can live with block sizes of 16 byte, you can try the XTS-mode described in the wikipedia article about Disk encryption theory (the advantage being that some good cryptologists already looked at it).
If you really need byte-wise encryption, I doubt that there is an established solution. In the conference Crypto 2009 there was a talk about How to Encipher Messages on a Small Domain: Deterministic Encryption and the Thorp Shuffle. In your case the domain is a byte, and as this is a power of 2, a Thorp Shuffle corresponds to a maximally unbalanced Feistel network. Maybe one can build something using the position and the password as key, but I'd be surprised if a home-made solution will be secure.

You can use AES in Counter Mode where you divide your input into blocks of 16 bytes (128 bits) and then basically encrypt a counter on the block number to get a pseudo-random 16 bytes that you can XOR with the plaintext. It is critically important to not use the same counter start value (and/or initialization vector) for the same key ever again or you will open yourself for an easy attack where an attacker can use a simple xor to recover the key.
You mention that you want to only operate on individual bytes, but this approach would give you that flexibility. Output Feedback Mode is another common one, but you have to be careful in its use.
You might consider using the EAX mode for better security. Also, make sure you're using something like PBKDF-2 or scrypt to generate your encryption key from the password.
However, as with most cryptography related issues, it's much better to use a rigorously tested and evaluated library rather than rolling your own.

Basically what you need to do is generate some value X (probably 1 byte) based on the offset and password, and use this to encrypt/decrypt the byte at that offset. We'll call it
X = f(offset,password)
The problem is that an attacker that "knows something" about the file contents (e.g. the file is English text, or a JPEG) can come up with an estimate (or sometimes be certain) of what an X could be. So he has a "rough idea" about many X values, and for each of these he knows what the offset is. There is a lot of information available.
Now, it would be nice if all that information were of little use to the attacker. For most purposes, using a cryptographic hash function (like SHA-1) will give you a reasonable assurance of decent security.
But I must stress that if this is something critical, consult an expert.

One possibility is a One Time Pad, possibly using the password to seed some pseudo-random number generator. One time pads theoretically achieve perfect secrecy, but there are some caveats. It should do what you're looking for though.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio