What does 'salt' refer to in string-to-key (s2k) specifier? - salt

What does 'salt' refer to in string-to-key (s2k) specifier?
It appears to be a random number generator to shake things up, but I would like to know what 'salt' stands for?
For example it is written:
3.6.1.2. Salted S2K
This includes a "salt" value in the S2K specifier -- some arbitrary
data -- that gets hashed along with the passphrase string, to help
prevent dictionary attacks.
Octet 0: 0x01
Octet 1: hash algorithm
Octets 2-9: 8-octet salt value
Salted S2K is exactly like Simple S2K, except that the input to the
hash function(s) consists of the 8 octets of salt from the S2K
specifier, followed by the passphrase.
But salt is not defined, although its meaning seems clear.

From Wikipedia:
In cryptography, a salt comprises of random bits that are used as one of the inputs to a key derivation function. The other input is usually a password or passphrase. The output of the key derivation function is stored as the encrypted version of the password.
A salt is just some bits that are used to increase the security of the system. They help prevent pre-computed dictionary attacks.

The salt can be any consistent value.
Either a constant, or the user ID. Better if it includes both.
This is used to prevent pre generated rainbow tables from working.

I think you're asking the origin of the term, not the definition.
Time for a round of folk etymology! (Until someone gives the real answer.)
My guess is that it is an analogy from cooking: the salt is an improving additive. And a little bit goes a long way.

Related

Why the most common prefix of hashed (SHA1) passwords is "00000"?

I was reading a post in Troy Hunt's blog (https://www.troyhunt.com/ive-just-launched-pwned-passwords-version-2/), about a feature called "Pwned Passwords" that checks if your password is in a database with more than 1 billion leaked passwords.
To do this check without passing your password, the client code hash it and pass just the first five chars of this hash, the backend returns all the sha1 hashes of the passwords that starts with the prefix that you passed. Then, to check if the hash of your password is in the database or not, the comparison is made on client code.
And he put some info about the data of these hashed passwords...
Every hash prefix from 00000 to FFFFF is populated with data (16^5 combinations)
The average number of hashes returned is 478
The smallest is 381 (hash prefixes "E0812" and "E613D")
The largest is 584 (hash prefixes "00000" and "4A4E8")
In the comments, people was wondering if the presence of this "00000" is a coincidence or is math...
Could someone that understands the SHA1 algorithm explain it to us?
Well, since the passwords originally come from data breaches, my best guess is that the password table in one of the breached systems was sorted or clustered by the (unsalted -- those are the kind of folks who get their passwords stolen) SHA1 hash of the password. When the system was breached, the attackers started with the "00000" hashes and just didn't make it all the way through...
Or maybe the list that Troy used includes the first part of an SHA1 rainbow table (https://en.wikipedia.org/wiki/Rainbow_table)...
Or something like that. The basic idea is that the SHA1 hash of the passwords was part of the password selection process.
It's either a coincidence, or (less likely) an artifact/error in acquiring or assembling the results for publication.
Not that it looks like a significant outlier. The spread that's described (381 min, 478 average, 584 max) seems like an even spread for the sample size. A graph of the entire corpus would probably look pretty random.
Like any reasonably constructed hashing algorithm, character frequency in SHA1 results should be randomly distributed. (If SHA1 had some kind of bias, this would be major news in the math and cryptography/cryptology community!)
someone would need to check my guess against the sha1 algorithm (and troy may have already debunked it since as per his blog answer he "took a peak at the [plain text] passwords) but since passwords are just alpha/numeric and limited symbols as depicted in ASCII creating a hash will ALWAYS start working with a first bit of ZERO (ascii is 0-255 but letters numbers and symbols used are in 32-98 range i believe, so first bit of every 8 bits always zero) and while it is the function of a hash to gloss over this, I suspect predictable bit positioning isn't as easy to obfuscate as one expects. while it ties with 4, 0 is 00000000 in bit form and 4 is 00000100 so both have first FIVE bits as 0,
also note that the two least frequent hash headers both start with E, WHICH IS 11111110 in binary, so they are almost exact opposite in construction (1's vs 0's) AND frequency (low vs high) implying the presence of zero bits may be a side effect of either the algorithm outright (doubtful) or a function of the algorithm on a limited subset skewed by convention, in other words, letters and digits occupy only 1/3rd - 1/4th of the full range depicted by ASCII which is most probable
of course we could go "tin foil hat" with this convo, but I'd just bet coincidence and ASCII are more to blame than that man on the grassy knoll

Using multiple hash outputs in iterations?

Is there a known or perceived weakness to using the output of other hash algorithms as input for the next hash iteration?
Of course double hashing is not recommended, but this is not the same as double hashing.
Example:
I take a "secret" input and I hash it with SHA256, SHA384, and RIPEMD160 separately. I then combine the output of each into a single long string to use as input for a SHA512 hash. I then repeat this process repeatedly for a number of times.
In my mind, doing this significantly expands the length of the input into the SHA512 and essentially makes brute for even more infeasible.
Additionally, I considered using a 4th hash function merely to generate a value which could then be used to vary the length of the combined input string, by possibly discarding a few bytes in an unpredictable manner, so that the input is not a constant size. I'm not entirely sure that would be of any benefit.
Thoughts?
An answer to this question depends heavily on the attack scenario.
Of course double hashing is not recommended, but this is not the same as double hashing.
I would say: No! If you are storing passwords using a hash function, the attack on the store will be harder, if you use multiple rounds (feeding the output of round n as input for round n+1). Bitcoin as another example uses 2 passes (see here and here). For additional info see Why hashing twice?
by possibly discarding a few bytes in an unpredictable manner, so that the input is not a constant size. I'm not entirely sure that would be of any benefit.
That counteracts the way hash functions are designed. You want the function to produce the same output using the same input. Lifting this relationship basically destroys all use from the function. You could use a random number generator instead. See also: Does the MD5 algorithm always generate the same output for the same string? or Is sha-1 hash always the same?
In my mind, doing [...] essentially makes brute for even more infeasible.
The quoted statement is correct, but the reasoning is flawed. It makes brute force harder, because an attacker has to compute 4 functions instead of one. And she cannot use rainbow tables, because they aren't generated for your setup.
Wild guess: If you are using the mentioned setup to store and verify passwords, don't do it. Use PBKDF2 or bcrypt for that. See Password Storage Cheat Sheet

How does the md5 hashing algorithm compress data to a fixed length?

I know that MD5 produces a 128-bit digest. My question is, how does it produce this fixed length output from a message of 128bits+?
EDIT:
I have now a greater understanding of hashing functions now. After reading this article I have realized that hash functions are one-way, meaning that you can't convert the hash back to plaintext. I was under the misimpression that you could due to all the online services converting them back to strings, but I have realised that thats just rainbow tables (collections of string's mapped to pre-computed hashes).
When you generate an MD5 hash, you're not compressing the input data. Compression implies that you'll be able to uncompress it back to it's original state. MD5, on the other hand, is a one-way process. This is why it's used for password storage; you ideally have to know the original input string to be able to generate the same MD5 result again.
This page provides a nice graphic-equipped explanation of MD5 and similar hash functions, and how they're used: An Illustrated Guide to Cryptographic Hashes
Consider something like starting with a 128-bit value, and taking input 128 bits at a time, and XORing each of those input blocks with the existing value.
MD5 is considerably more complex than that, but the general idea is the same: input is processed 128 bits at a time. Each input block can change the value of the result, but has no effect on the length.
It has noting (or, better, few) to do with compression. There is an algorithm which produces for every initial state and byte a new state. This state is more or less unique to this combination of inputs.
In short, it will split into many parts and do operation.
If you are wonder about the collsion, consider your message is only Readable.
The bit space is much bigger than readable char space.

Encryption algorithm that output byte by byte based on password and offset

Is there a well-known (to be considered) algorithm that can encrypt/decrypt any arbitrary byte inside the file based on the password entered and the offset inside the file.
(Databyte, Offset, Password) => EncryptedByte
(EncryptedByte, Offset, Password) => DataByte
And is there some fundamental weakness in this approach or it's still theoretically possible to build it strong enough
Update:
More datails: Any cryptographic algorithm has input and output. For many existing ones the input operates on large blocks. I want to operate on only one byte, but the system based on this can only can remap bytes and weak by default, but if we take the position in the file of this byte, we for example can take the bits of this position value to interpret them as some operation on some step (0: xor, 1: shitf) and create the encrypted byte with this. But it's too simple, I'm looking for something stronger.
Maybe it's not very efficient but how about this:
for encryption use:
encryptedDataByte = Encrypt(offset,key) ^ dataByte
for decryption use:
dataByte = Encrypt(offset,key) ^ encryptedDataByte
Where Encrypt(offset,key) might be e.g. 3DES or AES (with padding the offset, if needed, and throwing away all but one result bytes)
If you can live with block sizes of 16 byte, you can try the XTS-mode described in the wikipedia article about Disk encryption theory (the advantage being that some good cryptologists already looked at it).
If you really need byte-wise encryption, I doubt that there is an established solution. In the conference Crypto 2009 there was a talk about How to Encipher Messages on a Small Domain: Deterministic Encryption and the Thorp Shuffle. In your case the domain is a byte, and as this is a power of 2, a Thorp Shuffle corresponds to a maximally unbalanced Feistel network. Maybe one can build something using the position and the password as key, but I'd be surprised if a home-made solution will be secure.
You can use AES in Counter Mode where you divide your input into blocks of 16 bytes (128 bits) and then basically encrypt a counter on the block number to get a pseudo-random 16 bytes that you can XOR with the plaintext. It is critically important to not use the same counter start value (and/or initialization vector) for the same key ever again or you will open yourself for an easy attack where an attacker can use a simple xor to recover the key.
You mention that you want to only operate on individual bytes, but this approach would give you that flexibility. Output Feedback Mode is another common one, but you have to be careful in its use.
You might consider using the EAX mode for better security. Also, make sure you're using something like PBKDF-2 or scrypt to generate your encryption key from the password.
However, as with most cryptography related issues, it's much better to use a rigorously tested and evaluated library rather than rolling your own.
Basically what you need to do is generate some value X (probably 1 byte) based on the offset and password, and use this to encrypt/decrypt the byte at that offset. We'll call it
X = f(offset,password)
The problem is that an attacker that "knows something" about the file contents (e.g. the file is English text, or a JPEG) can come up with an estimate (or sometimes be certain) of what an X could be. So he has a "rough idea" about many X values, and for each of these he knows what the offset is. There is a lot of information available.
Now, it would be nice if all that information were of little use to the attacker. For most purposes, using a cryptographic hash function (like SHA-1) will give you a reasonable assurance of decent security.
But I must stress that if this is something critical, consult an expert.
One possibility is a One Time Pad, possibly using the password to seed some pseudo-random number generator. One time pads theoretically achieve perfect secrecy, but there are some caveats. It should do what you're looking for though.

two-way keyed encryption/hash algorithm

I am no way experienced in this type of thing so I am not even sure of the keywords (hence the title).
Basically I need a two way function
encrypt(w,x,y) = z
decrypt(z) = w, x, y
Where w = integer
x = string (username)
y = unix timestamp
and z = is an 8 digit number (possibly including letters, spec isn't there yet.)
I would like z to be not easily guessable and easily verifiable. Speed isn't a huge concern, security isn't either. Tracking one-to-one relationship is the main requirement.
Any resources or direction would be appreciated.
EDIT
Thanks for the answers, learning a lot. So to clarify, 8 characters is the only hard requirement, along with the ability to link W <-> Z. The username (Y) and timestamp (Z) would be considered icing on the cake.
I would like to do this mathematically rather than doing some database looks up, if possible.
If i had to finish this tonight, I could just find a fitting hash algorithm and use a look up table. I am simply trying to expand my understanding of this type of thing and see if I could do it mathematically.
Encryption vs. Hashing
This is an encryption problem, since the original information needs to be recovered. The quality of a cryptographic hash is judged by how difficult it is to reverse the hash and recover the original information, so hashing is not applicable here.
To perform encryption, some key material is needed. There are many encryption algorithms, but they fall into two main groups: symmetric and asymmetric.
Application
The application here isn't clear. But if you are "encrypting" some information and sending it somewhere, then later getting it back and doing something with it, symmetric encryption is the way to go. For example, say you want to encode a user name, an IP address, and some identifier from your application in a parameter that you include in a link in some HTML. When the user clicks the link, that parameter is passed back to your application and you decode it to recover the original information. That's a great fit for symmetric encryption, because the sender and the recipient are the same party, and key exchange is a no-op.
Background
In symmetric encryption, the sender and recipient need to know the same key, but keep it secret from everyone else. As a simple example, two people could meet in person, and decide on a password. Later on, they could use that password to keep their email to each other private. However, anyone who overhears the password exchange will be able to spy on them; the exchange has to happen over a secure channel... but if you had a secure channel to begin with, you wouldn't need to exchange a new password.
In asymmetric encryption, each party creates a pair of keys. One is public, and can be freely distributed to anyone who wants to send a private message. The other is private. Only the message recipient knows that private key.
A big advantage to symmetric encryption is that it is fast. All well-designed protocols use a symmetric algorithm to encrypt large amounts of data. The downside is that it can be difficult to exchange keys securely—what if you can't "meet up" (virtually or physically) in a secure place to agree on a password?
Since public keys can be freely shared, two people can exchange a private message over an insecure channel without having previously agreed on a key. However, asymmetric encryption is much slower, so its usually used to encrypt a symmetric key or perform "key agreement" for a symmetric cipher. SSL and most cryptographic protocols go through a handshake where asymmetric encryption is used to set up a symmetric key, which is used to protect the rest of the conversation.
You just need to encrypt a serialization of (w, x, y) with a private key. Use the same private key to decrypt it.
In any case, the size of z cannot be simply bounded like you did, since it depends on the size of the serialization (since it needs to be two way, there's a bound on the compression you can do, depending on the entropy).
And you are not looking for a hash function, since it would obviously lose some information and you wouldn't be able to reverse it.
EDIT: Since the size of z is a hard limit, you need to restrict the input to 8 bytes, and choose a encryption technique that use 64 bits (or less) block size. Blowfish and Triple DES use 64 bits blocks, but remember that those algorithms didn't receive the same scrutiny as AES.
If you want something really simple and quite unsecure, just xor your input with a secret key.
You probably can't.
Let's say that w is 32 bits, x supports at least 8 case-insensitive ASCII chars, so at least 37 bits, and y is 32 bits (gets you to 2038, and 31 bits doesn't even get you to now).
So, that's a total of at least 101 bits of data. You're trying to store it in an 8 digit number. It's mathematically impossible to create an invertible function from a larger set to a smaller set, so you'd need to store more than 12.5 bits per "digit".
Of course if you go to more than 8 characters, or if your characters are 16 bit unicode, then you're at least in with a chance.
Let's formalize your problem, to better study it.
Let k be a key from the set K of possible keys, and (w, x, y) a piece of information, from a set I, that we need to crypt. Let's define the set of "crypted-messages" as A8, where A is the alphabet from which we extract the characters to our crypted message (A = {0, 1, ..., 9, a, b, ..., z, ... }, depending on your specs, as you said).
We define the two functions:
crypt: I * K --> A^8.
decrypt A^8 * K --> I
The problem here is that the size of the set A^8, of crypted-messages, might be smaller than the set of pieces of information (w, x, y). If this is so, it is simply impossible to achieve what you are looking for, unless we try something different...
Let's say that only YOU (or your server, or your application on your server) have to be able to calculate (w, x, y) from z. That is, you might send z to someone, and you don't care that they will not be able to decrypt it.
In this case, what you can do is use a database on your server. You will crypt the information using a well-known algorithm, than you generate a random number z. You define the table:
Id: char[8]
CryptedInformation: byte[]
You will then store z on the Id column, and the crypted information on the corresponding column.
When you need to decrypt the information, someone will give you z, the index of the crypted information, and then you can proceed to decryption.
However, if this works for you, you might not even need to crypt the information, you could have a table:
Id: char[8]
Integer: int
Username: char[]
Timestamp: DateTime
And use the same method, without crypting anything.
This can be applied to an "e-mail verification system" on a subscription process, for example. The link you would send to the user by mail would contain z.
Hope this helps.
I can't tell if you are trying to set this up a way to store passwords, but if you are, you should not use a two way hash function.
If you really want to do what you described, you should just concatenate the string and the timestamp (fill in extra spaces with underscores or something). Take that resulting string, convert it to ASCII or UTF-8 or something, and find its value modulo the largest prime less than 10^8.
Encryption or no encryption, I do not think it is possible to pack that much information into an 8 digit number in such a way that you will ever be able to get it out again.
An integer is 4 bytes. Let's assume your username is limited to 8 characters, and that characters are bytes. Then the timestamp is at least another 4 bytes. That's 16 bytes right there. In hex, that will take 32 digits. Base36 or something will be less, but it's not going to be anywhere near 8.
Hashes by definition are one way only, once hashed, it is very difficult to get the original value back again.
For 2 way encryption i would look at TripleDES which .net has baked right in with TripleDESCryptoServiceProvider.
A fairly straight forward implementation article.
EDIT
It has been mentioned below that you can not cram a lot of information into a small encrypted value. However, for many (not all) situations this is exactly what Bit Masks exist to solve.

Resources