Time based hash that allows for comparison of hashed data - algorithm

I'm trying to hash two different geo positions (-180.0, 60.59) and (-179.0, 80.40) to protect the geo positions from being known while allowing to know the number differences between the two hashes. I figured the answer would be having a key generated and stored in the client and having a time based key in the hash.

Cryptographic Hash functions are not preserving operations, that is;
a + b != H(a+b)
Think the + as any operation. This will be very dangerous to allow finding hash collisions.
What you need is homomorphic encryption that enables at least one operation. An example is Paillier cryptosystem. When you multiply the ciphertext you get the addition of the plaintext.
a + b = Dec(Enc(a) * Enc(b)).

Related

hashing vs hash function, don't know the difference

For example, "Consistent hashing" and "Perfect hash function", in wikipedia, I click "hashing" and the link direct to "hash function", so it seems that they have the same meaning, but why does another exist? And is there any difference when using "hashing" or "hash function"? And is it ok to call "consistent hashing" as "consistent hash function"? Thanks!
A hash function takes some input data (typically a bunch of binary bytes, but could be anything - whatever you make it to) and calculates a hash value, which is typically an integer number (but, again, can be anything). The process of doing this is called hashing.
The hash value is always the same size, no matter what the input looks like. Well, I suppose you cold make a hash function that has a variable-size output, but I haven't seen one in the wild yet. It wouldn't be very practical. Thus, by its very nature, hashing is usually a one-way calculation. You can't normally get the original data back from the hash value, because there are many more possible input data combinations than there are possible hash values.
The main advantages are:
The hash value is always the same size
The same input will always generate the same output.
If it's a good hash function, different inputs will usually generate different outputs, but it's still possible that two different inputs generate the same output (this is called a hash collision).
If you have a cryptographical hash function you also get one more advantage:
From having only the hash value, it's impossible (unfeasible) to come up with input data that would hash to this value. Never mind that it's not the original input data, any kind of input data that would hash to the given output value is impossible to find in a useful timeframe.
The results of a hash function can be used in various ways. As mentioned in other answers, hash tables are one common use-case. Verifying data integrity is another case - for example, you download a file, then hash it, then check the hash value against the value that was specified in the webpage where you downloaded the file from. If they don't match, the file was not downloaded correctly. If you combine hash values with public-key cryptography you can get digital signatures. And I'm sure there are other uses to which the principle can be put.
you can write a hash function and what it does is to hash keys to bins.
In other words the hash function is doing the hashing.
I hope that clarifies it.
HashTable is a data Structure in which a given value is mapped with a particular key for faster access of elements. - Process of populating this data structure is known as hashing.
To do hashing , you need a function which will provide logic for mapping values to keys. This function is hash function
I hope this clarifies your doubt.

Perfect Hash Building

Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions and are also not revertible. So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Only reason I am able to think is they require say large key say 32bit.But still avoiding collision so the look up will definitely be O(1).
Because they are very slow, for two reasons:
They aim to be crytographically secure, not only collision-resistant in general
They produce a much larger hash value than what you actually need in a hash table
Because they handle unstructured data (octet / byte streams) but the objects you need to hash are often structured and would require linearization first
Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions...
Wrong because:
Two inputs cam still happen to have the same hash value. Say the hash value is 32 bit, a great general-purpose hash routine (i.e. one that doesn't utilise insights into the set of actual keys) still has at least 1/2^32 chance of returning the same hash value for any 2 keys, then 2/2^32 chance of colliding with one of those as a third key is hashed, 3/2^32 for the fourth etc..
Having distinct hash values is a very different thing from having the hash values map to distinct hash buckets in a hash table. Hash values are generally modded into the table size to select a bucket, so at best - and again for general-purpose hashing - the chance of a collision when adding an element to a hash table is #preexisting-elements / table-size.
So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Because speed is often the programmer's goal when choosing to use a hash table over say a binary tree. If the hash values are mathematically complicated to calculate, they may take a lot longer than using a slightly more (but still not particularly) collision prone but faster-to-calculate hash function. That said, there are times when more effort on the hashing can pay off - for example, when the hash table exists on magnetic disk and the I/O costs of seeking & reading records dwarfs hash calculation effort.
antti makes an interesting point about data too... general purpose hashing routines often work on blocks of binary data with a specific starting address and a number of bytes (they may even require that number of bytes to be a multiple of 2 or 4). In many applications, data that needs to be hashed will be intermingled with data that must not be included in the hash - such as cached values, file handles, pointers/references to other data or virtual dispatch tables etc.. A common solution is to hash the desired fields separately and combine the hash keys - perhaps using exclusive-or. As there can be bit fields that should be hashed in the same byte of memory as other data that should not be hashed, you sometimes need custom code to extract those values. Still, even if some copying and padding was required beforehand, each individual field could eventually be hashed using md5, SHA-1 or whatever and those hash values could be similarly combined, so this complication doesn't really categorically rule out the approach you're interested in.
Only reason I am able to think is they require say large key say 32bit.
All other things being equal, the larger the key the better, though if the hash function is mathematically ideal then any N of its bits - where 2^N >= # hash buckets - will produce minimal collisions.
But still avoiding collision so the look up will definitely be O(1).
Again, wrong as mentioned above.
(BTW... I stress general-purpose in a couple places above. That's just because there are trivial cases where you might have some insight into the keys you'll need to hash that allows you to position them perfectly within the available hash buckets. For example, if you knew the keys were the numbers 1000, 2000, 3000 etc. up to 100000 and that you had at least 100 hash buckets, you could trivially define your hash function as x/1000 and know you'd have perfect hashing sans collisions. This situation of knowing that all your keys map to distinct hash table buckets is known as "perfect hashing" - as per your question title - a good general-purpose hash like md5 is not a perfect hash, and indeed it makes no sense to talk about perfect hashing without knowing the complete set of possible keys).

Salting passwords 101

Could someone please help me understand how salting works?
So far I understand the following:
Validate password
Generate a random string
Hash the password and the random string and concat them, then store them in the password field...
How do we store the salt, or know what it is when a user logs in? Do we store it in its own field? If we don't, how does the application figure out what the salt is? And if we do store it, doesn't it defeat the whole purpose?
Salt is combined with the password before hashing. the password and salt clear values are concatenated and the resulting string is hashed. this guarantees that even if two people were to have the same password you would have different resulting hashes. (also makes attacks known as dictionary attacks using rainbow tables much more difficult).
The salt is then stored in original/clear format along with the hash result. Then later, when you want to verify the password you would do the original process again. Combine the salt from the record with the password the user provided, hash the result, compare the hash.
You probably already know this. but it's important to remember. the salt must be generated randomly each time. It must be different for each protected hash. Often times the RNG is used to generate the salt.
So..for example:
user-password: "mypassword"
random salt: "abcdefg12345"
resulting-cleartext: "mypassword:abcdefg12345" (how you combine them is up to you. as long as you use the same combination format every time).
hash the resulting cleartext: "somestandardlengthhashbasedonalgorithm"
In your database now you would store the hash and salt used. I've seen it two ways:
method 1:
field1 - salt = "abcdefg12345"
field2 - password_hash = "somestandardlengthhashbasedonalgorithm"
method 2:
field1 - password_hash = "abcdefg12345:somestandardlengthhashbasedonalgorithm"
In either case you have to load the salt and password hash out of your database and redo the hash for comparison
salt <- random
hash <- hash(password + salt)
store hash:salt
Later
input password
look up hash:salt
hash(password+salt)
compare with stored hash
Got it?
How do we store the salt, or know what it is when a user logs in? Do we store it in its own field?
Yes.
And if we do store it, doesn't it defeat the whole purpose?
No. The purpose of a salt is not being secret, but merely to prevent an attacker from amortizing the cost of computing rainbow tables over all sites in the world (not salt) or all users in your site (single salt used for all users).
According to Practical Cryptography (Neils Ferguson and Bruce Schneier), you should use salted, stretched hashes for maximum security.
x[0] := 0
x[i] := h(x[i-1] || p || s) for i = 1, ..., r
K := x[r]
where
h is the hash (SHA-1, SHA-256, etc.)
K is the generated hashed password
p is the plaintext password
r is the number of rounds
s is the randomly generated salt
|| is the concatenation operator
The salt value is a random number that is stored with the encrypted password. It does not need to remain secret.
Stretching is the act of performing the hash multiple times to make it computationally more difficult for a attacker to test many permutations of passwords. r should be chosen so that the computation takes about 200-1000ms on the user's computer. r may need to be increased as computers get faster.
If you're using a well-known hashing algorithm, someone could have a list of a lot of possible passwords already hashed using that algorithm and compare the items from that list with a hashed password they want to crack (dictionary attack).
If you "salt" all passwords before hashing them, these dictionaries are useless, because they'd have to be created using your salt.

Do cryptographic hashes provide really unique results?

I was wondering whether md5, sha1 and anothers return unique values.
For example, sha1() for test returns a94a8fe5ccb19ba61c4c0873d391e987982fbbd3, which is 40 characters long. So, sha1 for strings larger than 40 chars must be the same (of course it's scrambled, because the given input may contain whitespaces and special chars etc.).
Due to this, when we are storing users' passwords, they can enter either their original password or some super-long one, which nobody knows.
Is this right, or do these hash algorithms provide really unique results - I'm quite sure it's hardly possible.
(Note: You're asking about hashing functions, not encryption).
It's impossible for them to be unique, by definition. They take a large input and reduce its size. It obviously follows, then, that they can't represent all the information they have compressed. So no, they don't provide "truly unique" results.
What they do provide, however, is "collision resistant" results. I.e. they try and show that two slightly different datas produce a significantly different hash.
Hashing algorithms (which is what you are referring to) do not provide unique results. What you are referring to is called the Pigeonhole Principle. The number of inputs exceeds the number of outputs, so multiple inputs must be mapped to the same output. This is why the longer the output hash the better, because there are less number of inputs mapped to an output.
Encrypting something must provide a unique results, because you can encrypt a message and decrypt it and get the same message.
SHA1 is not encryption algorithm, but a cryptographic hash function.
You are right - since it maps arbitrary long input to a fixed size hash there can be collisions. But the idea of a cryptographic hash function is to make it impossible to create such collisions "on demand". That's why we call them one-way hash functions, too.
Quote (source):
The ideal cryptographic hash function has four main or significant properties:
* it is easy to compute the hash value for any given message,
* it is infeasible to find a message that has a given hash,
* it is infeasible to modify a message without changing its hash,
* it is infeasible to find two different messages with the same hash.
Hashing algorithms never guarantee a different result for a different input. That's why hashing is always used as a one-way "encryption".
But you have to be realistic, a 160-bit hashing algorithm can have 2^160 possible combinations, which is... a lot! (1 with 48 zeroes)
These are not encryption functions, but hashing ones.
Hashing, by definition, can have two different strings collide (map to the same value) for the very reasons you mention. But that is usually not relevant because:
Cryptographic hashes (such as SHA1) try hard to make the collision probability for similar strings (very, very) low
You cannot deduce the original string from the hash.
These two mean that you cannot take a hash and easily generate one of the strings that map to it.

When do hashes collide?

I understand that according to pigeonhole principle, if number of items is greater than number of containers, then at least one container will have more than one item. Does it matter which container will it be? How does this apply to MD5, SHA1, SHA2 hashes?
No it doesn't matter which container it is, and in fact this is not that important to cryptographic hashes; much more important is the birthday paradox, which says that you only need to hash sqrt(numberNeededByPigeonHolePrincipal) values, on average, before finding a collision.
Thus, the hash needs to be large enough that the square-root of the search space is too large to brute-force. The square-root-of-search-space for SHA1 is 280, and as of March 2012, no two values have ever been found with the same SHA1-hash (though I predict that will happen within the next year or two..); same with SHA2, a family of hashes which all have an even larger search-space. MD5 has been broken for a while though.
If you have more items to hash than you have slots, then you'll have hash collisions. But if you have a poor hashing algorithm, then you'll see collisions even when the items / slots ratio is very small. A good hashing algorithm (including most of the ones you'll see in the wild) will attempt to spread the resulting hashes over the entire output space as evenly as possible, and thus minimize collisions.
Note that a hash collision is not the end of the world. When used in a hash table, for instance, it just means that more than one item is stored in a slot, and the table code will have to traverse a little bit more to find or add the target item, increasing lookup time slightly.
You'll see people refer to MD5 as a "broken" hashing algorithm, when in reality, it's just a poor one to use as a cryptographic hash. It'll be better than one you build yourself.
The point of a hash function is to randomly distribute items into containers. For any good hash function, it doesn't/shouldn't "matter" which container is which as they must be indistinguishable.
This does not apply to "perfect hash" implementations which attempt to do better than random distribution — unlike the algorithms you mentioned.
As Michael mentioned, collisions happen LONG before there are as many items as slots. You must have graceful collision handling (or a perfect hash) if you want to handle the birthday paradox.
I think which application you're using the hash function for is an important distinction. Frequent collision in hashing containers, for example, can degrade performance. Frequent collision in cryptography will have far more devastating consequences (see: cryptographic hash function on Wikipedia).
Collision happens relatively easily even with "decent" hashing algorithm. For example, in Java,
String s = new String(new char[size]);
always hashes to 0. That is, all strings containing only \0 hash to 0 in Java.
As for "does it matter which container will it be?", again it depends on the application. You can design hash functions that would hash "similar" objects to nearby values. This is useful when you want to search for similar objects, for example. Just hash them all and see where they fall. In this case, collisions or near-collisions are desirable, because it groups objects that are similar.
In other applications, you want even the slightest change in the object to result in an entirely different hash value. This is the case in cryptography, for example, where you want to be as certain as possible that something has not been modified. It is far more difficult to find different objects that hash to the same value in this case.
Depending on your application, cryptographic hashes like MDA, SHA1/2 etc. may not be the ideal choice, precisely because they appear as if entirely random, thus giving you collisions as prediced by the birthday paradox. Traditionally, one reason for using simple hashes based on the remainder operation is that keys were expected to be serial numbers or similar, so that a remainder operation would sustain fewer collisions than expected at random. E.g. if the keys are the integers are 1..1000 you might have no collisions at all in a container of size 1009 if your hash function is the key mod 1009. People would sometimes hand-tune systems by carefully picking container size and hash function to achieve an even split.
Of course, if you have to worry about people maliciously choosing keys that will cause you difficulty, or an upstream system sending you very biassed keys (because e.g. it has its own hash table and decides to process all keys that hash to X at once). you may wish to use a hash based on a keyed cryptographic hash function to defend against this.

Resources