Memcached uses distributed consistent hashing to choose which server to put a key on but which hashing algo does it use to map string key into the final hash on which the Ketama algo is applied for server selection. And how good is that algo at spreading similar keys to different servers.
According to the source code in hash.c, memcached uses the following algorithm:
The hash function used here is by Bob Jenkins, 1996:
http://burtleburtle.net/bob/hash/doobs.html
"By Bob Jenkins, 1996. bob_jenkins#burtleburtle.net.
You may use this code any way you wish, private, educational,
or commercial. It's free."
From Bob Jenkins' website:
I offer you a new hash function for hash table lookup that is faster and more thorough than the one you are using now. I also give you a way to verify that it is more thorough.
Also, his requirements are:
The keys are unaligned variable-length byte arrays.
Sometimes keys are several such arrays.
Sometimes a set of independent hash functions were required.
Average key lengths ranged from 8 bytes to 200 bytes.
Keys might be character strings, numbers, bit-arrays, or weirder things.
Table sizes could be anything, including powers of 2.
The hash must be faster than the old one.
The hash must do a good job.
...
The real requirement then is that a good hash function should distribute hash values uniformly for the keys that users actually use.
To get back to your other question, he measured the ability of the algorithm to uniformly distribute hash values, so I would presume that the hash does a good job at spreading similar keys to different servers. If you have concerns, the code is isolated so you should be able to run your own tests.
Related
Hashing algorithms today are widely used to check for integrity of data, but why are they safe to use? A 256-bit hashing algorithm generates 256 bits representation of given data. However, a 256-bit hash only has 2512 variations. But 1 KB of data has 28192 different variations. It's mathematically impossible for every piece of data in the world to have different hash values. So why are hashing algorithms safe?
The reasons why hashing algorithms are considered safe are due to the following:
They are irreversible. You can't get to the input data by reverse-engineering the output hash value.
A small change in the input will produce a vastly different hash value. i.e. "hello" vs "hellp" will generate completely different values.
The assumption being made with data integrity is that a majority of your input is going to be the same between a good copy of input data and a bad (malicious) copy of input data. The small change in data will make the hash value completely different. Therefore, if I try to inject any malicious code or data, that small change will completely throw-off the value of the hash. When comparison is done with a known hash value, it'll be easily determinable if data has been modified or corrupted.
You are correct in that there is risk of collisions between an infinite number of datasets, but when you compare two datasets that are very similar, it is reasonable to assume that the hash values of those two almost-equivalent datasets with be completely different.
Not all hashes are safe. There are good hashes (for some value of "good") where it's sufficiently non-trivial to intentionally create collisions (I think FNV-1a may fall in this category). However, a cryptographic hash, used correctly, would be computationally expensive to generate a collision for.
"Good" hashes generally have the property that small changes in the input cause large changes in the output (rule of thumb is that a single-bit flip in the input cause roughly b bit flips in the output, for a 2b hash). There are some special-purpose hashes where "close inputs generate close hashes" is actually a feature, but you probably would not use those for error detecting, but they may be useful for other applications.
A specific use for FNV-1a is to hash large blocks of data, then compare the computed hash to that of other blocks. Only blocks that have a matching hash need to be fully compared to see if they're identical, meaning that a large number of blocks can simply be ignored, speeding up the comparison by orders of magnitude (you can compare one 2 MB to another in approximately the same time as you can compare its 64-bit hash to that of the hash of 256Ki blocks; although you will probbaly have a few blocks that have colliding hashes).
Note that "just a hash" may not be enough to provide security, you may also need to apply some sort of signing mechanism to ensure that you don't have the problem of someone modifying the hashed-over text as well as the hash.
Simply for ensuring storage integrity (basically "protect against accidental modification" as a threat model), a cryptographic hash without signature, plus the original size, should be good enough. You would need a really really unlikely sequence of random events mutating a fixed-length bit string to another fixed-length bit string of the same length, giving the same hash. Of course, this does not give you any sort of error correction ability, just error detection.
Something that has been bugging me about the HyperLogLog algorithm is its reliance on the hash of the keys. The issue I have is that the paper seems to assume that we have a totally random distribution of data on each partition, however in the context it is often used (MapReduce style jobs) things are often distributed by their hash values so all duplicated keys will be on the same partition. To me this means that we should actually be adding the cardinalities generated by HyperLogLog rather then using some sort of averaging technique (in the case where we are partitioned by hashing the same thing that HyperLogLog hashes).
So my question is: is this a real issue with HyperLogLog or have I not read the paper in enough detail
This is a real issue if you use non-independent hash functions for both tasks.
Let's say the partition decides the node by the first b bits of the hashed values. If you use the same hash function for both partition and HyperLogLog, the algorithm will still work properly, but the precision will be sacrificed. In practice, it'll be equivalent of using m/2^b buckets (log2m' = log2m-b), because the first b bits will always be the same, so only log2m-b bits will be used to choose the HLL bucket.
Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions and are also not revertible. So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Only reason I am able to think is they require say large key say 32bit.But still avoiding collision so the look up will definitely be O(1).
Because they are very slow, for two reasons:
They aim to be crytographically secure, not only collision-resistant in general
They produce a much larger hash value than what you actually need in a hash table
Because they handle unstructured data (octet / byte streams) but the objects you need to hash are often structured and would require linearization first
Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions...
Wrong because:
Two inputs cam still happen to have the same hash value. Say the hash value is 32 bit, a great general-purpose hash routine (i.e. one that doesn't utilise insights into the set of actual keys) still has at least 1/2^32 chance of returning the same hash value for any 2 keys, then 2/2^32 chance of colliding with one of those as a third key is hashed, 3/2^32 for the fourth etc..
Having distinct hash values is a very different thing from having the hash values map to distinct hash buckets in a hash table. Hash values are generally modded into the table size to select a bucket, so at best - and again for general-purpose hashing - the chance of a collision when adding an element to a hash table is #preexisting-elements / table-size.
So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Because speed is often the programmer's goal when choosing to use a hash table over say a binary tree. If the hash values are mathematically complicated to calculate, they may take a lot longer than using a slightly more (but still not particularly) collision prone but faster-to-calculate hash function. That said, there are times when more effort on the hashing can pay off - for example, when the hash table exists on magnetic disk and the I/O costs of seeking & reading records dwarfs hash calculation effort.
antti makes an interesting point about data too... general purpose hashing routines often work on blocks of binary data with a specific starting address and a number of bytes (they may even require that number of bytes to be a multiple of 2 or 4). In many applications, data that needs to be hashed will be intermingled with data that must not be included in the hash - such as cached values, file handles, pointers/references to other data or virtual dispatch tables etc.. A common solution is to hash the desired fields separately and combine the hash keys - perhaps using exclusive-or. As there can be bit fields that should be hashed in the same byte of memory as other data that should not be hashed, you sometimes need custom code to extract those values. Still, even if some copying and padding was required beforehand, each individual field could eventually be hashed using md5, SHA-1 or whatever and those hash values could be similarly combined, so this complication doesn't really categorically rule out the approach you're interested in.
Only reason I am able to think is they require say large key say 32bit.
All other things being equal, the larger the key the better, though if the hash function is mathematically ideal then any N of its bits - where 2^N >= # hash buckets - will produce minimal collisions.
But still avoiding collision so the look up will definitely be O(1).
Again, wrong as mentioned above.
(BTW... I stress general-purpose in a couple places above. That's just because there are trivial cases where you might have some insight into the keys you'll need to hash that allows you to position them perfectly within the available hash buckets. For example, if you knew the keys were the numbers 1000, 2000, 3000 etc. up to 100000 and that you had at least 100 hash buckets, you could trivially define your hash function as x/1000 and know you'd have perfect hashing sans collisions. This situation of knowing that all your keys map to distinct hash table buckets is known as "perfect hashing" - as per your question title - a good general-purpose hash like md5 is not a perfect hash, and indeed it makes no sense to talk about perfect hashing without knowing the complete set of possible keys).
I have a hash table where the vast majority of accesses at run-time follow one of the following patterns:
Iterate through all key/value pairs. (The speed of this operation is critical.)
Modify keys (i.e. remove a key/value pair & add another with the same value but a different key. Detect duplicate keys & combine values if necessary.) This is done in a loop, affecting many thousands of keys, but with no other operations intervening.
I would also like it to consume as little memory as possible.
Other standard operations must be available, though they are used less frequently, e.g.
Insert a new key/value pair
Given a key, look up the corresponding value
Change the value associated with an existing key
Of course all "standard" hash table implementations, including standard libraries of most high-level-languages, have all of these capabilities. What I am looking for is an implementation that is optimized for the operations in the first list.
Issues with common implementations:
Most hash table implementations use separate chaining (i.e. a linked list for each bucket.) This works but I am hoping for something that occupies less memory with better locality of reference. Note: my keys are small (13 bytes each, padded to 16 bytes.)
Most open addressing schemes have a major disadvantage for my application: Keys are removed and replaced in large groups. That leaves deletion markers that increase the load factor, requiring the table to be re-built frequently.
Schemes that work, but are less than ideal:
Separate chaining with an array (instead of a linked list) per bucket:
Poor locality of reference, resulting from memory fragmentation as small arrays are reallocated many times
Linear probing/quadratic hashing/double hashing (with or without Brent's Variation):
Table quickly fills up with deletion markers
Cuckoo hashing
Only works for <50% load factor, and I want a high LF to save memory and speed up iteration.
Is there a specialized hashing scheme that would work well for this case?
Note: I have a good hash function that works well with both power-of-2 and prime table sizes, and can be used for double hashing, so this shouldn't be an issue.
Would Extendable Hashing help? Iterating though the keys by walking the 'directory' should be fast. Not sure if the "modify key for value" operation is any better with this scheme or not.
Based on how you're accessing the data, does it really make sense to use a hash table at all?
Since you're main use cases involve iteration - a sorted list or a btree might be a better data structure.
It doesnt seem like you really need the constant time random data access a hash table is built for.
You can do much better than a 50% load factor with cuckoo hashing.
Two hash functions with four items will get you over 90% with little effort. See this paper:
http://www.ru.is/faculty/ulfar/CuckooHash.pdf
I'm building a pre-computed dictionary using a cuckoo hash and getting a load factor of better than 99% with two hash functions and seven items per bucket.
I understand that according to pigeonhole principle, if number of items is greater than number of containers, then at least one container will have more than one item. Does it matter which container will it be? How does this apply to MD5, SHA1, SHA2 hashes?
No it doesn't matter which container it is, and in fact this is not that important to cryptographic hashes; much more important is the birthday paradox, which says that you only need to hash sqrt(numberNeededByPigeonHolePrincipal) values, on average, before finding a collision.
Thus, the hash needs to be large enough that the square-root of the search space is too large to brute-force. The square-root-of-search-space for SHA1 is 280, and as of March 2012, no two values have ever been found with the same SHA1-hash (though I predict that will happen within the next year or two..); same with SHA2, a family of hashes which all have an even larger search-space. MD5 has been broken for a while though.
If you have more items to hash than you have slots, then you'll have hash collisions. But if you have a poor hashing algorithm, then you'll see collisions even when the items / slots ratio is very small. A good hashing algorithm (including most of the ones you'll see in the wild) will attempt to spread the resulting hashes over the entire output space as evenly as possible, and thus minimize collisions.
Note that a hash collision is not the end of the world. When used in a hash table, for instance, it just means that more than one item is stored in a slot, and the table code will have to traverse a little bit more to find or add the target item, increasing lookup time slightly.
You'll see people refer to MD5 as a "broken" hashing algorithm, when in reality, it's just a poor one to use as a cryptographic hash. It'll be better than one you build yourself.
The point of a hash function is to randomly distribute items into containers. For any good hash function, it doesn't/shouldn't "matter" which container is which as they must be indistinguishable.
This does not apply to "perfect hash" implementations which attempt to do better than random distribution — unlike the algorithms you mentioned.
As Michael mentioned, collisions happen LONG before there are as many items as slots. You must have graceful collision handling (or a perfect hash) if you want to handle the birthday paradox.
I think which application you're using the hash function for is an important distinction. Frequent collision in hashing containers, for example, can degrade performance. Frequent collision in cryptography will have far more devastating consequences (see: cryptographic hash function on Wikipedia).
Collision happens relatively easily even with "decent" hashing algorithm. For example, in Java,
String s = new String(new char[size]);
always hashes to 0. That is, all strings containing only \0 hash to 0 in Java.
As for "does it matter which container will it be?", again it depends on the application. You can design hash functions that would hash "similar" objects to nearby values. This is useful when you want to search for similar objects, for example. Just hash them all and see where they fall. In this case, collisions or near-collisions are desirable, because it groups objects that are similar.
In other applications, you want even the slightest change in the object to result in an entirely different hash value. This is the case in cryptography, for example, where you want to be as certain as possible that something has not been modified. It is far more difficult to find different objects that hash to the same value in this case.
Depending on your application, cryptographic hashes like MDA, SHA1/2 etc. may not be the ideal choice, precisely because they appear as if entirely random, thus giving you collisions as prediced by the birthday paradox. Traditionally, one reason for using simple hashes based on the remainder operation is that keys were expected to be serial numbers or similar, so that a remainder operation would sustain fewer collisions than expected at random. E.g. if the keys are the integers are 1..1000 you might have no collisions at all in a container of size 1009 if your hash function is the key mod 1009. People would sometimes hand-tune systems by carefully picking container size and hash function to achieve an even split.
Of course, if you have to worry about people maliciously choosing keys that will cause you difficulty, or an upstream system sending you very biassed keys (because e.g. it has its own hash table and decides to process all keys that hash to X at once). you may wish to use a hash based on a keyed cryptographic hash function to defend against this.