Improving query access performance for unordered maps that are unchanging - c++11

I am looking for suggestions in improving the query time access for unordered maps. My code essentially just consists of 2 steps. In the first step, I populate the unordered map. After the first step, no more entries are ever added to the map. In the second step, the unordered map is only queried. Since the map is essentially unchanging, is there something that can be done to speed up the query time?
For instance, does stl provide any function that can adjust the internal allocations in the map to improve query time access? In other words, it is possible that more than one key was mapped to the same bucket in the unordered map. If more memory was allocated to the map, then chances of such a collision occurring can reduce. In that sense, I am curious as to whether there is anything that can be done knowing the fact that the unordered map will remain unchanged.

If measurements show this is important for you, then I'd suggest taking measurements for other hash table implementations outside the Standard Library, e.g. google's. Using closed hashing aka open addressing may well work better for you, especially if your hash table entries are small enough to store directly in the hash table buckets.
More generally, Marshall suggests finding a good hash function. Be careful though - sometimes a generally "bad" hash function performs better than a "good" one, if it works in nicely with some of the properties of your keys. For example, if you tend to have incrementing number, perhaps with a few gaps, then an identity (aka trivial) hash function that just returns the key can select hash buckets with far less collisions than a crytographic hash that pseudo-randomly (but repeatably) scatters keys with as little as a single bit of difference in uncorrelated buckets. Identity hashing can also help if you're looking up several nearby key values, as their buckets are probably nearby too and you'll get better cache utilisation. But, you've told us nothing about your keys, values, number of entries etc. - so I'll leave the rest with you.

You have two knobs that you can twist: The the hash function and number of buckets in the map. One is fixed at compile-time (the hash function), and the other you can modify (somewhat) at run-time.
A good hash function will give you very few collisions (non-equal values that have the same hash value). If you have many collisions, then there's not really much you can do to improve your lookup times. Worst case (all inputs hash to the same value) gives you O(N) lookup times. So that's where you want to focus your effort.
Once you have a good hash function, then you can play games with the number of buckets (via rehash) which can reduce collisions further.

Related

What happens if you don't hash a value before selecting a hash map bucket?

Since a hash map works with a modulus/division operation to select the appropriate bucket to place the value in, it seems that the chance of collision is dependent on the number of buckets, not "how good the hash function is". How good the function function is decides the likelihood of a same-hash return collision. However 'collision' in a hash map is referring to something else, it's referring to the same value AFTER the modulus operation. Assuming the key value is an integer (say 64 bit), what can be expected if the hash function for a hash map is simply the key value itself? I would venture to say that retrieval would be lot faster, as there wouldn't be a need to loop through a number of bytes and do hash operations, with an end result, with respect to hash table collisions, much the same. I mean, the exact values that end up colliding with an already occupied bucket are different values, but if the values are spread all over the place then overall the results should be very similar.
it seems that the chance of collision is dependent on the number of buckets, not "how good the hash function is
No, that is not correct. Keys are not generally distributed evenly across bucket indexes. Hashing keys tends to more evenly distribute the bucket index better than raw key.
index = key%bucket_n;
// vs
index = hash(key)%bucket_n;
Further: A good hash function works well with any bucket_n. A weak hash function improves when bucket_n is a prime.
There is a need to balance the number of entries in a table vs. the table size. If entires_n much less than table_size, OP assertions make some sense. Yet this waste lots of memory
If entires_n much greater than table_size, collisions are common. Often even worse without a hash function.
IMO, the hash table size should exponentially grow with the entry count to maintain a density less than some threshold, say 1/3. A re-hash of the table may be needed to accommodate a size change.
Since a hash map works with a modulus/division operation to select the appropriate bucket to place the value in, it seems that the chance of collision is dependent on the number of buckets, not "how good the hash function is". How good the function function is decides the likelihood of a same-hash return collision
Not quite. A poor hash function can cluster keys or make particular bits more likely than others to be set. That, in turn, can result in some buckets being more likely to be selected by the modulus operator.
Assuming the key value is an integer (say 64 bit), what can be expected if the hash function for a hash map is simply the key value itself?
In general, you can't say. There could very well be patterns in the keys that, if you just used the modulus operator, will cause some buckets to be much more full than others. A good hashing function essentially randomizes the bits so you're more likely to evenly distribute the keys in the buckets.
Assuming the key value is an integer (say 64 bit), what can be expected if the hash function for a hash map is simply the key value itself?
Many languages do exactly that. E.g. Java.
But you have to be careful, if your hash function is too trivial, it would also be trivial for an attacker to exploit hash collisions to cause a DoS in your service. This is known as a Collision Attack. Different libraries deal with that in different ways.
Java HashMap falls back to a red-black tree whenever it detects too many collisions in a single bucket. Other languages introduce randomization in the hash function, so it would be harder for an attack to exploit it.

When to use hash tables?

What are the cases when using hash table can improve performance, and when it does not? and what are the cases when using hash tables are not applicable?
What are the cases when using hash table can improve performance, and when it does not?
If you have reason to care, implement using hash tables and whatever else you're considering, put your actual data through, and measure which performs better.
That said, if the hash tables has the operations you need (i.e. you're not expecting to iterate it in sorted order, or compare it quickly to another hash table), and has millions or more (billions, trillions...) of elements, then it'll probably be your best choice, but a lot depends on the hash table implementation (especially the choice of closed vs. open hashing), object size, hash function quality and calculation cost / runtime), comparison cost, oddities of your computers memory performance at different cache levels... in short: too many things to make even an educated guess a better choice than measuring, when it matters.
and what are the cases when using hash tables are not applicable?
Mainly when:
The input can't be hashed (e.g. you're given binary blobs and don't know which bits in there are significant, but you do have an int cmp(const T&, const T&) function you could use for a std::map), or
the available/possible hash functions are very collision prone, or
you want to avoid worst-case performance hits for:
handling lots of hash-colliding elements (perhaps "engineered" by someone trying to crash or slow down your software)
resizing the hash table: unless presized to be large enough (which can be wasteful and slow when excessive memory's used), the majority of implementations will outgrow the arrays they're using for the hash table every now and then, then allocate a bigger array and copy content across: this can make the specific insertions that cause this rehashing to be much slower than the normal O(1) behaviour, even though the average is still O(1); if you need more consistent behaviour in all cases, something like a balance binary tree may serve
your access patterns are quite specialised (e.g. frequently operating on elements with keys that are "nearby" in some specific sort order), such that cache efficiency is better for other storage models that keep them nearby in memory (e.g. bucket sorted elements), even if you're not exactly relying on the sort order for e.g. iteration
We use Hash Tables to get access time of O(1). Imagine a dictionary. When you are looking for a word, eg "happy", you jump straight to 'H'. Here the hash function is determined by the starting alphabet. And then you look for happy within the H bucket (actually H bucket then HA bucket then HAP bucket anbd so on).
It doesn't make sense to use Hash Tables when your data is ordered or needs ordering like sorted numbers. (Alphabets are ordered ABCD....XYZ but it wouldn't matter if you switched A and Z, provided you know it is switched in your dictionary.)

Perfect Hash Building

Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions and are also not revertible. So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Only reason I am able to think is they require say large key say 32bit.But still avoiding collision so the look up will definitely be O(1).
Because they are very slow, for two reasons:
They aim to be crytographically secure, not only collision-resistant in general
They produce a much larger hash value than what you actually need in a hash table
Because they handle unstructured data (octet / byte streams) but the objects you need to hash are often structured and would require linearization first
Why don't we use SHA-1, md5Sum and other standard cryptography hashes for hashing. They are smart enough to avoid collisions...
Wrong because:
Two inputs cam still happen to have the same hash value. Say the hash value is 32 bit, a great general-purpose hash routine (i.e. one that doesn't utilise insights into the set of actual keys) still has at least 1/2^32 chance of returning the same hash value for any 2 keys, then 2/2^32 chance of colliding with one of those as a third key is hashed, 3/2^32 for the fourth etc..
Having distinct hash values is a very different thing from having the hash values map to distinct hash buckets in a hash table. Hash values are generally modded into the table size to select a bucket, so at best - and again for general-purpose hashing - the chance of a collision when adding an element to a hash table is #preexisting-elements / table-size.
So rather then coming up with a set of new hash function , which might have collisions , why don't we use them.
Because speed is often the programmer's goal when choosing to use a hash table over say a binary tree. If the hash values are mathematically complicated to calculate, they may take a lot longer than using a slightly more (but still not particularly) collision prone but faster-to-calculate hash function. That said, there are times when more effort on the hashing can pay off - for example, when the hash table exists on magnetic disk and the I/O costs of seeking & reading records dwarfs hash calculation effort.
antti makes an interesting point about data too... general purpose hashing routines often work on blocks of binary data with a specific starting address and a number of bytes (they may even require that number of bytes to be a multiple of 2 or 4). In many applications, data that needs to be hashed will be intermingled with data that must not be included in the hash - such as cached values, file handles, pointers/references to other data or virtual dispatch tables etc.. A common solution is to hash the desired fields separately and combine the hash keys - perhaps using exclusive-or. As there can be bit fields that should be hashed in the same byte of memory as other data that should not be hashed, you sometimes need custom code to extract those values. Still, even if some copying and padding was required beforehand, each individual field could eventually be hashed using md5, SHA-1 or whatever and those hash values could be similarly combined, so this complication doesn't really categorically rule out the approach you're interested in.
Only reason I am able to think is they require say large key say 32bit.
All other things being equal, the larger the key the better, though if the hash function is mathematically ideal then any N of its bits - where 2^N >= # hash buckets - will produce minimal collisions.
But still avoiding collision so the look up will definitely be O(1).
Again, wrong as mentioned above.
(BTW... I stress general-purpose in a couple places above. That's just because there are trivial cases where you might have some insight into the keys you'll need to hash that allows you to position them perfectly within the available hash buckets. For example, if you knew the keys were the numbers 1000, 2000, 3000 etc. up to 100000 and that you had at least 100 hash buckets, you could trivially define your hash function as x/1000 and know you'd have perfect hashing sans collisions. This situation of knowing that all your keys map to distinct hash table buckets is known as "perfect hashing" - as per your question title - a good general-purpose hash like md5 is not a perfect hash, and indeed it makes no sense to talk about perfect hashing without knowing the complete set of possible keys).

Hash table optimized for full iteration + key replacement

I have a hash table where the vast majority of accesses at run-time follow one of the following patterns:
Iterate through all key/value pairs. (The speed of this operation is critical.)
Modify keys (i.e. remove a key/value pair & add another with the same value but a different key. Detect duplicate keys & combine values if necessary.) This is done in a loop, affecting many thousands of keys, but with no other operations intervening.
I would also like it to consume as little memory as possible.
Other standard operations must be available, though they are used less frequently, e.g.
Insert a new key/value pair
Given a key, look up the corresponding value
Change the value associated with an existing key
Of course all "standard" hash table implementations, including standard libraries of most high-level-languages, have all of these capabilities. What I am looking for is an implementation that is optimized for the operations in the first list.
Issues with common implementations:
Most hash table implementations use separate chaining (i.e. a linked list for each bucket.) This works but I am hoping for something that occupies less memory with better locality of reference. Note: my keys are small (13 bytes each, padded to 16 bytes.)
Most open addressing schemes have a major disadvantage for my application: Keys are removed and replaced in large groups. That leaves deletion markers that increase the load factor, requiring the table to be re-built frequently.
Schemes that work, but are less than ideal:
Separate chaining with an array (instead of a linked list) per bucket:
Poor locality of reference, resulting from memory fragmentation as small arrays are reallocated many times
Linear probing/quadratic hashing/double hashing (with or without Brent's Variation):
Table quickly fills up with deletion markers
Cuckoo hashing
Only works for <50% load factor, and I want a high LF to save memory and speed up iteration.
Is there a specialized hashing scheme that would work well for this case?
Note: I have a good hash function that works well with both power-of-2 and prime table sizes, and can be used for double hashing, so this shouldn't be an issue.
Would Extendable Hashing help? Iterating though the keys by walking the 'directory' should be fast. Not sure if the "modify key for value" operation is any better with this scheme or not.
Based on how you're accessing the data, does it really make sense to use a hash table at all?
Since you're main use cases involve iteration - a sorted list or a btree might be a better data structure.
It doesnt seem like you really need the constant time random data access a hash table is built for.
You can do much better than a 50% load factor with cuckoo hashing.
Two hash functions with four items will get you over 90% with little effort. See this paper:
http://www.ru.is/faculty/ulfar/CuckooHash.pdf
I'm building a pre-computed dictionary using a cuckoo hash and getting a load factor of better than 99% with two hash functions and seven items per bucket.

Chained Hash Tables vs. Open-Addressed Hash Tables

Can somebody explain the main differences between (advantages / disadvantages) the two implementations?
For a library, what implementation is recommended?
Wikipedia's article on hash tables gives a distinctly better explanation and overview of different hash table schemes that people have used than I'm able to off the top of my head. In fact you're probably better off reading that article than asking the question here. :)
That said...
A chained hash table indexes into an array of pointers to the heads of linked lists. Each linked list cell has the key for which it was allocated and the value which was inserted for that key. When you want to look up a particular element from its key, the key's hash is used to work out which linked list to follow, and then that particular list is traversed to find the element that you're after. If more than one key in the hash table has the same hash, then you'll have linked lists with more than one element.
The downside of chained hashing is having to follow pointers in order to search linked lists. The upside is that chained hash tables only get linearly slower as the load factor (the ratio of elements in the hash table to the length of the bucket array) increases, even if it rises above 1.
An open-addressing hash table indexes into an array of pointers to pairs of (key, value). You use the key's hash value to work out which slot in the array to look at first. If more than one key in the hash table has the same hash, then you use some scheme to decide on another slot to look in instead. For example, linear probing is where you look at the next slot after the one chosen, and then the next slot after that, and so on until you either find a slot that matches the key you're looking for, or you hit an empty slot (in which case the key must not be there).
Open-addressing is usually faster than chained hashing when the load factor is low because you don't have to follow pointers between list nodes. It gets very, very slow if the load factor approaches 1, because you end up usually having to search through many of the slots in the bucket array before you find either the key that you were looking for or an empty slot. Also, you can never have more elements in the hash table than there are entries in the bucket array.
To deal with the fact that all hash tables at least get slower (and in some cases actually break completely) when their load factor approaches 1, practical hash table implementations make the bucket array larger (by allocating a new bucket array, and copying elements from the old one into the new one, then freeing the old one) when the load factor gets above a certain value (typically about 0.7).
There are lots of variations on all of the above. Again, please see the wikipedia article, it really is quite good.
For a library that is meant to be used by other people, I would strongly recommend experimenting. Since they're generally quite performance-crucial, you're usually best off using somebody else's implementation of a hash table which has already been carefully tuned. There are lots of open-source BSD, LGPL and GPL licensed hash table implementations.
If you're working with GTK, for example, then you'll find that there's a good hash table in GLib.
My understanding (in simple terms) is that both the methods has pros and cons, though most of the libraries use Chaining strategy.
Chaining Method:
Here the hash tables array maps to a linked list of items. This is efficient if the number of collision is fairly small. The worst case scenario is O(n) where n is the number of elements in the table.
Open Addressing with Linear Probe:
Here when the collision occurs, move on to the next index until we find an open spot. So, if the number of collision is low, this is very fast and space efficient. The limitation here is the total number of entries in the table is limited by the size of the array. This is not the case with chaining.
There is another approach which is Chaining with binary search trees. In this approach, when the collision occurs, they are stored in binary search tree instead of linked list. Hence, the worst case scenario here would be O(log n). In practice, this approach is best suited when there is a extremely nonuniform distribution.
Since excellent explanation is given, I'd just add visualizations taken from CLRS for further illustration:
Open Addressing:
Chaining:
Open addressing vs. separate chaining
Linear probing, double and random hashing are appropriate if the keys are kept as entries in the hashtable itself...
doing that is called "open addressing"
it is also called "closed hashing"
Another idea: Entries in the hashtable are just pointers to the head of a linked list (“chain”); elements of the linked list contain the keys...
this is called "separate chaining"
it is also called "open hashing"
Collision resolution becomes easy with separate chaining: just insert a key in its linked list if it is not already there
(It is possible to use fancier data structures than linked lists for this; but linked lists work very well in the average case, as we will see)
Let’s look at analyzing time costs of these strategies
Source: http://cseweb.ucsd.edu/~kube/cls/100/Lectures/lec16/lec16-25.html
If the number of items that will be inserted in a hash table isn’t known when the table is created, chained hash table is preferable to open addressing.
Increasing the load factor(number of items/table size) causes major performance penalties in open addressed hash tables, but performance degrades only linearly in chained hash tables.
If you are dealing with low memory and want to reduce memory usage, go for open addressing. If you are not worried about memory and want speed, go for chained hash tables.
When in doubt, use chained hash tables. Adding more data than you anticipated won’t cause performance to slow to a crawl.

Resources