Hashmap hashcode to internal table index conversion - algorithm

Hashmaps usually implemented using internal array (table) of buckets. On accessing hashmap by key, we get key's hashcode using key-type specific(logic type specific) hash function. Then we need to map hashcode to actual internal buckets table index.
key -> (hash function) -> hashcode -> (???) -> index in internal table
Sometimes internal table could shrink and expand, depending on hashmap filling ratio. Then probably hashcode->index conversion method could be changed a bit.
For example our hash function returns 32 bit unsigned integer value and
moment A: internal table has capacity 10000
moment B: internal table has capacity 100000
What algorithms or approach usually used to perform hashcode->internal table index conversion? How is table resizing isue solved for them?

Usually, a simple modulo will do the job.
To take a quick example from Wikipedia, it's simple as that :
hash = hashfunc(key)
index = hash % array_size
As you said, the resizing happen dependending on the hashmap filling ratio. The array is reallocated (see realloc()), then the indices are recalculated given the new array size, and the values copied to their new index.

I wrote about this here and here.
When you increase the size of your vector of indeces you can be sure that the algorithm that worked well on the shorter vector will work less well on the longer. It is possible to test beforehand and have new algorithms to put in place when you make the vector longer. Or, as the the number of occupied indeces in the current vector increases, have a background, lower-priority thread that tests different algorithms on the data.
As the example in one of my answers shows, a "new algorithm" need be nothing more than a different pair of matched prime numbers.

Related

What happens if you don't hash a value before selecting a hash map bucket?

Since a hash map works with a modulus/division operation to select the appropriate bucket to place the value in, it seems that the chance of collision is dependent on the number of buckets, not "how good the hash function is". How good the function function is decides the likelihood of a same-hash return collision. However 'collision' in a hash map is referring to something else, it's referring to the same value AFTER the modulus operation. Assuming the key value is an integer (say 64 bit), what can be expected if the hash function for a hash map is simply the key value itself? I would venture to say that retrieval would be lot faster, as there wouldn't be a need to loop through a number of bytes and do hash operations, with an end result, with respect to hash table collisions, much the same. I mean, the exact values that end up colliding with an already occupied bucket are different values, but if the values are spread all over the place then overall the results should be very similar.
it seems that the chance of collision is dependent on the number of buckets, not "how good the hash function is
No, that is not correct. Keys are not generally distributed evenly across bucket indexes. Hashing keys tends to more evenly distribute the bucket index better than raw key.
index = key%bucket_n;
// vs
index = hash(key)%bucket_n;
Further: A good hash function works well with any bucket_n. A weak hash function improves when bucket_n is a prime.
There is a need to balance the number of entries in a table vs. the table size. If entires_n much less than table_size, OP assertions make some sense. Yet this waste lots of memory
If entires_n much greater than table_size, collisions are common. Often even worse without a hash function.
IMO, the hash table size should exponentially grow with the entry count to maintain a density less than some threshold, say 1/3. A re-hash of the table may be needed to accommodate a size change.
Since a hash map works with a modulus/division operation to select the appropriate bucket to place the value in, it seems that the chance of collision is dependent on the number of buckets, not "how good the hash function is". How good the function function is decides the likelihood of a same-hash return collision
Not quite. A poor hash function can cluster keys or make particular bits more likely than others to be set. That, in turn, can result in some buckets being more likely to be selected by the modulus operator.
Assuming the key value is an integer (say 64 bit), what can be expected if the hash function for a hash map is simply the key value itself?
In general, you can't say. There could very well be patterns in the keys that, if you just used the modulus operator, will cause some buckets to be much more full than others. A good hashing function essentially randomizes the bits so you're more likely to evenly distribute the keys in the buckets.
Assuming the key value is an integer (say 64 bit), what can be expected if the hash function for a hash map is simply the key value itself?
Many languages do exactly that. E.g. Java.
But you have to be careful, if your hash function is too trivial, it would also be trivial for an attacker to exploit hash collisions to cause a DoS in your service. This is known as a Collision Attack. Different libraries deal with that in different ways.
Java HashMap falls back to a red-black tree whenever it detects too many collisions in a single bucket. Other languages introduce randomization in the hash function, so it would be harder for an attack to exploit it.

Separate chaining in hash map

My understanding of separate chaining is that we convert a large number into a small number so that we can use it as an index. But how do we deal with the large index? For example, the current size of my hash_map is 10, a new index calculated by the hash function is 55. So do I need to resize my hash_map every time when the new index is too large?
Thanks!
A common technique is to have a hash function that computes some integer (typically 32-bit or 64-bit) and then to reduce the number to a valid index in the hash table by modding that integer by the table size. For example, if you have a 10-element hash table and your hash code is 55, you'd compute 55 mod 10 = 5 and place the item at index 5.
Depending on your programming language, there may be some edge cases to handle here (say, if the hash code could be negative, you need to ensure that your index is positive), but this general idea works pretty well and is used in many common hash table implementations.

How does a hash (in a language like Ruby) work "under the hood"?

I've read here and there about hash maps/tables, and can kind of understand the concept that a hash table is essentially a finite-sized array. The function could use the modulus operator to determine which index in the array corresponds to a particular key. If collisions occur, then a linked-list can be implemented to store all the collided values. This is my very-novice understanding, and I hope someone can expound on it/correct it in the context of a Ruby hash. In Ruby, all you really have to do is
hash = {}
hash[key] = value
and this creates a key with the corresponding value. Say that you're just storing a bunch of symbols as keys and numbers as values:
hash[:a] = 1
hash[:b] = 2
...
What exactly is happening under the hood in terms of storing the values in arrays and linked-lists? What would be an example of a collision?
The Ruby Language Specification does not prescribe any particular implementation strategy for the Hash class. Every implementation is allowed to implement it however they want, provided they honor the contract.
For example, here is Rubinius's implementation, which, being written in Ruby, is pretty easy to follow: kernel/common/hash.rb This is a fairly traditional hashtable. (One other cool thing to note about this implementation is that it actually happens to be as fast as YARV's, which proves that Ruby code can be as efficient as hand-optimized C.)
Rubinius also alternatively implements the Hash class with a Hash Array Mapped Trie: kernel/common/hash_hamt.rb [Note: this implementation uses three VM primitives written in C++.]
You can switch between those two implementations using a configuration option. So, not only is the Hash implementation different between different Ruby implementations, it might even be different between two runs of the exact same program on the exact same version of the exact same Ruby implementation!
In IronRuby, Ruby's Hash class simply delegates to a .NET System.Collections.Generic.Dictionary<object, object>: Ruby/Builtins/Hash.cs
In previous versions, it didn't even delegate, it was just simply a subclass: Ruby/Builtins/Hash.cs
If you are hardcore about this you could look at the implementation directly. This is what the hash ends up using:
https://github.com/ruby/ruby/blob/c8b3f1b470e343e7408ab5883f046b1056d94ccc/st.c
The hash itself is here:
https://github.com/ruby/ruby/blob/trunk/hash.c
Most of the times, the article diego provided in comments will be more than enough
In ruby 2.4, Hash table was moved to open addressing model, so I will describe only how Hash-tables structure works, but not how it is implemented in 2.4 and above.
Let's imagine that we store all entries in an array. When we want to find something, we have to go through all the elements to match one. This can take a long time if we have a lot of elements and using a hash table lets us go directly to the cell with the required value by computing the hash function for that key.
The hash table stores all values in the store (bins) groups, in a data structure similar to an array.
How does hash table work
When we add a new key-value pair, we need to calculate to which "storage" this pair will be inserted and we do this using the .hash method (hash function). The resulting value from the hash function is a pseudo-random number as it always produces the same number for the same value.
Roughly speaking, hash returns the equivalent of the link to the memory location where
the current object is stored. However, for strings, the calculation is relative to the value.
Having received a pseudo-random number, we have to calculate the number of the "storage" where the key-value pair will be stored.
'a'.hash % 16 =>9
a - key
16 - amount of storage
9 - the storage number
So, in Ruby the insertion works in the following way:
How insertion works
It takes the hash of the key using the internal hash function.
:c.hash #=> 2782
After getting the hash value, with the help of modulo operation (2782 % 16) we will get the storage number where to keep our key-value pair :d.hash % 16
Add key-value to a linked list of the proper bin
The search works as follows:
The search works quite the same way:
Determine "hash" function;
Find "storage";
Then iterate through the list and retrieve a hash element.
In ruby, the average number of elements per bin is 5. With the increase in the number of records, the density of elements will grow in each repository (in fact, that size of hash-table is only 16 storages).
If the density of the elements is large, for example 10_000 elements in one "storage", we will have to go through all the elements of this linked-list to find the corresponding record. And we'll go back to O(n) time, which is pretty bad.
To avoid this, table rehash is applied. This means that hash-table size will be increased (up to the next number of - 16, 32, 64, 128, ...) and for all current elements the position in the "storages" will be recalculated.
"Rehash" occurs when the number of all elements is greater than the maximum density multiplied by the current table size.
81 > 5 * 16 - rehash will be called when we add 81 elements to the table.
num_entries > ST_DEFAULT_MAX_DENSITY * table->num_bins
When the number of entries reaches the maximum possible value for current hash-table, the number of "storages" in that hash-table increases (it takes next size-number from 16, 32, 64, 128), and it re-calculates and corrects positions for all entries in that hash.
Check this article for a more in-depth explanation: Do You Know How Hash Table Works? (Ruby Examples)

How to improve the performance when doing rehash?

At some point we need to increase the size of hash, and normally we just rehash, which leads to re-constructure of the whole hash.
Is there any better solution so that when we increase the size, we don't need to re-construct the whole thing?
You could use http://en.wikipedia.org/wiki/Extendible_hashing, although AFAIK it is used mostly for on-disk databases.
There are also general methods for smoothing out some amortised costs. Starting points for this would be http://en.wikipedia.org/wiki/Static_and_dynamic_data_structures and http://en.wikipedia.org/wiki/Dynamization. One application of this to hash tables would be to always keep two tables, one of size N and one of size 2N or so. When the smaller overflows, start creating a table of size 4N, but don't populate it straight away - populate it incrementally while using the table of size 2N. By the time the table of size 2N is full, the table of size 4N should be ready. For the special case of hash tables, extendible hashing should be better.
Any time you re-hash, there's nothing that says you need to actually re-hash. In fact all that you actually need to do is re-mod (i.e. shift everything's position).
If you cache the hash (hehe, sounds like the start of a dr. seuss book) then you only need to compute it once. So store the hash along with the actual data, and that will save you from needing to calculate the hash again in the future. However I'm assuming that you're not already doing this, you didn't exactly explain the current process.
// Store these instead of the data directly. This assumes immutable data.
struct hashable_item
{
data dat;
int32 hash;
}

Efficient mapping from 2^24 values to a 2^7 index

I have a data structure that stores amongst others a 24-bit wide value. I have a lot of these objects.
To minimize storage cost, I calculated the 2^7 most important values out of the 2^24 possible values and stored them in a static array. Thus I only have to save a 7-bit index to that array in my data structure.
The problem is: I get these 24-bit values and I have to convert them to my 7-bit index on the fly (no preprocessing possible). The computation is basically a search which one out of 2^7 values fits best. Obviously, this takes some time for a big number of objects.
An obvious solution would be to create a simple mapping array of bytes with the length 2^24. But this would take 16 MB of RAM. Too much.
One observation of the 16 MB array: On average 31 consecutive values are the same. Unfortunately there are also a number of consecutive values that are different.
How would you implement this conversion from a 24-bit value to a 7-bit index saving as much CPU and memory as possible?
Hard to say without knowing what the definition is of "best fit". Perhaps a kd-tree would allow a suitable search based on proximity by some metric or other, so that you quickly rule out most candidates, and only have to actually test a few of the 2^7 to see which is best?
This sounds similar to the problem that an image processor has when reducing to a smaller colour palette. I don't actually know what algorithms/structures are used for that, but I'm sure they're look-up-able, and might help.
As an idea...
Up the index table to 8 bits, then xor all 3 bytes of the 24 bit word into it.
then your table would consist of this 8 bit hash value, plus the index back to the original 24 bit value.
Since your data is RGB like, a more sophisticated hashing method may be needed.
bit24var & 0x000f gives you the right hand most char.
(bit24var >> 8) & 0x000f gives you the one beside it.
(bit24var >> 16) & 0x000f gives you the one beside that.
Yes, you are thinking correctly. It is quite likely that one or more of the 24 bit values will hash to the same index, due to the pigeon hole principal.
One method of resolving a hash clash is to use some sort of chaining.
Another idea would be to put your important values is a different array, then simply search it first. If you don't find an acceptable answer there, then you can, shudder, search the larger array.
How many 2^24 haves do you have? Can you sort these values and count them by counting the number of consecutive values.
Since you already know which of the 2^24 values you need to keep (i.e. the 2^7 values you have determined to be important), we can simply just filter incoming data and assign a value, starting from 0 and up to 2^7-1, to these values as we encounter them. Of course, we would need some way of keeping track of which of the important values we have already seen and assigned a label in [0,2^7) already. For that we can use some sort of tree or hashtable based dictionary implementation (e.g. std::map in C++, HashMap or TreeMap in Java, or dict in Python).
The code might look something like this (I'm using a much smaller range of values):
import random
def make_mapping(data, important):
mapping=dict() # dictionary to hold the final mapping
next_index=0 # the next free label that can be assigned to an incoming value
for elem in data:
if elem in important: #check that the element is important
if elem not in mapping: # check that this element hasn't been assigned a label yet
mapping[elem]=next_index
next_index+=1 # this label is assigned, the next new important value will get the next label
return mapping
if __name__=='__main__':
important_values=[1,5,200000,6,24,33]
data=range(0,300000)
random.shuffle(data)
answer=make_mapping(data,important_values)
print answer
You can make the search much faster by using hash/tree based set data structure for the set of important values. That would make the entire procedure O(n*log(k)) (or O(n) if its is a hashtable) where n is the size of input and k is the set of important values.
Another idea is to represent the 24BitValue array in a bit map. A nice unsigned char can hold 8 bits, so one would need 2^16 array elements. Thats 65536. If the corresponding bit is set, then you know that that specific 24BitValue is present in the array, and needs to be checked.
One would need an iterator, to walk through the array and find the next set bit. Some machines actually provide a "find first bit" operation in their instruction set.
Good luck on your quest.
Let us know how things turn out.
Evil.

Resources