Hash table - why should hash function and compression function be separate? - data-structures

I was wondering why there is a need to separate the hash function and compression function when using a hash table?
AFAIK, first the hash function computes the indexes, and the compression function is used to narrow them down.
When the values are inserted into the array, isn't the compressed key (the index) then only thing that counts?

If I understand your terminology correctly, the hash function should work for any size array, while the compression function is specific to the current size. So a hash function might return the same 32bit number, the compression will for example calculate the modulo of said number to know which array index to use. Since most implementations of a hashtable shrink and grow dynamically when the table changes, it makes sense to separate the two.

Related

What happens if you don't hash a value before selecting a hash map bucket?

Since a hash map works with a modulus/division operation to select the appropriate bucket to place the value in, it seems that the chance of collision is dependent on the number of buckets, not "how good the hash function is". How good the function function is decides the likelihood of a same-hash return collision. However 'collision' in a hash map is referring to something else, it's referring to the same value AFTER the modulus operation. Assuming the key value is an integer (say 64 bit), what can be expected if the hash function for a hash map is simply the key value itself? I would venture to say that retrieval would be lot faster, as there wouldn't be a need to loop through a number of bytes and do hash operations, with an end result, with respect to hash table collisions, much the same. I mean, the exact values that end up colliding with an already occupied bucket are different values, but if the values are spread all over the place then overall the results should be very similar.
it seems that the chance of collision is dependent on the number of buckets, not "how good the hash function is
No, that is not correct. Keys are not generally distributed evenly across bucket indexes. Hashing keys tends to more evenly distribute the bucket index better than raw key.
index = key%bucket_n;
// vs
index = hash(key)%bucket_n;
Further: A good hash function works well with any bucket_n. A weak hash function improves when bucket_n is a prime.
There is a need to balance the number of entries in a table vs. the table size. If entires_n much less than table_size, OP assertions make some sense. Yet this waste lots of memory
If entires_n much greater than table_size, collisions are common. Often even worse without a hash function.
IMO, the hash table size should exponentially grow with the entry count to maintain a density less than some threshold, say 1/3. A re-hash of the table may be needed to accommodate a size change.
Since a hash map works with a modulus/division operation to select the appropriate bucket to place the value in, it seems that the chance of collision is dependent on the number of buckets, not "how good the hash function is". How good the function function is decides the likelihood of a same-hash return collision
Not quite. A poor hash function can cluster keys or make particular bits more likely than others to be set. That, in turn, can result in some buckets being more likely to be selected by the modulus operator.
Assuming the key value is an integer (say 64 bit), what can be expected if the hash function for a hash map is simply the key value itself?
In general, you can't say. There could very well be patterns in the keys that, if you just used the modulus operator, will cause some buckets to be much more full than others. A good hashing function essentially randomizes the bits so you're more likely to evenly distribute the keys in the buckets.
Assuming the key value is an integer (say 64 bit), what can be expected if the hash function for a hash map is simply the key value itself?
Many languages do exactly that. E.g. Java.
But you have to be careful, if your hash function is too trivial, it would also be trivial for an attacker to exploit hash collisions to cause a DoS in your service. This is known as a Collision Attack. Different libraries deal with that in different ways.
Java HashMap falls back to a red-black tree whenever it detects too many collisions in a single bucket. Other languages introduce randomization in the hash function, so it would be harder for an attack to exploit it.

How to find the value for a given key if a hash function is chosen randomly from Universal family of hash functions?

I am taking a course on data structures in coursera and I read recently about Universal family of hash functions. If i choose a hash function randomly from a universal family of hash functions, How will i exactly remap it to look up for a value. If i have to remember the function chosen for each key, then i should maintain a list for it. And this evaluation of finding the correct hash function for a key itself will take linear time violating the constant time look up of hash tables. How should i proceed implementing it?
When making one hash map, you use one function from the family. When you rehash the entire map (typically because of lack of capacity or too many collisions) or create a separate map, you can then choose a different hashing function from the family. You wouldn't use two different functions to attempt to create the same hash map.

Hashmap hashcode to internal table index conversion

Hashmaps usually implemented using internal array (table) of buckets. On accessing hashmap by key, we get key's hashcode using key-type specific(logic type specific) hash function. Then we need to map hashcode to actual internal buckets table index.
key -> (hash function) -> hashcode -> (???) -> index in internal table
Sometimes internal table could shrink and expand, depending on hashmap filling ratio. Then probably hashcode->index conversion method could be changed a bit.
For example our hash function returns 32 bit unsigned integer value and
moment A: internal table has capacity 10000
moment B: internal table has capacity 100000
What algorithms or approach usually used to perform hashcode->internal table index conversion? How is table resizing isue solved for them?
Usually, a simple modulo will do the job.
To take a quick example from Wikipedia, it's simple as that :
hash = hashfunc(key)
index = hash % array_size
As you said, the resizing happen dependending on the hashmap filling ratio. The array is reallocated (see realloc()), then the indices are recalculated given the new array size, and the values copied to their new index.
I wrote about this here and here.
When you increase the size of your vector of indeces you can be sure that the algorithm that worked well on the shorter vector will work less well on the longer. It is possible to test beforehand and have new algorithms to put in place when you make the vector longer. Or, as the the number of occupied indeces in the current vector increases, have a background, lower-priority thread that tests different algorithms on the data.
As the example in one of my answers shows, a "new algorithm" need be nothing more than a different pair of matched prime numbers.

How to implement a dynamic-size hash table?

I know the basic principle of the hash table data structure. If I have a hash table of size N, I have to distribute my data into these N buckets as evenly as possible.
But in reality, most languages have their built-in hash table types. When I use them, I don't need to know the size of hash table beforehand. I just put anything I want into it. For example, in Ruby:
h = {}
10000000.times{ |i| h[i]=rand(10000) }
How can it do this?
See the Dynamic resizing section of the Hash table article on Wikipedia.
The usual approach is to use the same logic as a dynamic array: have some number of buckets and when there is too much items in the hash table, create a new hash table with a larger size and move all the items to the new hash table.
Also, depending on the type of hash table, this resizing might not be necessary for correctness (i.e. it would still work even without resizing), but it is certainly necessary for performance.

Basics in Universal Hashing, how to ensure accessibility

to my current understanding Universal Hashing is a method whereby the hash function is chosen randomly at runtime in order to guarantee reasonable performance for any kind of input.
I understand we may do this in order to prevent manipulation by somebody choosing malicious input deliberately (a possibility of a deterministic hash function is know).
My Question is the following: Is it not true, that we still need to guarantee that a key will be mapped to the same address every time we hash it ? For instance if we want to retrieve information, but the hash function is chosen at random, how do we guarantee we can get back at our data ?
A universal hash function is a family of different hash functions that have the property that with high probability, two randomly-chosen elements from the universe will not collide no matter which hash function is chosen. Typically, this is implemented by having the implementation pick a random hash function from a family of hash functions to use inside the implementation. Once this hash function is chosen, the hash table works as usual - you use this hash function to compute a hash code for an object, then put the object into the appropriate location. The hash table has to remember the choice of the hash function it made and has to use it consistently throughout the program, since otherwise (as you've noted) it would forget where it mapped each element.
Hope this helps!

Resources