Basics in Universal Hashing, how to ensure accessibility - data-structures

to my current understanding Universal Hashing is a method whereby the hash function is chosen randomly at runtime in order to guarantee reasonable performance for any kind of input.
I understand we may do this in order to prevent manipulation by somebody choosing malicious input deliberately (a possibility of a deterministic hash function is know).
My Question is the following: Is it not true, that we still need to guarantee that a key will be mapped to the same address every time we hash it ? For instance if we want to retrieve information, but the hash function is chosen at random, how do we guarantee we can get back at our data ?

A universal hash function is a family of different hash functions that have the property that with high probability, two randomly-chosen elements from the universe will not collide no matter which hash function is chosen. Typically, this is implemented by having the implementation pick a random hash function from a family of hash functions to use inside the implementation. Once this hash function is chosen, the hash table works as usual - you use this hash function to compute a hash code for an object, then put the object into the appropriate location. The hash table has to remember the choice of the hash function it made and has to use it consistently throughout the program, since otherwise (as you've noted) it would forget where it mapped each element.
Hope this helps!

Related

What happens if you don't hash a value before selecting a hash map bucket?

Since a hash map works with a modulus/division operation to select the appropriate bucket to place the value in, it seems that the chance of collision is dependent on the number of buckets, not "how good the hash function is". How good the function function is decides the likelihood of a same-hash return collision. However 'collision' in a hash map is referring to something else, it's referring to the same value AFTER the modulus operation. Assuming the key value is an integer (say 64 bit), what can be expected if the hash function for a hash map is simply the key value itself? I would venture to say that retrieval would be lot faster, as there wouldn't be a need to loop through a number of bytes and do hash operations, with an end result, with respect to hash table collisions, much the same. I mean, the exact values that end up colliding with an already occupied bucket are different values, but if the values are spread all over the place then overall the results should be very similar.
it seems that the chance of collision is dependent on the number of buckets, not "how good the hash function is
No, that is not correct. Keys are not generally distributed evenly across bucket indexes. Hashing keys tends to more evenly distribute the bucket index better than raw key.
index = key%bucket_n;
// vs
index = hash(key)%bucket_n;
Further: A good hash function works well with any bucket_n. A weak hash function improves when bucket_n is a prime.
There is a need to balance the number of entries in a table vs. the table size. If entires_n much less than table_size, OP assertions make some sense. Yet this waste lots of memory
If entires_n much greater than table_size, collisions are common. Often even worse without a hash function.
IMO, the hash table size should exponentially grow with the entry count to maintain a density less than some threshold, say 1/3. A re-hash of the table may be needed to accommodate a size change.
Since a hash map works with a modulus/division operation to select the appropriate bucket to place the value in, it seems that the chance of collision is dependent on the number of buckets, not "how good the hash function is". How good the function function is decides the likelihood of a same-hash return collision
Not quite. A poor hash function can cluster keys or make particular bits more likely than others to be set. That, in turn, can result in some buckets being more likely to be selected by the modulus operator.
Assuming the key value is an integer (say 64 bit), what can be expected if the hash function for a hash map is simply the key value itself?
In general, you can't say. There could very well be patterns in the keys that, if you just used the modulus operator, will cause some buckets to be much more full than others. A good hashing function essentially randomizes the bits so you're more likely to evenly distribute the keys in the buckets.
Assuming the key value is an integer (say 64 bit), what can be expected if the hash function for a hash map is simply the key value itself?
Many languages do exactly that. E.g. Java.
But you have to be careful, if your hash function is too trivial, it would also be trivial for an attacker to exploit hash collisions to cause a DoS in your service. This is known as a Collision Attack. Different libraries deal with that in different ways.
Java HashMap falls back to a red-black tree whenever it detects too many collisions in a single bucket. Other languages introduce randomization in the hash function, so it would be harder for an attack to exploit it.

hashing vs hash function, don't know the difference

For example, "Consistent hashing" and "Perfect hash function", in wikipedia, I click "hashing" and the link direct to "hash function", so it seems that they have the same meaning, but why does another exist? And is there any difference when using "hashing" or "hash function"? And is it ok to call "consistent hashing" as "consistent hash function"? Thanks!
A hash function takes some input data (typically a bunch of binary bytes, but could be anything - whatever you make it to) and calculates a hash value, which is typically an integer number (but, again, can be anything). The process of doing this is called hashing.
The hash value is always the same size, no matter what the input looks like. Well, I suppose you cold make a hash function that has a variable-size output, but I haven't seen one in the wild yet. It wouldn't be very practical. Thus, by its very nature, hashing is usually a one-way calculation. You can't normally get the original data back from the hash value, because there are many more possible input data combinations than there are possible hash values.
The main advantages are:
The hash value is always the same size
The same input will always generate the same output.
If it's a good hash function, different inputs will usually generate different outputs, but it's still possible that two different inputs generate the same output (this is called a hash collision).
If you have a cryptographical hash function you also get one more advantage:
From having only the hash value, it's impossible (unfeasible) to come up with input data that would hash to this value. Never mind that it's not the original input data, any kind of input data that would hash to the given output value is impossible to find in a useful timeframe.
The results of a hash function can be used in various ways. As mentioned in other answers, hash tables are one common use-case. Verifying data integrity is another case - for example, you download a file, then hash it, then check the hash value against the value that was specified in the webpage where you downloaded the file from. If they don't match, the file was not downloaded correctly. If you combine hash values with public-key cryptography you can get digital signatures. And I'm sure there are other uses to which the principle can be put.
you can write a hash function and what it does is to hash keys to bins.
In other words the hash function is doing the hashing.
I hope that clarifies it.
HashTable is a data Structure in which a given value is mapped with a particular key for faster access of elements. - Process of populating this data structure is known as hashing.
To do hashing , you need a function which will provide logic for mapping values to keys. This function is hash function
I hope this clarifies your doubt.

Hash Table vs Dictionary

As far as I know hash table uses has key to store any item whereas dictionary uses simple key value pair to store item.it means that dictionary is a lot faster than hash table (Which I think. Please correct me if I am wrong).
Does this mean I should never use hash table?
The answer is "it depends".
A Dictionary is merely a way to map a key to a value. You can either use a library or implement one yourself.
A Hash table is a specific way to implement a dictionary where the key based upon a hash function. This function is usually based on modulo arithmetic. This means that two distinct value may end up with the hash key and therefore there will be a collision between the keys. It is then up to you (or whoever implements the hash table) to the determine how to resolve the collision. You could chain the value at the same key, re-hash and use a sub-hash table, or you may even want to start over with a new hash function (which would be expensive).
Depending on the underlying implementation of the dictionary (hash table) will affect your lookup performance.

How to find the value for a given key if a hash function is chosen randomly from Universal family of hash functions?

I am taking a course on data structures in coursera and I read recently about Universal family of hash functions. If i choose a hash function randomly from a universal family of hash functions, How will i exactly remap it to look up for a value. If i have to remember the function chosen for each key, then i should maintain a list for it. And this evaluation of finding the correct hash function for a key itself will take linear time violating the constant time look up of hash tables. How should i proceed implementing it?
When making one hash map, you use one function from the family. When you rehash the entire map (typically because of lack of capacity or too many collisions) or create a separate map, you can then choose a different hashing function from the family. You wouldn't use two different functions to attempt to create the same hash map.

What are the "new hash functions" in cuckoo hashing?

I am reading about cuckoo hashing from Pagh and Rodle and I can't understand the meaning of this paragraph:
It may happen that this process loops, as shown in Fig. 1(b).
Therefore the number of iterations is bounded by a value “MaxLoop” to
be specified in Section 2.3. If this number of iterations is reached,
we rehash the keys in the tables using new hash functions, and try
once again to accommodate the nestless key. There is no need to
allocate new tables for the rehashing: We may simply run through the
tables to delete and perform the usual insertion procedure on all keys
found not to be at their intended position in the table.
What does it mean by using new hash functions?
In the insert algorithm the table is resized. Are we supposed to have a "pool" of hash functions to use somehow? How do we create this pool?
Yes, they're expecting new hash functions, just like they say. Fortunately, they don't require a pile of new algorithms, just slightly different hashing behavior on your current data set.
Take a look at section 2.1 of the paper, and then Appendix A. It discusses the construction of random universal hash functions.
A simple, hopefully illustrative example, is to suppose you've got some normal hash function you like, that operates on blocks of bytes. We'll call it H(x). You want to use it to produce a family of new, slightly different hash functions H_n(x). Well, if H(x) is good, and your requirements are weak, you can just define H_n(x) = H(concat(n,x)). You don't have nice strong guarantees about the behaviors of H_n(x), but you'd expect most of them to be different.

Resources