I know the basic principle of the hash table data structure. If I have a hash table of size N, I have to distribute my data into these N buckets as evenly as possible.
But in reality, most languages have their built-in hash table types. When I use them, I don't need to know the size of hash table beforehand. I just put anything I want into it. For example, in Ruby:
h = {}
10000000.times{ |i| h[i]=rand(10000) }
How can it do this?
See the Dynamic resizing section of the Hash table article on Wikipedia.
The usual approach is to use the same logic as a dynamic array: have some number of buckets and when there is too much items in the hash table, create a new hash table with a larger size and move all the items to the new hash table.
Also, depending on the type of hash table, this resizing might not be necessary for correctness (i.e. it would still work even without resizing), but it is certainly necessary for performance.
Related
I am studying hash table at the moment, and got a question about its implementation with a fixed size of buckets.
Suppose we have a hash table with 23 elements(for example). Let's use the simplest hash function (hash_value = key%table_size) and the keys being integers only. If we say that one bucket can have at most only 1 element(no separate chaining), does it mean that when all buckets are full we will no longer be able to insert any element in the table at all? Or will we have to actually replace element that has the same hash value with a new element?
I do understand that I am putting a lot of constrains , and the real implementation might never look like that,but I want to be sure I understand that particular case.
A real implementation usually allows for a hash table to be able to resize, but this usually takes a long time and is undesired. Considering a fixed-size hash table, it would probably return an error code or throw an exception for the user to treat that error or not.
Or will we have to actually replace element that has the same hash value with a new element?
In Java's HashMap if you add a key that equals to another already present in the hash table only the value associated with that key will be replaced by the new one, but never if two keys hash to the same hash.
Yes. An "open" hash table - which you are describing - has a fixed size, so it can fill up.
However implementations will usually respond by copying all contents into a new, bigger table. In fact, normally they won't wait to fill entirely, but use some criterion - for example a fraction of all space used (sometimes called the "load factor") - to decide when it's time to expand.
Some implementations will also "shrink" themselves to a smaller table if the load factor becomes too small due to deletions.
You'd probably find reading Google's hash table implementation, which includes some documentation of its internals, to be a good learning experience.
I was wondering why there is a need to separate the hash function and compression function when using a hash table?
AFAIK, first the hash function computes the indexes, and the compression function is used to narrow them down.
When the values are inserted into the array, isn't the compressed key (the index) then only thing that counts?
If I understand your terminology correctly, the hash function should work for any size array, while the compression function is specific to the current size. So a hash function might return the same 32bit number, the compression will for example calculate the modulo of said number to know which array index to use. Since most implementations of a hashtable shrink and grow dynamically when the table changes, it makes sense to separate the two.
As far as I know hash table uses has key to store any item whereas dictionary uses simple key value pair to store item.it means that dictionary is a lot faster than hash table (Which I think. Please correct me if I am wrong).
Does this mean I should never use hash table?
The answer is "it depends".
A Dictionary is merely a way to map a key to a value. You can either use a library or implement one yourself.
A Hash table is a specific way to implement a dictionary where the key based upon a hash function. This function is usually based on modulo arithmetic. This means that two distinct value may end up with the hash key and therefore there will be a collision between the keys. It is then up to you (or whoever implements the hash table) to the determine how to resolve the collision. You could chain the value at the same key, re-hash and use a sub-hash table, or you may even want to start over with a new hash function (which would be expensive).
Depending on the underlying implementation of the dictionary (hash table) will affect your lookup performance.
Can somebody please explain the concept of buckets simply to me. I understand a Dict is an array of arrays, I cannot for the life of me make sense of this first block of code though and can't find anything online that explains num_buckets. If you could explain it line by line that would be great.
module Dict
def Dict.new(num_buckets=256)
# Initializes a Dict with the given number of buckets.
aDict = []
(0...num_buckets).each do |i|
aDict.push([])
end
return aDict
end
The code is meant to implement a data structure called Hash table. It is the data structure of Ruby's built-in Hash class.
Hash tables use the hashing of keys as indexes. Because there are limited number of possible indexes, collision (i.e, different keys have the same hashing) happens. Separate chaining is one common method for collision resolution. Keys are inserted into buckets. num_buckets here is the number of buckets. Different keys with the same hashing are in the same bucket.
An image illuatrating separate chaining from Wikipedia:
I'm attempting to build a simple hash table from scratch. The hash table I have currently uses an array of linked lists. The hashing function takes the hash value of a key-pair objects modulo the size of the array for indexing. This is all well and good, but I'm wondering if I could dynamically expand my array by using an array-list once it starts to fill up (Tell me why this is not a good idea if you think so). Obviously the hash function would be compromised since we're finding indexes using the array length. What would be a good hash function to use that would allow my array of linked-lists to expand while not compromising the integrity of the hash function?
If I am understanding your question correctly, you will have to re-hash all elements after expanding the bucket array. It can be done by iterating over the contents of the old hash table, and inserting them into the newly expanded hash table.