I am reading Data Structures and Algorithms & Software Principles in C to try to wrap my head around some internals of data structures, and two things are really bothering me:
(1) How do hash tables deal with deciding which item in the bucket is the item you are looking up if they all have the same hash?
e.g.
Get Key, Value
use Hash algorithm on the key to find the index to try to put value into
if the slot is taken, but there is no bucket(single entry), create a bucket and throw the current item into the bucket and then throw the current value into it.
now I have a bucket with a bunch of values and a "lost and found problem" where you can't tell which value belongs to which key because all the keys map to the same hash and the item in the bucket has no key to search the bucket by key.
This would work if the bucket saves keys as well as values for each entry, but I am confused since I can't find a site that confirms that hash tables save keys along with the values for their entries.
(2) How do hash tables tell if the value at an index is the correct value for the key, or if probing found a collision and put it elsewhere.
eg.
Get Key, Value
hash key to find index(0)
index taken, use a naive probe algorithm of perform linear search until slot found(slot 1 is empty).
now I search for my key and find index 0. How does the hash know that index 0 is not the correct item for this key, but that it has been probed into slot 1?
Again, this would make sense to me if the table saved a key as well as value for the entry, but I am not sure if hashes save keys along with values for the entries or have another way of ensuring that the item at the hash index or bucket index is the correct item, or if I am misunderstanding it.
To clarify the question: do hash tables save key along with value to disambiguate buckets and probe sequences or do they use something else to avoid ambiguity of hashes?
Sorry for the crudely formulated question but I just had to ask.
Thanks ahead of time.
Hash Tables save the entry. An entry consists of key and value.
How do hash tables deal with deciding which item in the bucket is the item you are looking up if they all have the same hash?
Because query is done by passing the key.
Purpose of hashing is to reduce the time to find the index. They key is hashed to find the right bucket. Then, when the items have been reduced from a total N to a very small n, you can even perform a linear search to find the right item out of all the keys having the same hash.
How do hash tables tell if the value at an index is the correct value for the key, or if probing found a collision and put it elsewhere.
Again, that's because Hash Table would save entries instead of just the value. If, in case of a collision, the Hash Table sees that the key found at this bucket is not the key that's queried, the Hash Table knows that the collision occurred earlier and the key may be in the next bucket. Please note that in this case the bucket stores a single entry unlike the case of first answer where the bucket may store a LinkedList or a Tree of entries.
Related
I am studying hash table at the moment, and got a question about its implementation with a fixed size of buckets.
Suppose we have a hash table with 23 elements(for example). Let's use the simplest hash function (hash_value = key%table_size) and the keys being integers only. If we say that one bucket can have at most only 1 element(no separate chaining), does it mean that when all buckets are full we will no longer be able to insert any element in the table at all? Or will we have to actually replace element that has the same hash value with a new element?
I do understand that I am putting a lot of constrains , and the real implementation might never look like that,but I want to be sure I understand that particular case.
A real implementation usually allows for a hash table to be able to resize, but this usually takes a long time and is undesired. Considering a fixed-size hash table, it would probably return an error code or throw an exception for the user to treat that error or not.
Or will we have to actually replace element that has the same hash value with a new element?
In Java's HashMap if you add a key that equals to another already present in the hash table only the value associated with that key will be replaced by the new one, but never if two keys hash to the same hash.
Yes. An "open" hash table - which you are describing - has a fixed size, so it can fill up.
However implementations will usually respond by copying all contents into a new, bigger table. In fact, normally they won't wait to fill entirely, but use some criterion - for example a fraction of all space used (sometimes called the "load factor") - to decide when it's time to expand.
Some implementations will also "shrink" themselves to a smaller table if the load factor becomes too small due to deletions.
You'd probably find reading Google's hash table implementation, which includes some documentation of its internals, to be a good learning experience.
As far as I know hash table uses has key to store any item whereas dictionary uses simple key value pair to store item.it means that dictionary is a lot faster than hash table (Which I think. Please correct me if I am wrong).
Does this mean I should never use hash table?
The answer is "it depends".
A Dictionary is merely a way to map a key to a value. You can either use a library or implement one yourself.
A Hash table is a specific way to implement a dictionary where the key based upon a hash function. This function is usually based on modulo arithmetic. This means that two distinct value may end up with the hash key and therefore there will be a collision between the keys. It is then up to you (or whoever implements the hash table) to the determine how to resolve the collision. You could chain the value at the same key, re-hash and use a sub-hash table, or you may even want to start over with a new hash function (which would be expensive).
Depending on the underlying implementation of the dictionary (hash table) will affect your lookup performance.
I have a giant hashmap in Redis which keeps on growing. Around 50k per day, upon the subsequent days this number will become less as the same keys would repeat. I want to limit this hashmap to 1 million key value pairs. I want to do this based on LRU.
I know I can do this with Redis' sorted set with timestamp as a value and removing entries which are not in range, but I need to retain the key value pair structure, when I move to sorted set I will be missing the key value structure as the value would be timestamp and I need to perform some string operations in the key to achieve the equivalent hash functionality (not feasible).
So my requirements are:
Key value pair structure
get values based on the given key or
multiple keys
trimming the structure with 1 million pairs with
the lru policy.
Can I achieve it with hash? I am also open for other suggestions? Thanks in advance
Why not use both HASH and Sorted Set.
Save data in HASH
HSET KEY FIELD VALUE
With your data saved in HASH, you can achieve both "Key value pair structure" and "get values based on the given key or multiple keys".
Implement LRU with Sorted Set
ZADD KEY TIMESTAMP FIELD
With Sorted Set, you can save the timestamp as the score of a field. Each time when you access a field in HASH, update the score of the field with current timestamp.
If the number of members in Sorted Set is larger than a million, get fields whose score is too small (with zcard and zrange). Then remove these fields in both HASH and Sorted Set.
I am aware how hash table works. But I am not sure of the possible implementation of get(key) when multiple values are stored at the same place with the help of linked list.
For example:
set(1,'Val1') get stored at index 7
set(2,'Val2') also get stored at index 7. (Internal implementation create a linked list and store pointer at index 7. That's understandable).
But I am thinking if now I call get(2). How does Hash Table knows which Value to return. Because my hash function will resolve this to index 7. But at index 7 there are 2 values.
One possible way is to store at the linked node, both value and key.
Is there any other different implementation possible?
Go through the linked list and do a linear search for the key '2'. The properties of the hash function and the hash table size should guarantee that these lists' length is O(1) on average.
I think you misunderstood the fact that hash tables has to store their keys. The hash function is only for speeding up insertion/lookup.
"There can be many keys (and their associated values) in each partition, but the records for any given key are all in a single partition." This is a line of a famous hadoop text book. I am not getting the entire meaning of the second part of it which says "but the records for any given key are all in a single partition." Is this mean that all the records for a single key should be in a single partition or something else.
but the records for any given key are all in a single partition
If you have one key, the key and its associated value, must be on a single partition. Sometimes the value can be rather large. But this is a constraint on the size of a value. It must be small enough to fit on a single partition.
Note, there maybe other constants on both keys and values, depending on what you use for back-end storage, for example, a single key-value pair may be required to fit into the node's memory.