Can two items in a hashmap be in different locations but have the same hashcode? - data-structures

Can two items in a hashmap be in different locations but have the same hashcode?
I'm new to hashing, and I've recently learned about hashmaps. I was wondering whether two objects with the same hashcode can possibly go to different locations in a hashmap?
I'm not completely sure and would appreciate any help

As #Dai pointed out in the comments, this will depend on what kind of hash table you're using. (Turns out, there's a bunch of different ways to make a hash table, and no one data structure is "the" way that hash tables work!)
One of more common hash tables uses a strategy called closed addressing. In closed addressing, every item is mapped to a slot based on its hash code and stored with all other items that also end up in that slot. Lookups are then done by finding which bucket to look in, then inspecting all the items in that bucket. In that case, any two items with the same hash code will end up in the same bucket. (They can't literally occupy the same spot within that bucket, though.)
Another strategy for building hash tables uses an approach called open addressing. This is a family of different methods that are all based on the following idea. We require that each slot in the table store at most one element. As before, to do an insertion, we use the element's hash code to figure out which slot to put it in. If the slot is empty, great! We put the element there. If that slot is full, we can't put the item there. Instead, using some predictable strategy, we start looking at other slots until we find a free one, then put the item there. (The simplest way of doing this, linear probing, works by trying the next slot after the desired slot, then the next one, etc., wrapping around if need be.) In this system, since we can't store multiple items in the same spot, no, two elements with the same hash code don't have to (and in fact, can't!) occupy the same spot.
A more recent hashing strategy that's becoming more popular is cuckoo hashing. In cuckoo hashing, we maintain some small number of separate hash tables (typically, two or three), where each slot can only hold one item. To insert an element, we try placing it in the first table at a spot determined by its hash code. If that spot is free, great! We put the item there. If not, we kick out the item there and try putting that item in the next table. This process repeats until eventually everything comes to rest or we get caught in a loop. Like open addressing, this system prevents multiple items from being stored in the same slot, so two elements with the same hash code might go to different places. (There are variations on cuckoo hashing in which each table slot can store a fixed but small number of items, in which case you could have two items with the same hash code in the same spot. But it's not guaranteed.)
There are some other hashing schemes I didn't describe here. FKS perfect hashing works by using two layers of hash tables, along the lines of closed addressing, but periodically rebuilds the whole table to ensure that no one bucket is too overloaded. Extendible hashing uses a trie-like structure to grow overflowing buckets once they become too fully. Hopscotch hashing is a hybrid between linear probing and chained hashing and plays well with concurrency. But hopefully this gives you a sense of how the type of hash table you use influences the answer to your question!

Related

Improving query access performance for unordered maps that are unchanging

I am looking for suggestions in improving the query time access for unordered maps. My code essentially just consists of 2 steps. In the first step, I populate the unordered map. After the first step, no more entries are ever added to the map. In the second step, the unordered map is only queried. Since the map is essentially unchanging, is there something that can be done to speed up the query time?
For instance, does stl provide any function that can adjust the internal allocations in the map to improve query time access? In other words, it is possible that more than one key was mapped to the same bucket in the unordered map. If more memory was allocated to the map, then chances of such a collision occurring can reduce. In that sense, I am curious as to whether there is anything that can be done knowing the fact that the unordered map will remain unchanged.
If measurements show this is important for you, then I'd suggest taking measurements for other hash table implementations outside the Standard Library, e.g. google's. Using closed hashing aka open addressing may well work better for you, especially if your hash table entries are small enough to store directly in the hash table buckets.
More generally, Marshall suggests finding a good hash function. Be careful though - sometimes a generally "bad" hash function performs better than a "good" one, if it works in nicely with some of the properties of your keys. For example, if you tend to have incrementing number, perhaps with a few gaps, then an identity (aka trivial) hash function that just returns the key can select hash buckets with far less collisions than a crytographic hash that pseudo-randomly (but repeatably) scatters keys with as little as a single bit of difference in uncorrelated buckets. Identity hashing can also help if you're looking up several nearby key values, as their buckets are probably nearby too and you'll get better cache utilisation. But, you've told us nothing about your keys, values, number of entries etc. - so I'll leave the rest with you.
You have two knobs that you can twist: The the hash function and number of buckets in the map. One is fixed at compile-time (the hash function), and the other you can modify (somewhat) at run-time.
A good hash function will give you very few collisions (non-equal values that have the same hash value). If you have many collisions, then there's not really much you can do to improve your lookup times. Worst case (all inputs hash to the same value) gives you O(N) lookup times. So that's where you want to focus your effort.
Once you have a good hash function, then you can play games with the number of buckets (via rehash) which can reduce collisions further.

Is there a specific scenario of a hash table that isn't full yet an insertion can't occur?

What I mean to ask is for a hash-table following the standard size of a prime number, is it possible to have some scenario (of inserted keys) where no further insertion of a given element is possible even though there's some empty slots? What kind of hash-function would achieve that?
So, most hash functions allow for collisions ("Hash Collisions" is the phrase you should google to understand this better, by the way.) Collisions are handled by having a secondary data structure, like a list, to store all of the values inserted at keys with the same hash.
Because these data structures can generally store arbitrarily many elements, you will always be able to insert into the hash table, but the performance will get worse and worse, approaching the performance of the backing data structure.
If you do not have a backing data structure, then you can be unable to insert as soon as two things get added to the same position. Since a good hash function distributes things evenly and effectively randomly, this would happen pretty quickly (see "The Birthday Problem").
There are failure-to-insert scenarios for some but not all hash table implementations.
For example, closed hashing aka open addressing implementations use some logic to create a sequence of buckets in which they'll "probe" for values not found at the hashed-to bucket due to collisions. In the real world, sometimes the sequence-creation is pretty basic, for example:
the programmer might have hard-coded N prime numbers, thinking the odds of adding in each of those in turn and still not finding an empty bucket are low (but a malicious user who knows the hash table design may be able to calculate values to make the table fail, or it may simply be so full that the odds are no longer good, or - while emptier - a statistical freak event)
the programmer might have done something like picked a prime number they liked - say 13903 - to add to the last-probed bucket each time until a free one is found, but if the table size happens to be 13903 too it'll keep checking the same bucket.
Still, there are probing approaches such as linear probing that guarantee to try all buckets (unless the implementation goes out of its way to put a limit on retries). It has some other "issues" though, and won't always be the best choice.
If a hash table is implemented using open addressing instead of separate chaining, then it is a good idea to leave at least 1 slot empty to simplify the algorithm.
In open addressing when we are trying to find an element, we first compute the hash index i, then check the table at indexes {i, i + 1, i + 2, ... N - 1, (wrapping around) 0, 1, 2, ...}, until we either find the element we want or hit an empty slot. You can see that in this algorithm, if no slot is empty but the element can't be found, then the search would loop forever.
However, I should emphasize that enforcing merely simplifies the search algorithm. Because alternatively, the search algorithm can remember the starting index i, and halt the search if the entire table has been scanned and it lands back at index i.

Hash table optimized for full iteration + key replacement

I have a hash table where the vast majority of accesses at run-time follow one of the following patterns:
Iterate through all key/value pairs. (The speed of this operation is critical.)
Modify keys (i.e. remove a key/value pair & add another with the same value but a different key. Detect duplicate keys & combine values if necessary.) This is done in a loop, affecting many thousands of keys, but with no other operations intervening.
I would also like it to consume as little memory as possible.
Other standard operations must be available, though they are used less frequently, e.g.
Insert a new key/value pair
Given a key, look up the corresponding value
Change the value associated with an existing key
Of course all "standard" hash table implementations, including standard libraries of most high-level-languages, have all of these capabilities. What I am looking for is an implementation that is optimized for the operations in the first list.
Issues with common implementations:
Most hash table implementations use separate chaining (i.e. a linked list for each bucket.) This works but I am hoping for something that occupies less memory with better locality of reference. Note: my keys are small (13 bytes each, padded to 16 bytes.)
Most open addressing schemes have a major disadvantage for my application: Keys are removed and replaced in large groups. That leaves deletion markers that increase the load factor, requiring the table to be re-built frequently.
Schemes that work, but are less than ideal:
Separate chaining with an array (instead of a linked list) per bucket:
Poor locality of reference, resulting from memory fragmentation as small arrays are reallocated many times
Linear probing/quadratic hashing/double hashing (with or without Brent's Variation):
Table quickly fills up with deletion markers
Cuckoo hashing
Only works for <50% load factor, and I want a high LF to save memory and speed up iteration.
Is there a specialized hashing scheme that would work well for this case?
Note: I have a good hash function that works well with both power-of-2 and prime table sizes, and can be used for double hashing, so this shouldn't be an issue.
Would Extendable Hashing help? Iterating though the keys by walking the 'directory' should be fast. Not sure if the "modify key for value" operation is any better with this scheme or not.
Based on how you're accessing the data, does it really make sense to use a hash table at all?
Since you're main use cases involve iteration - a sorted list or a btree might be a better data structure.
It doesnt seem like you really need the constant time random data access a hash table is built for.
You can do much better than a 50% load factor with cuckoo hashing.
Two hash functions with four items will get you over 90% with little effort. See this paper:
http://www.ru.is/faculty/ulfar/CuckooHash.pdf
I'm building a pre-computed dictionary using a cuckoo hash and getting a load factor of better than 99% with two hash functions and seven items per bucket.

Chained Hash Tables vs. Open-Addressed Hash Tables

Can somebody explain the main differences between (advantages / disadvantages) the two implementations?
For a library, what implementation is recommended?
Wikipedia's article on hash tables gives a distinctly better explanation and overview of different hash table schemes that people have used than I'm able to off the top of my head. In fact you're probably better off reading that article than asking the question here. :)
That said...
A chained hash table indexes into an array of pointers to the heads of linked lists. Each linked list cell has the key for which it was allocated and the value which was inserted for that key. When you want to look up a particular element from its key, the key's hash is used to work out which linked list to follow, and then that particular list is traversed to find the element that you're after. If more than one key in the hash table has the same hash, then you'll have linked lists with more than one element.
The downside of chained hashing is having to follow pointers in order to search linked lists. The upside is that chained hash tables only get linearly slower as the load factor (the ratio of elements in the hash table to the length of the bucket array) increases, even if it rises above 1.
An open-addressing hash table indexes into an array of pointers to pairs of (key, value). You use the key's hash value to work out which slot in the array to look at first. If more than one key in the hash table has the same hash, then you use some scheme to decide on another slot to look in instead. For example, linear probing is where you look at the next slot after the one chosen, and then the next slot after that, and so on until you either find a slot that matches the key you're looking for, or you hit an empty slot (in which case the key must not be there).
Open-addressing is usually faster than chained hashing when the load factor is low because you don't have to follow pointers between list nodes. It gets very, very slow if the load factor approaches 1, because you end up usually having to search through many of the slots in the bucket array before you find either the key that you were looking for or an empty slot. Also, you can never have more elements in the hash table than there are entries in the bucket array.
To deal with the fact that all hash tables at least get slower (and in some cases actually break completely) when their load factor approaches 1, practical hash table implementations make the bucket array larger (by allocating a new bucket array, and copying elements from the old one into the new one, then freeing the old one) when the load factor gets above a certain value (typically about 0.7).
There are lots of variations on all of the above. Again, please see the wikipedia article, it really is quite good.
For a library that is meant to be used by other people, I would strongly recommend experimenting. Since they're generally quite performance-crucial, you're usually best off using somebody else's implementation of a hash table which has already been carefully tuned. There are lots of open-source BSD, LGPL and GPL licensed hash table implementations.
If you're working with GTK, for example, then you'll find that there's a good hash table in GLib.
My understanding (in simple terms) is that both the methods has pros and cons, though most of the libraries use Chaining strategy.
Chaining Method:
Here the hash tables array maps to a linked list of items. This is efficient if the number of collision is fairly small. The worst case scenario is O(n) where n is the number of elements in the table.
Open Addressing with Linear Probe:
Here when the collision occurs, move on to the next index until we find an open spot. So, if the number of collision is low, this is very fast and space efficient. The limitation here is the total number of entries in the table is limited by the size of the array. This is not the case with chaining.
There is another approach which is Chaining with binary search trees. In this approach, when the collision occurs, they are stored in binary search tree instead of linked list. Hence, the worst case scenario here would be O(log n). In practice, this approach is best suited when there is a extremely nonuniform distribution.
Since excellent explanation is given, I'd just add visualizations taken from CLRS for further illustration:
Open Addressing:
Chaining:
Open addressing vs. separate chaining
Linear probing, double and random hashing are appropriate if the keys are kept as entries in the hashtable itself...
doing that is called "open addressing"
it is also called "closed hashing"
Another idea: Entries in the hashtable are just pointers to the head of a linked list (“chain”); elements of the linked list contain the keys...
this is called "separate chaining"
it is also called "open hashing"
Collision resolution becomes easy with separate chaining: just insert a key in its linked list if it is not already there
(It is possible to use fancier data structures than linked lists for this; but linked lists work very well in the average case, as we will see)
Let’s look at analyzing time costs of these strategies
Source: http://cseweb.ucsd.edu/~kube/cls/100/Lectures/lec16/lec16-25.html
If the number of items that will be inserted in a hash table isn’t known when the table is created, chained hash table is preferable to open addressing.
Increasing the load factor(number of items/table size) causes major performance penalties in open addressed hash tables, but performance degrades only linearly in chained hash tables.
If you are dealing with low memory and want to reduce memory usage, go for open addressing. If you are not worried about memory and want speed, go for chained hash tables.
When in doubt, use chained hash tables. Adding more data than you anticipated won’t cause performance to slow to a crawl.

Best way to remove an entry from a hash table

What is the best way to remove an entry from a hashtable that uses linear probing? One way to do this would be to use a flag to indicate deleted elements? Are there any ways better than this?
An easy technique is to:
Find and remove the desired element
Go to the next bucket
If the bucket is empty, quit
If the bucket is full, delete the element in that bucket and re-add it to the hash table using the normal means. The item must be removed before re-adding, because it is likely that the item could be added back into its original spot.
Repeat step 2.
This technique keeps your table tidy at the expense of slightly slower deletions.
It depends on how you handle overflow and whether (1) the item being removed is in an overflow slot or not, and (2) if there are overflow items beyond the item being removed, whether they have the hash key of the item being removed or possibly some other hash key. [Overlooking that double condition is a common source of bugs in deletion implementations.]
If collisions overflow into a linked list, it is pretty easy. You're either popping up the list (which may have gone empty) or deleting a member from the middle or end of the linked list. Those are fun and not particularly difficult. There can be other optimizations to avoid excessive memory allocations and freeings to make this even more efficient.
For linear probing, Knuth suggests that a simple approach is to have a way to mark a slot as empty, deleted, or occupied. Mark a removed occupant slot as deleted so that overflow by linear probing will skip past it, but if an insertion is needed, you can fill the first deleted slot that you passed over [The Art of Computer Programming, vol.3: Sorting and Searching, section 6.4 Hashing, p. 533 (ed.2)]. This assumes that deletions are rather rare.
Knuth gives a nice refinment as Algorithm R6.4 [pp. 533-534] that instead marks the cell as empty rather than deleted, and then finds ways to move table entries back closer to their initial-probe location by moving the hole that was just made until it ends up next to another hole.
Knuth cautions that this will move existing still-occupied slot entries and is not a good idea if pointers to the slots are being held onto outside of the hash table. [If you have garbage-collected- or other managed-references in the slots, it is all right to move the slot, since it is the reference that is being used outside of the table and it doesn't matter where the slot that references the same object is in the table.]
The Python hash table implementation (arguable very fast) uses dummy elements to mark deletions. As you grow or shrink or table (assuming you're not doing a fixed-size table), you can drop the dummies at the same time.
If you have access to a copy, have a look at the article in Beautiful Code about the implementation.
The best general solutions I can think of include:
If you're can use a non-const iterator (ala C++ STL or Java), you should be able to remove them as you encounter them. Presumably, though, you wouldn't be asking this question unless you're using a const iterator or an enumerator which would be invalidated if the underlying collection is modified.
As you said, you could mark a deleted flag within the contained object. This doesn't release any memory or reduce collisions on the key, though, so it's not the best solution. Also requires the addition of a property on the class that probably doesn't really belong there. If this bothers you as much as it would me, or if you simply can't add a flag to the stored object (perhaps you don't control the class), you could store these flags in a separate hash table. This requires the most long-term memory use.
Push the keys of the to-be-removed items into a vector or array list while traversing the hash table. After releasing the enumerator, loop through this secondary list and remove the keys from the hash table. If you have a lot of items to remove and/or the keys are large (which they shouldn't be), this may not be the best solution.
If you're going to end up removing more items from the hash table than you're leaving in there, it may be better to create a new hash table, and as you traverse your original one, add to the new hash table only the items you're going to keep. Then replace your reference(s) to the old hash table with the new one. This saves a secondary list iteration, but it's probably only efficient if the new hash table will have significantly fewer items than the original one, and it definitely only works if you can change all the references to the original hash table, of course.
If your hash table gives you access to its collection of keys, you may be able to iterate through those and remove items from the hash table in one pass.
If your hash table or some helper in your library provides you with predicate-based collection modifiers, you may have a Remove() function to which you can pass a lambda expression or function pointer to identify the items to remove.
A common technique when time is a factor is to have a second table of deleted items, and clean up the main table when you have time. Commonly used in search engines.
How about enhancing the hash table to contain pointers like a linked list?
When you insert, if the bucket is full, create a pointer from this bucket to the bucket where the new field in stored.
While deleting something from the hashtable, the solution will be equivalent to how you write a function to delete a node from linkedlist.

Resources