Relation between the load factor and time complexity in hash tables? - data-structures

Regarding hash tables, we measure the performance of the hash table using load factor. But I need to understand the relationship between the load factor and the time complexity of hash table . According to my understanding, the relation is directly proportional. Meaning that, we just take O(1) for the computation of the hash function to find the index. If the load factor is low, this means that no enough elements are there in the table and therefore the chance of finding the key-value pair at their right index is high and therefore the searching operation is minimal and still the complexity is a constant. On the other hand, when the load factor is high the chance of finding the key-value pair into their exact position is low and therefore we will need to do some search operations and therefore the complexity will rise to be in O(n) . The same can be said for the insert operation. Is this right?

This is a great question, and the answer is "it depends on what kind of hash table you're using."
A chained hash table where, to store an item, you hash it into a bucket, then store the item in that bucket. If multiple items end up in the same bucket, you simply store a list of all the items that end up in that bucket within the bucket itself. (This is the most commonly-taught version of a hash table.) In this kind of hash table, the expected number of elements in a bucket, assuming a good hash function, is O(α), where the load factor is denoted by α. That makes intuitive sense, since if you distribute your items randomly across the buckets you'd expect that roughly α of them end up in each bucket. In this case, as the load factor increases, you will have to do more and more work on average to find an element, since more elements will be in each bucket. The runtime of a lookup won't necessarily reach O(n), though, since you will still have the items distributed across the buckets even if there aren't nearly enough buckets to go around.
A linear probing hash table works by having an array of slots. Whenever you hash an element, you go to its slot, then walk forward in the table until you either find the element or find a free slot. In that case, as the load factor approaches one, more and more table slots will be filled in, and indeed you'll find yourself in a situation where searches do indeed take time O(n) in the worst case because there will only be a few free slots to stop your search. (There's a beautiful and famous analysis by Don Knuth showing that, assuming the hash function behaves like a randomly-chosen function, the cost of an unsuccessful lookup or insertion into the hash table will take time O(1 / (1 - α)2). It's interesting to plot this function and see how the runtime grows as α gets closer and closer to one.)
Hope this helps!

Related

Time Complexity of Hash Map Traversal

What is the best, average and worst case time complexity for traversing a hash map under the assumption that the hash map uses chaining with linked lists.
I've read multiple times that the time complexity is O(m+n) for traversal for all three cases (m=number of buckets, n=number of elements). However, this differs from my time complexity analysis: In the worst case all elements are linearly chained in the last bucket which leads to a time complexity of O(m+n). In the best case no hash collisions happen and therefore time complexity should be O(m). In the average case I assume that the elements are uniformly distributed, i.e. each bucket on average has n/m elements. This leads to a time complexity of O(m * n/m) = O(n). Is my analysis wrong?
In practice, a good implementation can always achieve O(n). GCC's C++ Standard Library implementation for the hash table containers unordered_map and unordered_set, for example, maintains a forward/singly linked list between the elements inserted into the hash table, wherein elements that currently hash to the same bucket are grouped together in the list. Hash table buckets contain iterators into the singly-linked list for the point where the element before that bucket's colliding elements start (so if erasing an element, the previous link can be rewired to skip over it).
During traversal, only the singly-linked list need be consulted - the hash table buckets are not visited. This becomes especially important when the load factor is very low (many elements were inserted, then many were erased, but in C++ the table never reduces size, so you can end up with a very low load factor.
IF instead you have a hash table implementation where each bucket literally maintains a head pointer for its own linked list, then the kind of analysis you attempted comes into play.
You're right about worst case complexity.
In the best case no hash collisions happen and therefore time complexity should be O(m).
It depends. In C++ for example, values/elements are never stored in the hash table buckets (which would waste a huge amount of memory if the values were large in size and many buckets were empty). If instead the buckets contain the "head" pointer/iterator for the list of colliding elements, then even if there's no collision at a bucket, you still have to follow the pointer to a distinct memory area - that's just as bothersome as following a pointer between nodes on the same linked list, and is therefore normally included in the complexity calculation, so it's still O(m + n).
In the average case I assume that the elements are uniformly
distributed, i.e. each bucket on average has n/m elements.
No... elements being uniformly distributed across buckets is the best case for a hash table: see above. An "average" or typical case is where there's more variation in the number of elements hashing to any given bucket. For example, if you have 1 million buckets and 1 million values and a cryptographic strength hash function, you can statistically expect 1/e (~36.8%) buckets to be empty, 1/1!e (simplifies to 1/1e) buckets to have 1 element, 1/2!e (~18.4%) buckets to have 2 colliding elements, 1/3!e (~6.1%) buckets to have 3 colliding elements and so on (the "!" is for factorial...).
Anyway, the key point is that a naive bucket-visiting hash table traversal (as distinct from actually being able to traverse a list of elements without bucket-visiting), always has to visit all the buckets, then if you imagine each element being tacked onto a bucket somewhere, there's always one extra link to traverse to reach it. Hence O(m+n).

Hash table is always O(n) time for lookup?

I don't understand how hash tables are constant time lookup, if there's a constant number of buckets. Say we have 100 buckets, and 1,000,000 elements. This is clearly O(n) lookup, and that's the point of complexity, to understand how things behave for very large values of n. Thus, a hashtable is never constant lookup, it's always O(n) lookup.
Why do people say it's O(1) lookup on average, and only O(n) for worst case?
The purpose of using a hash is to be able to index into the table directly, just like an array. In the ideal case there's only one item per bucket, and we achieve O(1) easily.
A practical hash table will have more buckets than it has elements, so that the odds of having only one element per bucket are high. If the number of elements inserted into the table gets too great, the table will be resized to increase the number of buckets.
There is always a possibility that every element will have the same hash, or that all active hashes will be assigned to the same bucket; in that case the lookup time is indeed O(n). But a good hash table implementation will be designed to minimize the chance of that occurring.
In layman terms with some hand waving:
At the one extreme, you can have a hash map that is perfectly distributed with one value per bucket. In this case, your lookup returns the value directly, and cost is 1 operation -- or on the order of one, if you like: O(1).
In the real world, implementation often arrange for that to be the case, by expanding the size of the table, etc. to meet the requirements of the data. When you have more items than buckets, you start increasing complexity.
In the worst case, you have one bucket and n items in the one bucket. In this case, it is basically like searching a list, linearly. And so if the value happens to be the last one, you need to do n comparisons, to find it. Or, on the order of n: O(n).
The latter case is pretty much always /possible/ for a given data set. That's why there has been so much study and effort put into coming up with good hashing algorithms. So, it is theoretically possible to engineer a dataset that will cause collisions. So, there is some way to end up with O(n) performance, unless the implementation tweaks other aspects ; table size, hash implementation, etc., etc.
By saying
Say we have 100 buckets, and 1,000,000 elements.
you are basically depriving the hashmap from its real power of rehashing, and also not considering the initial capacity of hashmap in accordance to need. Hashmap is more efficient in cases where each entry gets its own bucket. Lesser percentage of collision can be achieved by higher capacity of hashmap. Each collision means you need to traverse the corresponding list.
Below points should be considered for Hash table impelmentation.
A hashtable is designed such that it re sizes itself as the number of entries get larger than number of buckets by a certain threshold value. This is how we should design if we wish to implement our own custom Hash table.
A good hash function makes sure that entries are well distributed in the buckets of hashtable. This keeps the list in a bucket short.
Above takes care that access time remains constant.

Why hashmap lookup is O(1) i.e. constant time?

If we look from Java perspective then we can say that hashmap lookup takes constant time. But what about internal implementation? It still would have to search through particular bucket (for which key's hashcode matched) for different matching keys.Then why do we say that hashmap lookup takes constant time? Please explain.
Under the appropriate assumptions on the hash function being used, we can say that hash table lookups take expected O(1) time (assuming you're using a standard hashing scheme like linear probing or chained hashing). This means that on average, the amount of work that a hash table does to perform a lookup is at most some constant.
Intuitively, if you have a "good" hash function, you would expect that elements would be distributed more or less evenly throughout the hash table, meaning that the number of elements in each bucket would be close to the number of elements divided by the number of buckets. If the hash table implementation keeps this number low (say, by adding more buckets every time the ratio of elements to buckets exceeds some constant), then the expected amount of work that gets done ends up being some baseline amount of work to choose which bucket should be scanned, then doing "not too much" work looking at the elements there, because on expectation there will only be a constant number of elements in that bucket.
This doesn't mean that hash tables have guaranteed O(1) behavior. In fact, in the worst case, the hashing scheme will degenerate and all elements will end up in one bucket, making lookups take time Θ(n) in the worst case. This is why it's important to design good hash functions.
For more information, you might want to read an algorithms textbook to see the formal derivation of why hash tables support lookups so efficiently. This is usually included as part of a typical university course on algorithms and data structures, and there are many good resources online.
Fun fact: there are certain types of hash tables (cuckoo hash tables, dynamic perfect hash tables) where the worst case lookup time for an element is O(1). These hash tables work by guaranteeing that each element can only be in one of a few fixed positions, with insertions sometimes scrambling around elements to try to make everything fit.
Hope this helps!
The key is in this statement in the docs:
If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table.
and
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
http://docs.oracle.com/javase/6/docs/api/java/util/HashMap.html
The internal bucket structure will actually be rebuilt if the load factor is exceeded, allowing for the amortized cost of get and put to be O(1).
Note that if the internal structure is rebuilt, that introduces a performance penalty that is likely to be O(N), so quite a few get and put may be required before the amortized cost approaches O(1) again. For that reason, plan the initial capacity and load factor appropriately, so that you neither waste space, nor trigger avoidable rebuilding of the internal structure.
Hashtables AREN'T O(1).
Via the pigeonhole principle, you cannot be better than O(log(n)) for lookup, because you need log(n) bits per item to uniquely identify n items.
Hashtables seem to be O(1) because they have a small constant factor combined with their 'n' in the O(log(n)) being increased to the point that, for many practical applications, it is independent of the number of actual items you are using. However, big O notation doesn't care about that fact, and it is a (granted, absurdly common) misuse of the notation to call hashtables O(1).
Because while you could store a million, or a billion items in a hashtable and still get the same lookup time as a single item hashtable... You lose that ability if you're taking about a nonillion or googleplex items. The fact that you will never actually be using a nonillion or googleplex items doesn't matter for big O notation.
Practically speaking, hashtable performance can be a constant factor worse than array lookup performance. Which, yes, is also O(log(n)), because you CAN'T do better.
Basically, real world computers make every array lookup for arrays of size less than their chip bit size just as bad as their biggest theoretically usable array, and as hastables are clever tricks performed on arrays, that's why you seem to get O(1)
To follow up on templatetypedef's comments as well:
The constant time implementation of a hash table could be a hashmap, with which you can implement a boolean array list that indicates whether a particular element exists in a bucket. However, if you are implementing a linked list for your hashmap, the worst case would require you going through every bucket and having to traverse through the ends of the lists.

Hash table runtime complexity (insert, search and delete)

Why do I keep seeing different runtime complexities for these functions on a hash table?
On wiki, search and delete are O(n) (I thought the point of hash tables was to have constant lookup so what's the point if search is O(n)).
In some course notes from a while ago, I see a wide range of complexities depending on certain details including one with all O(1). Why would any other implementation be used if I can get all O(1)?
If I'm using standard hash tables in a language like C++ or Java, what can I expect the time complexity to be?
Hash tables are O(1) average and amortized case complexity, however it suffers from O(n) worst case time complexity. [And I think this is where your confusion is]
Hash tables suffer from O(n) worst time complexity due to two reasons:
If too many elements were hashed into the same key: looking inside this key may take O(n) time.
Once a hash table has passed its load balance - it has to rehash [create a new bigger table, and re-insert each element to the table].
However, it is said to be O(1) average and amortized case because:
It is very rare that many items will be hashed to the same key [if you chose a good hash function and you don't have too big load balance.
The rehash operation, which is O(n), can at most happen after n/2 ops, which are all assumed O(1): Thus when you sum the average time per op, you get : (n*O(1) + O(n)) / n) = O(1)
Note because of the rehashing issue - a realtime applications and applications that need low latency - should not use a hash table as their data structure.
EDIT: Annother issue with hash tables: cache
Another issue where you might see a performance loss in large hash tables is due to cache performance. Hash Tables suffer from bad cache performance, and thus for large collection - the access time might take longer, since you need to reload the relevant part of the table from the memory back into the cache.
Ideally, a hashtable is O(1). The problem is if two keys are not equal, however they result in the same hash.
For example, imagine the strings "it was the best of times it was the worst of times" and "Green Eggs and Ham" both resulted in a hash value of 123.
When the first string is inserted, it's put in bucket 123. When the second string is inserted, it would see that a value already exists for bucket 123. It would then compare the new value to the existing value, and see they are not equal. In this case, an array or linked list is created for that key. At this point, retrieving this value becomes O(n) as the hashtable needs to iterate through each value in that bucket to find the desired one.
For this reason, when using a hash table, it's important to use a key with a really good hash function that's both fast and doesn't often result in duplicate values for different objects.
Make sense?
Some hash tables (cuckoo hashing) have guaranteed O(1) lookup
Perhaps you were looking at the space complexity? That is O(n). The other complexities are as expected on the hash table entry. The search complexity approaches O(1) as the number of buckets increases. If at the worst case you have only one bucket in the hash table, then the search complexity is O(n).
Edit in response to comment I don't think it is correct to say O(1) is the average case. It really is (as the wikipedia page says) O(1+n/k) where K is the hash table size. If K is large enough, then the result is effectively O(1). But suppose K is 10 and N is 100. In that case each bucket will have on average 10 entries, so the search time is definitely not O(1); it is a linear search through up to 10 entries.
Depends on the how you implement hashing, in the worst case it can go to O(n), in best case it is 0(1) (generally you can achieve if your DS is not that big easily)

Why does hashtable have constant access time in average?

I don't understand this explanation which says if n is the number of elements in the hash table and m is the total number of buckets then hashtables have constant access time in average only if n is proportional to theta(n). Why does it have to be proportional ?
well actually m should be proportional to n. Otherwise you could, for example, have just 1 bucket and it would be just like an unsorted set.
To be more precise, if m is proportional to n, i.e. m = c * n, then the number of items in each bucket will be n/m = 1/c which is a constant. Going to any bucket is an O(1) operation (just compute the hash code) and then the search through the bucket is constant order (you could just do a linear search through the items in the bucket which would be a constant).
Thus the order of the algorithm is O(1), if m = c * n.
To take a converse example, suppose we had a fixed size table of size tableSize. Then the expected number of items in each bucket is n/tableSize which is a linear function of n. Any kind of search through the bucket is at best O(log(n)) for a tree (I'm assuming you don't stick another hash table inside the bucket or we then have the same argument over that hash table), so it would not be O(1) in this case.
Strictly speaking, the average-case time complexity of hash table access is actually in Ω(n1/3). Information can't travel faster than the speed of light, which is a constant. Since space has three dimensions, storing n bits of data requires that some data be located at a distance on the order of n1/3 from the CPU.
More detail in my blog.
The chance of collisions is higher and thus the incidence of having to scan through the list of items with the same hash key is also higher.
Access time is constant because access is based on a calculation of a hash value and then a constant lookup to find the appropriate bucket. Assuming the hash function evenly distributes items amongst buckets, then the time it takes to access any individual item will be equal to the time to access other items, regardless of n.
Constant doesn't necessarily mean constantly low though. The average access time is related to the even distribution of the hashing function and the number of buckets. If you have thousands of items evenly distributed amongst a small number of buckets, you're finding the bucket fast but then looping through a lot of items in the bucket. If you have a good proportion of buckets to items but a bad hash function that puts many more items in some buckets rather than other, the access time for the items in larger buckets will be slower than access time for others.
A reasonably-sized hash table, where there are enough slots for every element you store and plenty of extra space, will have the hashing function doing most of the work choosing slots and very few collisions where different elements have the same hash. A very crowded hash table would have lots of collisions, and would degrade to basically a linear search, where almost every lookup will be a wrong item that had the same hash and you'll have to keep searching for the right one (a hash table lookup still has to check the key once it picks the first slot, because the key it's looking for might have had a collision when it was stored).
What determines the hit-collision ratio is exactly the ratio of number-of-items to size-of-hash (i.e., the percentage chance that a randomly chosen slot will be filled).

Resources