I don't understand how hash tables are constant time lookup, if there's a constant number of buckets. Say we have 100 buckets, and 1,000,000 elements. This is clearly O(n) lookup, and that's the point of complexity, to understand how things behave for very large values of n. Thus, a hashtable is never constant lookup, it's always O(n) lookup.
Why do people say it's O(1) lookup on average, and only O(n) for worst case?
The purpose of using a hash is to be able to index into the table directly, just like an array. In the ideal case there's only one item per bucket, and we achieve O(1) easily.
A practical hash table will have more buckets than it has elements, so that the odds of having only one element per bucket are high. If the number of elements inserted into the table gets too great, the table will be resized to increase the number of buckets.
There is always a possibility that every element will have the same hash, or that all active hashes will be assigned to the same bucket; in that case the lookup time is indeed O(n). But a good hash table implementation will be designed to minimize the chance of that occurring.
In layman terms with some hand waving:
At the one extreme, you can have a hash map that is perfectly distributed with one value per bucket. In this case, your lookup returns the value directly, and cost is 1 operation -- or on the order of one, if you like: O(1).
In the real world, implementation often arrange for that to be the case, by expanding the size of the table, etc. to meet the requirements of the data. When you have more items than buckets, you start increasing complexity.
In the worst case, you have one bucket and n items in the one bucket. In this case, it is basically like searching a list, linearly. And so if the value happens to be the last one, you need to do n comparisons, to find it. Or, on the order of n: O(n).
The latter case is pretty much always /possible/ for a given data set. That's why there has been so much study and effort put into coming up with good hashing algorithms. So, it is theoretically possible to engineer a dataset that will cause collisions. So, there is some way to end up with O(n) performance, unless the implementation tweaks other aspects ; table size, hash implementation, etc., etc.
By saying
Say we have 100 buckets, and 1,000,000 elements.
you are basically depriving the hashmap from its real power of rehashing, and also not considering the initial capacity of hashmap in accordance to need. Hashmap is more efficient in cases where each entry gets its own bucket. Lesser percentage of collision can be achieved by higher capacity of hashmap. Each collision means you need to traverse the corresponding list.
Below points should be considered for Hash table impelmentation.
A hashtable is designed such that it re sizes itself as the number of entries get larger than number of buckets by a certain threshold value. This is how we should design if we wish to implement our own custom Hash table.
A good hash function makes sure that entries are well distributed in the buckets of hashtable. This keeps the list in a bucket short.
Above takes care that access time remains constant.
Related
The way I understand it, an hash table is an array of linked lists, so it should actually be O(n/array_length).
Isn't saying the it's O(1) just plain wrong?
For example, if I have 1M items on hash table that is based on a 100-sized array, the average lookup would take 5,000 items. Clearly not O(1), although I assume that most implementations of a hash table use a much bigger sized array.
And what is usually the array size that is being used in most languages' (JS, Go, etc) hash table implementations?
You are correct. In general, it is not possible to say that all hash tables are O(1). It depends on some design decisions, and on other factors.
In your example, you seem to be talking about a hash table with a fixed number of buckets (100) and an unbounded number of entries N. With that design, it will take on average N / 50 comparisons to find a key that is present, and on average N / 100 comparison to discover that a key is not present. That is O(N).
There is a common implementation strategy to deal with this. As the hash table gets larger, you periodically resize the primary array, and then redistribute the keys / entries. For example, the standard Java HashMap and HashTable classes track the ratio of the array size and the number of entries. When the ratio exceeds a configurable load factor, the primary array size ts doubled. (See the javadoc for an explanation of the load factor.)
The analysis of this is complicated. However, if we can assume that keys are roughly evenly distributed across the buckets, we get the following:
average lookup times that are O(1)
average insertion times that are O(1)
worst-case insertion times that are O(N) ... when the insertion triggers a resize.
What if the key distribution is badly skewed?
This can happen if the hash function is a poor one, or if the process that generates the keys is pathological.
Well, if you don't do anything about it, the worst case occurs when all keys have the same hash value and end up in the same bucket ... irrespective of the primary array size. This results in O(N) lookup and insertion.
But there are a couple of ways to mitigate this. A simple way is to perform a second hashing operation on the key's hashcodes. This helps in some cases. A more complex way is to replace the hash chains with balanced binary trees. This changes the average behavior of lookup and insertion (for the case of pathological keys) from O(N) to O(logN)`.
From Java 8 onwards, the HashMap implementation uses either hash chains or trees, depending on the number of keys in a given bucket.
And what is usually the array size that is being used in most languages' (JS, Go, etc) hash table implementations?
For Java (and probably for the others) the array size changes over the life-time of the hash table as described above. In Java, there is an upper limit on the size or the array. Java arrays can only have 231 - 1 elements.
In the implementations you mention, the array starts small, and is reallocated with a bigger size as more elements are added.
The array size stays within a constant factor of the element count, so that the average number of elements per slot is bounded.
Regarding hash tables, we measure the performance of the hash table using load factor. But I need to understand the relationship between the load factor and the time complexity of hash table . According to my understanding, the relation is directly proportional. Meaning that, we just take O(1) for the computation of the hash function to find the index. If the load factor is low, this means that no enough elements are there in the table and therefore the chance of finding the key-value pair at their right index is high and therefore the searching operation is minimal and still the complexity is a constant. On the other hand, when the load factor is high the chance of finding the key-value pair into their exact position is low and therefore we will need to do some search operations and therefore the complexity will rise to be in O(n) . The same can be said for the insert operation. Is this right?
This is a great question, and the answer is "it depends on what kind of hash table you're using."
A chained hash table where, to store an item, you hash it into a bucket, then store the item in that bucket. If multiple items end up in the same bucket, you simply store a list of all the items that end up in that bucket within the bucket itself. (This is the most commonly-taught version of a hash table.) In this kind of hash table, the expected number of elements in a bucket, assuming a good hash function, is O(α), where the load factor is denoted by α. That makes intuitive sense, since if you distribute your items randomly across the buckets you'd expect that roughly α of them end up in each bucket. In this case, as the load factor increases, you will have to do more and more work on average to find an element, since more elements will be in each bucket. The runtime of a lookup won't necessarily reach O(n), though, since you will still have the items distributed across the buckets even if there aren't nearly enough buckets to go around.
A linear probing hash table works by having an array of slots. Whenever you hash an element, you go to its slot, then walk forward in the table until you either find the element or find a free slot. In that case, as the load factor approaches one, more and more table slots will be filled in, and indeed you'll find yourself in a situation where searches do indeed take time O(n) in the worst case because there will only be a few free slots to stop your search. (There's a beautiful and famous analysis by Don Knuth showing that, assuming the hash function behaves like a randomly-chosen function, the cost of an unsuccessful lookup or insertion into the hash table will take time O(1 / (1 - α)2). It's interesting to plot this function and see how the runtime grows as α gets closer and closer to one.)
Hope this helps!
I know of two ways to implement a queue, using a linked-list or using an array. Which one should I use for making buckets in a hash-table where the hash-table needs to be rehashed when bucket exceeds limit of entries. Is it possible for me to get O(1) en-queue and de-queue along with indexing using some other data-structure?
Using array I can let the bucket size go to higher values, cause indexing in an array lets me use binary search on the keys(inserted in sorted order). Consider the benefits if bucket size becomes 1000, search becomes ln(1000) vs 1000. Insert operation becomes O(n) but, lookup is more common then insert.
Using linked-list I get O(1) insert, delete but also I get O(n).
My question is that, can I get the benefits of both using some other data-structure or is the benefit of using of of these clearly more than the other?
I think you're asking the wrong question. Rather than worrying about how to handle large numbers of items in a bucket, you should be concerned with why your bucket has become overfull.
Hash tables assume two things:
You've selected a hashing function that provides a good distribution of items among buckets.
You won't let the load factor get too high. A good hash table implementation will provide pretty decent performance with a load factor up to about 0.8, but beyond that performance drops precipitously. I think most implementations like to keep the load factor under 0.7. So if the number of items in your hash table exceeds 70% of the table's capacity, you should consider increasing the capacity. Most hash table implementations automatically increase the capacity when the load factor goes beyond some threshold.
When you elect to use a hash table, you take on the responsibility of ensuring that both conditions hold true. If you pick a poor hashing function or if you exceed the designed load factor, performance will suffer, and no amount of optimizing the bucket structure is going to help you.
The implementation of your bucket's list structure shouldn't matter because your buckets shouldn't be large enough to make a performance difference. A simple linked list gives you O(1) insertion and O(k) lookup (where k is the number of items in the bucket). But k shouldn't be more than 2 or 3, so it doesn't make sense to use an asymptotically more efficient data structure.
Regardless of how you implement the buckets, you're going to pay the price of an O(n) resize from time to time when you exceed the hash table's capacity (or the load factor threshold if your hash table implementation does automatic resizing).
When you implement the buckets for the hash table, you should use linked lists because they are resizable. The only operation you need to do in buckets within a hash map are to traverse and to append new items, both of which can be accomplished in O(1) per element. When you use an array, you allocate memory unnecessarily or too little, as you cannot resize it. Moreover, you shouldn't use a queue, you'd be better off just using an ordinary linked list.
If we look from Java perspective then we can say that hashmap lookup takes constant time. But what about internal implementation? It still would have to search through particular bucket (for which key's hashcode matched) for different matching keys.Then why do we say that hashmap lookup takes constant time? Please explain.
Under the appropriate assumptions on the hash function being used, we can say that hash table lookups take expected O(1) time (assuming you're using a standard hashing scheme like linear probing or chained hashing). This means that on average, the amount of work that a hash table does to perform a lookup is at most some constant.
Intuitively, if you have a "good" hash function, you would expect that elements would be distributed more or less evenly throughout the hash table, meaning that the number of elements in each bucket would be close to the number of elements divided by the number of buckets. If the hash table implementation keeps this number low (say, by adding more buckets every time the ratio of elements to buckets exceeds some constant), then the expected amount of work that gets done ends up being some baseline amount of work to choose which bucket should be scanned, then doing "not too much" work looking at the elements there, because on expectation there will only be a constant number of elements in that bucket.
This doesn't mean that hash tables have guaranteed O(1) behavior. In fact, in the worst case, the hashing scheme will degenerate and all elements will end up in one bucket, making lookups take time Θ(n) in the worst case. This is why it's important to design good hash functions.
For more information, you might want to read an algorithms textbook to see the formal derivation of why hash tables support lookups so efficiently. This is usually included as part of a typical university course on algorithms and data structures, and there are many good resources online.
Fun fact: there are certain types of hash tables (cuckoo hash tables, dynamic perfect hash tables) where the worst case lookup time for an element is O(1). These hash tables work by guaranteeing that each element can only be in one of a few fixed positions, with insertions sometimes scrambling around elements to try to make everything fit.
Hope this helps!
The key is in this statement in the docs:
If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table.
and
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
http://docs.oracle.com/javase/6/docs/api/java/util/HashMap.html
The internal bucket structure will actually be rebuilt if the load factor is exceeded, allowing for the amortized cost of get and put to be O(1).
Note that if the internal structure is rebuilt, that introduces a performance penalty that is likely to be O(N), so quite a few get and put may be required before the amortized cost approaches O(1) again. For that reason, plan the initial capacity and load factor appropriately, so that you neither waste space, nor trigger avoidable rebuilding of the internal structure.
Hashtables AREN'T O(1).
Via the pigeonhole principle, you cannot be better than O(log(n)) for lookup, because you need log(n) bits per item to uniquely identify n items.
Hashtables seem to be O(1) because they have a small constant factor combined with their 'n' in the O(log(n)) being increased to the point that, for many practical applications, it is independent of the number of actual items you are using. However, big O notation doesn't care about that fact, and it is a (granted, absurdly common) misuse of the notation to call hashtables O(1).
Because while you could store a million, or a billion items in a hashtable and still get the same lookup time as a single item hashtable... You lose that ability if you're taking about a nonillion or googleplex items. The fact that you will never actually be using a nonillion or googleplex items doesn't matter for big O notation.
Practically speaking, hashtable performance can be a constant factor worse than array lookup performance. Which, yes, is also O(log(n)), because you CAN'T do better.
Basically, real world computers make every array lookup for arrays of size less than their chip bit size just as bad as their biggest theoretically usable array, and as hastables are clever tricks performed on arrays, that's why you seem to get O(1)
To follow up on templatetypedef's comments as well:
The constant time implementation of a hash table could be a hashmap, with which you can implement a boolean array list that indicates whether a particular element exists in a bucket. However, if you are implementing a linked list for your hashmap, the worst case would require you going through every bucket and having to traverse through the ends of the lists.
Why do I keep seeing different runtime complexities for these functions on a hash table?
On wiki, search and delete are O(n) (I thought the point of hash tables was to have constant lookup so what's the point if search is O(n)).
In some course notes from a while ago, I see a wide range of complexities depending on certain details including one with all O(1). Why would any other implementation be used if I can get all O(1)?
If I'm using standard hash tables in a language like C++ or Java, what can I expect the time complexity to be?
Hash tables are O(1) average and amortized case complexity, however it suffers from O(n) worst case time complexity. [And I think this is where your confusion is]
Hash tables suffer from O(n) worst time complexity due to two reasons:
If too many elements were hashed into the same key: looking inside this key may take O(n) time.
Once a hash table has passed its load balance - it has to rehash [create a new bigger table, and re-insert each element to the table].
However, it is said to be O(1) average and amortized case because:
It is very rare that many items will be hashed to the same key [if you chose a good hash function and you don't have too big load balance.
The rehash operation, which is O(n), can at most happen after n/2 ops, which are all assumed O(1): Thus when you sum the average time per op, you get : (n*O(1) + O(n)) / n) = O(1)
Note because of the rehashing issue - a realtime applications and applications that need low latency - should not use a hash table as their data structure.
EDIT: Annother issue with hash tables: cache
Another issue where you might see a performance loss in large hash tables is due to cache performance. Hash Tables suffer from bad cache performance, and thus for large collection - the access time might take longer, since you need to reload the relevant part of the table from the memory back into the cache.
Ideally, a hashtable is O(1). The problem is if two keys are not equal, however they result in the same hash.
For example, imagine the strings "it was the best of times it was the worst of times" and "Green Eggs and Ham" both resulted in a hash value of 123.
When the first string is inserted, it's put in bucket 123. When the second string is inserted, it would see that a value already exists for bucket 123. It would then compare the new value to the existing value, and see they are not equal. In this case, an array or linked list is created for that key. At this point, retrieving this value becomes O(n) as the hashtable needs to iterate through each value in that bucket to find the desired one.
For this reason, when using a hash table, it's important to use a key with a really good hash function that's both fast and doesn't often result in duplicate values for different objects.
Make sense?
Some hash tables (cuckoo hashing) have guaranteed O(1) lookup
Perhaps you were looking at the space complexity? That is O(n). The other complexities are as expected on the hash table entry. The search complexity approaches O(1) as the number of buckets increases. If at the worst case you have only one bucket in the hash table, then the search complexity is O(n).
Edit in response to comment I don't think it is correct to say O(1) is the average case. It really is (as the wikipedia page says) O(1+n/k) where K is the hash table size. If K is large enough, then the result is effectively O(1). But suppose K is 10 and N is 100. In that case each bucket will have on average 10 entries, so the search time is definitely not O(1); it is a linear search through up to 10 entries.
Depends on the how you implement hashing, in the worst case it can go to O(n), in best case it is 0(1) (generally you can achieve if your DS is not that big easily)