I understand that insertion for hash tables is O(1) and sometimes O(n) depending on the load factor. This makes sense to me, however, I'm still confused. When talking about insertion, are we including the hash function in that measurement or is it just placing some value at that index? For ints, I could see how it could be O(1), but what about strings or any other objects?
Edit: This seemed to answer my question, sorry about the confusion.
Time complexity of creating hash value of a string in hashtable
Yes, the hash function needs to be included in the cost of lookup in a hash table, just the like comparison function needs to be included in the cost of lookup in a sorted table. If the keys have unbounded size, then the key length must be somehow accounted for.
You could stop computing the hash at a certain fixed key length. (Lua does that, for example). But there is a pathological case where every new key is a suffix of the previously inserted key, which would eventually reduce a bounded-length hash function to a linear search.
Regardless of the hash function, the hash table lookup must eventually compare the found key --if there is one-- with the target key, to ensure that they are the same. So that must take time proportional to the size of the key.
In short, constant average-time hash table lookup requires that keys have a bounded size, which is not necessarily the case. But since alternative lookup algorithms would also be affected by key size, this fact doesn't generally help in comparing different lookup up algorithms.
Related
The way I understand it, an hash table is an array of linked lists, so it should actually be O(n/array_length).
Isn't saying the it's O(1) just plain wrong?
For example, if I have 1M items on hash table that is based on a 100-sized array, the average lookup would take 5,000 items. Clearly not O(1), although I assume that most implementations of a hash table use a much bigger sized array.
And what is usually the array size that is being used in most languages' (JS, Go, etc) hash table implementations?
You are correct. In general, it is not possible to say that all hash tables are O(1). It depends on some design decisions, and on other factors.
In your example, you seem to be talking about a hash table with a fixed number of buckets (100) and an unbounded number of entries N. With that design, it will take on average N / 50 comparisons to find a key that is present, and on average N / 100 comparison to discover that a key is not present. That is O(N).
There is a common implementation strategy to deal with this. As the hash table gets larger, you periodically resize the primary array, and then redistribute the keys / entries. For example, the standard Java HashMap and HashTable classes track the ratio of the array size and the number of entries. When the ratio exceeds a configurable load factor, the primary array size ts doubled. (See the javadoc for an explanation of the load factor.)
The analysis of this is complicated. However, if we can assume that keys are roughly evenly distributed across the buckets, we get the following:
average lookup times that are O(1)
average insertion times that are O(1)
worst-case insertion times that are O(N) ... when the insertion triggers a resize.
What if the key distribution is badly skewed?
This can happen if the hash function is a poor one, or if the process that generates the keys is pathological.
Well, if you don't do anything about it, the worst case occurs when all keys have the same hash value and end up in the same bucket ... irrespective of the primary array size. This results in O(N) lookup and insertion.
But there are a couple of ways to mitigate this. A simple way is to perform a second hashing operation on the key's hashcodes. This helps in some cases. A more complex way is to replace the hash chains with balanced binary trees. This changes the average behavior of lookup and insertion (for the case of pathological keys) from O(N) to O(logN)`.
From Java 8 onwards, the HashMap implementation uses either hash chains or trees, depending on the number of keys in a given bucket.
And what is usually the array size that is being used in most languages' (JS, Go, etc) hash table implementations?
For Java (and probably for the others) the array size changes over the life-time of the hash table as described above. In Java, there is an upper limit on the size or the array. Java arrays can only have 231 - 1 elements.
In the implementations you mention, the array starts small, and is reallocated with a bigger size as more elements are added.
The array size stays within a constant factor of the element count, so that the average number of elements per slot is bounded.
I don't understand how hash tables are constant time lookup, if there's a constant number of buckets. Say we have 100 buckets, and 1,000,000 elements. This is clearly O(n) lookup, and that's the point of complexity, to understand how things behave for very large values of n. Thus, a hashtable is never constant lookup, it's always O(n) lookup.
Why do people say it's O(1) lookup on average, and only O(n) for worst case?
The purpose of using a hash is to be able to index into the table directly, just like an array. In the ideal case there's only one item per bucket, and we achieve O(1) easily.
A practical hash table will have more buckets than it has elements, so that the odds of having only one element per bucket are high. If the number of elements inserted into the table gets too great, the table will be resized to increase the number of buckets.
There is always a possibility that every element will have the same hash, or that all active hashes will be assigned to the same bucket; in that case the lookup time is indeed O(n). But a good hash table implementation will be designed to minimize the chance of that occurring.
In layman terms with some hand waving:
At the one extreme, you can have a hash map that is perfectly distributed with one value per bucket. In this case, your lookup returns the value directly, and cost is 1 operation -- or on the order of one, if you like: O(1).
In the real world, implementation often arrange for that to be the case, by expanding the size of the table, etc. to meet the requirements of the data. When you have more items than buckets, you start increasing complexity.
In the worst case, you have one bucket and n items in the one bucket. In this case, it is basically like searching a list, linearly. And so if the value happens to be the last one, you need to do n comparisons, to find it. Or, on the order of n: O(n).
The latter case is pretty much always /possible/ for a given data set. That's why there has been so much study and effort put into coming up with good hashing algorithms. So, it is theoretically possible to engineer a dataset that will cause collisions. So, there is some way to end up with O(n) performance, unless the implementation tweaks other aspects ; table size, hash implementation, etc., etc.
By saying
Say we have 100 buckets, and 1,000,000 elements.
you are basically depriving the hashmap from its real power of rehashing, and also not considering the initial capacity of hashmap in accordance to need. Hashmap is more efficient in cases where each entry gets its own bucket. Lesser percentage of collision can be achieved by higher capacity of hashmap. Each collision means you need to traverse the corresponding list.
Below points should be considered for Hash table impelmentation.
A hashtable is designed such that it re sizes itself as the number of entries get larger than number of buckets by a certain threshold value. This is how we should design if we wish to implement our own custom Hash table.
A good hash function makes sure that entries are well distributed in the buckets of hashtable. This keeps the list in a bucket short.
Above takes care that access time remains constant.
I was reading about tries and this topcoder article (https://www.topcoder.com/community/data-science/data-science-tutorials/using-tries/) says:
The tries can insert and find strings in O(L) time (where L represent the length of a single word). This is much faster than set , but is it a bit faster than a hash table.
I had always learned that sets and hash tables were really fast for looking things up and that they had constant lookup time. Is this not true? Why is it "much faster" than a set? And it also seems to imply that hash tables have different lookup time than sets too. I always thought that sets and hash tables were implemented in pretty much the same way except that one stores some object.
The referenced article is not comparing a trie with an abstract "set" datastructure; it is comparing the trie with the C++ standard library std::set, which is a search tree, usually a red-black tree, which allows you to iterate the contents in sorted order. (C++ also has std::unordered_set, which is based on a hash table, but the article may have been written before that was part of the standard library.)
Hash tables are (on average) O(1) only if the hash can be computed in O(1), since the hash of the key must be computed before any lookup is done. For string keys, most hash functions need to look at every character in the key, so they are O(L) in the length of the string. (This rather obvious fact is for some reason often skipped over in discussion of hashtable computational complexity.) Since both the trie and the hashtable must eventually verify that the provided key is equal to the candidate key in the container, there is an O(L) factor in both cases.
However, tries still have advantages. For example, they can be iterated in lexicographic order, like std::set, but usually faster, whereas hashtables can only be iterated in some non-deterministic order. So if you need to do prefix searches, the hashtable is not an appropriate datastructure.
If we look from Java perspective then we can say that hashmap lookup takes constant time. But what about internal implementation? It still would have to search through particular bucket (for which key's hashcode matched) for different matching keys.Then why do we say that hashmap lookup takes constant time? Please explain.
Under the appropriate assumptions on the hash function being used, we can say that hash table lookups take expected O(1) time (assuming you're using a standard hashing scheme like linear probing or chained hashing). This means that on average, the amount of work that a hash table does to perform a lookup is at most some constant.
Intuitively, if you have a "good" hash function, you would expect that elements would be distributed more or less evenly throughout the hash table, meaning that the number of elements in each bucket would be close to the number of elements divided by the number of buckets. If the hash table implementation keeps this number low (say, by adding more buckets every time the ratio of elements to buckets exceeds some constant), then the expected amount of work that gets done ends up being some baseline amount of work to choose which bucket should be scanned, then doing "not too much" work looking at the elements there, because on expectation there will only be a constant number of elements in that bucket.
This doesn't mean that hash tables have guaranteed O(1) behavior. In fact, in the worst case, the hashing scheme will degenerate and all elements will end up in one bucket, making lookups take time Θ(n) in the worst case. This is why it's important to design good hash functions.
For more information, you might want to read an algorithms textbook to see the formal derivation of why hash tables support lookups so efficiently. This is usually included as part of a typical university course on algorithms and data structures, and there are many good resources online.
Fun fact: there are certain types of hash tables (cuckoo hash tables, dynamic perfect hash tables) where the worst case lookup time for an element is O(1). These hash tables work by guaranteeing that each element can only be in one of a few fixed positions, with insertions sometimes scrambling around elements to try to make everything fit.
Hope this helps!
The key is in this statement in the docs:
If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table.
and
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
http://docs.oracle.com/javase/6/docs/api/java/util/HashMap.html
The internal bucket structure will actually be rebuilt if the load factor is exceeded, allowing for the amortized cost of get and put to be O(1).
Note that if the internal structure is rebuilt, that introduces a performance penalty that is likely to be O(N), so quite a few get and put may be required before the amortized cost approaches O(1) again. For that reason, plan the initial capacity and load factor appropriately, so that you neither waste space, nor trigger avoidable rebuilding of the internal structure.
Hashtables AREN'T O(1).
Via the pigeonhole principle, you cannot be better than O(log(n)) for lookup, because you need log(n) bits per item to uniquely identify n items.
Hashtables seem to be O(1) because they have a small constant factor combined with their 'n' in the O(log(n)) being increased to the point that, for many practical applications, it is independent of the number of actual items you are using. However, big O notation doesn't care about that fact, and it is a (granted, absurdly common) misuse of the notation to call hashtables O(1).
Because while you could store a million, or a billion items in a hashtable and still get the same lookup time as a single item hashtable... You lose that ability if you're taking about a nonillion or googleplex items. The fact that you will never actually be using a nonillion or googleplex items doesn't matter for big O notation.
Practically speaking, hashtable performance can be a constant factor worse than array lookup performance. Which, yes, is also O(log(n)), because you CAN'T do better.
Basically, real world computers make every array lookup for arrays of size less than their chip bit size just as bad as their biggest theoretically usable array, and as hastables are clever tricks performed on arrays, that's why you seem to get O(1)
To follow up on templatetypedef's comments as well:
The constant time implementation of a hash table could be a hashmap, with which you can implement a boolean array list that indicates whether a particular element exists in a bucket. However, if you are implementing a linked list for your hashmap, the worst case would require you going through every bucket and having to traverse through the ends of the lists.
Why do I keep seeing different runtime complexities for these functions on a hash table?
On wiki, search and delete are O(n) (I thought the point of hash tables was to have constant lookup so what's the point if search is O(n)).
In some course notes from a while ago, I see a wide range of complexities depending on certain details including one with all O(1). Why would any other implementation be used if I can get all O(1)?
If I'm using standard hash tables in a language like C++ or Java, what can I expect the time complexity to be?
Hash tables are O(1) average and amortized case complexity, however it suffers from O(n) worst case time complexity. [And I think this is where your confusion is]
Hash tables suffer from O(n) worst time complexity due to two reasons:
If too many elements were hashed into the same key: looking inside this key may take O(n) time.
Once a hash table has passed its load balance - it has to rehash [create a new bigger table, and re-insert each element to the table].
However, it is said to be O(1) average and amortized case because:
It is very rare that many items will be hashed to the same key [if you chose a good hash function and you don't have too big load balance.
The rehash operation, which is O(n), can at most happen after n/2 ops, which are all assumed O(1): Thus when you sum the average time per op, you get : (n*O(1) + O(n)) / n) = O(1)
Note because of the rehashing issue - a realtime applications and applications that need low latency - should not use a hash table as their data structure.
EDIT: Annother issue with hash tables: cache
Another issue where you might see a performance loss in large hash tables is due to cache performance. Hash Tables suffer from bad cache performance, and thus for large collection - the access time might take longer, since you need to reload the relevant part of the table from the memory back into the cache.
Ideally, a hashtable is O(1). The problem is if two keys are not equal, however they result in the same hash.
For example, imagine the strings "it was the best of times it was the worst of times" and "Green Eggs and Ham" both resulted in a hash value of 123.
When the first string is inserted, it's put in bucket 123. When the second string is inserted, it would see that a value already exists for bucket 123. It would then compare the new value to the existing value, and see they are not equal. In this case, an array or linked list is created for that key. At this point, retrieving this value becomes O(n) as the hashtable needs to iterate through each value in that bucket to find the desired one.
For this reason, when using a hash table, it's important to use a key with a really good hash function that's both fast and doesn't often result in duplicate values for different objects.
Make sense?
Some hash tables (cuckoo hashing) have guaranteed O(1) lookup
Perhaps you were looking at the space complexity? That is O(n). The other complexities are as expected on the hash table entry. The search complexity approaches O(1) as the number of buckets increases. If at the worst case you have only one bucket in the hash table, then the search complexity is O(n).
Edit in response to comment I don't think it is correct to say O(1) is the average case. It really is (as the wikipedia page says) O(1+n/k) where K is the hash table size. If K is large enough, then the result is effectively O(1). But suppose K is 10 and N is 100. In that case each bucket will have on average 10 entries, so the search time is definitely not O(1); it is a linear search through up to 10 entries.
Depends on the how you implement hashing, in the worst case it can go to O(n), in best case it is 0(1) (generally you can achieve if your DS is not that big easily)