I don't understand this explanation which says if n is the number of elements in the hash table and m is the total number of buckets then hashtables have constant access time in average only if n is proportional to theta(n). Why does it have to be proportional ?
well actually m should be proportional to n. Otherwise you could, for example, have just 1 bucket and it would be just like an unsorted set.
To be more precise, if m is proportional to n, i.e. m = c * n, then the number of items in each bucket will be n/m = 1/c which is a constant. Going to any bucket is an O(1) operation (just compute the hash code) and then the search through the bucket is constant order (you could just do a linear search through the items in the bucket which would be a constant).
Thus the order of the algorithm is O(1), if m = c * n.
To take a converse example, suppose we had a fixed size table of size tableSize. Then the expected number of items in each bucket is n/tableSize which is a linear function of n. Any kind of search through the bucket is at best O(log(n)) for a tree (I'm assuming you don't stick another hash table inside the bucket or we then have the same argument over that hash table), so it would not be O(1) in this case.
Strictly speaking, the average-case time complexity of hash table access is actually in Ω(n1/3). Information can't travel faster than the speed of light, which is a constant. Since space has three dimensions, storing n bits of data requires that some data be located at a distance on the order of n1/3 from the CPU.
More detail in my blog.
The chance of collisions is higher and thus the incidence of having to scan through the list of items with the same hash key is also higher.
Access time is constant because access is based on a calculation of a hash value and then a constant lookup to find the appropriate bucket. Assuming the hash function evenly distributes items amongst buckets, then the time it takes to access any individual item will be equal to the time to access other items, regardless of n.
Constant doesn't necessarily mean constantly low though. The average access time is related to the even distribution of the hashing function and the number of buckets. If you have thousands of items evenly distributed amongst a small number of buckets, you're finding the bucket fast but then looping through a lot of items in the bucket. If you have a good proportion of buckets to items but a bad hash function that puts many more items in some buckets rather than other, the access time for the items in larger buckets will be slower than access time for others.
A reasonably-sized hash table, where there are enough slots for every element you store and plenty of extra space, will have the hashing function doing most of the work choosing slots and very few collisions where different elements have the same hash. A very crowded hash table would have lots of collisions, and would degrade to basically a linear search, where almost every lookup will be a wrong item that had the same hash and you'll have to keep searching for the right one (a hash table lookup still has to check the key once it picks the first slot, because the key it's looking for might have had a collision when it was stored).
What determines the hit-collision ratio is exactly the ratio of number-of-items to size-of-hash (i.e., the percentage chance that a randomly chosen slot will be filled).
Related
What is the best, average and worst case time complexity for traversing a hash map under the assumption that the hash map uses chaining with linked lists.
I've read multiple times that the time complexity is O(m+n) for traversal for all three cases (m=number of buckets, n=number of elements). However, this differs from my time complexity analysis: In the worst case all elements are linearly chained in the last bucket which leads to a time complexity of O(m+n). In the best case no hash collisions happen and therefore time complexity should be O(m). In the average case I assume that the elements are uniformly distributed, i.e. each bucket on average has n/m elements. This leads to a time complexity of O(m * n/m) = O(n). Is my analysis wrong?
In practice, a good implementation can always achieve O(n). GCC's C++ Standard Library implementation for the hash table containers unordered_map and unordered_set, for example, maintains a forward/singly linked list between the elements inserted into the hash table, wherein elements that currently hash to the same bucket are grouped together in the list. Hash table buckets contain iterators into the singly-linked list for the point where the element before that bucket's colliding elements start (so if erasing an element, the previous link can be rewired to skip over it).
During traversal, only the singly-linked list need be consulted - the hash table buckets are not visited. This becomes especially important when the load factor is very low (many elements were inserted, then many were erased, but in C++ the table never reduces size, so you can end up with a very low load factor.
IF instead you have a hash table implementation where each bucket literally maintains a head pointer for its own linked list, then the kind of analysis you attempted comes into play.
You're right about worst case complexity.
In the best case no hash collisions happen and therefore time complexity should be O(m).
It depends. In C++ for example, values/elements are never stored in the hash table buckets (which would waste a huge amount of memory if the values were large in size and many buckets were empty). If instead the buckets contain the "head" pointer/iterator for the list of colliding elements, then even if there's no collision at a bucket, you still have to follow the pointer to a distinct memory area - that's just as bothersome as following a pointer between nodes on the same linked list, and is therefore normally included in the complexity calculation, so it's still O(m + n).
In the average case I assume that the elements are uniformly
distributed, i.e. each bucket on average has n/m elements.
No... elements being uniformly distributed across buckets is the best case for a hash table: see above. An "average" or typical case is where there's more variation in the number of elements hashing to any given bucket. For example, if you have 1 million buckets and 1 million values and a cryptographic strength hash function, you can statistically expect 1/e (~36.8%) buckets to be empty, 1/1!e (simplifies to 1/1e) buckets to have 1 element, 1/2!e (~18.4%) buckets to have 2 colliding elements, 1/3!e (~6.1%) buckets to have 3 colliding elements and so on (the "!" is for factorial...).
Anyway, the key point is that a naive bucket-visiting hash table traversal (as distinct from actually being able to traverse a list of elements without bucket-visiting), always has to visit all the buckets, then if you imagine each element being tacked onto a bucket somewhere, there's always one extra link to traverse to reach it. Hence O(m+n).
Regarding hash tables, we measure the performance of the hash table using load factor. But I need to understand the relationship between the load factor and the time complexity of hash table . According to my understanding, the relation is directly proportional. Meaning that, we just take O(1) for the computation of the hash function to find the index. If the load factor is low, this means that no enough elements are there in the table and therefore the chance of finding the key-value pair at their right index is high and therefore the searching operation is minimal and still the complexity is a constant. On the other hand, when the load factor is high the chance of finding the key-value pair into their exact position is low and therefore we will need to do some search operations and therefore the complexity will rise to be in O(n) . The same can be said for the insert operation. Is this right?
This is a great question, and the answer is "it depends on what kind of hash table you're using."
A chained hash table where, to store an item, you hash it into a bucket, then store the item in that bucket. If multiple items end up in the same bucket, you simply store a list of all the items that end up in that bucket within the bucket itself. (This is the most commonly-taught version of a hash table.) In this kind of hash table, the expected number of elements in a bucket, assuming a good hash function, is O(α), where the load factor is denoted by α. That makes intuitive sense, since if you distribute your items randomly across the buckets you'd expect that roughly α of them end up in each bucket. In this case, as the load factor increases, you will have to do more and more work on average to find an element, since more elements will be in each bucket. The runtime of a lookup won't necessarily reach O(n), though, since you will still have the items distributed across the buckets even if there aren't nearly enough buckets to go around.
A linear probing hash table works by having an array of slots. Whenever you hash an element, you go to its slot, then walk forward in the table until you either find the element or find a free slot. In that case, as the load factor approaches one, more and more table slots will be filled in, and indeed you'll find yourself in a situation where searches do indeed take time O(n) in the worst case because there will only be a few free slots to stop your search. (There's a beautiful and famous analysis by Don Knuth showing that, assuming the hash function behaves like a randomly-chosen function, the cost of an unsuccessful lookup or insertion into the hash table will take time O(1 / (1 - α)2). It's interesting to plot this function and see how the runtime grows as α gets closer and closer to one.)
Hope this helps!
I'm having trouble understanding using the load factor for big-O analysis of Chained and Open-Addressing of HashTables.
From my understanding:
LoadFactor = (# of entries in HashTable)/(# of "slots" in HashTable) = (n/m)
Thus, the LoadFactor reflects how much of the HashTable is being utilized by data being entered into the HashTable.
For a Chained HashTable, the Worst-Case Time Complexity is O(n) because a non-uniform distribution of data with all elements hashed to the last slot in the HashTable reduces the problem to a search in a Linked List of size n.
For an Open-Addressed HashTable, the Worst-Case Time Complexity is O(n) because once again, a non-uniform distribution of data with all elements hashed to one hashCode will result in all elements being entered consecutively. Thus, the problem reduces to a search in an array of size n.
For the worst-case scenarios, I assumed n>m.
Now for a small load factor, both the Chained and Open-Addressed HashTables will yield an O(1).
I fail to see the distinction between the n>m and n
Why is this the case?
When n is much smaller than m (the number of buckets), then it's likely that every key has hashed to a unique bucket. As n increases, the likelihood of a collision (two keys hashing to the same bucket) increases. When n > m, then by the Pigeonhole principle, it's guaranteed that at least two keys will hash to the same value.
So when n is much smaller than m, the likelihood of collision is smaller, so the expected lookup time is O(1). As the number of items grows, you spend more time resolving collisions.
Empirical evidence suggests that you don't want your load factor to exceed 0.75. Probably closer to 0.70. When n > 0.70*m, the hash table becomes hugely inefficient. The actual number varies depending on your data, of course, but that's a good rule of thumb.
The Birthday problem shows how the collision rate increases as n approaches m. The likelihood of encountering a single collision is close to 50% when you have inserted the square root of the number of items in the table. If you were to create a hash table of size 365 and start inserting items, your chances of seeing a hash collision is higher than 50% when you insert just 25 items. (That's assuming a good hashing function. A poor hashing function will increase your likelihood of collision overall.)
The expected-case time complexity of small load factor hash tables is O(1) because the time to access an item in the hash table doesn't depend on the number of items in the hash table.
I don't understand how hash tables are constant time lookup, if there's a constant number of buckets. Say we have 100 buckets, and 1,000,000 elements. This is clearly O(n) lookup, and that's the point of complexity, to understand how things behave for very large values of n. Thus, a hashtable is never constant lookup, it's always O(n) lookup.
Why do people say it's O(1) lookup on average, and only O(n) for worst case?
The purpose of using a hash is to be able to index into the table directly, just like an array. In the ideal case there's only one item per bucket, and we achieve O(1) easily.
A practical hash table will have more buckets than it has elements, so that the odds of having only one element per bucket are high. If the number of elements inserted into the table gets too great, the table will be resized to increase the number of buckets.
There is always a possibility that every element will have the same hash, or that all active hashes will be assigned to the same bucket; in that case the lookup time is indeed O(n). But a good hash table implementation will be designed to minimize the chance of that occurring.
In layman terms with some hand waving:
At the one extreme, you can have a hash map that is perfectly distributed with one value per bucket. In this case, your lookup returns the value directly, and cost is 1 operation -- or on the order of one, if you like: O(1).
In the real world, implementation often arrange for that to be the case, by expanding the size of the table, etc. to meet the requirements of the data. When you have more items than buckets, you start increasing complexity.
In the worst case, you have one bucket and n items in the one bucket. In this case, it is basically like searching a list, linearly. And so if the value happens to be the last one, you need to do n comparisons, to find it. Or, on the order of n: O(n).
The latter case is pretty much always /possible/ for a given data set. That's why there has been so much study and effort put into coming up with good hashing algorithms. So, it is theoretically possible to engineer a dataset that will cause collisions. So, there is some way to end up with O(n) performance, unless the implementation tweaks other aspects ; table size, hash implementation, etc., etc.
By saying
Say we have 100 buckets, and 1,000,000 elements.
you are basically depriving the hashmap from its real power of rehashing, and also not considering the initial capacity of hashmap in accordance to need. Hashmap is more efficient in cases where each entry gets its own bucket. Lesser percentage of collision can be achieved by higher capacity of hashmap. Each collision means you need to traverse the corresponding list.
Below points should be considered for Hash table impelmentation.
A hashtable is designed such that it re sizes itself as the number of entries get larger than number of buckets by a certain threshold value. This is how we should design if we wish to implement our own custom Hash table.
A good hash function makes sure that entries are well distributed in the buckets of hashtable. This keeps the list in a bucket short.
Above takes care that access time remains constant.
I was reading the javadocs on HashSet when I came across the interesting statement:
This class offers constant time performance for the basic operations (add, remove, contains and size)
This confuses me greatly, as I don't understand how one could possibly get constant time, O(1), performance for a comparison operation. Here are my thoughts:
If this is true, then no matter how much data I'm dumping into my HashSet, I will be able to access any element in constant time. That is, if I put 1 element in my HashSet, it will take the same amount of time to find it as if I had a googolplex of elements.
However, this wouldn't be possible if I had a constant number of buckets, or a consistent hash function, since for any fixed number of buckets, the number of elements in that bucket will grow linearly (albeit slowly, if the number is big enough) with the number of elements in the set.
Then, the only way for this to work is to have a changing hash function every time you insert an element (or every few times). A simple hash function that never any collisions would satisfy this need. One toy example for strings could be: Take the ASCII value of the strings and concatenate them together (because adding could result in a conflict).
However, this hash function, and any other hash function of this sort will likely fail for large enough strings or numbers etc. The number of buckets that you can form is immediately limited by the amount of stack/heap space you have, etc. Thus, skipping locations in memory can't be allowed indefinitely, so you'll eventually have to fill in the gaps.
But if at some point there's a recalculation of the hash function, this can only be as fast as finding a polynomial which passes through N points, or O(nlogn).
Thus arrives my confusion. While I will believe that the HashSet can access elements in O(n/B) time, where B is the number of buckets it has decided to use, I don't see how a HashSet could possibly perform add or get functions in O(1) time.
Note: This post and this post both don't address the concerns I listed..
The number of buckets is dynamic, and is approximately ~2n, where n is the number of elements in the set.
Note that HashSet gives amortized and average time performance of O(1), not worst case. This means, we can suffer an O(n) operation from time to time.
So, when the bins are too packed up, we just create a new, bigger array, and copy the elements to it.
This costs n operations, and is done when number of elements in the set exceeds 2n/2=n, so it means, the average cost of this operation is bounded by n/n=1, which is a constant.
Additionally, the number of collisions a HashMap offers is also constant on average.
Assume you are adding an element x. The probability of h(x) to be filled up with one element is ~n/2n = 1/2. The probability of it being filled up with 3 elements, is ~(n/2n)^2 = 1/4 (for large values of n), and so on and so on.
This gives you an average running time of 1 + 1/2 + 1/4 + 1/8 + .... Since this sum converges to 2, it means this operation takes constant time on average.
What I know about hashed structures is that to keep a O(1) complexity for insertion removal you need to have a good hash function to avoid collisions and the structure should not be full ( if the structure is full you will have collisions).
Normally hashed structures define a kind of fill limit, by example 70%.
When the number of object make the structure be filled more than this limit than you should extend it size to stay below the limit and warranty performances. Generally you double the size of the structure when reaching the limit so that structure size grow faster than number of elements and reduce the number of resize/maintenance operations to perform
This is a kind of maintenance operation that consists on rehashing all elements contained int he structure to redistribute them in the resized structure. For sure this has a cost whose complexity is O(n) with n the number of elements stored in the structure but this cost is not integrated in the add function that will make the maintenance operation needed
I think this is what disturb you.
I learned also that the hash function generally depends on size of the structure that is used as parameter (there was something like max number of elements to reach the limit is a prime number of structure size to reduce the probability of collision or something like that) meaning that you don't change the hash function itself, you just change on of its parameters.
To answer to your comment there is not warranty if bucket 0 or 1 was filled that when you resize to 4 new elements will go inside bucket 3 and 4. Perhaps resizing to 4 make elements A and B now be in buckets 0 and 3
For sure all above is theorical and in real life you don`t have infinite memory, you can have collisions and maintenance has a cost etc so that's why you need to have an idea about the number of objects that you will store and do a trade off with available memory to try to choose an initial size of hashed structure that will limit the need to perform maintenance operations and allow you to stay in the O(1) performances