looking at Wikipedia for Hash tables, it says that inserting and searching is O(1). But my concern is, that my teacher told me that only the lookup is O(1) and that hashing is O(s), where s the length of the string. Shouldn't the inserting and search be O(s) instead. Where it says hashing(s) + lookup(s)= O(hashing(s) + lookup(s))= O(s).
Could anyone explain me what is the correct way of writing the time complexity in big O notation for hash tables, and why? If assuming it is perfect hashing and no collisions occur.
Hash tables are used for more than just strings. The O(1) complexities for insert and lookup are for hash tables in general and only count the known operations.
Hashing and comparison are counted as O(1), because something must always be done for those, even if you're just storing integers, but we don't know what that is.
If you use a hash table for some data type (like strings) that multiplies the cost of those operations then it will multiply the complexity.
It is actually very important to consider this when measuring the complexity of a concrete algorithm that uses hash tables. Many of the string-based algorithms on this site, for example, are given complexities based on the assumption that the length of input strings is bounded by some constant. Thankfully that is usually the case.
This question is very similar to a question I asked: Is a lookup in a hash table O(1)?
The accepted answer was that for hashtables, "time" is measured in comparisons, and not operations. Here's the full answer, quoted:
What is wrong with your reasoning is the use of conflicting
definitions of "time".
When one says that lookup in a hash table takes O(1) time, one usually
means that it takes O(1) comparisons, that is, the number of
comparisons required to find an item is bounded above by a constant.
Under this idea of "time", the actual time (as in the thing you would
measure in seconds) used to compute the hash causes no variation.
Measuring time in comparisons is an approximation that, while it may
not reflect reality in the same way that measuring it in seconds
would, still provides useful information about the behaviour of the
hash table.
This sort of thing is true for most asymptotic complexity descriptions
of algorithms: people often use "time" with a very abstract meaning
that isn't the informal meaning of "time", but more often than not is
some variation of "number of operations" (with the kind of operation
often left unstated, expected to be obvious, or clear from context).
Related
If a hash table holds N distinct items, and is not overloaded, then the hashes for the N items must have have approximately lg(N) bits, otherwise too many items will get the same hash value.
But a hash table lookup is usually said to take O(1) time on average.
It's not possible to generate lg(N) bits in O(1) time, so the standard results for the complexity of hash tables are wrong.
What's wrong with my reasoning?
What is wrong with your reasoning is the use of conflicting definitions of "time".
When one says that lookup in a hash table takes O(1) time, one usually means that it takes O(1) comparisons, that is, the number of comparisons required to find an item is bounded above by a constant. Under this idea of "time", the actual time (as in the thing you would measure in seconds) used to compute the hash causes no variation.
Measuring time in comparisons is an approximation that, while it may not reflect reality in the same way that measuring it in seconds would, still provides useful information about the behaviour of the hash table.
This sort of thing is true for most asymptotic complexity descriptions of algorithms: people often use "time" with a very abstract meaning that isn't the informal meaning of "time", but more often than not is some variation of "number of operations" (with the kind of operation often left unstated, expected to be obvious, or clear from context).
The analysis is based on the assumption that the hash function is fixed and not related to the actual number of elements stored in the table. Rather than saying that the hash function returns a lg N-bit value if there are N elements in the hash table, the analysis is based on a hash function that returns, say, a k-bit value, where k is independent of N. Typical value of k (such as 32 or 64) provide for a hash table far larger than anything you need in practice.
So in once sense, yes, a table holding N elements requires a hash function that returns O(lg n) bits; but in practice, a constant that is far larger than the anticipated maximum value of lg n is used.
Hashtable search is O(1).
I think you are mixing insertion(which is O(n)) and search.
If we look from Java perspective then we can say that hashmap lookup takes constant time. But what about internal implementation? It still would have to search through particular bucket (for which key's hashcode matched) for different matching keys.Then why do we say that hashmap lookup takes constant time? Please explain.
Under the appropriate assumptions on the hash function being used, we can say that hash table lookups take expected O(1) time (assuming you're using a standard hashing scheme like linear probing or chained hashing). This means that on average, the amount of work that a hash table does to perform a lookup is at most some constant.
Intuitively, if you have a "good" hash function, you would expect that elements would be distributed more or less evenly throughout the hash table, meaning that the number of elements in each bucket would be close to the number of elements divided by the number of buckets. If the hash table implementation keeps this number low (say, by adding more buckets every time the ratio of elements to buckets exceeds some constant), then the expected amount of work that gets done ends up being some baseline amount of work to choose which bucket should be scanned, then doing "not too much" work looking at the elements there, because on expectation there will only be a constant number of elements in that bucket.
This doesn't mean that hash tables have guaranteed O(1) behavior. In fact, in the worst case, the hashing scheme will degenerate and all elements will end up in one bucket, making lookups take time Θ(n) in the worst case. This is why it's important to design good hash functions.
For more information, you might want to read an algorithms textbook to see the formal derivation of why hash tables support lookups so efficiently. This is usually included as part of a typical university course on algorithms and data structures, and there are many good resources online.
Fun fact: there are certain types of hash tables (cuckoo hash tables, dynamic perfect hash tables) where the worst case lookup time for an element is O(1). These hash tables work by guaranteeing that each element can only be in one of a few fixed positions, with insertions sometimes scrambling around elements to try to make everything fit.
Hope this helps!
The key is in this statement in the docs:
If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table.
and
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
http://docs.oracle.com/javase/6/docs/api/java/util/HashMap.html
The internal bucket structure will actually be rebuilt if the load factor is exceeded, allowing for the amortized cost of get and put to be O(1).
Note that if the internal structure is rebuilt, that introduces a performance penalty that is likely to be O(N), so quite a few get and put may be required before the amortized cost approaches O(1) again. For that reason, plan the initial capacity and load factor appropriately, so that you neither waste space, nor trigger avoidable rebuilding of the internal structure.
Hashtables AREN'T O(1).
Via the pigeonhole principle, you cannot be better than O(log(n)) for lookup, because you need log(n) bits per item to uniquely identify n items.
Hashtables seem to be O(1) because they have a small constant factor combined with their 'n' in the O(log(n)) being increased to the point that, for many practical applications, it is independent of the number of actual items you are using. However, big O notation doesn't care about that fact, and it is a (granted, absurdly common) misuse of the notation to call hashtables O(1).
Because while you could store a million, or a billion items in a hashtable and still get the same lookup time as a single item hashtable... You lose that ability if you're taking about a nonillion or googleplex items. The fact that you will never actually be using a nonillion or googleplex items doesn't matter for big O notation.
Practically speaking, hashtable performance can be a constant factor worse than array lookup performance. Which, yes, is also O(log(n)), because you CAN'T do better.
Basically, real world computers make every array lookup for arrays of size less than their chip bit size just as bad as their biggest theoretically usable array, and as hastables are clever tricks performed on arrays, that's why you seem to get O(1)
To follow up on templatetypedef's comments as well:
The constant time implementation of a hash table could be a hashmap, with which you can implement a boolean array list that indicates whether a particular element exists in a bucket. However, if you are implementing a linked list for your hashmap, the worst case would require you going through every bucket and having to traverse through the ends of the lists.
The amortised performance of Hash tables is often said to be O(1) for most operations.
What is the amortized performance for a search operation on say a standard LinkedList implementation? Is it O(n)?
I'm a little confused on how this is computed, since in the worst-case (assuming say a hash function that always collides), a Hash table is pretty much equivalent to a LinkedList in terms of say a search operation (assuming a standard bucket implementation).
I know in practise this would never happen unless the hash function was broken, and so the average performance is almost constant time over a series of operations since collisions are rare. But when calculating amortized worst-case performance, shouldn't we consider the worst-case sequence with the worst-case implementation?
There is no such thing as "amortized worst-case performance". Amortized performance is a kind of "average" performance over a long sequence of operations.
With a hash table, sometimes the hash table will need to be resized after a long sequence of inserts, which will take O(n) time. But, since it only happens every O(n) inserts, that operation's cost is spread out over all the inserts to get O(1) amortized time.
Yes, a hash table could be O(n) for every operation in the worst case of a broken hash function. But, analyzing such a hash table is meaningless because it won't be the case for typical usage.
"Worst case" sometimes depends on "worst case under what constraints".
The case of a hashtable with a valid but stupid hash function mapping all keys to 0 generally isn't a meaningful "worst case", it's not sufficiently interesting. So you can analyse a hashtable's average performance under the minimal assumption that (for practical purposes) the hash function distributes the set of all keys uniformly across the set of all hash values.
If the hash function is reasonably sound but not cryptographically secure there's a separate "worst case" to consider. A malicious or unwitting user could systematically provide data whose hashes collide. You'd come up with a different answer for the "worst case input" vs the "worst case assuming input with well-distributed hashes".
In a given sequence of insertions to a hashtable, one of them might provoke a rehash. Then you would consider that one the "worst case" in that particular. This has very little to do with the input data overall -- if the load factor gets high enough you're going to rehash eventually but rarely. That's why the "amortised" running time is an interesting measure, whenever you can put a tighter upper bound on the total cost of n operations than just n times the tightest upper bound on one operation.
Even if the hash function is cryptographically secure, there is a negligible probability that you could get input whose hashes all collide. This is where there's a difference between "averaging over all possible inputs" and "averaging over a sequence of operations with worst-case input". So the word "amortised" also comes with small print. In my experience it normally means the average over a series of operations, and the issue of whether the data is a good or a bad case is not part of the amortisation. nneonneo says that "there's no such thing as amortized worst-case performance", but in my experience there certainly is such a thing as worst-case amortised performance. So it's worth being precise, since this might reflect a difference in what we each expect the term to mean.
When hashtables come up with O(1) amortized insertion, they mean that n insertions takes O(n) time, either (a) assuming that nothing pathologically bad happens with the hash function or (b) expected time for n insertions assuming random input. Because you get the same answer for hashtables either way, it's tempting to be lazy about saying which one you're talking about.
Why do I keep seeing different runtime complexities for these functions on a hash table?
On wiki, search and delete are O(n) (I thought the point of hash tables was to have constant lookup so what's the point if search is O(n)).
In some course notes from a while ago, I see a wide range of complexities depending on certain details including one with all O(1). Why would any other implementation be used if I can get all O(1)?
If I'm using standard hash tables in a language like C++ or Java, what can I expect the time complexity to be?
Hash tables are O(1) average and amortized case complexity, however it suffers from O(n) worst case time complexity. [And I think this is where your confusion is]
Hash tables suffer from O(n) worst time complexity due to two reasons:
If too many elements were hashed into the same key: looking inside this key may take O(n) time.
Once a hash table has passed its load balance - it has to rehash [create a new bigger table, and re-insert each element to the table].
However, it is said to be O(1) average and amortized case because:
It is very rare that many items will be hashed to the same key [if you chose a good hash function and you don't have too big load balance.
The rehash operation, which is O(n), can at most happen after n/2 ops, which are all assumed O(1): Thus when you sum the average time per op, you get : (n*O(1) + O(n)) / n) = O(1)
Note because of the rehashing issue - a realtime applications and applications that need low latency - should not use a hash table as their data structure.
EDIT: Annother issue with hash tables: cache
Another issue where you might see a performance loss in large hash tables is due to cache performance. Hash Tables suffer from bad cache performance, and thus for large collection - the access time might take longer, since you need to reload the relevant part of the table from the memory back into the cache.
Ideally, a hashtable is O(1). The problem is if two keys are not equal, however they result in the same hash.
For example, imagine the strings "it was the best of times it was the worst of times" and "Green Eggs and Ham" both resulted in a hash value of 123.
When the first string is inserted, it's put in bucket 123. When the second string is inserted, it would see that a value already exists for bucket 123. It would then compare the new value to the existing value, and see they are not equal. In this case, an array or linked list is created for that key. At this point, retrieving this value becomes O(n) as the hashtable needs to iterate through each value in that bucket to find the desired one.
For this reason, when using a hash table, it's important to use a key with a really good hash function that's both fast and doesn't often result in duplicate values for different objects.
Make sense?
Some hash tables (cuckoo hashing) have guaranteed O(1) lookup
Perhaps you were looking at the space complexity? That is O(n). The other complexities are as expected on the hash table entry. The search complexity approaches O(1) as the number of buckets increases. If at the worst case you have only one bucket in the hash table, then the search complexity is O(n).
Edit in response to comment I don't think it is correct to say O(1) is the average case. It really is (as the wikipedia page says) O(1+n/k) where K is the hash table size. If K is large enough, then the result is effectively O(1). But suppose K is 10 and N is 100. In that case each bucket will have on average 10 entries, so the search time is definitely not O(1); it is a linear search through up to 10 entries.
Depends on the how you implement hashing, in the worst case it can go to O(n), in best case it is 0(1) (generally you can achieve if your DS is not that big easily)
People say it takes amortized O(1) to put into a hash table. Therefore, putting n elements must be O(n). That's not true for large n, however, since as an answerer said, "All you need to satisfy expected amortized O(1) is to expand the table and rehash everything with a new random hash function any time there is a collision."
So: what is the average running-time of inserting n elements into a hash table? I realize this is probably implementation-dependent, so mention what type of implementation you're talking about.
For example, if there are (log n) equally spaced collisions, and each collision takes O(k) to resolve, where k is the current size of the hashtable, then you'd have this recurrence relation:
T(n) = T(n/2) + n/2 + n/2
(that is, you take the time to insert n/2 elements, then you have a collision, taking n/2 to resolve, then you do the remaining n/2 inserts without a collision). This still ends up being O(n), so yay. But is this reasonable?
It completely depends on how inefficient your rehashing is. Specifically, if you can properly estimate the expected size of your hashtable the second time, your runtime still approaches O(n). Effectively, you have to specify how inefficient your rehash size calculation is before you can determine the expected order.
People say it takes amortized O(1) to put into a hash table.
From a theoretical standpoint, it is expected amortized O(1).
Hash tables are fundamentally a randomized data structure, in the same sense that quicksort is a randomized algorithm. You need to generate your hash functions with some randomness, or else there exist pathological inputs which are not O(1).
You can achieve expected amortized O(1) using dynamic perfect hashing:
The naive idea I originally posted was to rehash with a new random hash function on every collision. (See also perfect hash functions) The problem with this is that this requires O(n^2) space, from birthday paradox.
The solution is to have two hash tables, with the second table for collisions; resolve collisions on that second table by rebuilding it. That table will have O(\sqrt{n}) elements, so would grow to O(n) size.
In practice you often just use a fixed hash function because you can assume (or don't care if) your input is pathological, much like you often quicksort without prerandomizing the input.
All O(1) is saying is that the operation is performed in constant time, and it's not dependent on the number of elements in your data structure.
In simple words, this means that you'll have to pay the same cost no matter how big your data structure is.
In practical terms this means that simple data structures such as trees are generally more effective when you don't have to store a lot of data. In my experience I find trees faster up to ~1k elements (32bit integers), then hash tables take over. But as usual YMMW.
Why not just run a few tests on your system? Maybe if you'll post the source, we can go back and test them on our systems and we could really shape this into a very useful discussion.
It is just not the implementation, but the environment as well that decides how much time the algorithm actually takes. You can however, look if any benchmarking samples are available or not. The problem with me posting my results will be of no use since people have no idea what else is running on my system, how much RAM is free right now and so on. You can only ever have a broad idea. And that is about as good as what the big-O gives you.