How does hashmap instantiation work under the hood? - data-structures

I was taught that a hashmaps are O(1) (ignoring collisions). This was explained to me as follows:
A range in memory is reserved for the hashmap. A key is hashed to a seemingly random address in this range. We store the key-value pair at that address. The possibility that multiple keys can hash to the same address is solved either by rehashing the hash every time such a collision occurs or by having each utilized address store a pointer to a linked list of everything hashed to that address.
A key can later be hashed to check whether a matching key-value pair is found (reutilizing the same collision resolution used during storage if necessary). If the key is found, the value is returned.
But if a range of memory is assigned to a hashmap, there is a chance that the bits previously there would mimic a key-value pair being present. So I think the hashmap's memory range must be sanitized on instantiation (or even earlier...?). Since that range cannot be significantly smaller than the number of items to be stored, wouldn't that sanitation be O(n)? Does modern hardware solve this by having any instruction to fill a range of memory with a repeated value? If so, did the advent of this instruction make hashmaps viable? Otherwise, I do not understand how this works. Sure, this O(n) would be a one-time event on instantiation. But other storage methods should be able to never do anything slower than O(log(n)). Please help me, what am I missing?

A typical HashMap implementation will allocate a small table when it's constructed. Then when the number of items exceeds some constant factor of the table size, it will allocate a table twice as big and move all the keys over to the new one.
This leads to an amortized O(1) insertion time, but an actual worst case insertion of O(n) when the table needs to be allocated. Dealing with collisions degrades that "amortized" O(1) time to "expected" O(1) time.
It's possible, however, to move the keys over to the reallocated table incrementally, spreading out that cost so that it really is O(1) per operation. This is rarely done, though, because it doesn't guarantee O(1) time -- there's that collision problem again, and there is no guarantee that you can even allocate memory in constant time.
That answers your question about hash maps, but your original line of thinking around the cost of initialization also does not theoretically apply. See this answer: Initializing an Array in the Context of Studying Data Structures

Related

Hash Table Insertion Time Complexity Confusion

I understand that insertion for hash tables is O(1) and sometimes O(n) depending on the load factor. This makes sense to me, however, I'm still confused. When talking about insertion, are we including the hash function in that measurement or is it just placing some value at that index? For ints, I could see how it could be O(1), but what about strings or any other objects?
Edit: This seemed to answer my question, sorry about the confusion.
Time complexity of creating hash value of a string in hashtable
Yes, the hash function needs to be included in the cost of lookup in a hash table, just the like comparison function needs to be included in the cost of lookup in a sorted table. If the keys have unbounded size, then the key length must be somehow accounted for.
You could stop computing the hash at a certain fixed key length. (Lua does that, for example). But there is a pathological case where every new key is a suffix of the previously inserted key, which would eventually reduce a bounded-length hash function to a linear search.
Regardless of the hash function, the hash table lookup must eventually compare the found key --if there is one-- with the target key, to ensure that they are the same. So that must take time proportional to the size of the key.
In short, constant average-time hash table lookup requires that keys have a bounded size, which is not necessarily the case. But since alternative lookup algorithms would also be affected by key size, this fact doesn't generally help in comparing different lookup up algorithms.

Why hashmap lookup is O(1) i.e. constant time?

If we look from Java perspective then we can say that hashmap lookup takes constant time. But what about internal implementation? It still would have to search through particular bucket (for which key's hashcode matched) for different matching keys.Then why do we say that hashmap lookup takes constant time? Please explain.
Under the appropriate assumptions on the hash function being used, we can say that hash table lookups take expected O(1) time (assuming you're using a standard hashing scheme like linear probing or chained hashing). This means that on average, the amount of work that a hash table does to perform a lookup is at most some constant.
Intuitively, if you have a "good" hash function, you would expect that elements would be distributed more or less evenly throughout the hash table, meaning that the number of elements in each bucket would be close to the number of elements divided by the number of buckets. If the hash table implementation keeps this number low (say, by adding more buckets every time the ratio of elements to buckets exceeds some constant), then the expected amount of work that gets done ends up being some baseline amount of work to choose which bucket should be scanned, then doing "not too much" work looking at the elements there, because on expectation there will only be a constant number of elements in that bucket.
This doesn't mean that hash tables have guaranteed O(1) behavior. In fact, in the worst case, the hashing scheme will degenerate and all elements will end up in one bucket, making lookups take time Θ(n) in the worst case. This is why it's important to design good hash functions.
For more information, you might want to read an algorithms textbook to see the formal derivation of why hash tables support lookups so efficiently. This is usually included as part of a typical university course on algorithms and data structures, and there are many good resources online.
Fun fact: there are certain types of hash tables (cuckoo hash tables, dynamic perfect hash tables) where the worst case lookup time for an element is O(1). These hash tables work by guaranteeing that each element can only be in one of a few fixed positions, with insertions sometimes scrambling around elements to try to make everything fit.
Hope this helps!
The key is in this statement in the docs:
If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table.
and
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
http://docs.oracle.com/javase/6/docs/api/java/util/HashMap.html
The internal bucket structure will actually be rebuilt if the load factor is exceeded, allowing for the amortized cost of get and put to be O(1).
Note that if the internal structure is rebuilt, that introduces a performance penalty that is likely to be O(N), so quite a few get and put may be required before the amortized cost approaches O(1) again. For that reason, plan the initial capacity and load factor appropriately, so that you neither waste space, nor trigger avoidable rebuilding of the internal structure.
Hashtables AREN'T O(1).
Via the pigeonhole principle, you cannot be better than O(log(n)) for lookup, because you need log(n) bits per item to uniquely identify n items.
Hashtables seem to be O(1) because they have a small constant factor combined with their 'n' in the O(log(n)) being increased to the point that, for many practical applications, it is independent of the number of actual items you are using. However, big O notation doesn't care about that fact, and it is a (granted, absurdly common) misuse of the notation to call hashtables O(1).
Because while you could store a million, or a billion items in a hashtable and still get the same lookup time as a single item hashtable... You lose that ability if you're taking about a nonillion or googleplex items. The fact that you will never actually be using a nonillion or googleplex items doesn't matter for big O notation.
Practically speaking, hashtable performance can be a constant factor worse than array lookup performance. Which, yes, is also O(log(n)), because you CAN'T do better.
Basically, real world computers make every array lookup for arrays of size less than their chip bit size just as bad as their biggest theoretically usable array, and as hastables are clever tricks performed on arrays, that's why you seem to get O(1)
To follow up on templatetypedef's comments as well:
The constant time implementation of a hash table could be a hashmap, with which you can implement a boolean array list that indicates whether a particular element exists in a bucket. However, if you are implementing a linked list for your hashmap, the worst case would require you going through every bucket and having to traverse through the ends of the lists.

what is the implementation and complexity of operations of C# collections?

I want to cache 10.000+ key/value pairs (both strings) and started thinking which .NET (2.0, bound to MS Studio 2005 :( ) structure would be best. All items will be added in one shot, then there will be a few 100s of queries for particular keys.
I've read MSDN descriptions referenced in the other question but I stil miss some details about implementation / complexity of operation on various collections.
E.g. in the above mentioned question, there is quote from MSDN saying that SortedList is based on a tree and SortedDictionary "has similar object model" but different complexity.
The other question: are HashTable and Dictionary implemented in the same way?
For HashTable, they write:
If Count is less than the capacity of the Hashtable, this method is an O(1) operation. If the capacity needs to be increased to accommodate the new element, this method becomes an O(n) operation, where n is Count.
But when the capacity is increased? With every "Add"? Then it would be quadratic complexity of adding a series of key/value pairs. The same as with SortedList.
Not mentioning OrderedDictionary, where nothing is mentioned about implementation / complexity.
Maybe someone knows some good article about implementation of .NET collections?
The capacity the HashTable is different than the Count.
Normally the capacity -- the maximum number of items that can be stored, normally related to the number of underlying hash buckets -- doubles when a "grow" is required, although this is implementation-dependent. The Count simply refers to the number of items actually stored, which must be less than or equal to the capacity but is otherwise not related.
Because of the exponentially increasing interval (between the O(n), n = Count, resizing), most hash implementations claim O(1) amortized access. The quote is just saying: "Hey! It's amortized and isn't always true!".
Happy coding.
If you are adding that many pairs, you can/should use this Dictionary constructor to specify the capacity in advance. Then every add and lookup will be O(1).
If you really want to see how these classes are implemented, you can look at the Rotor source or use .NET Reflector to look at System.Collections (not sure of the legality of the latter).
The HashTable and Dictionary are implemented in the same way. Dictionary is the generic replacement for the HashTable.
When the capacity of collections like List and Dictionary have to increase, it will grow at a certain rate. For List the rate is 2.0, i.e. the capacity is doubled. I don't know the exact rate for Dictionary, but it works the same way.
For a List, the way that the capacity is increased means that an item has been copied by average 1.3 times extra. As that value stays constant when the list grows, the Add method is still an O(1) operation by average.
Dictionary is a kind of hashtable; I never use the original Hashtable since it only holds "objects". Don't worry worry about the fact that insertion is O(N) when the capacity is increased; Dictionary always doubles the capacity when the hashtable is full, so the average (amortized) complexity is O(1).
You should almost never use SortedList (which is basically an array), since complexity is O(N) for each insert or delete (assuming the data is not already sorted. If the data is sorted then you get O(1), but if the data is already sorted then you still don't need to use SortedList because an ordinary List would have sufficed.) Instead of SortedList, use SortedDictionary which offers O(N log N) for insert, delete, and search. However, SortedDictionary is slower than Dictionary, so use it only if your data needs to be sorted.
You say you want to cache 10,000 key-value pairs. If you want to do all the inserts before you do any queries, an efficient method is to create an unsorted List, then Sort it, and use BinarySearch for queries. This approach saves a lot of memory compared to using SortedDictionary, and it creates less work for the garbage collector.

Can hash tables really be O(1)?

It seems to be common knowledge that hash tables can achieve O(1), but that has never made sense to me. Can someone please explain it? Here are two situations that come to mind:
A. The value is an int smaller than the size of the hash table. Therefore, the value is its own hash, so there is no hash table. But if there was, it would be O(1) and still be inefficient.
B. You have to calculate a hash of the value. In this situation, the order is O(n) for the size of the data being looked up. The lookup might be O(1) after you do O(n) work, but that still comes out to O(n) in my eyes.
And unless you have a perfect hash or a large hash table, there are probably several items per bucket. So, it devolves into a small linear search at some point anyway.
I think hash tables are awesome, but I do not get the O(1) designation unless it is just supposed to be theoretical.
Wikipedia's article for hash tables consistently references constant lookup time and totally ignores the cost of the hash function. Is that really a fair measure?
Edit: To summarize what I learned:
It is technically true because the hash function is not required to use all the information in the key and so could be constant time, and because a large enough table can bring collisions down to near constant time.
It is true in practice because over time it just works out as long as the hash function and table size are chosen to minimize collisions, even though that often means not using a constant time hash function.
You have two variables here, m and n, where m is the length of the input and n is the number of items in the hash.
The O(1) lookup performance claim makes at least two assumptions:
Your objects can be equality compared in O(1) time.
There will be few hash collisions.
If your objects are variable size and an equality check requires looking at all bits then performance will become O(m). The hash function however does not have to be O(m) - it can be O(1). Unlike a cryptographic hash, a hash function for use in a dictionary does not have to look at every bit in the input in order to calculate the hash. Implementations are free to look at only a fixed number of bits.
For sufficiently many items the number of items will become greater than the number of possible hashes and then you will get collisions causing the performance rise above O(1), for example O(n) for a simple linked list traversal (or O(n*m) if both assumptions are false).
In practice though the O(1) claim while technically false, is approximately true for many real world situations, and in particular those situations where the above assumptions hold.
You have to calculate the hash, so the order is O(n) for the size of the data being looked up. The lookup might be O(1) after you do O(n) work, but that still comes out to O(n) in my eyes.
What? To hash a single element takes constant time. Why would it be anything else? If you're inserting n elements, then yes, you have to compute n hashes, and that takes linear time... to look an element up, you compute a single hash of what you're looking for, then find the appropriate bucket with that. You don't re-compute the hashes of everything that's already in the hash table.
And unless you have a perfect hash or a large hash table there are probably several items per bucket so it devolves into a small linear search at some point anyway.
Not necessarily. The buckets don't necessarily have to be lists or arrays, they can be any container type, such as a balanced BST. That means O(log n) worst case. But this is why it's important to choose a good hashing function to avoid putting too many elements into one bucket. As KennyTM pointed out, on average, you will still get O(1) time, even if occasionally you have to dig through a bucket.
The trade off of hash tables is of course the space complexity. You're trading space for time, which seems to be the usual case in computing science.
You mention using strings as keys in one of your other comments. You're concerned about the amount of time it takes to compute the hash of a string, because it consists of several chars? As someone else pointed out again, you don't necessarily need to look at all the chars to compute the hash, although it might produce a better hash if you did. In that case, if there are on average m chars in your key, and you used all of them to compute your hash, then I suppose you're right, that lookups would take O(m). If m >> n then you might have a problem. You'd probably be better off with a BST in that case. Or choose a cheaper hashing function.
The hash is fixed size - looking up the appropriate hash bucket is a fixed cost operation. This means that it is O(1).
Calculating the hash does not have to be a particularly expensive operation - we're not talking cryptographic hash functions here. But that's by the by. The hash function calculation itself does not depend on the number n of elements; while it might depend on the size of the data in an element, this is not what n refers to. So the calculation of the hash does not depend on n and is also O(1).
Hashing is O(1) only if there are only constant number of keys in the table and some other assumptions are made. But in such cases it has advantage.
If your key has an n-bit representation, your hash function can use 1, 2, ... n of these bits. Thinking about a hash function that uses 1 bit. Evaluation is O(1) for sure. But you are only partitioning the key space into 2. So you are mapping as many as 2^(n-1) keys into the same bin. using BST search this takes up to n-1 steps to locate a particular key if nearly full.
You can extend this to see that if your hash function uses K bits your bin size is 2^(n-k).
so K-bit hash function ==> no more than 2^K effective bins ==> up to 2^(n-K) n-bit keys per bin ==> (n-K) steps (BST) to resolve collisions. Actually most hash functions are much less "effective" and need/use more than K bits to produce 2^k bins. So even this is optimistic.
You can view it this way -- you will need ~n steps to be able to uniquely distinguish a pair of keys of n bits in the worst case. There is really no way to get around this information theory limit, hash table or not.
However, this is NOT how/when you use hash table!
The complexity analysis assumes that for n-bit keys, you could have O(2^n) keys in the table (e.g. 1/4 of all possible keys). But most if not all of the time we use hash table, we only have a constant number of the n-bit keys in the table. If you only want a constant number of keys in the table, say C is your maximum number, then you could form a hash table of O(C) bins, that guarantees expected constant collision (with a good hash function); and a hash function using ~logC of the n bits in the key. Then every query is O(logC) = O(1). This is how people claim "hash table access is O(1)"/
There are a couple of catches here -- first, saying you don't need all the bits may only be a billing trick. First you cannot really pass the key value to the hash function, because that would be moving n bits in the memory which is O(n). So you need to do e.g. a reference passing. But you still need to store it somewhere already which was an O(n) operation; you just don't bill it to the hashing; you overall computation task cannot avoid this. Second, you do the hashing, find the bin, and found more than 1 keys; your cost depends on your resolution method -- if you do comparison based (BST or List), you will have O(n) operation (recall key is n-bit); if you do 2nd hash, well, you have the same issue if 2nd hash has collision. So O(1) is not 100% guaranteed unless you have no collision (you can improve the chance by having a table with more bins than keys, but still).
Consider the alternative, e.g. BST, in this case. there are C keys, so a balanced BST will be O(logC) in depth, so a search takes O(logC) steps. However the comparison in this case would be an O(n) operation ... so it appears hashing is a better choice in this case.
TL;DR: Hash tables guarantee O(1) expected worst case time if you pick your hash function uniformly at random from a universal family of hash functions. Expected worst case is not the same as average case.
Disclaimer: I don't formally prove hash tables are O(1), for that have a look at this video from coursera [1]. I also don't discuss the amortized aspects of hash tables. That is orthogonal to the discussion about hashing and collisions.
I see a surprisingly great deal of confusion around this topic in other answers and comments, and will try to rectify some of them in this long answer.
Reasoning about worst case
There are different types of worst case analysis. The analysis that most answers have made here so far is not worst case, but rather average case [2]. Average case analysis tends to be more practical. Maybe your algorithm has one bad worst case input, but actually works well for all other possible inputs. Bottomline is your runtime depends on the dataset you're running on.
Consider the following pseudocode of the get method of a hash table. Here I'm assuming we handle collision by chaining, so each entry of the table is a linked list of (key,value) pairs. We also assume the number of buckets m is fixed but is O(n), where n is the number of elements in the input.
function get(a: Table with m buckets, k: Key being looked up)
bucket <- compute hash(k) modulo m
for each (key,value) in a[bucket]
return value if k == key
return not_found
As other answers have pointed out, this runs in average O(1) and worst case O(n). We can make a little sketch of a proof by challenge here. The challenge goes as follows:
(1) You give your hash table algorithm to an adversary.
(2) The adversary can study it and prepare as long as he wants.
(3) Finally the adversary gives you an input of size n for you to insert in your table.
The question is: how fast is your hash table on the adversary input?
From step (1) the adversary knows your hash function; during step (2) the adversary can craft a list of n elements with the same hash modulo m, by e.g. randomly computing the hash of a bunch of elements; and then in (3) they can give you that list. But lo and behold, since all n elements hash to the same bucket, your algorithm will take O(n) time to traverse the linked list in that bucket. No matter how many times we retry the challenge, the adversary always wins, and that's how bad your algorithm is, worst case O(n).
How come hashing is O(1)?
What threw us off in the previous challenge was that the adversary knew our hash function very well, and could use that knowledge to craft the worst possible input.
What if instead of always using one fixed hash function, we actually had a set of hash functions, H, that the algorithm can randomly choose from at runtime? In case you're curious, H is called a universal family of hash functions [3]. Alright, let's try adding some randomness to this.
First suppose our hash table also includes a seed r, and r is assigned to a random number at construction time. We assign it once and then it's fixed for that hash table instance. Now let's revisit our pseudocode.
function get(a: Table with m buckets and seed r, k: Key being looked up)
rHash <- H[r]
bucket <- compute rHash(k) modulo m
for each (key,value) in a[bucket]
return value if k == key
return not_found
If we try the challenge one more time: from step (1) the adversary can know all the hash functions we have in H, but now the specific hash function we use depends on r. The value of r is private to our structure, the adversary cannot inspect it at runtime, nor predict it ahead of time, so he can't concoct a list that's always bad for us. Let's assume that in step (2) the adversary chooses one function hash in H at random, he then crafts a list of n collisions under hash modulo m, and sends that for step (3), crossing fingers that at runtime H[r] will be the same hash they chose.
This is a serious bet for the adversary, the list he crafted collides under hash, but will just be a random input under any other hash function in H. If he wins this bet our run time will be worst case O(n) like before, but if he loses then well we're just being given a random input which takes the average O(1) time. And indeed most times the adversary will lose, he wins only once every |H| challenges, and we can make |H| be very large.
Contrast this result to the previous algorithm where the adversary always won the challenge. Handwaving here a bit, but since most times the adversary will fail, and this is true for all possible strategies the adversary can try, it follows that although the worst case is O(n), the expected worst case is in fact O(1).
Again, this is not a formal proof. The guarantee we get from this expected worst case analysis is that our run time is now independent of any specific input. This is a truly random guarantee, as opposed to the average case analysis where we showed a motivated adversary could easily craft bad inputs.
TL-DR; usually hash() is O(m) where m is length of a key
My three cents.
24 years ago when Sun released jdk 1.2 they fixed a bug in String.hashCode() so instead of computing a hash only based on some portion of a string since jdk1.2 it reads every single character of a string instead. This change was intentional and IHMO very wise.
In most languages builtin hash works similar. It process the whole object to compute a hash because keys are usually small while collisions can cause serious issues.
There are a lot of theoretical arguments confirming and denying the O(1) hash lookup cost. A lot of them are reasonable and educative.
Let us skip the theory and do some experiment instead:
import timeit
samples = [tuple("LetsHaveSomeFun!")] # better see for tuples
# samples = ["LetsHaveSomeFun!"] # hash for string is much faster. Increase sample size to see
for _ in range(25 if isinstance(samples[0], str) else 20):
samples.append(samples[-1] * 2)
empty = {}
for i, s in enumerate(samples):
t = timeit.timeit(lambda: s in empty, number=2000)
print(f"{i}. For element of length {len(s)} it took {t:0.3f} time to lookup in empty hashmap")
When I run it I get:
0. For element of length 16 it took 0.000 time to lookup in empty hashmap
1. For element of length 32 it took 0.000 time to lookup in empty hashmap
2. For element of length 64 it took 0.001 time to lookup in empty hashmap
3. For element of length 128 it took 0.001 time to lookup in empty hashmap
4. For element of length 256 it took 0.002 time to lookup in empty hashmap
5. For element of length 512 it took 0.003 time to lookup in empty hashmap
6. For element of length 1024 it took 0.006 time to lookup in empty hashmap
7. For element of length 2048 it took 0.012 time to lookup in empty hashmap
8. For element of length 4096 it took 0.025 time to lookup in empty hashmap
9. For element of length 8192 it took 0.048 time to lookup in empty hashmap
10. For element of length 16384 it took 0.094 time to lookup in empty hashmap
11. For element of length 32768 it took 0.184 time to lookup in empty hashmap
12. For element of length 65536 it took 0.368 time to lookup in empty hashmap
13. For element of length 131072 it took 0.743 time to lookup in empty hashmap
14. For element of length 262144 it took 1.490 time to lookup in empty hashmap
15. For element of length 524288 it took 2.900 time to lookup in empty hashmap
16. For element of length 1048576 it took 5.872 time to lookup in empty hashmap
17. For element of length 2097152 it took 12.003 time to lookup in empty hashmap
18. For element of length 4194304 it took 25.176 time to lookup in empty hashmap
19. For element of length 8388608 it took 50.399 time to lookup in empty hashmap
20. For element of length 16777216 it took 99.281 time to lookup in empty hashmap
Clearly the hash is O(m) where m is the length of a key.
You can make similar experiments for other mainstream languages and I expect you get a similar results.
It seems based on discussion here, that if X is the ceiling of (# of elements in table/# of bins), then a better answer is O(log(X)) assuming an efficient implementation of bin lookup.
There are two settings under which you can get O(1) worst-case times.
If your setup is static, then FKS hashing will get you worst-case O(1) guarantees. But as you indicated, your setting isn't static.
If you use Cuckoo hashing, then queries and deletes are O(1)
worst-case, but insertion is only O(1) expected. Cuckoo hashing works quite well if you have an upper bound on the total number of inserts, and set the table size to be roughly 25% larger.
Copied from here
A. The value is an int smaller than the size of the hash table. Therefore, the value is its own hash, so there is no hash table. But if there was, it would be O(1) and still be inefficient.
This is a case where you could trivially map the keys to distinct buckets, so an array seems a better choice of data structure than a hash table. Still, the inefficiencies don't grow with the table size.
(You might still use a hash table because you don't trust the ints to remain smaller than the table size as the program evolves, you want to make the code potentially reusable when that relationship doesn't hold, or you just don't want people reading/maintaining the code to have to waste mental effort understanding and maintaining the relationship).
B. You have to calculate a hash of the value. In this situation, the order is O(n) for the size of the data being looked up. The lookup might be O(1) after you do O(n) work, but that still comes out to O(n) in my eyes.
We need to distinguish between the size of the key (e.g. in bytes), and the size of the number of keys being stored in the hash table. Claims that hash tables provide O(1) operations mean that operations (insert/erase/find) don't tend to slow down further as the number of keys increases from hundreds to thousands to millions to billions (at least not if all the data is accessed/updated in equally fast storage, be that RAM or disk - cache effects may come into play but even the cost of a worst-case cache miss tends to be some constant multiple of best-case hit).
Consider a telephone book: you may have names in there that are quite long, but whether the book has 100 names, or 10 million, the average name length is going to be pretty consistent, and the worst case in history...
Guinness world record for the Longest name used by anyone ever was set by Adolph Blaine Charles David Earl Frederick Gerald Hubert Irvin John Kenneth Lloyd Martin Nero Oliver Paul Quincy Randolph Sherman Thomas Uncas Victor William Xerxes Yancy Wolfeschlegelsteinhausenbergerdorff, Senior
...wc tells me that's 215 characters - that's not a hard upper-bound to the key length, but we don't need to worry about there being massively more.
That holds for most real world hash tables: the average key length doesn't tend to grow with the number of keys in use. There are exceptions, for example a key creation routine might return strings embedding incrementing integers, but even then every time you increase the number of keys by an order of magnitude you only increase the key length by 1 character: it's not significant.
It's also possible to create a hash from a fixed-size amount of key data. For example, Microsoft's Visual C++ ships with a Standard Library implementation of std::hash<std::string> that creates a hash incorporating just ten bytes evenly spaced along the string, so if the strings only vary at other indices you get collisions (and hence in practice non O(1) behaviours on the post-collision searching side), but the time to create the hash has a hard upper bound.
And unless you have a perfect hash or a large hash table, there are probably several items per bucket. So, it devolves into a small linear search at some point anyway.
Generally true, but the awesome thing about hash tables is that the number of keys visited during those "small linear searches" is - for the separate chaining approach to collisions - a function of the hash table load factor (ratio of keys to buckets).
For example, with a load factor of 1.0 there's an average of ~1.58 to the length of those linear searches, regardless of the number of keys (see my answer here). For closed hashing it's a bit more complicated, but not much worse when the load factor isn't too high.
It is technically true because the hash function is not required to use all the information in the key and so could be constant time, and because a large enough table can bring collisions down to near constant time.
This kind of misses the point. Any kind of associative data structure ultimately has to do operations across every part of the key sometimes (inequality may sometimes be determined from just a part of the key, but equality generally requires every bit be considered). At a minimum, it can hash the key once and store the hash value, and if it uses a strong enough hash function - e.g. 64-bit MD5 - it might practically ignore even the possibility of two keys hashing to the same value (a company I worked for did exactly that for the distributed database: hash-generation time was still insignificant compared to WAN-wide network transmissions). So, there's not too much point obsessing about the cost to process the key: that's inherent in storing keys regardless of the data structure, and as said above - doesn't tend to grow worse on average with there being more keys.
As for large enough hash tables bringing collisions down, that's missing the point too. For separate chaining, you still have a constant average collision chain length at any given load factor - it's just higher when the load factor is higher, and that relationship is non-linear. The SO user Hans comments on my answer also linked above that:
average bucket length conditioned on nonempty buckets is a better measure of efficiency. It is a/(1-e^{-a}) [where a is the load factor, e is 2.71828...]
So, the load factor alone determines the average number of colliding keys you have to search through during insert/erase/find operations. For separate chaining, it doesn't just approach being constant when the load factor is low - it's always constant. For open addressing though your claim has some validity: some colliding elements are redirected to alternative buckets and can then interfere with operations on other keys, so at higher load factors (especially > .8 or .9) collision chain length gets more dramatically worse.
It is true in practice because over time it just works out as long as the hash function and table size are chosen to minimize collisions, even though that often means not using a constant time hash function.
Well, the table size should result in a sane load factor given the choice of close hashing or separate chaining, but also if the hash function is a bit weak and the keys aren't very random, having a prime number of buckets often helps reduce collisions too (hash-value % table-size then wraps around such that changes only to a high order bit or two in the hash-value still resolve to buckets spread pseudo-randomly across different parts of the hash table).
Leaving other considerations aside, the O(1) claim hinges on a constant time access model of memory, which is a good enough approximation for most practical computer science but not strictly justifiable from a theoretical point of view.
For starters, any memory addressing scheme necessarily requires multiplexing at the circuit level, which in turn requires a circuit depth at least proportional to O(log N). Since clock frequency is inversely proportional to the longest path (in number of traversed gates) of a circuit, this implies no general memory access scheme can run in less than O(log N) for fast enough CPUs or large enough memories.
Then, at a more fundamental level, you can only stack so many bits of memory within a finite distance D from the processor, and given the finite speed of light this means your worst case time for a random memory access is at least O(D^1/3), and more likely O(D^1/2) if we take into account integrated circuits are two-dimensional.
But of course in practice computers operate far from reaching these limits... or do they? This is when cache hierarchies enter the game, and why no good implementation of an algorithm or data structure can afford to ignore the actual details of the use case or the hardware implementation.
Either way the absolute worst case for a random memory access timing is given by the ping latency between your computer and some server at the opposite side of the planet, which can be in the 100s of ms and is, for the record, a lot worse than the best case scenario of having the data cached in L1 or -even better- already loaded in the registers.
As for the cost of hashing, you are correct in that it cannot be truly constant or even bounded by a set number of operations when applied to a potentially unbounded set of arbitrary-size keys such as strings, which can only be dealt with efficiently for the randomized case, but often do share arbitrarily long common prefixes that require reading and processing a number of bits larger than the size of the prefix.
For such cases it may be advisable to use a specialized data structure such as a z-fast trie or similar, which can simultaneously disambiguate prefixes and perform random memory access in amortized O(lg lg lg N).

Are there O(1) random access data structures that don't rely on contiguous storage?

The classic O(1) random access data structure is the array. But an array relies on the programming language being used supporting guaranteed continuous memory allocation (since the array relies on being able to take a simple offset of the base to find any element).
This means that the language must have semantics regarding whether or not memory is continuous, rather than leaving this as an implementation detail. Thus it could be desirable to have a data structure that has O(1) random access, yet doesn't rely on continuous storage.
Is there such a thing?
How about a trie where the length of keys is limited to some contant K (for example, 4 bytes so you can use 32-bit integers as indices). Then lookup time will be O(K), i.e. O(1) with non-contiguous memory. Seems reasonable to me.
Recalling our complexity classes, don't forget that every big-O has a constant factor, i.e. O(n) + C, This approach will certainly have a much larger C than a real array.
EDIT: Actually, now that I think about it, it's O(K*A) where A is the size of the "alphabet". Each node has to have a list of up to A child nodes, which will have to be a linked list keep the implementation non-contiguous. But A is still constant, so it's still O(1).
In practice, for small datasets using contiguous storage is not a problem, and for large datasets O(log(n)) is just as good as O(1); the constant factor is rather more important.
In fact, For REALLY large datasets, O(root3(n)) random access is the best you can get in a 3-dimensional physical universe.
Edit:
Assuming log10 and the O(log(n)) algorithm being twice as fast as the O(1) one at a million elements, it will take a trillion elements for them to become even, and a quintillion for the O(1) algorithm to become twice as fast - rather more than even the biggest databases on earth have.
All current and foreseeable storage technologies require a certain physical space (let's call it v) to store each element of data. In a 3-dimensional universe, this means for n elements there is a minimum distance of root3(n*v*3/4/pi) between at least some of the elements and the place that does the lookup, because that's the radius of a sphere of volume n*v. And then, the speed of light gives a physical lower boundary of root3(n*v*3/4/pi)/c for the access time to those elements - and that's O(root3(n)), no matter what fancy algorithm you use.
Apart from a hashtable, you can have a two-level array-of-arrays:
Store the first 10,000 element in the first sub-array
Store the next 10,000 element in the next sub-array
etc.
Thus it could be desirable to have a data structure that has O(1) random access, yet
doesn't rely on continuous storage.
Is there such a thing?
No, there is not. Sketch of proof:
If you have a limit on your continuous block size, then obviously you'll have to use indirection to get to your data items. Fixed depth of indirection with a limited block size gets you only a fixed-sized graph (although its size grows exponentially with the depth), so as your data set grows, the indirection depth will grow (only logarithmically, but not O(1)).
Hashtable?
Edit:
An array is O(1) lookup because a[i] is just syntactic sugar for *(a+i). In other words, to get O(1), you need either a direct pointer or an easily-calculated pointer to every element (along with a good-feeling that the memory you're about to lookup is for your program). In the absence of having a pointer to every element, it's not likely to have an easily-calculated pointer (and know the memory is reserved for you) without contiguous memory.
Of course, it's plausible (if terrible) to have a Hashtable implementation where each lookup's memory address is simply *(a + hash(i)) Not being done in an array, i.e. being dynamically created at the specified memory location if you have that sort of control.. the point is that the most efficient implementation is going to be an underlying array, but it's certainly plausible to take hits elsewhere to do a WTF implementation that still gets you constant-time lookup.
Edit2:
My point is that an array relies on contiguous memory because it's syntactic sugar, but a Hashtable chooses an array because it's the best implementation method, not because it's required. Of course I must be reading the DailyWTF too much, since I'm imagining overloading C++'s array-index operator to also do it without contiguous memory in the same fashion..
Aside from the obvious nested structures to finite depth noted by others, I'm not aware of a data structure with the properties you describe. I share others' opinions that with a well-designed logarithmic data structure, you can have non-contiguous memory with fast access times to any data that will fit in main memory.
I am aware of an interesting and closely related data structure:
Cedar ropes are immutable strings that provide logarithmic rather than constant-time access, but they do provide a constant-time concatenation operation and efficient insertion of characters. The paper is copyrighted but there is a Wikipedia explanation.
This data structure is efficient enough that you can represent the entire contents of a large file using it, and the implementation is clever enough to keep bits on disk unless you need them.
Surely what you're talking about here is not contiguous memory storage as such, but more the ability to index a containing data structure. It is common to internally implement a dynamic array or list as an array of pointers with the actual content of each element elsewhere in memory. There are a number of reasons for doing this - not least that it enables each entry to be a different size. As others have pointed out, most hashtable implementations also rely on indexing too. I can't think of a way to implement an O(1) algorithm that doesn't rely on indexing, but this implies contiguous memory for the index at least.
Distributed hash maps have such a property. Well, actually, not quite, basically a hash function tells you what storage bucket to look in, in there you'll probably need to rely on traditional hash maps. It doesn't completely cover your requirements, as the list containing the storage areas / nodes (in a distributed scenario), again, is usually a hash map (essentially making it a hash table of hash tables), although you could use some other algorithm, e.g. if the number of storage areas is known.
EDIT:
Forgot a little tid-bit, you'd probably want to use different hash functions for the different levels, otherwise you'll end up with a lot of similar hash-values within each storage area.
A bit of a curiosity: the hash trie saves space by interleaving in memory the key-arrays of trie nodes that happen not to collide. That is, if node 1 has keys A,B,D while node 2 has keys C,X,Y,Z, for example, then you can use the same contiguous storage for both nodes at once. It's generalized to different offsets and an arbitrary number of nodes; Knuth used this in his most-common-words program in Literate Programming.
So this gives O(1) access to the keys of any given node, without reserving contiguous storage to it, albeit using contiguous storage for all nodes collectively.
It's possible to allocate a memory block not for the whole data, but only for a reference array to pieces of data. This brings dramatic increase decrease in length of necessary contiguous memory.
Another option, If the elements can be identified with keys and these keys can be uniquely mapped to the available memory locations, it is possible not to place all the objects contiguously, leaving spaces between them. This requires control over the memory allocation so you can still distribute free memory and relocate 2nd-priroty objects to somewhere else when you have to use that memory location for your 1st-priority object. They would still be contiguous in a sub-dimension, though.
Can I name a common data structure which answers your question? No.
Some pseudo O(1) answers-
A VList is O(1) access (on average), and doesn't require that the whole of the data is contiguous, though it does require contiguous storage in small blocks. Other data structures based on numerical representations are also amortized O(1).
A numerical representation applies the same 'cheat' that a radix sort does, yielding an O(k) access structure - if there is another upper bound of the index, such as it being a 64 bit int, then a binary tree where each level correspond to a bit in the index takes a constant time. Of course, that constant k is greater than lnN for any N which can be used with the structure, so it's not likely to be a performance improvement (radix sort can get performance improvements if k is only a little greater than lnN and the implementation of the radix sort runs better exploits the platform).
If you use the same representation of a binary tree that is common in heap implementations, you end up back at an array.

Resources