I want to cache 10.000+ key/value pairs (both strings) and started thinking which .NET (2.0, bound to MS Studio 2005 :( ) structure would be best. All items will be added in one shot, then there will be a few 100s of queries for particular keys.
I've read MSDN descriptions referenced in the other question but I stil miss some details about implementation / complexity of operation on various collections.
E.g. in the above mentioned question, there is quote from MSDN saying that SortedList is based on a tree and SortedDictionary "has similar object model" but different complexity.
The other question: are HashTable and Dictionary implemented in the same way?
For HashTable, they write:
If Count is less than the capacity of the Hashtable, this method is an O(1) operation. If the capacity needs to be increased to accommodate the new element, this method becomes an O(n) operation, where n is Count.
But when the capacity is increased? With every "Add"? Then it would be quadratic complexity of adding a series of key/value pairs. The same as with SortedList.
Not mentioning OrderedDictionary, where nothing is mentioned about implementation / complexity.
Maybe someone knows some good article about implementation of .NET collections?
The capacity the HashTable is different than the Count.
Normally the capacity -- the maximum number of items that can be stored, normally related to the number of underlying hash buckets -- doubles when a "grow" is required, although this is implementation-dependent. The Count simply refers to the number of items actually stored, which must be less than or equal to the capacity but is otherwise not related.
Because of the exponentially increasing interval (between the O(n), n = Count, resizing), most hash implementations claim O(1) amortized access. The quote is just saying: "Hey! It's amortized and isn't always true!".
Happy coding.
If you are adding that many pairs, you can/should use this Dictionary constructor to specify the capacity in advance. Then every add and lookup will be O(1).
If you really want to see how these classes are implemented, you can look at the Rotor source or use .NET Reflector to look at System.Collections (not sure of the legality of the latter).
The HashTable and Dictionary are implemented in the same way. Dictionary is the generic replacement for the HashTable.
When the capacity of collections like List and Dictionary have to increase, it will grow at a certain rate. For List the rate is 2.0, i.e. the capacity is doubled. I don't know the exact rate for Dictionary, but it works the same way.
For a List, the way that the capacity is increased means that an item has been copied by average 1.3 times extra. As that value stays constant when the list grows, the Add method is still an O(1) operation by average.
Dictionary is a kind of hashtable; I never use the original Hashtable since it only holds "objects". Don't worry worry about the fact that insertion is O(N) when the capacity is increased; Dictionary always doubles the capacity when the hashtable is full, so the average (amortized) complexity is O(1).
You should almost never use SortedList (which is basically an array), since complexity is O(N) for each insert or delete (assuming the data is not already sorted. If the data is sorted then you get O(1), but if the data is already sorted then you still don't need to use SortedList because an ordinary List would have sufficed.) Instead of SortedList, use SortedDictionary which offers O(N log N) for insert, delete, and search. However, SortedDictionary is slower than Dictionary, so use it only if your data needs to be sorted.
You say you want to cache 10,000 key-value pairs. If you want to do all the inserts before you do any queries, an efficient method is to create an unsorted List, then Sort it, and use BinarySearch for queries. This approach saves a lot of memory compared to using SortedDictionary, and it creates less work for the garbage collector.
Related
I don't understand how hash tables are constant time lookup, if there's a constant number of buckets. Say we have 100 buckets, and 1,000,000 elements. This is clearly O(n) lookup, and that's the point of complexity, to understand how things behave for very large values of n. Thus, a hashtable is never constant lookup, it's always O(n) lookup.
Why do people say it's O(1) lookup on average, and only O(n) for worst case?
The purpose of using a hash is to be able to index into the table directly, just like an array. In the ideal case there's only one item per bucket, and we achieve O(1) easily.
A practical hash table will have more buckets than it has elements, so that the odds of having only one element per bucket are high. If the number of elements inserted into the table gets too great, the table will be resized to increase the number of buckets.
There is always a possibility that every element will have the same hash, or that all active hashes will be assigned to the same bucket; in that case the lookup time is indeed O(n). But a good hash table implementation will be designed to minimize the chance of that occurring.
In layman terms with some hand waving:
At the one extreme, you can have a hash map that is perfectly distributed with one value per bucket. In this case, your lookup returns the value directly, and cost is 1 operation -- or on the order of one, if you like: O(1).
In the real world, implementation often arrange for that to be the case, by expanding the size of the table, etc. to meet the requirements of the data. When you have more items than buckets, you start increasing complexity.
In the worst case, you have one bucket and n items in the one bucket. In this case, it is basically like searching a list, linearly. And so if the value happens to be the last one, you need to do n comparisons, to find it. Or, on the order of n: O(n).
The latter case is pretty much always /possible/ for a given data set. That's why there has been so much study and effort put into coming up with good hashing algorithms. So, it is theoretically possible to engineer a dataset that will cause collisions. So, there is some way to end up with O(n) performance, unless the implementation tweaks other aspects ; table size, hash implementation, etc., etc.
By saying
Say we have 100 buckets, and 1,000,000 elements.
you are basically depriving the hashmap from its real power of rehashing, and also not considering the initial capacity of hashmap in accordance to need. Hashmap is more efficient in cases where each entry gets its own bucket. Lesser percentage of collision can be achieved by higher capacity of hashmap. Each collision means you need to traverse the corresponding list.
Below points should be considered for Hash table impelmentation.
A hashtable is designed such that it re sizes itself as the number of entries get larger than number of buckets by a certain threshold value. This is how we should design if we wish to implement our own custom Hash table.
A good hash function makes sure that entries are well distributed in the buckets of hashtable. This keeps the list in a bucket short.
Above takes care that access time remains constant.
When implementing a Hashtable using an array, we inherit the constant time indexing of the array. What are the reasons for implementing a Hashtable with a Binary Search Tree since it offers search with O(logn)? Why not just use a Binary Search Tree directly?
If the elements don't have a total order (i.e. the "greater than" and "less than" is not be defined for all pairs or it is not consistent between elements), you can't compare all pairs, thus you can't use a BST directly, but nothing's stopping you from indexing the BST by the hash value - since this is an integral value, it obviously has a total order (although you'd still need to resolve collision, that is have a way to handle elements with the same hash value).
However, one of the biggest advantages of a BST over a hash table is the fact that the elements are in order - if we order it by hash value, the elements will have an arbitrary order instead, and this advantage would no longer be applicable.
As for why one might consider implementing a hash table using a BST instead of an array, it would:
Not have the disadvantage of needing to resize the array - with an array, you typically mod the hash value with the array size and resize the array if it gets full, reinserting all elements, but with a BST, you can just directly insert the unchanging hash value into the BST.
This might be relevant if we want any individual operation to never take more than a certain amount of time (which could very well happen if we need to resize the array), with the overall performance being secondary, but there might be better ways to solve this problem.
Have a reduced risk of hash collisions since you don't mod with the array size and thus the number of possible hashes could be significantly bigger. This would reduce the risk of getting the worst-case performance of a hash table (which is when a significant portion of the elements hash to the same value).
What the actual worst-case performance is would depend on how you're resolving collisions. This is typically done with linked-lists for O(n) worst case performance. But we can also achieve O(log n) performance with BST's (as is done in Java's hash table implementation if the number of elements with some hash are above a threshold) - that is, have your hash table array where each element points to a BST where all elements have the same hash value.
Possibly use less memory - with an array you'd inevitably have some empty indices, but with a BST, these simply won't need to exist. Although this is not a clear-cut advantage, if it's an advantage at all.
If we assume we use the less common array-based BST implementation, this array will also have some empty indices and this would also require the occasional resizing, but this is a simply memory copy as opposed to needing to reinsert all elements with updated hashes.
If we use the typical pointer-based BST implementation, the added cost for the pointers would seemingly outweigh the cost of having a few empty indices in an array (unless the array is particularly sparse, which tends to be a bad sign for a hash table anyway).
But, since I haven't personally ever heard of this ever being done, presumably the benefits are not worth the increased cost of operations from expected O(1) to O(log n).
Typically the choice is indeed between using a BST directly (without hash values) and using a hash table (with an array).
Pros:
Potentially use less space b/c we don't allocate a large array
Can iterate through the keys in order, sometimes useful
Cons:
You'd have O(log N) lookup time, which is worse than the guaranteed O(1) for a chained hash table.
Since the requirements of a Hash Table are O(1) lookup, it's not a Hash Table if it has logarithmic lookup times. Granted, since collision is an issue with the array implementation (well, not likely an issue), using a BST could offer benefits in that regard. Generally, though, it's not worth the tradeoff - I can't think of a situation where you wouldn't want guaranteed O(1) lookup time when using a Hash Table.
Alternatively, there is the possibility of an underlying structure to guarantee logarithmic insertion and deletion via a BST variant, where each index in the array has a reference to the corresponding node in the BST. A structure like that could get sort of complex, but would guarantee O(1) lookup and O(logn) insertion/deletion.
I found this looking to see if anyone had done it. I guess maybe not.
I came up with an idea this morning of implementing a binary tree as an array consisting of rows stored by index. Row 1 has 1, row 2 has 2, row 3 has 4 (yes, powers of two). The advantage of this structure is a bit shift and addition or subtraction can be used to walk the tree instead of using extra memory to store bi- or uni-directional references.
This would allow you to rapidly search for a hash value based on some sort of hashable input, to discover if the value exists in some other store. Or for a hash collision (or partial collision) search. I can't think of many other uses for it but for these it would be phenomenally fast. Very likely a lot of the rotation operations would happen entirely in cpu cache and be written out in nice linear blobs to main memory.
Its main utility would be with sorting input values of a random nature. If the blobs in the array were two parts, like a hash, and an identifier for another store, you could do the comparisons very fast and insert very fast to discover where an item bearing a hash value is kept in another location (like the UUID of a filesystem node or maybe even the filename, or other short identifiable string).
I'll leave it to others to dream of other ways to use it but I'm using it for a graph theoretic proof of work search table for identifying partial collisions for a variant of Cuckoo Cycle.
I am just now working on the walk formula, and here it is:
i = index of array element
Walk Up (go to parent):
i>>1-(i+1)%2
(Obviously you probably need to test if i is zero)
Walk Left (down and left):
i<<1+2
(this and the next would also need to test against 2^depth of the structure, so it doesn't walk off the edge and fall back to the root)
Walk Right (down and right):
i<<1+1
As you can see, each walk is a short formula based on the index. A bit shift and addition for going left and right, and a bit shift, addition and modulus for ascending. Two instructions to move down, 4 to move up (in assembler, or as above in C and other HLL operator notation)
edit:
I can see from further commentary that the benefit of slashing the insert time definitely would be of benefit. But I don't think that a conventional vector based binary tree would provide nearly as much benefit as a dense version. A dense version, where all the nodes are in a contiguous array, when it is searched, naturally will travel in a linear fashion through the memory, which should help reduce cache misses and thus reduce the latency of the searches significantly, as well as the fact that there is a latency hit with memory in accessing randomly compared to streaming through blocks sequentially.
https://github.com/calibrae-project/bast/blob/master/pkg/bast/bast.go
This is my current state of a WiP to implement what I am calling a Bifurcation Array Search Tree. For the purpose of a fast insert/delete and not horribly slow search through a sorted collection of hashes, I think that this would be of quite large benefit for cases where there is a lot of data coming and going through the structure, or more to the point, beneficial for more realtime applications.
I need to frequently find the minimum value object in a set that's being continually updated. I need to have a priority queue type of functionality. What's the best algorithm or data structure to do this? I was thinking of having a sorted tree/heap, and every time the value of an object is updated, I can remove the object, and re-insert it into the tree/heap. Is there a better way to accomplish this?
A binary heap is hard to beat for simplicity, but it has the disadvantage that decrease-key takes O(n) time. I know, the standard references say that it's O(log n), but first you have to find the item. That's O(n) for a standard binary heap.
By the way, if you do decide to use a binary heap, changing an item's priority doesn't require a remove and re-insert. You can change the item's priority in-place and then either bubble it up or sift it down as required.
If the performance of decrease-key is important, a good alternative is a pairing heap, which is theoretically slower than a Fibonacci heap, but is much easier to implement and in practice is faster than the Fibonacci heap due to lower constant factors. In practice, pairing heap compares favorably with binary heap, and outperforms binary heap if you do a lot of decrease-key operations.
You could also marry a binary heap and a dictionary or hash map, and keep the dictionary updated with the position of the item in the heap. This gives you faster decrease-key at the cost of more memory and increased constant factors for the other operations.
Quoting Wikipedia:
To improve performance, priority queues typically use a heap as their
backbone, giving O(log n) performance for inserts and removals, and
O(n) to build initially. Alternatively, when a self-balancing binary
search tree is used, insertion and removal also take O(log n) time,
although building trees from existing sequences of elements takes O(n
log n) time; this is typical where one might already have access to
these data structures, such as with third-party or standard libraries.
If you are looking for a better way, there must be something special about the objects in your priority queue. For example, if the keys are numbers from 1 to 10, a countsort-based approach may outperform the usual ones.
If your application looks anything like repeatedly choosing the next scheduled event in a discrete event simulation, you might consider the options listed in e.g. http://en.wikipedia.org/wiki/Discrete_event_simulation and http://www.acm-sigsim-mskr.org/Courseware/Fujimoto/Slides/FujimotoSlides-03-FutureEventList.pdf. The later summarizes results from different implementations in this domain, including many of the options considered in other comments and answers - and a search will find a number of papers in this area. Priority queue overhead really does make some difference in how many times real time you can get your simulation to run - and if you wish to simulate something that takes weeks of real time this can be important.
Is it theoretically possible to have a data-structure that has
O(1) access, insertion, deletion times
and dynamic length?
I'm guessing one hasn't yet been invented or we would entirely forego the use of arrays and linked lists (seperately) and instead opt to use one of these.
Is there a proof this cannot happen, and therefore some relationship between access-time, insertion-time and deletion-time (like conservation of energy) that suggests if one of the times becomes constant the other has to be linear or something along that.
No such data structure exists on current architectures.
Informal reasoning:
To get better than O(n) time for insertion/deletion, you need a tree data structure of some sort
To get O(1) random access, you can't afford to traverse a tree
The best you can do is get O(log n) for all these operations. That's a fairly good compromise, and there are plenty of data structures that achieve this (e.g. a Skip List).
You can also get "close to O(1)" by using trees with high branching factors. For example, Clojure's persistent data structure use 32-way trees, which gives you O(log32 n) operations. For practical purposes, that's fairly close to O(1) (i.e. for realistic sizes of n that you are likely to encounter in real-world collections)
If you are willing to settle for amortized constant time, it is called a hash table.
The closest such datastructure is a B+-tree, which can easily answer questions like "what is the kth item", but performs the requisite operations in O(log(n)) time. Notably iteration (and access of close elements), especially with a cursor implementation, can be very close to array speeds.
Throw in an extra factor, C, as our "block size" (which should be a multiple of a cache line), and we can get something like insertion time ~ log_C(n) + log_2(C) + C. For C = 256 and 32-bit integers, log_C(n) = 3 implies our structure is 64GB. Beyond this point you're probably looking for a hybrid datastructure and are more worried about network cache effects than local ones.
Let's enumerate your requirements instead of mentioning a single possible data structure first.
Basically, you want constant operation time for...
Access
If you know exactly where the entity that you're looking for is, this is easily accomplished. A hashed value or an indexed location is something that can be used to uniquely identify entities, and provide constant access time. The chief drawback with this approach is that you will not be able to have truly identical entities placed into the same data structure.
Insertion
If you can insert at the very end of a list without having to traverse it, then you can accomplish constant access time. The chief drawback with this approach is that you have to have a reference pointing to the end of your list at all times, which must be modified at update time (which, in theory, should be a constant time operation as well). If you decide to hash every value for fast access later, then there's a cost for both calculating the hash and adding it to some backing structure for quick indexing.
Deletion Time
The main principle here is that there can't be too many moving parts; I'm deleting from a fixed, well-defined location. Something like a Stack, Queue, or Deque can provide that for the most part, in that they're deleting only one element, either in LIFO or FIFO order. The chief drawback with this approach is that you can't scan the collection to find any elements in it, since that would take O(n) time. If you were going about the route of using a hash, you could probably do it in O(1) time at the cost of some multiple of O(n) storage space (for the hashes).
Dynamic Length
If you're chaining references, then that shouldn't be such a big deal; LinkedList already has an internal Node class. The chief drawback to this approach is that your memory is not infinite. If you were going the approach of hashes, then the more stuff you have to hash, the higher of a probability of a collision (which does take you out of the O(1) time, and put you more into an amortized O(1) time).
By this, there's really no single, perfect data structure that gives you absolutely constant runtime performance with dynamic length. I'm also unsure of any value that would be provided by writing a proof for such a thing, since the general use of data structures is to make use of its positives and live with its negatives (in the case of hashed collections: love the access time, no duplicates is an ouchie).
Although, if you were willing to live with some amortized performance, a set is likely your best option.
If we look from Java perspective then we can say that hashmap lookup takes constant time. But what about internal implementation? It still would have to search through particular bucket (for which key's hashcode matched) for different matching keys.Then why do we say that hashmap lookup takes constant time? Please explain.
Under the appropriate assumptions on the hash function being used, we can say that hash table lookups take expected O(1) time (assuming you're using a standard hashing scheme like linear probing or chained hashing). This means that on average, the amount of work that a hash table does to perform a lookup is at most some constant.
Intuitively, if you have a "good" hash function, you would expect that elements would be distributed more or less evenly throughout the hash table, meaning that the number of elements in each bucket would be close to the number of elements divided by the number of buckets. If the hash table implementation keeps this number low (say, by adding more buckets every time the ratio of elements to buckets exceeds some constant), then the expected amount of work that gets done ends up being some baseline amount of work to choose which bucket should be scanned, then doing "not too much" work looking at the elements there, because on expectation there will only be a constant number of elements in that bucket.
This doesn't mean that hash tables have guaranteed O(1) behavior. In fact, in the worst case, the hashing scheme will degenerate and all elements will end up in one bucket, making lookups take time Θ(n) in the worst case. This is why it's important to design good hash functions.
For more information, you might want to read an algorithms textbook to see the formal derivation of why hash tables support lookups so efficiently. This is usually included as part of a typical university course on algorithms and data structures, and there are many good resources online.
Fun fact: there are certain types of hash tables (cuckoo hash tables, dynamic perfect hash tables) where the worst case lookup time for an element is O(1). These hash tables work by guaranteeing that each element can only be in one of a few fixed positions, with insertions sometimes scrambling around elements to try to make everything fit.
Hope this helps!
The key is in this statement in the docs:
If many mappings are to be stored in a HashMap instance, creating it with a sufficiently large capacity will allow the mappings to be stored more efficiently than letting it perform automatic rehashing as needed to grow the table.
and
The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
http://docs.oracle.com/javase/6/docs/api/java/util/HashMap.html
The internal bucket structure will actually be rebuilt if the load factor is exceeded, allowing for the amortized cost of get and put to be O(1).
Note that if the internal structure is rebuilt, that introduces a performance penalty that is likely to be O(N), so quite a few get and put may be required before the amortized cost approaches O(1) again. For that reason, plan the initial capacity and load factor appropriately, so that you neither waste space, nor trigger avoidable rebuilding of the internal structure.
Hashtables AREN'T O(1).
Via the pigeonhole principle, you cannot be better than O(log(n)) for lookup, because you need log(n) bits per item to uniquely identify n items.
Hashtables seem to be O(1) because they have a small constant factor combined with their 'n' in the O(log(n)) being increased to the point that, for many practical applications, it is independent of the number of actual items you are using. However, big O notation doesn't care about that fact, and it is a (granted, absurdly common) misuse of the notation to call hashtables O(1).
Because while you could store a million, or a billion items in a hashtable and still get the same lookup time as a single item hashtable... You lose that ability if you're taking about a nonillion or googleplex items. The fact that you will never actually be using a nonillion or googleplex items doesn't matter for big O notation.
Practically speaking, hashtable performance can be a constant factor worse than array lookup performance. Which, yes, is also O(log(n)), because you CAN'T do better.
Basically, real world computers make every array lookup for arrays of size less than their chip bit size just as bad as their biggest theoretically usable array, and as hastables are clever tricks performed on arrays, that's why you seem to get O(1)
To follow up on templatetypedef's comments as well:
The constant time implementation of a hash table could be a hashmap, with which you can implement a boolean array list that indicates whether a particular element exists in a bucket. However, if you are implementing a linked list for your hashmap, the worst case would require you going through every bucket and having to traverse through the ends of the lists.