Data structure with O(1) insertion time and O(log m) lookup? - performance

Backstory (skip to second-to-last paragraph for data structure part): I'm working on a compression algorithm (of the LZ77 variety). The algorithm boils down to finding the longest match between a given string and all strings that have already been seen.
To do this quickly, I've used a hash table (with separate chaining) as recommended in the DEFLATE spec: I insert every string seen so far one at a time (one per input byte) with m slots in the chain for each hash code. Insertions are fast (constant-time with no conditional logic), but searches are slow because I have to look at O(m) strings to find the longest match. Because I do hundreds of thousands of insertions and tens of thousands of lookups in a typical example, I need a highly efficient data structure if I want my algorithm to run quickly (currently it's too slow for m > 4; I'd like an m closer to 128).
I've implemented a special case where m is 1, which runs very fast buts offers only so-so compression. Now I'm working on an algorithm for those who'd prefer improved compression ratio over speed, where the larger m is, the better the compression gets (to a point, obviously). Unfortunately, my attempts so far are too slow for the modest gains in compression ratio as m increases.
So, I'm looking for a data structure that allows very fast insertion (since I do more insertions than searches), but still fairly fast searches (better than O(m)). Does an O(1) insertion and O(log m) search data structure exist? Failing that, what would be the best data structure to use? I'm willing to sacrifice memory for speed. I should add that on my target platform, jumps (ifs, loops, and function calls) are very slow, as are heap allocations (I have to implement everything myself using a raw byte array in order to get acceptable performance).
So far, I've thought of storing the m strings in sorted order, which would allow O(log m) searches using a binary search, but then the insertions also become O(log m).
Thanks!

You might be interested in this match-finding structure :
http://encode.ru/threads/1393-A-proposed-new-fast-match-searching-structure
It's O(1) insertion time and O(m) lookup. But (m) is many times lower than a standard Hash Table for an equivalent match finding result. As en example, with m=4, this structure gets equivalent results than an 80-probes hash table.

You might want to consider using a trie (aka prefix tree) instead of a hash table.
For your particular application, you might be able to additionally optimize insertion. If you know that after inserting ABC you're likely to insert ABCD, then keep a reference to the entry created for ABC and just extend it with D---no need to repeat the lookup of the prefix.

One common optimization in hash tables is to move the item you just found to the head of the list (with the idea that it's likely to be used again soon for that bucket). Perhaps you can use a variation of this idea.
If you do all of your insertions before you do your lookups, you can add a bit to each bucket that says whether the chain for that bucket is sorted. On each lookup, you can check the bit to see if the bucket is sorted. If not, you would sort the bucket and set the bit. Once the bucket is sorted, each lookup is O(lg m).
If you interleave insertions and lookups, you could have 2 lists for each bucket: one that's sorted and on that isn't. Inserts would always go to the non-sorted list. A lookup would first check the sorted list, and only if it's not there would it look in the non-sorted list. When it's found in the non-sorted list you would remove it and put it in the sorted list. This way you only pay to sort items that you lookup.

Related

Hashtable with chaining efficiency if linked list are sorted

I am currently working on an exercise from the CLRS, here is the problem:
11.2-3
Professor Marley hypothesizes that he can obtain substantial performance gains by
modifying the chaining scheme to keep each list in sorted order. How does the pro-
fessor’s modification affect the running time for successful searches, unsuccessful
searches, insertions, and deletions?
I saw on the internet that the answer is the following:
I do not understand why the result is like this, my answer is that since the linked list are sorted, then we can use a dichotomy to make the search so the excepted search time ( as well as the worst case time ) is θ(log2(a)) ( a being the load factor, n/m, n being the number of keys effectively stored in the table and m it's capacity ).
I am ok that deletion still take θ(1) time ( if lists are double chained ) and I said insertion will now take θ(log2(a)) because you need to determinate the correct place for the element you are adding to the list. Why is this not the correct answer?
A technical point: If you store the buckets as a linked list, then you can't use binary search to look over the items in that linked list in time O(log b), where b is the number of items in the bucket.
But let's suppose that instead of doing this you use dynamic arrays for each bucket. Then you could drop the runtimes down by a log factor, in an asymptotic sense. However, if you were to do that:
You're now using dynamic arrays rather than linked lists for your buckets. If you're storing large elements in your buckets, since most buckets won't be very loaded, the memory overhead of the unused slots in the array will start to add up.
From a practical perspective, you now need some way of comparing the elements you're hashing from lowest to highest. In Theoryland, that's not a problem. In practice, though, this could be a bit of a nuisance.
But more importantly, you might want to ask whether this is worthwhile in the first place. Remember that in a practical hash table the choice of α you'll be using is probably going to be very small (say, α ≤ 5 or something like that). For small α, a binary search might actually be slower than a linear scan, even if in theory for sufficiently large α it's faster.
So generally, you don't see this approach used in practice. If you're looking to speed up a hash table, it's probably better to change hashing strategies (say, use open addressing rather than chaining) or to try to squeeze performance out in other ways.

Why "delete" operation is considered to be "slow" on a sorted array?

I am currently studying algorithms and data structures with the help of the famous Stanford course by Tim Roughgarden. In video 13-1 when explaining Balanced Binary Search Trees he compared them to sorted arrays and mentioned that we do not do deletion on sorted array because it is too slow (I believe he meant "slow in comparison with other operations, that we can run in constant [Select, Min/Max, Pred/Succ], O(log n) [Search, Rank] and O(n) [Output/print] time").
I cannot stop thinking about this statement. Namely I cannot wrap my mind around the following:
Let's say we are given an order statistic or a value of the item we
want to delete from a sorted (ascending) array.
We can most certainly find its position in array using Select or
Search in constant or O(n) time respectively.
We can then remove this item and iterate over the items to the right
of the deleted one, incrementing their indices by one, which will take
O(n) time. [this is me (possibly unsuccessfully) trying to describe
the 'move each of them 1 position to the left' operation]
The whole operation will take linear time - O(n) - in the worst case
scenario.
Key question - Am I thinking in a wrong way? If not, why is it considered slow and undesirable?
You are correct: deleting from an array is slow because you have to move all elements after it one position to the left, so that you can cover the hole you created.
Whether O(n) is considered slow depends on the situation. Deleting from an array is most likely part of a larger, more complex algorithm, e.g. inside a loop. This then would add a factor of n to your final complexity, which is usually bad. Using a tree would only add a factor of log n, and O(n log n) is much better than O(n^2) (asymptotically).
The statement is relative to the specific data structure which is being used to hold the sorted values: A sorted array. This specific data structure would be selected for simplicity, for efficient storage, and for quick searches, but is slow for adding and removing elements from the data structure.
Other data structures which hold sorted values may be selected. For example, a binary tree, or a balanced binary tree, or a trie. Each has different characteristics in terms of operation performance and storage efficiency, and would be selected based on the intended usage.
A sorted array is slow for additions and removals because, on average, these operations require shifting half of the array to make room for a new element (or, respectively, to fill in an emptied cell).
However, on many architectures, the simplicity of the data structure and the speed of shifting means that the data structure is fine for "small" data sets.

Why implement a Hashtable with a Binary Search Tree?

When implementing a Hashtable using an array, we inherit the constant time indexing of the array. What are the reasons for implementing a Hashtable with a Binary Search Tree since it offers search with O(logn)? Why not just use a Binary Search Tree directly?
If the elements don't have a total order (i.e. the "greater than" and "less than" is not be defined for all pairs or it is not consistent between elements), you can't compare all pairs, thus you can't use a BST directly, but nothing's stopping you from indexing the BST by the hash value - since this is an integral value, it obviously has a total order (although you'd still need to resolve collision, that is have a way to handle elements with the same hash value).
However, one of the biggest advantages of a BST over a hash table is the fact that the elements are in order - if we order it by hash value, the elements will have an arbitrary order instead, and this advantage would no longer be applicable.
As for why one might consider implementing a hash table using a BST instead of an array, it would:
Not have the disadvantage of needing to resize the array - with an array, you typically mod the hash value with the array size and resize the array if it gets full, reinserting all elements, but with a BST, you can just directly insert the unchanging hash value into the BST.
This might be relevant if we want any individual operation to never take more than a certain amount of time (which could very well happen if we need to resize the array), with the overall performance being secondary, but there might be better ways to solve this problem.
Have a reduced risk of hash collisions since you don't mod with the array size and thus the number of possible hashes could be significantly bigger. This would reduce the risk of getting the worst-case performance of a hash table (which is when a significant portion of the elements hash to the same value).
What the actual worst-case performance is would depend on how you're resolving collisions. This is typically done with linked-lists for O(n) worst case performance. But we can also achieve O(log n) performance with BST's (as is done in Java's hash table implementation if the number of elements with some hash are above a threshold) - that is, have your hash table array where each element points to a BST where all elements have the same hash value.
Possibly use less memory - with an array you'd inevitably have some empty indices, but with a BST, these simply won't need to exist. Although this is not a clear-cut advantage, if it's an advantage at all.
If we assume we use the less common array-based BST implementation, this array will also have some empty indices and this would also require the occasional resizing, but this is a simply memory copy as opposed to needing to reinsert all elements with updated hashes.
If we use the typical pointer-based BST implementation, the added cost for the pointers would seemingly outweigh the cost of having a few empty indices in an array (unless the array is particularly sparse, which tends to be a bad sign for a hash table anyway).
But, since I haven't personally ever heard of this ever being done, presumably the benefits are not worth the increased cost of operations from expected O(1) to O(log n).
Typically the choice is indeed between using a BST directly (without hash values) and using a hash table (with an array).
Pros:
Potentially use less space b/c we don't allocate a large array
Can iterate through the keys in order, sometimes useful
Cons:
You'd have O(log N) lookup time, which is worse than the guaranteed O(1) for a chained hash table.
Since the requirements of a Hash Table are O(1) lookup, it's not a Hash Table if it has logarithmic lookup times. Granted, since collision is an issue with the array implementation (well, not likely an issue), using a BST could offer benefits in that regard. Generally, though, it's not worth the tradeoff - I can't think of a situation where you wouldn't want guaranteed O(1) lookup time when using a Hash Table.
Alternatively, there is the possibility of an underlying structure to guarantee logarithmic insertion and deletion via a BST variant, where each index in the array has a reference to the corresponding node in the BST. A structure like that could get sort of complex, but would guarantee O(1) lookup and O(logn) insertion/deletion.
I found this looking to see if anyone had done it. I guess maybe not.
I came up with an idea this morning of implementing a binary tree as an array consisting of rows stored by index. Row 1 has 1, row 2 has 2, row 3 has 4 (yes, powers of two). The advantage of this structure is a bit shift and addition or subtraction can be used to walk the tree instead of using extra memory to store bi- or uni-directional references.
This would allow you to rapidly search for a hash value based on some sort of hashable input, to discover if the value exists in some other store. Or for a hash collision (or partial collision) search. I can't think of many other uses for it but for these it would be phenomenally fast. Very likely a lot of the rotation operations would happen entirely in cpu cache and be written out in nice linear blobs to main memory.
Its main utility would be with sorting input values of a random nature. If the blobs in the array were two parts, like a hash, and an identifier for another store, you could do the comparisons very fast and insert very fast to discover where an item bearing a hash value is kept in another location (like the UUID of a filesystem node or maybe even the filename, or other short identifiable string).
I'll leave it to others to dream of other ways to use it but I'm using it for a graph theoretic proof of work search table for identifying partial collisions for a variant of Cuckoo Cycle.
I am just now working on the walk formula, and here it is:
i = index of array element
Walk Up (go to parent):
i>>1-(i+1)%2
(Obviously you probably need to test if i is zero)
Walk Left (down and left):
i<<1+2
(this and the next would also need to test against 2^depth of the structure, so it doesn't walk off the edge and fall back to the root)
Walk Right (down and right):
i<<1+1
As you can see, each walk is a short formula based on the index. A bit shift and addition for going left and right, and a bit shift, addition and modulus for ascending. Two instructions to move down, 4 to move up (in assembler, or as above in C and other HLL operator notation)
edit:
I can see from further commentary that the benefit of slashing the insert time definitely would be of benefit. But I don't think that a conventional vector based binary tree would provide nearly as much benefit as a dense version. A dense version, where all the nodes are in a contiguous array, when it is searched, naturally will travel in a linear fashion through the memory, which should help reduce cache misses and thus reduce the latency of the searches significantly, as well as the fact that there is a latency hit with memory in accessing randomly compared to streaming through blocks sequentially.
https://github.com/calibrae-project/bast/blob/master/pkg/bast/bast.go
This is my current state of a WiP to implement what I am calling a Bifurcation Array Search Tree. For the purpose of a fast insert/delete and not horribly slow search through a sorted collection of hashes, I think that this would be of quite large benefit for cases where there is a lot of data coming and going through the structure, or more to the point, beneficial for more realtime applications.

Which search data structure works best for sorted integer data?

I have a sorted integers of over a billion, which data structure do you think can exploited the sorted behavior? Main goal is to search items faster...
Options I can think of --
1) regular Binary Search trees with recursively splitting in the middle approach.
2) Any other balanced Binary search trees should work well, but does not exploit the sorted heuristics..
Thanks in advance..
[Edit]
Insertions and deletions are very rare...
Also, apart from integers I have to store some other information in the nodes, I think plain arrays cant do that unless it is a list right?
This really depends on what operations you want to do on the data.
If you are just searching the data and never inserting or deleting anything, just storing the data in a giant sorted array may be perfectly fine. You could then use binary search to look up elements efficiently in O(log n) time. However, insertions and deletions can be expensive since with a billion integers O(n) will hurt. You could store auxiliary information inside the array itself, if you'd like, by just placing it next to each of the integers.
However, with a billion integers, this may be too memory-intensive and you may want to switch to using a bit vector. You could then do a binary search over the bitvector in time O(log U), where U is the number of bits. With a billion integers, I assume that U and n would be close, so this isn't that much of a penalty. Depending on the machine word size, this could save you anywhere from 32x to 128x memory without causing too much of a performance hit. Plus, this will increase the locality of the binary searches and can improve performance as well. this does make it much slower to actually iterate over the numbers in the list, but it makes insertions and deletions take O(1) time. In order to do this, you'd need to store some secondary structure (perhaps a hash table?) containing the data associated with each of the integers. This isn't too bad, since you can use this sorted bit vector for sorted queries and the unsorted hash table once you've found what you're looking for.
If you also need to add and remove values from the list, a balanced BST can be a good option. However, because you specifically know that you're storing integers, you may want to look at the more complex van Emde Boas tree structure, which supports insertion, deletion, predecessor, successor, find-max, and find-min all in O(log log n) time, which is exponentially faster than binary search trees. The implementation cost of this approach is high, though, since the data structure is notoriously tricky to get right.
Another data structure you might want to explore is a bitwise trie, which has the same time bounds as the sorted bit vector but allows you to store auxiliary data along with each integer. Plus, it's super easy to implement!
Hope this helps!
The best data structure for searching sorted integers is an array.
You can search it with log(N) operations, and it is more compact (less memory overhead) than a tree.
And you don't even have to write any code (so less chance of a bug) -- just use bsearch from your standard library.
With a sorted array the best you can archieve is with an interpolation search, that gives you log(log(n)) average time. It is essentially a binary search but don't divide the array in 2 sub arrays of the same size.
It's really fast and extraordinary easy to implement.
http://en.wikipedia.org/wiki/Interpolation_search
Don't let the worst case O(n) bound scares you, because with 1 billion integers it's pratically impossible to obtain.
O(1) solutions:
Assuming 32-bit integers and a lot of ram:
A lookup table with size 2³² roughly (4 billion elements), where each index corresponds to the number of integers with that value.
Assuming larger integers:
A really big hash table. The usual modulus hash function would be appropriate if you have a decent distribution of the values, if not, you might want to combine the 32-bit strategy with a hash lookup.

Advantages of Binary Search Trees over Hash Tables

What are the advantages of binary search trees over hash tables?
Hash tables can look up any element in Theta(1) time and it is just as easy to add an element....but I'm not sure of the advantages going the other way around.
One advantage that no one else has pointed out is that binary search tree allows you to do range searches efficiently.
In order to illustrate my idea, I want to make an extreme case. Say you want to get all the elements whose keys are between 0 to 5000. And actually there is only one such element and 10000 other elements whose keys are not in the range. BST can do range searches quite efficiently since it does not search a subtree which is impossible to have the answer.
While, how can you do range searches in a hash table? You either need to iterate every bucket space, which is O(n), or you have to look for whether each of 1,2,3,4... up to 5000 exists.
(what about the keys between 0 and 5000 are an infinite set? for example keys can be decimals)
Remember that Binary Search Trees (reference-based) are memory-efficient. They do not reserve more memory than they need to.
For instance, if a hash function has a range R(h) = 0...100, then you need to allocate an array of 100 (pointers-to) elements, even if you are just hashing 20 elements. If you were to use a binary search tree to store the same information, you would only allocate as much space as you needed, as well as some metadata about links.
One "advantage" of a binary tree is that it may be traversed to list off all elements in order. This is not impossible with a Hash table but is not a normal operation one design into a hashed structure.
In addition to all the other good comments:
Hash tables in general have better cache behavior requiring less memory reads compared to a binary tree. For a hash table you normally only incur a single read before you have access to a reference holding your data. The binary tree, if it is a balanced variant, requires something in the order of k * lg(n) memory reads for some constant k.
On the other hand, if an enemy knows your hash-function the enemy can enforce your hash table to make collisions, greatly hampering its performance. The workaround is to choose the hash-function randomly from a family, but a BST does not have this disadvantage. Also, when the hash table pressure grows too much, you often tend to enlargen and reallocate the hash table which may be an expensive operation. The BST has simpler behavior here and does not tend to suddenly allocate a lot of data and do a rehashing operation.
Trees tend to be the ultimate average data structure. They can act as lists, can easily be split for parallel operation, have fast removal, insertion and lookup on the order of O(lg n). They do nothing particularly well, but they don't have any excessively bad behavior either.
Finally, BSTs are much easier to implement in (pure) functional languages compared to hash-tables and they do not require destructive updates to be implemented (the persistence argument by Pascal above).
The main advantages of a binary tree over a hash table is that the binary tree gives you two additional operations you can't do (easily, quickly) with a hash table
find the element closest to (not necessarily equal to) some arbitrary key value (or closest above/below)
iterate through the contents of the tree in sorted order
The two are connected -- the binary tree keeps its contents in a sorted order, so things that require that sorted order are easy to do.
A (balanced) binary search tree also has the advantage that its asymptotic complexity is actually an upper bound, while the "constant" times for hash tables are amortized times: If you have a unsuitable hash function, you could end up degrading to linear time, rather than constant.
A binary tree is slower to search and insert into, but has the very nice feature of the infix traversal which essentially means that you can iterate through the nodes of the tree in a sorted order.
Iterating through the entries of a hash table just doesn't make a lot of sense because they are all scattered in memory.
A hashtable would take up more space when it is first created - it will have available slots for the elements that are yet to be inserted (whether or not they are ever inserted), a binary search tree will only be as big as it needs to be. Also, when a hash-table needs more room, expanding to another structure could be time-consuming, but that might depend on the implementation.
A binary search tree can be implemented with a persistent interface, where a new tree is returned but the old tree continues to exist. Implemented carefully, the old and new trees shares most of their nodes. You cannot do this with a standard hash table.
BSTs also provide the "findPredecessor" and "findSuccessor" operations (To find the next smallest and next largest elements) in O(logn) time, which might also be very handy operations. Hash Table can't provide in that time efficiency.
From Cracking the Coding Interview, 6th Edition
We can implement the hash table with a balanced binary search tree (BST) . This gives us an O(log n) lookup time. The advantage of this is potentially using less space, since we no longer allocate a large array. We can also iterate through the keys in order, which can be useful sometimes.
GCC C++ case study
Let's also get some insight from one of the most important implementations in the world. As we will see, it actually matches out theory perfectly!
As shown at What is the underlying data structure of a STL set in C++?, in GCC 6.4:
std::map uses BST
std::unordered_map uses hashmap
So this already points out to the fact that you can't transverse a hashmap efficiently, which is perhaps the main advantage of a BST.
And then, I also benchmarked insertion times in hash map vs BST vs heap at Heap vs Binary Search Tree (BST) which clearly highlights the key performance characteristics:
BST insertion is O(log), hashmap is O(1). And in this particular implementation, hashmap is almost always faster than BST, even for relatively small sizes
hashmap, although much faster in general, has some extremely slow insertions visible as single points in the zoomed out plot.
These happen when the implementation decides that it is time to increase its size, and it needs to be copied over to a larger one.
In more precise terms, this is because only its amortized complexity is O(1), not the worst case, which is actually O(n) during the array copy.
This might make hashmaps inadequate for certain real-time applications, where you need stronger time guarantees.
Related:
Binary Trees vs. Linked Lists vs. Hash Tables
https://cs.stackexchange.com/questions/270/hash-tables-versus-binary-trees
If you want to access the data in a sorted manner, then a sorted list has to be maintained in parallel to the hash table. A good example is Dictionary in .Net. (see http://msdn.microsoft.com/en-us/library/3fcwy8h6.aspx).
This has the side-effect of not only slowing inserts, but it consumes a larger amount of memory than a b-tree.
Further, since a b-tree is sorted, it is simple to find ranges of results, or to perform unions or merges.
It also depends on the use, Hash allows to locate exact match. If you want to query for a range then BST is the choice. Suppose you have a lots of data e1, e2, e3 ..... en.
With hash table you can locate any element in constant time.
If you want to find range values greater than e41 and less than e8, BST can quickly find that.
The key thing is the hash function used to avoid a collision. Of course, we cannot totally avoid a collision, in which case we resort to chaining or other methods. This makes retrieval no longer constant time in worst cases.
Once full, hash table has to increase its bucket size and copy over all the elements again. This is an additional cost not present over BST.
Binary search trees are good choice to implement dictionary if the keys have some total order (keys are comparable) defined on them and you want to preserve the order information.
As BST preserves the order information, it provides you with four additional dynamic set operations that cannot be performed (efficiently) using hash tables. These operations are:
Maximum
Minimum
Successor
Predecessor
All these operations like every BST operation have time complexity of O(H). Additionally all the stored keys remain sorted in the BST thus enabling you to get the sorted sequence of keys just by traversing the tree in in-order.
In summary if all you want is operations insert, delete and remove then hash table is unbeatable (most of the time) in performance. But if you want any or all the operations listed above you should use a BST, preferably a self-balancing BST.
A hashmap is a set associative array. So, your array of input values gets pooled into buckets. In an open addressing scheme, you have a pointer to a bucket, and each time you add a new value into a bucket, you find out where in the bucket there are free spaces. There are a few ways to do this- you start at the beginning of the bucket and increment the pointer each time and test whether its occupied. This is called linear probing. Then, you can do a binary search like add, where you double the difference between the beginning of the bucket and where you double up or back down each time you are searching for a free space. This is called quadratic probing.
OK. Now the problems in both these methods is that if the bucket overflows into the next buckets address, then you need to-
Double each buckets size- malloc(N buckets)/change the hash function-
Time required: depends on malloc implementation
Transfer/Copy each of the earlier buckets data into the new buckets data. This is an O(N) operation where N represents the whole data
OK. but if you use a linkedlist there shouldn't be such a problem right? Yes, In linked lists you don't have this problem. Considering each bucket to begin with a linked list, and if you have 100 elements in a bucket it requires you to traverse those 100 elements to reach the end of the linkedlist hence the List.add(Element E) will take time to-
Hash the element to a bucket- Normal as in all implementations
Take time to find the last element in said bucket- O(N) operation.
The advantage of the linkedlist implementation is that you don't need the memory allocation operation and O(N) transfer/copy of all buckets as in the case of the open addressing implementation.
So, the way to minimize the O(N) operation is to convert the implementation to that of a Binary Search Tree where find operations are O(log(N)) and you add the element in its position based on it's value. The added feature of a BST is that it comes sorted!
Hash Tables are not good for indexing. When you are searching for a range, BSTs are better. That's the reason why most database indexes use B+ trees instead of Hash Tables
Binary search trees can be faster when used with string keys. Especially when strings are long.
Binary search trees using comparisons for less/greater which are fast for strings (when they are not equal). So a BST can quickly answer when a string is not found.
When it's found it will need to do only one full comparison.
In a hash table. You need to calculate the hash of the string and this means you need to go through all bytes at least once to compute the hash. Then again, when a matching entry is found.

Resources