Sets and hashmaps don't have constant lookup time? - algorithm

I was reading about tries and this topcoder article (https://www.topcoder.com/community/data-science/data-science-tutorials/using-tries/) says:
The tries can insert and find strings in O(L) time (where L represent the length of a single word). This is much faster than set , but is it a bit faster than a hash table.
I had always learned that sets and hash tables were really fast for looking things up and that they had constant lookup time. Is this not true? Why is it "much faster" than a set? And it also seems to imply that hash tables have different lookup time than sets too. I always thought that sets and hash tables were implemented in pretty much the same way except that one stores some object.

The referenced article is not comparing a trie with an abstract "set" datastructure; it is comparing the trie with the C++ standard library std::set, which is a search tree, usually a red-black tree, which allows you to iterate the contents in sorted order. (C++ also has std::unordered_set, which is based on a hash table, but the article may have been written before that was part of the standard library.)
Hash tables are (on average) O(1) only if the hash can be computed in O(1), since the hash of the key must be computed before any lookup is done. For string keys, most hash functions need to look at every character in the key, so they are O(L) in the length of the string. (This rather obvious fact is for some reason often skipped over in discussion of hashtable computational complexity.) Since both the trie and the hashtable must eventually verify that the provided key is equal to the candidate key in the container, there is an O(L) factor in both cases.
However, tries still have advantages. For example, they can be iterated in lexicographic order, like std::set, but usually faster, whereas hashtables can only be iterated in some non-deterministic order. So if you need to do prefix searches, the hashtable is not an appropriate datastructure.

Related

Hash Table Insertion Time Complexity Confusion

I understand that insertion for hash tables is O(1) and sometimes O(n) depending on the load factor. This makes sense to me, however, I'm still confused. When talking about insertion, are we including the hash function in that measurement or is it just placing some value at that index? For ints, I could see how it could be O(1), but what about strings or any other objects?
Edit: This seemed to answer my question, sorry about the confusion.
Time complexity of creating hash value of a string in hashtable
Yes, the hash function needs to be included in the cost of lookup in a hash table, just the like comparison function needs to be included in the cost of lookup in a sorted table. If the keys have unbounded size, then the key length must be somehow accounted for.
You could stop computing the hash at a certain fixed key length. (Lua does that, for example). But there is a pathological case where every new key is a suffix of the previously inserted key, which would eventually reduce a bounded-length hash function to a linear search.
Regardless of the hash function, the hash table lookup must eventually compare the found key --if there is one-- with the target key, to ensure that they are the same. So that must take time proportional to the size of the key.
In short, constant average-time hash table lookup requires that keys have a bounded size, which is not necessarily the case. But since alternative lookup algorithms would also be affected by key size, this fact doesn't generally help in comparing different lookup up algorithms.

Why implement a Hashtable with a Binary Search Tree?

When implementing a Hashtable using an array, we inherit the constant time indexing of the array. What are the reasons for implementing a Hashtable with a Binary Search Tree since it offers search with O(logn)? Why not just use a Binary Search Tree directly?
If the elements don't have a total order (i.e. the "greater than" and "less than" is not be defined for all pairs or it is not consistent between elements), you can't compare all pairs, thus you can't use a BST directly, but nothing's stopping you from indexing the BST by the hash value - since this is an integral value, it obviously has a total order (although you'd still need to resolve collision, that is have a way to handle elements with the same hash value).
However, one of the biggest advantages of a BST over a hash table is the fact that the elements are in order - if we order it by hash value, the elements will have an arbitrary order instead, and this advantage would no longer be applicable.
As for why one might consider implementing a hash table using a BST instead of an array, it would:
Not have the disadvantage of needing to resize the array - with an array, you typically mod the hash value with the array size and resize the array if it gets full, reinserting all elements, but with a BST, you can just directly insert the unchanging hash value into the BST.
This might be relevant if we want any individual operation to never take more than a certain amount of time (which could very well happen if we need to resize the array), with the overall performance being secondary, but there might be better ways to solve this problem.
Have a reduced risk of hash collisions since you don't mod with the array size and thus the number of possible hashes could be significantly bigger. This would reduce the risk of getting the worst-case performance of a hash table (which is when a significant portion of the elements hash to the same value).
What the actual worst-case performance is would depend on how you're resolving collisions. This is typically done with linked-lists for O(n) worst case performance. But we can also achieve O(log n) performance with BST's (as is done in Java's hash table implementation if the number of elements with some hash are above a threshold) - that is, have your hash table array where each element points to a BST where all elements have the same hash value.
Possibly use less memory - with an array you'd inevitably have some empty indices, but with a BST, these simply won't need to exist. Although this is not a clear-cut advantage, if it's an advantage at all.
If we assume we use the less common array-based BST implementation, this array will also have some empty indices and this would also require the occasional resizing, but this is a simply memory copy as opposed to needing to reinsert all elements with updated hashes.
If we use the typical pointer-based BST implementation, the added cost for the pointers would seemingly outweigh the cost of having a few empty indices in an array (unless the array is particularly sparse, which tends to be a bad sign for a hash table anyway).
But, since I haven't personally ever heard of this ever being done, presumably the benefits are not worth the increased cost of operations from expected O(1) to O(log n).
Typically the choice is indeed between using a BST directly (without hash values) and using a hash table (with an array).
Pros:
Potentially use less space b/c we don't allocate a large array
Can iterate through the keys in order, sometimes useful
Cons:
You'd have O(log N) lookup time, which is worse than the guaranteed O(1) for a chained hash table.
Since the requirements of a Hash Table are O(1) lookup, it's not a Hash Table if it has logarithmic lookup times. Granted, since collision is an issue with the array implementation (well, not likely an issue), using a BST could offer benefits in that regard. Generally, though, it's not worth the tradeoff - I can't think of a situation where you wouldn't want guaranteed O(1) lookup time when using a Hash Table.
Alternatively, there is the possibility of an underlying structure to guarantee logarithmic insertion and deletion via a BST variant, where each index in the array has a reference to the corresponding node in the BST. A structure like that could get sort of complex, but would guarantee O(1) lookup and O(logn) insertion/deletion.
I found this looking to see if anyone had done it. I guess maybe not.
I came up with an idea this morning of implementing a binary tree as an array consisting of rows stored by index. Row 1 has 1, row 2 has 2, row 3 has 4 (yes, powers of two). The advantage of this structure is a bit shift and addition or subtraction can be used to walk the tree instead of using extra memory to store bi- or uni-directional references.
This would allow you to rapidly search for a hash value based on some sort of hashable input, to discover if the value exists in some other store. Or for a hash collision (or partial collision) search. I can't think of many other uses for it but for these it would be phenomenally fast. Very likely a lot of the rotation operations would happen entirely in cpu cache and be written out in nice linear blobs to main memory.
Its main utility would be with sorting input values of a random nature. If the blobs in the array were two parts, like a hash, and an identifier for another store, you could do the comparisons very fast and insert very fast to discover where an item bearing a hash value is kept in another location (like the UUID of a filesystem node or maybe even the filename, or other short identifiable string).
I'll leave it to others to dream of other ways to use it but I'm using it for a graph theoretic proof of work search table for identifying partial collisions for a variant of Cuckoo Cycle.
I am just now working on the walk formula, and here it is:
i = index of array element
Walk Up (go to parent):
i>>1-(i+1)%2
(Obviously you probably need to test if i is zero)
Walk Left (down and left):
i<<1+2
(this and the next would also need to test against 2^depth of the structure, so it doesn't walk off the edge and fall back to the root)
Walk Right (down and right):
i<<1+1
As you can see, each walk is a short formula based on the index. A bit shift and addition for going left and right, and a bit shift, addition and modulus for ascending. Two instructions to move down, 4 to move up (in assembler, or as above in C and other HLL operator notation)
edit:
I can see from further commentary that the benefit of slashing the insert time definitely would be of benefit. But I don't think that a conventional vector based binary tree would provide nearly as much benefit as a dense version. A dense version, where all the nodes are in a contiguous array, when it is searched, naturally will travel in a linear fashion through the memory, which should help reduce cache misses and thus reduce the latency of the searches significantly, as well as the fact that there is a latency hit with memory in accessing randomly compared to streaming through blocks sequentially.
https://github.com/calibrae-project/bast/blob/master/pkg/bast/bast.go
This is my current state of a WiP to implement what I am calling a Bifurcation Array Search Tree. For the purpose of a fast insert/delete and not horribly slow search through a sorted collection of hashes, I think that this would be of quite large benefit for cases where there is a lot of data coming and going through the structure, or more to the point, beneficial for more realtime applications.

Why not use hashing/hash tables for everything?

In computer science, it is said that the insert, delete and searching operations for hash tables have a complexity of O(1), which is the best. So, I was wondering, why do we need to use other data structures since hashing operations are so fast? Why can't we just simply use hashing/hash tables for everything?
Hash tables, on average, do have excellent time complexity for insertion, retrieval, and deletion. BUT:
Big-O complexity isn't everything. The constant factor is also very important. You could use hashtables in place of arrays, with the array indexes as hash keys. In either case, the time complexity of retrieving an item is O(1). But the constant factor is way higher for the hash table as opposed to the array.
Memory consumption may be much higher. This is certainly true if you use hash tables to replace arrays. (Of course, if the array is sparse, then the hash table may take less memory.)
There are some operations which are not efficiently supported by hash tables, such as iterating over all the elements whose keys are within a certain range, finding the element with the largest key or smallest key, and so on.
The O(n) complexity is on average. For some extreme cases (for example, all data fall into the same bucket), it would be inefficient.
All of that aside, you do still have a good point. Hashtables have an extraordinarily broad range of suitable use cases. That's why they are the primary built-in data structure in some scripting languages, like Lua.
You may use Hash to search the element, but you cannot use it to do the things like find the largest number quickly, you should use the data strutcture for the specified problem. Hash cannot solve all the problem.
HashTable is not answer for all. If your hash function does not distribute your key well than hashMap may turn into a linkedList in worst case for which the insertion, deletion, search will take O(N) in worst case.
HashMap has significant memory footprint so there are some use cases where you memory is too precious than time complexity then you HashMap may not be the best choice.
HashMap is not an answer for range queries or prefix queries. So that is why most of the database vendor do implement indexing by Btree rather than only by hashing for range or prefix queries.
HashTable in general exhibit poor locality of reference that is, the data to be accessed is distributed seemingly at random in memory.
For certain string processing applications, such as spellchecking, hash tables may be less efficient than tries, finite automata, or Judy arrays. Also, if each key is represented by a small enough number of bits, then, instead of a hash table, one may use the key directly as the index into an array of values. Note that there are no collisions in this case.
Hash Tables are not sorted (map)
Hash Tables are not best for head/tail insert (link list/deque)
Hash Tables have overhead to support searching (vector/array)
The potential security issues of hash tables on the web should also be pointed out. If someone knows the hash function, that person may perform a denial-of-service attack by creating lots of items with the same hashcode.
I don't get it, enum/symbol-keys not wasteful enough? ;) What about just using the raw string pointer as key? I must have overlooked some obvious advantage in hashing... but now thinking about it, it makes less and less sense.
It's all just local representation anyway, right? I mean, I could share the data everywhere... API's, IPC or RPC - but not sure how helpful those hashed keys are unless the full string is embedded too.
Meaning you just spent a lot of time hashing strings back and forth for your own amusement.
I'll just leave this here...

Is a data structure implementation with O(1) search possible without using arrays?

I am currently taking a university course in data structures, and this topic has been bothering me for a while now (this is not a homework assignment, just a purely theoretical question).
Let's assume you want to implement a dictionary. The dictionary should, of course, have a search function, accepting a key and returning a value.
Right now, I can only imagine 2 very general methods of implementing such a thing:
Using some kind of search tree, which would (always?) give an O(log n) worst case running time for finding the value by the key, or,
Hashing the key, which essentially returns a natural number which corresponds to an index in an array of values, giving an O(1) worst case running time.
Is O(1) worst case running time possible for a search function, without the use of arrays?
Is random access available only through the use of arrays?
Is it possible through the use of a pointer-based data structure (such as linked lists, search trees, etc.)?
Is it possible when making some specific assumptions, for example, the keys being in some order?
In other words, can you think of an implementation (if one is possible) for the search function and the dictionary that will receive any key in the dictionary and return its value in O(1) time, without using arrays for random access?
Here's another answer I made on that general subject.
Essentially, algorithms reach their results by processing a certain number of bits of information. The length of time they take depends on how quickly they can do that.
A decision point having only 2 branches cannot process more than 1 bit of information. However, a decision point having n branches can process up to log(n) bits (base 2).
The only mechanism I'm aware of, in computers, that can process more than 1 bit of information, in a single operation, is indexing, whether it is indexing an array or executing a jump table (which is indexing an array).
It is not the use of an array that makes the lookup O(1), it's the fact that the lookup time is not dependent upon the size of the data storage. Hence any method that accesses data directly, without a search proportional in some way to the data sotrage size, would be O(1).
you could have a hash implemented with a trie tree. The complexity is O(max(length(string))), if you have strings of limited size, then you could say it runs in O(1), it doesn't depend on the number of strings you have in the structure. http://en.wikipedia.org/wiki/Trie

Advantages of Binary Search Trees over Hash Tables

What are the advantages of binary search trees over hash tables?
Hash tables can look up any element in Theta(1) time and it is just as easy to add an element....but I'm not sure of the advantages going the other way around.
One advantage that no one else has pointed out is that binary search tree allows you to do range searches efficiently.
In order to illustrate my idea, I want to make an extreme case. Say you want to get all the elements whose keys are between 0 to 5000. And actually there is only one such element and 10000 other elements whose keys are not in the range. BST can do range searches quite efficiently since it does not search a subtree which is impossible to have the answer.
While, how can you do range searches in a hash table? You either need to iterate every bucket space, which is O(n), or you have to look for whether each of 1,2,3,4... up to 5000 exists.
(what about the keys between 0 and 5000 are an infinite set? for example keys can be decimals)
Remember that Binary Search Trees (reference-based) are memory-efficient. They do not reserve more memory than they need to.
For instance, if a hash function has a range R(h) = 0...100, then you need to allocate an array of 100 (pointers-to) elements, even if you are just hashing 20 elements. If you were to use a binary search tree to store the same information, you would only allocate as much space as you needed, as well as some metadata about links.
One "advantage" of a binary tree is that it may be traversed to list off all elements in order. This is not impossible with a Hash table but is not a normal operation one design into a hashed structure.
In addition to all the other good comments:
Hash tables in general have better cache behavior requiring less memory reads compared to a binary tree. For a hash table you normally only incur a single read before you have access to a reference holding your data. The binary tree, if it is a balanced variant, requires something in the order of k * lg(n) memory reads for some constant k.
On the other hand, if an enemy knows your hash-function the enemy can enforce your hash table to make collisions, greatly hampering its performance. The workaround is to choose the hash-function randomly from a family, but a BST does not have this disadvantage. Also, when the hash table pressure grows too much, you often tend to enlargen and reallocate the hash table which may be an expensive operation. The BST has simpler behavior here and does not tend to suddenly allocate a lot of data and do a rehashing operation.
Trees tend to be the ultimate average data structure. They can act as lists, can easily be split for parallel operation, have fast removal, insertion and lookup on the order of O(lg n). They do nothing particularly well, but they don't have any excessively bad behavior either.
Finally, BSTs are much easier to implement in (pure) functional languages compared to hash-tables and they do not require destructive updates to be implemented (the persistence argument by Pascal above).
The main advantages of a binary tree over a hash table is that the binary tree gives you two additional operations you can't do (easily, quickly) with a hash table
find the element closest to (not necessarily equal to) some arbitrary key value (or closest above/below)
iterate through the contents of the tree in sorted order
The two are connected -- the binary tree keeps its contents in a sorted order, so things that require that sorted order are easy to do.
A (balanced) binary search tree also has the advantage that its asymptotic complexity is actually an upper bound, while the "constant" times for hash tables are amortized times: If you have a unsuitable hash function, you could end up degrading to linear time, rather than constant.
A binary tree is slower to search and insert into, but has the very nice feature of the infix traversal which essentially means that you can iterate through the nodes of the tree in a sorted order.
Iterating through the entries of a hash table just doesn't make a lot of sense because they are all scattered in memory.
A hashtable would take up more space when it is first created - it will have available slots for the elements that are yet to be inserted (whether or not they are ever inserted), a binary search tree will only be as big as it needs to be. Also, when a hash-table needs more room, expanding to another structure could be time-consuming, but that might depend on the implementation.
A binary search tree can be implemented with a persistent interface, where a new tree is returned but the old tree continues to exist. Implemented carefully, the old and new trees shares most of their nodes. You cannot do this with a standard hash table.
BSTs also provide the "findPredecessor" and "findSuccessor" operations (To find the next smallest and next largest elements) in O(logn) time, which might also be very handy operations. Hash Table can't provide in that time efficiency.
From Cracking the Coding Interview, 6th Edition
We can implement the hash table with a balanced binary search tree (BST) . This gives us an O(log n) lookup time. The advantage of this is potentially using less space, since we no longer allocate a large array. We can also iterate through the keys in order, which can be useful sometimes.
GCC C++ case study
Let's also get some insight from one of the most important implementations in the world. As we will see, it actually matches out theory perfectly!
As shown at What is the underlying data structure of a STL set in C++?, in GCC 6.4:
std::map uses BST
std::unordered_map uses hashmap
So this already points out to the fact that you can't transverse a hashmap efficiently, which is perhaps the main advantage of a BST.
And then, I also benchmarked insertion times in hash map vs BST vs heap at Heap vs Binary Search Tree (BST) which clearly highlights the key performance characteristics:
BST insertion is O(log), hashmap is O(1). And in this particular implementation, hashmap is almost always faster than BST, even for relatively small sizes
hashmap, although much faster in general, has some extremely slow insertions visible as single points in the zoomed out plot.
These happen when the implementation decides that it is time to increase its size, and it needs to be copied over to a larger one.
In more precise terms, this is because only its amortized complexity is O(1), not the worst case, which is actually O(n) during the array copy.
This might make hashmaps inadequate for certain real-time applications, where you need stronger time guarantees.
Related:
Binary Trees vs. Linked Lists vs. Hash Tables
https://cs.stackexchange.com/questions/270/hash-tables-versus-binary-trees
If you want to access the data in a sorted manner, then a sorted list has to be maintained in parallel to the hash table. A good example is Dictionary in .Net. (see http://msdn.microsoft.com/en-us/library/3fcwy8h6.aspx).
This has the side-effect of not only slowing inserts, but it consumes a larger amount of memory than a b-tree.
Further, since a b-tree is sorted, it is simple to find ranges of results, or to perform unions or merges.
It also depends on the use, Hash allows to locate exact match. If you want to query for a range then BST is the choice. Suppose you have a lots of data e1, e2, e3 ..... en.
With hash table you can locate any element in constant time.
If you want to find range values greater than e41 and less than e8, BST can quickly find that.
The key thing is the hash function used to avoid a collision. Of course, we cannot totally avoid a collision, in which case we resort to chaining or other methods. This makes retrieval no longer constant time in worst cases.
Once full, hash table has to increase its bucket size and copy over all the elements again. This is an additional cost not present over BST.
Binary search trees are good choice to implement dictionary if the keys have some total order (keys are comparable) defined on them and you want to preserve the order information.
As BST preserves the order information, it provides you with four additional dynamic set operations that cannot be performed (efficiently) using hash tables. These operations are:
Maximum
Minimum
Successor
Predecessor
All these operations like every BST operation have time complexity of O(H). Additionally all the stored keys remain sorted in the BST thus enabling you to get the sorted sequence of keys just by traversing the tree in in-order.
In summary if all you want is operations insert, delete and remove then hash table is unbeatable (most of the time) in performance. But if you want any or all the operations listed above you should use a BST, preferably a self-balancing BST.
A hashmap is a set associative array. So, your array of input values gets pooled into buckets. In an open addressing scheme, you have a pointer to a bucket, and each time you add a new value into a bucket, you find out where in the bucket there are free spaces. There are a few ways to do this- you start at the beginning of the bucket and increment the pointer each time and test whether its occupied. This is called linear probing. Then, you can do a binary search like add, where you double the difference between the beginning of the bucket and where you double up or back down each time you are searching for a free space. This is called quadratic probing.
OK. Now the problems in both these methods is that if the bucket overflows into the next buckets address, then you need to-
Double each buckets size- malloc(N buckets)/change the hash function-
Time required: depends on malloc implementation
Transfer/Copy each of the earlier buckets data into the new buckets data. This is an O(N) operation where N represents the whole data
OK. but if you use a linkedlist there shouldn't be such a problem right? Yes, In linked lists you don't have this problem. Considering each bucket to begin with a linked list, and if you have 100 elements in a bucket it requires you to traverse those 100 elements to reach the end of the linkedlist hence the List.add(Element E) will take time to-
Hash the element to a bucket- Normal as in all implementations
Take time to find the last element in said bucket- O(N) operation.
The advantage of the linkedlist implementation is that you don't need the memory allocation operation and O(N) transfer/copy of all buckets as in the case of the open addressing implementation.
So, the way to minimize the O(N) operation is to convert the implementation to that of a Binary Search Tree where find operations are O(log(N)) and you add the element in its position based on it's value. The added feature of a BST is that it comes sorted!
Hash Tables are not good for indexing. When you are searching for a range, BSTs are better. That's the reason why most database indexes use B+ trees instead of Hash Tables
Binary search trees can be faster when used with string keys. Especially when strings are long.
Binary search trees using comparisons for less/greater which are fast for strings (when they are not equal). So a BST can quickly answer when a string is not found.
When it's found it will need to do only one full comparison.
In a hash table. You need to calculate the hash of the string and this means you need to go through all bytes at least once to compute the hash. Then again, when a matching entry is found.

Resources