As far as I know, hash tables and double array tries are two of fastest data structures for searching a dictionary. Are there any other data structures or algorithms that can beat them?
Hash table need not always be a fast searching data structure. It really depends on how good your hashing function is. If your hashing function is not very good it could resolves multiple keys to map to similar index causing collision and make the hash table degenerate to O(n) run time.
Self Balancing trees are considered to be fast data structures as well as they guarantee O(log n)
Related
In computer science, it is said that the insert, delete and searching operations for hash tables have a complexity of O(1), which is the best. So, I was wondering, why do we need to use other data structures since hashing operations are so fast? Why can't we just simply use hashing/hash tables for everything?
Hash tables, on average, do have excellent time complexity for insertion, retrieval, and deletion. BUT:
Big-O complexity isn't everything. The constant factor is also very important. You could use hashtables in place of arrays, with the array indexes as hash keys. In either case, the time complexity of retrieving an item is O(1). But the constant factor is way higher for the hash table as opposed to the array.
Memory consumption may be much higher. This is certainly true if you use hash tables to replace arrays. (Of course, if the array is sparse, then the hash table may take less memory.)
There are some operations which are not efficiently supported by hash tables, such as iterating over all the elements whose keys are within a certain range, finding the element with the largest key or smallest key, and so on.
The O(n) complexity is on average. For some extreme cases (for example, all data fall into the same bucket), it would be inefficient.
All of that aside, you do still have a good point. Hashtables have an extraordinarily broad range of suitable use cases. That's why they are the primary built-in data structure in some scripting languages, like Lua.
You may use Hash to search the element, but you cannot use it to do the things like find the largest number quickly, you should use the data strutcture for the specified problem. Hash cannot solve all the problem.
HashTable is not answer for all. If your hash function does not distribute your key well than hashMap may turn into a linkedList in worst case for which the insertion, deletion, search will take O(N) in worst case.
HashMap has significant memory footprint so there are some use cases where you memory is too precious than time complexity then you HashMap may not be the best choice.
HashMap is not an answer for range queries or prefix queries. So that is why most of the database vendor do implement indexing by Btree rather than only by hashing for range or prefix queries.
HashTable in general exhibit poor locality of reference that is, the data to be accessed is distributed seemingly at random in memory.
For certain string processing applications, such as spellchecking, hash tables may be less efficient than tries, finite automata, or Judy arrays. Also, if each key is represented by a small enough number of bits, then, instead of a hash table, one may use the key directly as the index into an array of values. Note that there are no collisions in this case.
Hash Tables are not sorted (map)
Hash Tables are not best for head/tail insert (link list/deque)
Hash Tables have overhead to support searching (vector/array)
The potential security issues of hash tables on the web should also be pointed out. If someone knows the hash function, that person may perform a denial-of-service attack by creating lots of items with the same hashcode.
I don't get it, enum/symbol-keys not wasteful enough? ;) What about just using the raw string pointer as key? I must have overlooked some obvious advantage in hashing... but now thinking about it, it makes less and less sense.
It's all just local representation anyway, right? I mean, I could share the data everywhere... API's, IPC or RPC - but not sure how helpful those hashed keys are unless the full string is embedded too.
Meaning you just spent a lot of time hashing strings back and forth for your own amusement.
I'll just leave this here...
Given a hash table with collisions, the generic hash table implementation will cause look ups within a bucket to run in O(n), assuming that a linkedlist is used.
If we switch the linked list for a binary search tree, we go down to O(log n). Is this the best we can do, or is there a better data structure for this use case?
Using hash tables for the buckets themselves would bring the look up time to O(1), but that would require clever revisions of the hash function.
There is trade-off between insertion time to look-up time in your solution. (Keep bucket sorted)
If you want to keep every bucket sorted, you will get O(log n) look-up time using Binary search. However when you insert a new element, you will have to place him in the right location so the bucket will continue be sorted - O(log n) search time for placing new element.
So in your solution, you get total complexity O(log n) for both insertion and look-up.
(In contrast to the traditional solution that take O(n) for look-up in the worst case, and O(1) for insertion)
EDIT :
If you choose to use a sorted bucket, of course you can't use LinkedList any more. You can switch to any other suitable data structure.
Perfect hashing is known to achieve collision-free O(1) hashing of a limited set of keys known at the time the hash function is constructed. The Wikipedia article mensions several aproaches to apply those ideas to a dynamic set of keys, like dynamic perfect hashing and cuckoo hashing, which might be of interest to you.
You've pretty much answered your own question. Since a hash table is just an array of other data structures, your lookup time is just dependent on the lookup time of the secondary data structure and how well your hash function distributes items across the buckets.
While reading some materials on data structure design for sparse vectors, the authors make some statements as follows.
A hash table could be used
to implement a simple index-to-value mapping. Accessing an index value is slower than with direct array
access, but not by much.
Why assessing an index value is slower when using hash table?
Further, the authors state that
The problem with a hash-backed implementation is that it becomes relatively slow to iterate through
all values in order by index.
An ordered mapping based on a tree structure or
similar can address this problem, since it maintains keys in order. The price of this feature is longer access
time.
Why hash-based implementation performs bad when iterating through all values? Does that due the slower operation of assessing an index?
How can a tree structure help this kind of issue?
Accessing a hash table index is just a bit slower because of the calculation overhead.
In a hash table, if you request item 452345435 it doesn't mean it's in cell 452345435 ... The hash table performs a series of calculation to find the right cell. This is implementation dependent.
Hash table Performance analysis
Hash tables don't store sorted data. So if you want to get the items in the right order, a sorting algorithm will need to be called.
To solve that, you can use a tree, or any other sorted data structure.
But that will increase the inserting complexity from O(1) (hash table) to O(logn) (insert to a tree, sorted database).
That because each index will be added to both data structures, and the complexity will be O(1) + O(logn) = O(logn)
It will still take only O(1) to retrieve the data, because it's enough to request it from the hash table.
I will never be deleting from this data structure, but will be doing a huge number of lookups and insertions (~a trillion lookups and insertions). What is the best data structure for handling this?
Red-black and AVL trees seem decent, but are there any better suited for this situation?
A hash table would seem to be ideal if you are only doing insertions and lookup by exact key.
Try Splay trees if you are doing insertions, and find/find-next on ordered keys.
I assume that most of your operations are going to be lookups, or you're going to need one heap of a lot of memory.
I would choose a red-black tree or a hash table.
Operations on a red-black is O(log2(n)).
If implementet right the hash can have a O(1 + k/n). If implementet wrong it can be as bad as o(k). If what you are trying to do is just to make it as fast as possible, I would go with hash and do the extra work. Otherwise I would go with red-black. It is fairly simple and you know your running time.
If all of the queries are successful (i.e., to elements that are actually stored in the table), then hashing is probably best, and you could experiment with various types of collision resolution, such as cuckoo hashing, which provides worst-case performance guarantees on lookups (see http://en.wikipedia.org/wiki/Hash_table).
If some queries are in between the stored keys, I would use van Emde Boas trees, y-fast trees, or fusion trees, which offer better performance than binary search trees (see http://courses.csail.mit.edu/6.851/spring10/scribe/lec09.pdf and http://courses.csail.mit.edu/6.851/spring10/scribe/lec10.pdf).
Is TRIE the most recommended data structure while designing something like a dictionary for storing words? Any other alternatives that improve either the time or memory performance?
I believe a hash may be good if there's no collision but then memory requirements start getting bad for overlapping words: over, overlap, overlaps, overlapped, overlapping all occupy exclusive storage while we could share space in trie.
EDIT: Thanks #Moron and to all of you for the very useful answers. I agree -- generating the hash key is O(n) and so is a TRIE search. However, for hash things can be worse with chaining adding to the time while for TRIE this will not happen. My concern remains that for every node in a TRIE I need to keep a pointer which may be blowing things if the dictionary size is small.
A trie has the following advantages over a Hash table:
Looking up data in a trie is faster in the worst case, O(m) time, compared to an imperfect hash table. An imperfect hash table can have key collisions. A key collision is the hash function mapping of different keys to the same position in a hash table. The worst-case lookup speed in an imperfect hash table is O(N) time, but far more typically is O(1), with O(m) time spent evaluating the hash.
There are no collisions of different keys in a trie.
Buckets in a trie which are analogous to hash table buckets that store key collisions are only necessary if a single key is associated with more than one value.
There is no need to provide a hash function or to change hash functions as more keys are added to a trie.
A trie can provide an alphabetical ordering of the entries by key.
Tries have the following drawbacks:
Tries can be slower in some cases than hash tables for looking up data, especially if the data is directly accessed on a hard disk drive or some other secondary storage device where the random access time is high compared to main memory.
It is not easy to represent all keys as strings, such as floating point numbers - a straightforward encoding using the bitstring of their encoding leads to long chains and prefixes that are not particularly meaningful.
If the drawbacks are something that you can live with, I'd suggest going with the trie.
Source: Wikipedia: Trie#As a replacement of other data structures
You can try considering Directed Acyclic Word graph which is basically a trie, but has better memory usage, and according to the wiki, for english, the memory consumption is much lower than a trie.
Time wise, it is like a trie and is likely better than hash. Not sure where you got the O(logn) time for hash. It should be O(n) for reasonable hashes, where n is the length of the word that is being searched.
I guess that is the big question, eh? Maybe try looking at a Bloom filter?
http://en.wikipedia.org/wiki/Bloom_filter