How Do I Choose Between a Hash Table and a Trie (Prefix Tree)? - algorithm

So if I have to choose between a hash table or a prefix tree what are the discriminating factors that would lead me to choose one over the other. From my own naive point of view it seems as though using a trie has some extra overhead since it isn't stored as an array but that in terms of run time (assuming the longest key is the longest english word) it can be essentially O(1) (in relation to the upper bound). Maybe the longest english word is 50 characters?
Hash tables are instant look up once you get the index. Hashing the key to get the index however seems like it could easily take near 50 steps.
Can someone provide me a more experienced perspective on this? Thanks!

Advantages of tries:
The basics:
Predictable O(k) lookup time where k is the size of the key
Lookup can take less than k time if it's not there
Supports ordered traversal
No need for a hash function
Deletion is straightforward
New operations:
You can quickly look up prefixes of keys, enumerate all entries with a given prefix, etc.
Advantages of linked structure:
If there are many common prefixes, the space they require is shared.
Immutable tries can share structure. Instead of updating a trie in place, you can build a new one that's different only along one branch, elsewhere pointing into the old trie. This can be useful for concurrency, multiple simultaneous versions of a table, etc.
An immutable trie is compressible. That is, it can share structure on the suffixes as well, by hash-consing.
Advantages of hashtables:
Everyone knows hashtables, right? Your system will already have a nice well-optimized implementation, faster than tries for most purposes.
Your keys need not have any special structure.
More space-efficient than the obvious linked trie structure (see comments below)

It all depends on what problem you're trying to solve. If all you need to do is insertions and lookups, go with a hash table. If you need to solve more complex problems such as prefix-related queries, then a trie might be the better solution.

Everyone knows hash table and its uses but it is not exactly constant look up time , it depends on how big the hash table is , the computational complexity of the hash function.
Creating huge hash tables for efficient lookup is not an elegant solution in most of the industrial scenarios where even small latency/scalability matters (e.g.: high frequency trading). You have to care about the data structures to be optimized for space it takes up in memory too to reduce cache miss.
A very good example where trie better suits the requirements is messaging middleware . You have a million subscribers and publishers of messages to various categories (in JMS terms - Topics or exchanges) , in such cases if you want to filter out messages based on topics (which are actually strings) , you definitely do not want create hash table for the million subscriptions with million topics . A better approach is store the topics in trie , so when filtering is done based on topic match , its complexity is independent of number of topics/subscriptions/publishers (only depends on the length of string). I like it because you can be creative with this data structure to optimize space requirements and hence have lower cache miss.

Use a tree:
If you need auto complete feature
Find all words beginning with 'a' or 'axe' so on.
A suffix tree is a special form of a tree. Suffix trees have a whole list of advantages that hash cannot cover.

Insertion and lookup on a trie is linear with the lengh of the input string O(s).
A hash will give you a O(1) for lookup ans insertion, but first you have to calculate the hash based on the input string which again is O(s).
Conclussion, the asymptotic time complexity is linear in both cases.
The trie has some more overhead from data perspective, but you can choose a compressed trie which will put you again, more or less on a tie with the hash table.
To break the tie ask yourself this question: Do i need to lookup for full words only? Or do I need to return all words matching a prefix? (As in a predictive text input system ). For the first case, go for a hash. It is simpler and cleaner code. Easier to test and maintain. For a more ellaborated use case where prefixes or sufixes matter, go for a trie.
And if you do it just for fun, implementing a trie would put a Sunday afternoon to a good use.

There's something I haven't seen anyone mention explicitly that I think is important to keep in mind. Both hash tables and tries of various kinds will typically have O(k) operations, where k is the length of the string in bits (or equivalently in chars).
This is assuming you have a good hash function. If you don't want "farm" and "farm animals" to hash to the same value, then the hash function will have to use all the bits of the key, and so hashing "farm animals" should take about twice as long as "farm" (unless you're in some sort of rolling hash scenario, but there are somewhat similar operation-saving scenarios with tries too). And with a vanilla trie, it's clear why inserting "farm animals" will take about twice as long as just "farm". In the long run it's true with compressed tries as well.

HashTable implementation is space efficient as compared to basic Trie implementation. But with strings, ordering is necessary in most of the practical applications. But HashTable totally disturbs the lexographical order. Now, if your application is doing operations based on lexographical order (like partial search, all strings with given prefix, all words in sorted order), you should use Tries. For only lookup, HashTable should be used (as arguably, it gives minimum lookup time).
P.S.: Other than these, Ternary Search Trees (TSTs) would be an excellent choice. Its lookup time is more than HashTable, but is time-efficient in all other operations. Also, its more space efficient than tries.

Some (usually embedded, real-time) applications require that the processing time be independent of the data. In that case, a hash table can guarantee a known execution time, while a trie varies based on the data.

Related

How hashmap retrieves value with hash key?

I'm more confused with Hashmap or Hashtable concept, when people say Hashmap is faster over List. I'm clear with hashing concept, in which the value is stored in hash code for the given key.
But when I want to retrieve the data how it works,
For example, I'm storing n number of strings with n different keys in a HashMap.
If I want to retrieve a specific value associated with specific key, how it will return it in O(1) of time ? Because the hashed key will be compared with all other keys right ?
Lets go on a word journey, say you have a bunch weird m&m's with all the letters.
Now it's your job is to vend people m&m's in the letter color combo of their choosing.
You have some choices about how to organize your shop. ( This act of organization will be metaphorically our hash function. )
You can sort your M&M's into buckets by color or by letter or by both. The question follows, what provides you the fastest retrieval time of a specific request?
The answer is rather intuitive, being that the sorting providing the fewest different M&Ms in each bucket facilitates the most efficient queering.
Lets say someone asked if you had any green Q ; if all your M&M's are in a single bin or list or bucket or otherwise unstructured container the answer will be far from accessible in O(1) as compared to keeping an organized shop.
This analogy relies on the concept of Separate chaining where each hash-Key corresponds to a container of multiple elements.
Without this concept the idea of hashing is more generally to use keys from uniformly throughout an array such that the amortized performance is constant. Collisions can be resolved through a variety of method variations and the Wikipedia article will tell you all about it.
http://en.wikipedia.org/wiki/Hash_table
"If the set of key-value pairs is fixed and known ahead of time (so insertions and deletions are not allowed), one may reduce the average lookup cost by a careful choice of the hash function, bucket table size, and internal data structures. In particular, one may be able to devise a hash function that is collision-free, or even perfect "

Data structure for non overlapping ranges of integers?

I remember learning a data structure that stored a set of integers as ranges in a tree, but it's been 10 years and I can't remember the name of the data structure, and I'm a bit fuzzy on the details. If it helps, it's a functional data structure that was taught at CMU, I believe in 15-212 (Principles of Programming) in 2002.
Basically, I want to store a set of integers, most of which are consecutive. I want to be able to query for set membership efficiently, add a range of integers efficiently, and remove a range of integers efficiently. In particular, I don't care to preserve what the original ranges are. It's better if adjacent ranges are coalesced into a single larger range.
A naive implementation would be to simply use a generic set data structure such as a HashSet or TreeSet, and add all integers in a range when adding a range, or remove all integers in a range when removing a range. But of course, that would waste a lot of memory in addition to making add and remove slow.
I'm thinking of a purely functional data structure, but for my current use I don't need it to be. IIRC, lookup, insertion, and deletion were all O(log N), where N was the number of ranges in the set.
So, can you tell me the name of the data structure I'm trying to remember, or a suitable alternative?
I found the old homework and the data structure I had in mind were Discrete Interval Encoding Trees or diets for short. They are described in detail in Diets for Fat Sets, Martin Erwig. Journal of Functional Programming, Vol. 8, No. 6, 627-632, 1998. It is basically a tree of intervals with the invariant that all of the intervals are non-overlapping and non-touching. There is a Haskell implementation in Hackage. I was hoping there would be an existing implementation for Scala, but I'm not seeing any.
The homework also included another data structure they called a Recursive Interval-Occluding Tree (RIOT), which rather than keeping only an interval at each node keeps an interval and another (possibly empty) RIOT of things removed from the interval. The assignment included benchmarks showing it did better than diets for random insertions and deletions. AFAICT it is simply something the TAs made up and never published as it no longer seems to exist anywhere on the Internets, at least not under that name.
You probably are looking for segment trees. This might be helpful: http://www.topcoder.com/tc?d1=tutorials&d2=lowestCommonAncestor&module=Static
You can also use binary search trees for the same, for which each node will have two data fields: min_val and max_val.
During insertion algorithm, you just need to call another merging operation to check if the left-child,parent,right-child create a sequence, so as to club them into a single node. This will take O(log n) time.
Other operations like deletion and look-up will take O(log n) time as usual, but special measures need to be taken while deletion.

Data Structure for tuple indexing

I need a data structure that stores tuples and would allow me to do a query like: given tuple (x,y,z) of integers, find the next one (an upped bound for it). By that I mean considering the natural ordering (a,b,c)<=(d,e,f) <=> a<=d and b<=e and c<=f. I have tried MSD radix sort, which splits items into buckets and sorts them (and does this recursively for all positions in the tuples). Does anybody have any other suggestion? Ideally I would like the abouve query to happen within O(log n) where n is the number of tuples.
Two options.
Use binary search on a sorted array. If you build the keys ( assuming 32bit int)' with (a<<64)|(b<<32)|c and hold them in a simple array, packed one beside the other, you can use binary search to locate the value you are searching for ( if using C, there is even a library function to do this), and the next one is simply one position along. Worst case Performance is O(logN), and if you can do http://en.wikipedia.org/wiki/Interpolation_search then you might even approach O(log log N)
Problem with binary keys is might be tricky to add new values, might need gyrations if you will exceed available memory. But it is fast, only a few random memory accesses on average.
Alternatively, you could build a hash table by generating a key with a|b|c in some form, and then have the hash data pointing to a structure that contains the next value, whatever that might be. Possibly a little harder to create in the first place as when generating the table you need to know the next value already.
Problems with hash approach are it will likely use more memory than binary search method, performance is great if you don't get hash collisions, but then starts to drop off, although there a variations around this algorithm to help in some cases. Hash approach is possibly much easier to insert new values.
I also see you had a similar question along these lines, so I guess the guts of what I am saying is combine A,b,c to produce a single long key, and use that with binary search, hash or even b-tree. If the length of the key is your problem (what language), could you treat it as a string?
If this answer is completely off base, let me know and I will see if I can delete this answer, so you questions remains unanswered rather than a useless answer.

whats the best way to traverse a large dictionary of words?

lets say I'm looking for a word that may or may not be in a dictionary of 95k words - I Cannot use word length to facilitate search. My question is in regards to the fastest way to find the word without doing a O(n) look up.
Here are my two thoughts:
first, store the words in a hast table, look up of the word is O(1), this seems the best scenario in my mind, but going through different websites using Trie was also suggested, my question regarding this is whether its practical to have a Trie that holds so many words.
The lookup would be O(k) in this case.
So what is the most optimal way of finding a word in a large dictionary?
Optimality depends on your use case - do you care about look up-time or space? (also, do you care about inserting new words?).
The best you can do time-wise is to use a hash table, but for a dictionary, it is space-inefficient. A trie compresses the space requirement because it stores prefixes, not the entire word, but takes longer to look up. So, to answer your question, it is more space efficient to have a trie with a large number of words than a hash table.
If you are just searching for a single word, the cost of setting up a hash table or tree structure would exceed a linear search. These structures become (very) efficient when their costs are amortized over (very) many uses.
If the dictionary is sorted (and why wouldn't a dictionary be?), then you can look for a single word in log(n) time with a binary search through the file, no additional structures needed.
I think the best way to find a word in a dictionary is a B+ tree.And let me explain you the reason.
Lets say you have a root block of 10 strings.The strings in the block are sorted.These 10 strings are followed by a pointer to another cell of 10 strings and that goes one.So the only thing you have to do is just String compare your Key word starting by the First one until you find a word smaller in comparison (StringCompare).
If we take it as standard that each string has next to it a pointer that shows to a cell with words that are smaller in comparison,it will take you 5 steps and 5 comparisons to end to the final bracket of data that will may or may not contain your Key word.
in 5 comparisons + the comparisons in the final bracket you are searching a dictionary of 10*10*10*10*10 words.
The algorithm is of logarithmic speed Log 100000 with base the number of strings in the cell.If each cell has 10 words you need 5 steps.
I must mention that only the Root of the tree must be stored in the Ram memory.All the other blocks can be stored in the hard drive without significant loss in performance because of the few steps.
Hope i explained right :D At least i tried! have fun
Trie is preferable because this data-structure can be faster than hash-table. Hash tables is O(1) only in ideal case, in real world applications collisions can occur. Different types of trie data structure doesn't suffer from this.
Another case is compression. Trie are much more compact than hash table. Hash table require some space for efficient insert operations. If load factor of the hash table are colse to 100% than insert operations takes very long time.
With hash tables you must compare your key with at least one key from the dictionary, key comparison in this case takes O(k) where k in key length. With trie you are doing the same thing, your lookup operations is O(k).
Tries allow ordered traversal, hash tables - don't.
There is many types of tries out there, for example ternary search trie is verty good in this particular case. Array mapped trie are also very fast, compared to regular hash table.

Hash table vs Balanced binary tree [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
What factors should I take into account when I need to choose between a hash table or a balanced binary tree in order to implement a set or an associative array?
This question cannot be answered, in general, I fear.
The issue is that there are many types of hash tables and balanced binary trees, and their performances vary widely.
So, the naive answer is: it depends on the functionality you need. Use a hash table if you do not need ordering and a balanced binary tree otherwise.
For a more elaborate answer, let's consider some alternatives.
Hash Table (see Wikipedia's entry for some basics)
Not all hash tables use a linked-list as a bucket. A popular alternative is to use a "better" bucket, for example a binary tree, or another hash table (with another hash function), ...
Some hash tables do not use buckets at all: see Open Addressing (they come with other issues, obviously)
There is something called Linear re-hashing (it's a quality of implementation detail), which avoids the "stop-the-world-and-rehash" pitfall. Basically during the migration phase you only insert in the "new" table, and also move one "old" entry into the "new" table. Of course, migration phase means double look-up etc...
Binary Tree
Re-balancing is costly, you may consider a Skip-List (also better for multi-threaded accesses) or a Splay Tree.
A good allocator can "pack" nodes together in memory (better caching behavior), even though this does not alleviate the pointer-look-up issue.
B-Tree and variants also offer "packing"
Let's not forget that O(1) is an asymptotic complexity. For few elements, the coefficient is usually more important (performance-wise). Which is especially true if your hash function is slow...
Finally, for sets, you may also wish to consider probabilistic data structures, like Bloom Filters.
Hash tables are generally better if there isn't any need to keep the data in any sort of sequence. Binary trees are better if the data must be kept sorted.
A worthy point on a modern architecture: A Hash table will usually, if its load factor is low, have fewer memory reads than a binary tree will. Since memory access tend to be rather costly compared to burning CPU cycles, the Hash table is often faster.
In the following Binary tree is assumed to be self-balancing, like a red black tree, an AVL tree or like a treap.
On the other hand, if you need to rehash everything in the hash table when you decide to extend it, this may be a costly operation which occur (amortized). Binary trees does not have this limitation.
Binary trees are easier to implement in purely functional languages.
Binary trees have a natural sort order and a natural way to walk the tree for all elements.
When the load factor in the hash table is low, you may be wasting a lot of memory space, but with two pointers, binary trees tend to take up more space.
Hash tables are nearly O(1) (depending on how you handle the load factor) vs. Bin trees O(lg n).
Trees tend to be the "average performer". There are nothing they do particularly well, but then nothing they do particularly bad.
Hash tables are faster lookups:
You need a key that generates an even distribution (otherwise you'll miss a lot and have to rely on something other than hash; like a linear search).
Hash's can use a lot of empty space. You may reserve 256 entries but only need 8 (so far).
Binary trees:
Deterministic. O(log n) I think...
Don't need extra space like hash tables can
Must be kept sorted. Adding an element in the middle means moving the rest around.
A binary search tree requires a total order relationship among the keys. A hash table requires only an equivalence or identity relationship with a consistent hash function.
If a total order relationship is available, then a sorted array has lookup performance comparable to binary trees, worst-case insert performance in the order of hash tables, and less complexity and memory use than both.
The worst-case insertion complexity for a hash table can be left at O(1)/O(log K) (with K the number of elements with the same hash) if it's acceptable to increase the worst-case lookup complexity to O(K) or O(log K) if the elements can be sorted.
Invariants for both trees and hash tables are expensive to restore if the keys change, but less than O(n log N) for sorted arrays.
These are factors to take into account in deciding which implementation to use:
Availability of a total order relationship.
Availability of a good hashing function for the equivalence relationship.
A-priory knowledge of the number of elements.
Knowledge about the rate of insertions, deletions, and lookups.
Relative complexity of the comparison and hashing functions.
If you only need to access single elements, hashtables are better. If you need a range of elements, you simply have no other option than binary trees.
To add to the other great answers above, I'd say:
Use a hash table if the amount of data will not change (e.g. storing constants); but, if the amount of data will change, use a tree. This is due to the fact that, in a hash table, once the load factor has been reached, the hash table must resize. The resize operation can be very slow.
One point that I don't think has been addressed is that trees are much better for persistent data structures. That is, immutable structures. A standard hash table (i.e. one that uses a single array of linked lists) cannot be modified without modifying the whole table. One situation in which this is relevant is if two concurrent functions both have a copy of a hash table, and one of them changes the table (if the table is mutable, that change will be visible to the other one as well). Another situation would be something like the following:
def bar(table):
# some intern stuck this line of code in
table["hello"] = "world"
return table["the answer"]
def foo(x, y, table):
z = bar(table)
if "hello" in table:
raise Exception("failed catastrophically!")
return x + y + z
important_result = foo(1, 2, {
"the answer": 5,
"this table": "doesn't contain hello",
"so it should": "be ok"
})
# catastrophic failure occurs
With a mutable table, we can't guarantee that the table a function call receives will remain that table throughout its execution, because other function calls might modify it.
So, mutability is sometimes not a pleasant thing. Now, a way around this would be to keep the table immutable, and have updates return a new table without modifying the old one. But with a hash table this would often be a costly O(n) operation, since the entire underlying array would need to be copied. On the other hand, with a balanced tree, a new tree can be generated with only O(log n) nodes needing to be created (the rest of the tree being identical).
This means that an efficient tree can be very convenient when immutable maps are desired.
If you''ll have many slightly-different instances of sets, you'll probably want them to share structure. This is easy with trees (if they're immutable or copy-on-write). I'm not sure how well you can do it with hashtables; it's at least less obvious.
In my experience, hastables are always faster because trees suffer too much of cache effects.
To see some real data, you can check the benchmark page of my TommyDS library http://tommyds.sourceforge.net/
Here you can see compared the performance of the most common hashtable, tree and trie libraries available.
One point to note is about the traversal, minimum and maximum item. Hash tables don’t support any kind of ordered traversal, or access to the minimum or maximum items. If these capabilities are important, the binary tree is a better choice.

Resources