Under which circumstance should I use tries instead of binary trees/hash tables? [duplicate] - algorithm

This question already has answers here:
How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?
(8 answers)
Closed 9 years ago.
Does the fact that the keys are usually strings, make it more useful for collections of string data? I know that a hash table uses less space, because it has a chunk of memory allocated to it, rather than for each character of each string.
In terms of search, O(m) is the worst case, where m is the length of a key. Binary tree search is O(log n), so I guess I should compare which is more efficient depending on a situation?
P.S. Before you vote to close, this not an opinion question. I need real facts regarding data structures to make the optimal choice.
Thank you

You have to decide what you are looking for in terms of use cases.
As far as facts are considered, here are the points we should keep in mind.
HashTable stores data for key, and it can be used only if you want to search a particular string.
So, if you want to search all strings starting with K, then you will have to iterate whole Hashtable, and order information is also lost while inserting data in table.
As far as BST is considered, it is easy to store strings in it and it will store strings are per it's natural ordering, but at each node it will have to match all the characters , and that is not good from search time point of view.
Now coming to Trie, unlike Hashtable and BST, Trie is not good from storage point of view, and it will take too much space, but from search point of view, it is much faster.
Once again, it all depends what you want to buy and at what price , based on this you can go for Hashtable,BST,Trie or SuffixTree.

Related

Data Structure for fast searching

If I have to develop an application for a data grid station of an institute. The purpose of application is to receive the data from data GRID station once in a week between 10 A.M to 10:30 A.M and then store it into a data structure and data is consist of digits only but the numbers could be very long for one entry then which data structure will be the best for given scenario from array, list, linked list, doubly linked list, queue, priority queue, stack, binary search tree, AVL trees, threaded binary tree, heap, sorted sequential array and skip list
I want to store sorted digits. The sorted data can be in ascending or descending order and the main concern is "fast and efficient searching".
From your description I gather that you don't store any other data with the digits or numbers. So basically you want to know if a number is in the set or not.
Fastest way to know this, is to have an array of flags for each number. Let's say you deal with numbers from 1 to 1000. You want to know if number 200 is in the set. Look at position 200 wether the flag is true or false. You see, this is the fastest method, because you only look up one place.
As we are talking about boolean flags here, a bit is sufficient for storage. You would decide wether to store the booleans in bits, bytes, words or whatever, depending on the number of numbers, the available memory and the machine's architecture.
Having said this, you may have to deal with so many numbers that above approach is no more feasible. It would be fastest in theory, but with limited memory, swaps to hard disk, many, many reads from it, other algorithms may prove better. You would have the choice between:
storing the numbers contiguously and perform a binary search on them
storing the numbers in a binary tree
using a hash algorithm
Which of these proves most efficient, again depends on your data and the machine.
It depends what type of searching you want to do. If you just want to know if a number is within your dataset, then a hash will be extremely fast and independent of the size of your dataset. And there is no need to sort, or even any concept of order.
If I may quote Larry Wall, author of Perl:
Doing linear scans over an associative array is like trying to club
someone to death with a loaded Uzi.
(An associative array is synonymous with a hash.)

Data Structure for tuple indexing

I need a data structure that stores tuples and would allow me to do a query like: given tuple (x,y,z) of integers, find the next one (an upped bound for it). By that I mean considering the natural ordering (a,b,c)<=(d,e,f) <=> a<=d and b<=e and c<=f. I have tried MSD radix sort, which splits items into buckets and sorts them (and does this recursively for all positions in the tuples). Does anybody have any other suggestion? Ideally I would like the abouve query to happen within O(log n) where n is the number of tuples.
Two options.
Use binary search on a sorted array. If you build the keys ( assuming 32bit int)' with (a<<64)|(b<<32)|c and hold them in a simple array, packed one beside the other, you can use binary search to locate the value you are searching for ( if using C, there is even a library function to do this), and the next one is simply one position along. Worst case Performance is O(logN), and if you can do http://en.wikipedia.org/wiki/Interpolation_search then you might even approach O(log log N)
Problem with binary keys is might be tricky to add new values, might need gyrations if you will exceed available memory. But it is fast, only a few random memory accesses on average.
Alternatively, you could build a hash table by generating a key with a|b|c in some form, and then have the hash data pointing to a structure that contains the next value, whatever that might be. Possibly a little harder to create in the first place as when generating the table you need to know the next value already.
Problems with hash approach are it will likely use more memory than binary search method, performance is great if you don't get hash collisions, but then starts to drop off, although there a variations around this algorithm to help in some cases. Hash approach is possibly much easier to insert new values.
I also see you had a similar question along these lines, so I guess the guts of what I am saying is combine A,b,c to produce a single long key, and use that with binary search, hash or even b-tree. If the length of the key is your problem (what language), could you treat it as a string?
If this answer is completely off base, let me know and I will see if I can delete this answer, so you questions remains unanswered rather than a useless answer.

whats the best way to traverse a large dictionary of words?

lets say I'm looking for a word that may or may not be in a dictionary of 95k words - I Cannot use word length to facilitate search. My question is in regards to the fastest way to find the word without doing a O(n) look up.
Here are my two thoughts:
first, store the words in a hast table, look up of the word is O(1), this seems the best scenario in my mind, but going through different websites using Trie was also suggested, my question regarding this is whether its practical to have a Trie that holds so many words.
The lookup would be O(k) in this case.
So what is the most optimal way of finding a word in a large dictionary?
Optimality depends on your use case - do you care about look up-time or space? (also, do you care about inserting new words?).
The best you can do time-wise is to use a hash table, but for a dictionary, it is space-inefficient. A trie compresses the space requirement because it stores prefixes, not the entire word, but takes longer to look up. So, to answer your question, it is more space efficient to have a trie with a large number of words than a hash table.
If you are just searching for a single word, the cost of setting up a hash table or tree structure would exceed a linear search. These structures become (very) efficient when their costs are amortized over (very) many uses.
If the dictionary is sorted (and why wouldn't a dictionary be?), then you can look for a single word in log(n) time with a binary search through the file, no additional structures needed.
I think the best way to find a word in a dictionary is a B+ tree.And let me explain you the reason.
Lets say you have a root block of 10 strings.The strings in the block are sorted.These 10 strings are followed by a pointer to another cell of 10 strings and that goes one.So the only thing you have to do is just String compare your Key word starting by the First one until you find a word smaller in comparison (StringCompare).
If we take it as standard that each string has next to it a pointer that shows to a cell with words that are smaller in comparison,it will take you 5 steps and 5 comparisons to end to the final bracket of data that will may or may not contain your Key word.
in 5 comparisons + the comparisons in the final bracket you are searching a dictionary of 10*10*10*10*10 words.
The algorithm is of logarithmic speed Log 100000 with base the number of strings in the cell.If each cell has 10 words you need 5 steps.
I must mention that only the Root of the tree must be stored in the Ram memory.All the other blocks can be stored in the hard drive without significant loss in performance because of the few steps.
Hope i explained right :D At least i tried! have fun
Trie is preferable because this data-structure can be faster than hash-table. Hash tables is O(1) only in ideal case, in real world applications collisions can occur. Different types of trie data structure doesn't suffer from this.
Another case is compression. Trie are much more compact than hash table. Hash table require some space for efficient insert operations. If load factor of the hash table are colse to 100% than insert operations takes very long time.
With hash tables you must compare your key with at least one key from the dictionary, key comparison in this case takes O(k) where k in key length. With trie you are doing the same thing, your lookup operations is O(k).
Tries allow ordered traversal, hash tables - don't.
There is many types of tries out there, for example ternary search trie is verty good in this particular case. Array mapped trie are also very fast, compared to regular hash table.

Hash table vs Balanced binary tree [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
What factors should I take into account when I need to choose between a hash table or a balanced binary tree in order to implement a set or an associative array?
This question cannot be answered, in general, I fear.
The issue is that there are many types of hash tables and balanced binary trees, and their performances vary widely.
So, the naive answer is: it depends on the functionality you need. Use a hash table if you do not need ordering and a balanced binary tree otherwise.
For a more elaborate answer, let's consider some alternatives.
Hash Table (see Wikipedia's entry for some basics)
Not all hash tables use a linked-list as a bucket. A popular alternative is to use a "better" bucket, for example a binary tree, or another hash table (with another hash function), ...
Some hash tables do not use buckets at all: see Open Addressing (they come with other issues, obviously)
There is something called Linear re-hashing (it's a quality of implementation detail), which avoids the "stop-the-world-and-rehash" pitfall. Basically during the migration phase you only insert in the "new" table, and also move one "old" entry into the "new" table. Of course, migration phase means double look-up etc...
Binary Tree
Re-balancing is costly, you may consider a Skip-List (also better for multi-threaded accesses) or a Splay Tree.
A good allocator can "pack" nodes together in memory (better caching behavior), even though this does not alleviate the pointer-look-up issue.
B-Tree and variants also offer "packing"
Let's not forget that O(1) is an asymptotic complexity. For few elements, the coefficient is usually more important (performance-wise). Which is especially true if your hash function is slow...
Finally, for sets, you may also wish to consider probabilistic data structures, like Bloom Filters.
Hash tables are generally better if there isn't any need to keep the data in any sort of sequence. Binary trees are better if the data must be kept sorted.
A worthy point on a modern architecture: A Hash table will usually, if its load factor is low, have fewer memory reads than a binary tree will. Since memory access tend to be rather costly compared to burning CPU cycles, the Hash table is often faster.
In the following Binary tree is assumed to be self-balancing, like a red black tree, an AVL tree or like a treap.
On the other hand, if you need to rehash everything in the hash table when you decide to extend it, this may be a costly operation which occur (amortized). Binary trees does not have this limitation.
Binary trees are easier to implement in purely functional languages.
Binary trees have a natural sort order and a natural way to walk the tree for all elements.
When the load factor in the hash table is low, you may be wasting a lot of memory space, but with two pointers, binary trees tend to take up more space.
Hash tables are nearly O(1) (depending on how you handle the load factor) vs. Bin trees O(lg n).
Trees tend to be the "average performer". There are nothing they do particularly well, but then nothing they do particularly bad.
Hash tables are faster lookups:
You need a key that generates an even distribution (otherwise you'll miss a lot and have to rely on something other than hash; like a linear search).
Hash's can use a lot of empty space. You may reserve 256 entries but only need 8 (so far).
Binary trees:
Deterministic. O(log n) I think...
Don't need extra space like hash tables can
Must be kept sorted. Adding an element in the middle means moving the rest around.
A binary search tree requires a total order relationship among the keys. A hash table requires only an equivalence or identity relationship with a consistent hash function.
If a total order relationship is available, then a sorted array has lookup performance comparable to binary trees, worst-case insert performance in the order of hash tables, and less complexity and memory use than both.
The worst-case insertion complexity for a hash table can be left at O(1)/O(log K) (with K the number of elements with the same hash) if it's acceptable to increase the worst-case lookup complexity to O(K) or O(log K) if the elements can be sorted.
Invariants for both trees and hash tables are expensive to restore if the keys change, but less than O(n log N) for sorted arrays.
These are factors to take into account in deciding which implementation to use:
Availability of a total order relationship.
Availability of a good hashing function for the equivalence relationship.
A-priory knowledge of the number of elements.
Knowledge about the rate of insertions, deletions, and lookups.
Relative complexity of the comparison and hashing functions.
If you only need to access single elements, hashtables are better. If you need a range of elements, you simply have no other option than binary trees.
To add to the other great answers above, I'd say:
Use a hash table if the amount of data will not change (e.g. storing constants); but, if the amount of data will change, use a tree. This is due to the fact that, in a hash table, once the load factor has been reached, the hash table must resize. The resize operation can be very slow.
One point that I don't think has been addressed is that trees are much better for persistent data structures. That is, immutable structures. A standard hash table (i.e. one that uses a single array of linked lists) cannot be modified without modifying the whole table. One situation in which this is relevant is if two concurrent functions both have a copy of a hash table, and one of them changes the table (if the table is mutable, that change will be visible to the other one as well). Another situation would be something like the following:
def bar(table):
# some intern stuck this line of code in
table["hello"] = "world"
return table["the answer"]
def foo(x, y, table):
z = bar(table)
if "hello" in table:
raise Exception("failed catastrophically!")
return x + y + z
important_result = foo(1, 2, {
"the answer": 5,
"this table": "doesn't contain hello",
"so it should": "be ok"
})
# catastrophic failure occurs
With a mutable table, we can't guarantee that the table a function call receives will remain that table throughout its execution, because other function calls might modify it.
So, mutability is sometimes not a pleasant thing. Now, a way around this would be to keep the table immutable, and have updates return a new table without modifying the old one. But with a hash table this would often be a costly O(n) operation, since the entire underlying array would need to be copied. On the other hand, with a balanced tree, a new tree can be generated with only O(log n) nodes needing to be created (the rest of the tree being identical).
This means that an efficient tree can be very convenient when immutable maps are desired.
If you''ll have many slightly-different instances of sets, you'll probably want them to share structure. This is easy with trees (if they're immutable or copy-on-write). I'm not sure how well you can do it with hashtables; it's at least less obvious.
In my experience, hastables are always faster because trees suffer too much of cache effects.
To see some real data, you can check the benchmark page of my TommyDS library http://tommyds.sourceforge.net/
Here you can see compared the performance of the most common hashtable, tree and trie libraries available.
One point to note is about the traversal, minimum and maximum item. Hash tables don’t support any kind of ordered traversal, or access to the minimum or maximum items. If these capabilities are important, the binary tree is a better choice.

How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?

So if I have to choose between a hash table or a prefix tree what are the discriminating factors that would lead me to choose one over the other. From my own naive point of view it seems as though using a trie has some extra overhead since it isn't stored as an array but that in terms of run time (assuming the longest key is the longest english word) it can be essentially O(1) (in relation to the upper bound). Maybe the longest english word is 50 characters?
Hash tables are instant look up once you get the index. Hashing the key to get the index however seems like it could easily take near 50 steps.
Can someone provide me a more experienced perspective on this? Thanks!
Advantages of tries:
The basics:
Predictable O(k) lookup time where k is the size of the key
Lookup can take less than k time if it's not there
Supports ordered traversal
No need for a hash function
Deletion is straightforward
New operations:
You can quickly look up prefixes of keys, enumerate all entries with a given prefix, etc.
Advantages of linked structure:
If there are many common prefixes, the space they require is shared.
Immutable tries can share structure. Instead of updating a trie in place, you can build a new one that's different only along one branch, elsewhere pointing into the old trie. This can be useful for concurrency, multiple simultaneous versions of a table, etc.
An immutable trie is compressible. That is, it can share structure on the suffixes as well, by hash-consing.
Advantages of hashtables:
Everyone knows hashtables, right? Your system will already have a nice well-optimized implementation, faster than tries for most purposes.
Your keys need not have any special structure.
More space-efficient than the obvious linked trie structure (see comments below)
It all depends on what problem you're trying to solve. If all you need to do is insertions and lookups, go with a hash table. If you need to solve more complex problems such as prefix-related queries, then a trie might be the better solution.
Everyone knows hash table and its uses but it is not exactly constant look up time , it depends on how big the hash table is , the computational complexity of the hash function.
Creating huge hash tables for efficient lookup is not an elegant solution in most of the industrial scenarios where even small latency/scalability matters (e.g.: high frequency trading). You have to care about the data structures to be optimized for space it takes up in memory too to reduce cache miss.
A very good example where trie better suits the requirements is messaging middleware . You have a million subscribers and publishers of messages to various categories (in JMS terms - Topics or exchanges) , in such cases if you want to filter out messages based on topics (which are actually strings) , you definitely do not want create hash table for the million subscriptions with million topics . A better approach is store the topics in trie , so when filtering is done based on topic match , its complexity is independent of number of topics/subscriptions/publishers (only depends on the length of string). I like it because you can be creative with this data structure to optimize space requirements and hence have lower cache miss.
Use a tree:
If you need auto complete feature
Find all words beginning with 'a' or 'axe' so on.
A suffix tree is a special form of a tree. Suffix trees have a whole list of advantages that hash cannot cover.
Insertion and lookup on a trie is linear with the lengh of the input string O(s).
A hash will give you a O(1) for lookup ans insertion, but first you have to calculate the hash based on the input string which again is O(s).
Conclussion, the asymptotic time complexity is linear in both cases.
The trie has some more overhead from data perspective, but you can choose a compressed trie which will put you again, more or less on a tie with the hash table.
To break the tie ask yourself this question: Do i need to lookup for full words only? Or do I need to return all words matching a prefix? (As in a predictive text input system ). For the first case, go for a hash. It is simpler and cleaner code. Easier to test and maintain. For a more ellaborated use case where prefixes or sufixes matter, go for a trie.
And if you do it just for fun, implementing a trie would put a Sunday afternoon to a good use.
There's something I haven't seen anyone mention explicitly that I think is important to keep in mind. Both hash tables and tries of various kinds will typically have O(k) operations, where k is the length of the string in bits (or equivalently in chars).
This is assuming you have a good hash function. If you don't want "farm" and "farm animals" to hash to the same value, then the hash function will have to use all the bits of the key, and so hashing "farm animals" should take about twice as long as "farm" (unless you're in some sort of rolling hash scenario, but there are somewhat similar operation-saving scenarios with tries too). And with a vanilla trie, it's clear why inserting "farm animals" will take about twice as long as just "farm". In the long run it's true with compressed tries as well.
HashTable implementation is space efficient as compared to basic Trie implementation. But with strings, ordering is necessary in most of the practical applications. But HashTable totally disturbs the lexographical order. Now, if your application is doing operations based on lexographical order (like partial search, all strings with given prefix, all words in sorted order), you should use Tries. For only lookup, HashTable should be used (as arguably, it gives minimum lookup time).
P.S.: Other than these, Ternary Search Trees (TSTs) would be an excellent choice. Its lookup time is more than HashTable, but is time-efficient in all other operations. Also, its more space efficient than tries.
Some (usually embedded, real-time) applications require that the processing time be independent of the data. In that case, a hash table can guarantee a known execution time, while a trie varies based on the data.

Resources