whats the best way to traverse a large dictionary of words? - algorithm

lets say I'm looking for a word that may or may not be in a dictionary of 95k words - I Cannot use word length to facilitate search. My question is in regards to the fastest way to find the word without doing a O(n) look up.
Here are my two thoughts:
first, store the words in a hast table, look up of the word is O(1), this seems the best scenario in my mind, but going through different websites using Trie was also suggested, my question regarding this is whether its practical to have a Trie that holds so many words.
The lookup would be O(k) in this case.
So what is the most optimal way of finding a word in a large dictionary?

Optimality depends on your use case - do you care about look up-time or space? (also, do you care about inserting new words?).
The best you can do time-wise is to use a hash table, but for a dictionary, it is space-inefficient. A trie compresses the space requirement because it stores prefixes, not the entire word, but takes longer to look up. So, to answer your question, it is more space efficient to have a trie with a large number of words than a hash table.

If you are just searching for a single word, the cost of setting up a hash table or tree structure would exceed a linear search. These structures become (very) efficient when their costs are amortized over (very) many uses.
If the dictionary is sorted (and why wouldn't a dictionary be?), then you can look for a single word in log(n) time with a binary search through the file, no additional structures needed.

I think the best way to find a word in a dictionary is a B+ tree.And let me explain you the reason.
Lets say you have a root block of 10 strings.The strings in the block are sorted.These 10 strings are followed by a pointer to another cell of 10 strings and that goes one.So the only thing you have to do is just String compare your Key word starting by the First one until you find a word smaller in comparison (StringCompare).
If we take it as standard that each string has next to it a pointer that shows to a cell with words that are smaller in comparison,it will take you 5 steps and 5 comparisons to end to the final bracket of data that will may or may not contain your Key word.
in 5 comparisons + the comparisons in the final bracket you are searching a dictionary of 10*10*10*10*10 words.
The algorithm is of logarithmic speed Log 100000 with base the number of strings in the cell.If each cell has 10 words you need 5 steps.
I must mention that only the Root of the tree must be stored in the Ram memory.All the other blocks can be stored in the hard drive without significant loss in performance because of the few steps.
Hope i explained right :D At least i tried! have fun

Trie is preferable because this data-structure can be faster than hash-table. Hash tables is O(1) only in ideal case, in real world applications collisions can occur. Different types of trie data structure doesn't suffer from this.
Another case is compression. Trie are much more compact than hash table. Hash table require some space for efficient insert operations. If load factor of the hash table are colse to 100% than insert operations takes very long time.
With hash tables you must compare your key with at least one key from the dictionary, key comparison in this case takes O(k) where k in key length. With trie you are doing the same thing, your lookup operations is O(k).
Tries allow ordered traversal, hash tables - don't.
There is many types of tries out there, for example ternary search trie is verty good in this particular case. Array mapped trie are also very fast, compared to regular hash table.

Related

What is the fastest way to lookup an item from a small set of items by key?

Say I have a class with a fields array. The fields each have a name. Basically, like a SQL table.
class X {
foo: String
bar: String
...
}
What is the way to construct a data structure and algorithm to fetch a field by key such that it is (a) fast, in terms of number of operations, and (b) minimal, in terms of memory / data-structure size?
Obviously if you know the index of the field the fastest would be to lookup the field by index in the array. But I need to find these by key.
Now, the number of keys will be relatively small for each class. In this example there are only 2 keys/fields.
One way to do this would be to create a hash table, such as like this one in JS. You give it the key, and it iterates through each character in the key and runs it through some mixing function. But this is, for one, dependent on the size of the key. Not too bad for the types of field names I am expecting which shouldn't be too large, let's say they usually aren't longer than 100 characters.
Another way to do this would be to create a trie. You first have to compute the trie, then when you do a lookup, each node of the trie would have one character, so it would have name.length number of steps to find the field.
But I'm wondering, since the number of fields will be small, why do we need to iterate over the keys in the string? A possibly simpler approach, as long as the number of fields is small, is to just iterate through the fields and do a direct string match against each field name.
But all of these 3 techniques would be roughly the same in terms of number of iterations.
Is there any other type of magic that will give you the fewest number of iterations/steps?
It seems that there could be a possible hashing algorithm that uses to its advantage the fact that the number of items in the hash table will be small. You would create a new hash table for each class, giving it a "size" (number of fields on the specific class used for this hash table). Somehow maybe it can use this size information to construct a simple hashing algorithm that minimizes the number of iterations.
Is anything like that possible? If so, how would you do it? If not, then it would be interesting to know why its not possible to get any more optimal than these.
How "small" is the field list?
If you keep field-list sorted by key, you can use binary search.
For a very small number of fields (e.g. 4) it will perform about the same number of iterations and key-comparison as linear search, if considering the worst case of linear search. (Linear search would be very efficient (speed and memory) for this case.)
To beat the average case of linear search, you'd need more fields (e.g. 8).
This is as memory efficient as your linear search solution. More memory efficient than trie solution.

Under which circumstance should I use tries instead of binary trees/hash tables? [duplicate]

This question already has answers here:
How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?
(8 answers)
Closed 9 years ago.
Does the fact that the keys are usually strings, make it more useful for collections of string data? I know that a hash table uses less space, because it has a chunk of memory allocated to it, rather than for each character of each string.
In terms of search, O(m) is the worst case, where m is the length of a key. Binary tree search is O(log n), so I guess I should compare which is more efficient depending on a situation?
P.S. Before you vote to close, this not an opinion question. I need real facts regarding data structures to make the optimal choice.
Thank you
You have to decide what you are looking for in terms of use cases.
As far as facts are considered, here are the points we should keep in mind.
HashTable stores data for key, and it can be used only if you want to search a particular string.
So, if you want to search all strings starting with K, then you will have to iterate whole Hashtable, and order information is also lost while inserting data in table.
As far as BST is considered, it is easy to store strings in it and it will store strings are per it's natural ordering, but at each node it will have to match all the characters , and that is not good from search time point of view.
Now coming to Trie, unlike Hashtable and BST, Trie is not good from storage point of view, and it will take too much space, but from search point of view, it is much faster.
Once again, it all depends what you want to buy and at what price , based on this you can go for Hashtable,BST,Trie or SuffixTree.

Data Structure for tuple indexing

I need a data structure that stores tuples and would allow me to do a query like: given tuple (x,y,z) of integers, find the next one (an upped bound for it). By that I mean considering the natural ordering (a,b,c)<=(d,e,f) <=> a<=d and b<=e and c<=f. I have tried MSD radix sort, which splits items into buckets and sorts them (and does this recursively for all positions in the tuples). Does anybody have any other suggestion? Ideally I would like the abouve query to happen within O(log n) where n is the number of tuples.
Two options.
Use binary search on a sorted array. If you build the keys ( assuming 32bit int)' with (a<<64)|(b<<32)|c and hold them in a simple array, packed one beside the other, you can use binary search to locate the value you are searching for ( if using C, there is even a library function to do this), and the next one is simply one position along. Worst case Performance is O(logN), and if you can do http://en.wikipedia.org/wiki/Interpolation_search then you might even approach O(log log N)
Problem with binary keys is might be tricky to add new values, might need gyrations if you will exceed available memory. But it is fast, only a few random memory accesses on average.
Alternatively, you could build a hash table by generating a key with a|b|c in some form, and then have the hash data pointing to a structure that contains the next value, whatever that might be. Possibly a little harder to create in the first place as when generating the table you need to know the next value already.
Problems with hash approach are it will likely use more memory than binary search method, performance is great if you don't get hash collisions, but then starts to drop off, although there a variations around this algorithm to help in some cases. Hash approach is possibly much easier to insert new values.
I also see you had a similar question along these lines, so I guess the guts of what I am saying is combine A,b,c to produce a single long key, and use that with binary search, hash or even b-tree. If the length of the key is your problem (what language), could you treat it as a string?
If this answer is completely off base, let me know and I will see if I can delete this answer, so you questions remains unanswered rather than a useless answer.

Hash table - why is it faster than arrays?

In cases where I have a key for each element and I don't know the index of the element into an array, hashtables perform better than arrays (O(1) vs O(n)).
Why is that? I mean: I have a key, I hash it.. I have the hash.. shouldn't the algorithm compare this hash against every element's hash? I think there's some trick behind the memory disposition, isn't it?
In cases where I have a key for each element and I don't know the
index of the element into an array, hashtables perform better than
arrays (O(1) vs O(n)).
The hash table search performs O(1) in the average case. In the worst case, the hash table search performs O(n): when you have collisions and the hash function always returns the same slot. One may think "this is a remote situation," but a good analysis should consider it. In this case you should iterate through all the elements like in an array or linked lists (O(n)).
Why is that? I mean: I have a key, I hash it.. I have the hash..
shouldn't the algorithm compare this hash against every element's
hash? I think there's some trick behind the memory disposition, isn't
it?
You have a key, You hash it.. you have the hash: the index of the hash table where the element is present (if it has been located before). At this point you can access the hash table record in O(1). If the load factor is small, it's unlikely to see more than one element there. So, the first element you see should be the element you are looking for. Otherwise, if you have more than one element you must compare the elements you will find in the position with the element you are looking for. In this case you have O(1) + O(number_of_elements).
In the average case, the hash table search complexity is O(1) + O(load_factor) = O(1 + load_factor).
Remember, load_factor = n in the worst case. So, the search complexity is O(n) in the worst case.
I don't know what you mean with "trick behind the memory disposition". Under some points of view, the hash table (with its structure and collisions resolution by chaining) can be considered a "smart trick".
Of course, the hash table analysis results can be proven by math.
With arrays: if you know the value, you have to search on average half the values (unless sorted) to find its location.
With hashes: the location is generated based on the value. So, given that value again, you can calculate the same hash you calculated when inserting. Sometimes, more than 1 value results in the same hash, so in practice each "location" is itself an array (or linked list) of all the values that hash to that location. In this case, only this much smaller (unless it's a bad hash) array needs to be searched.
Hash tables are a bit more complex. They put elements in different buckets based on their hash % some value. In an ideal situation, each bucket holds very few items and there aren't many empty buckets.
Once you know the key, you compute the hash. Based on the hash, you know which bucket to look for. And as stated above, the number of items in each bucket should be relatively small.
Hash tables are doing a lot of magic internally to make sure buckets are as small as possible while not consuming too much memory for empty buckets. Also, much depends on the quality of the key -> hash function.
Wikipedia provides very comprehensive description of hash table.
A Hash Table will not have to compare every element in the Hash. It will calculate the hashcode according to the key. For example, if the key is 4, then hashcode may be - 4*x*y. Now the pointer knows exactly which element to pick.
Whereas if it has been an array, it will have to traverse through the whole array to search for this element.
Why is [it] that [hashtables perform lookups by key better than arrays (O(1) vs O(n))]? I mean: I have a key, I hash it.. I have the hash.. shouldn't the algorithm compare this hash against every element's hash? I think there's some trick behind the memory disposition, isn't it?
Once you have the hash, it lets you calculate an "ideal" or expected location in the array of buckets: commonly:
ideal bucket = hash % num_buckets
The problem is then that another value may have already hashed to that bucket, in which case the hash table implementation has two main choice:
1) try another bucket
2) let several distinct values "belong" to one bucket, perhaps by making the bucket hold a pointer into a linked list of values
For implementation 1, known as open addressing or closed hashing, you jump around other buckets: if you find your value, great; if you find a never-used bucket, then you can store your value in there if inserting, or you know you'll never find your value when searching. There's a potential for the searching to be even worse than O(n) if the way you traverse alternative buckets ends up searching the same bucket multiple times; for example, if you use quadratic probing you try the ideal bucket index +1, then +4, then +9, then +16 and so on - but you must avoid out-of-bounds bucket access using e.g. % num_buckets, so if there are say 12 buckets then ideal+4 and ideal+16 search the same bucket. It can be expensive to track which buckets have been searched, so it can be hard to know when to give up too: the implementation can be optimistic and assume it will always find either the value or an unused bucket (risking spinning forever), it can have a counter and after a threshold of tries either give up or start a linear bucket-by-bucket search.
For implementation 2, known as closed addressing or separate chaining, you have to search inside the container/data-structure of values that all hashed to the ideal bucket. How efficient this is depends on the type of container used. It's generally expected that the number of elements colliding at one bucket will be small, which is true of a good hash function with non-adversarial inputs, and typically true enough of even a mediocre hash function especially with a prime number of buckets. So, a linked list or contiguous array is often used, despite the O(n) search properties: linked lists are simple to implement and operate on, and arrays pack the data together for better memory cache locality and access speed. The worst possible case though is that every value in your table hashed to the same bucket, and the container at that bucket now holds all the values: your entire hash table is then only as efficient as the bucket's container. Some Java hash table implementations have started using binary trees if the number of elements hashing to the same buckets passes a threshold, to make sure complexity is never worse than O(log2n).
Python hashes are an example of 1 = open addressing = closed hashing. C++ std::unordered_set is an example of closed addressing = separate chaining.
The purpose of hashing is to produce an index into the underlying array, which enables you to jump straight to the element in question. This is usually accomplished by dividing the hash by the size of the array and taking the remainder index = hash%capacity.
The type/size of the hash is typically that of the smallest integer large enough to index all of RAM. On a 32 bit system this is a 32 bit integer. On a 64 bit system this is a 64 bit integer. In C++ this corresponds to unsigned int and unsigned long long respectively. To be pedantic C++ technically specifies minimum sizes for its primitives i.e. at least 32 bits and at least 64 bits, but that's beside the point. For the sake of making code portable C++ also provides a size_t primative which corresponds to the appropriate unsigned integer. You'll see that type a lot in for loops which index into arrays, in well written code. In the case of a language like Python the integer primitive grows to whatever size it needs to be. This is typically implemented in the standard libraries of other languages under the name "Big Integer". To deal with this the Python programming language simply truncates whatever value you return from the __hash__() method down to the appropriate size.
On this score I think it's worth giving a word to the wise. The result of arithmetic is the same regardless of whether you compute the remainder at the end or at each step along the way. Truncation is equivalent to computing the remainder modulo 2^n where n is the number of bits you leave intact. Now you might think that computing the remainder at each step would be foolish due to the fact that you're incurring an extra computation at every step along the way. However this is not the case for two reasons. First, computationally speaking, truncation is extraordinarily cheap, far cheaper than generalized division. Second, and this is the real reason as the first is insufficient, and the claim would generally hold even in its absence, taking the remainder at each step keeps the number (relatively) small. So instead of something like product = 31*product + hash(array[index]), you'll want something like product = hash(31*product + hash(array[index])). The primary purpose of the inner hash() call is to take something which might not be a number and turn it into one, where as the primary purpose of the outer hash() call is to take a potentially oversized number and truncate it. Lastly I'll note that in languages like C++ where integer primitives have a fixed size this truncation step is automatically performed after every operation.
Now for the elephant in the room. You've probably realized that hash codes being generally speaking smaller than the objects they correspond to, not to mention that the indices derived from them are again generally speaking even smaller still, it's entirely possible for two objects to hash to the same index. This is called a hash collision. Data structures backed by a hash table like Python's set or dict or C++'s std::unordered_set or std::unordered_map primarily handle this in one of two ways. The first is called separate chaining, and the second is called open addressing. In separate chaining the array functioning as the hash table is itself an array of lists (or in some cases where the developer feels like getting fancy, some other data structure like a binary search tree), and every time an element hashes to a given index it gets added to the corresponding list. In open addressing if an element hashes to an index which is already occupied the data structure probes over to the next index (or in some cases where the developer feels like getting fancy, an index defined by some other function as is the case in quadratic probing) and so on until it finds an empty slot, of course wrapping around when it reaches the end of the array.
Next a word about load factor. There is of course an inherent space/time trade off when it comes to increasing or decreasing the load factor. The higher the load factor the less wasted space the table consumes; however this comes at the expense of increasing the likelihood of performance degrading collisions. Generally speaking hash tables implemented with separate chaining are less sensitive to load factor than those implemented with open addressing. This is due to the phenomenon known as clustering where by clusters in an open addressed hash table tend to become larger and larger in a positive feed back loop as a result of the fact that the larger they become the more likely they are to contain the preferred index of a newly added element. This is actually the reason why the afore mentioned quadratic probing scheme, which progressively increases the jump distance, is often preferred. In the extreme case of load factors greater than 1, open addressing can't work at all as the number of elements exceeds the available space. That being said load factors greater than 1 are exceedingly rare in general. At time of writing Python's set and dict classes employ a max load factor of 2/3 where as Java's java.util.HashSet and java.util.HashMap use 3/4 with C++'s std::unordered_set and std::unordered_map taking the cake with a max load factor of 1. Unsurprisingly Python's hash table backed data structures handle collisions with open addressing where as their Java and C++ counterparts do it with separate chaining.
Last a comment about table size. When the max load factor is exceeded, the size of the hash table must of course be grown. Due to the fact that this requires that every element there in be reindexed, it's highly inefficient to grow the table by a fixed amount. To do so would incur order size operations every time a new element is added. The standard fix for this problem is the same as that employed by most dynamic array implementations. At every point where we need to grow the table we simply increase its size by its current size. This unsurprisingly is known as table doubling.
I think you answered your own question there. "shouldn't the algorithm compare this hash against every element's hash". That's kind of what it does when it doesn't know the index location of what you're searching for. It compares each element to find the one you're looking for:
E.g. Let's say you're looking for an item called "Car" inside an array of strings. You need to go through every item and check item.Hash() == "Car".Hash() to find out that that is the item you're looking for. Obviously it doesn't use the hash when searching always, but the example stands. Then you have a hash table. What a hash table does is it creates a sparse array, or sometimes array of buckets as the guy above mentioned. Then it uses the "Car".Hash() to deduce where in the sparse array your "Car" item is actually. This means that it doesn't have to search through the entire array to find your item.

How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?

So if I have to choose between a hash table or a prefix tree what are the discriminating factors that would lead me to choose one over the other. From my own naive point of view it seems as though using a trie has some extra overhead since it isn't stored as an array but that in terms of run time (assuming the longest key is the longest english word) it can be essentially O(1) (in relation to the upper bound). Maybe the longest english word is 50 characters?
Hash tables are instant look up once you get the index. Hashing the key to get the index however seems like it could easily take near 50 steps.
Can someone provide me a more experienced perspective on this? Thanks!
Advantages of tries:
The basics:
Predictable O(k) lookup time where k is the size of the key
Lookup can take less than k time if it's not there
Supports ordered traversal
No need for a hash function
Deletion is straightforward
New operations:
You can quickly look up prefixes of keys, enumerate all entries with a given prefix, etc.
Advantages of linked structure:
If there are many common prefixes, the space they require is shared.
Immutable tries can share structure. Instead of updating a trie in place, you can build a new one that's different only along one branch, elsewhere pointing into the old trie. This can be useful for concurrency, multiple simultaneous versions of a table, etc.
An immutable trie is compressible. That is, it can share structure on the suffixes as well, by hash-consing.
Advantages of hashtables:
Everyone knows hashtables, right? Your system will already have a nice well-optimized implementation, faster than tries for most purposes.
Your keys need not have any special structure.
More space-efficient than the obvious linked trie structure (see comments below)
It all depends on what problem you're trying to solve. If all you need to do is insertions and lookups, go with a hash table. If you need to solve more complex problems such as prefix-related queries, then a trie might be the better solution.
Everyone knows hash table and its uses but it is not exactly constant look up time , it depends on how big the hash table is , the computational complexity of the hash function.
Creating huge hash tables for efficient lookup is not an elegant solution in most of the industrial scenarios where even small latency/scalability matters (e.g.: high frequency trading). You have to care about the data structures to be optimized for space it takes up in memory too to reduce cache miss.
A very good example where trie better suits the requirements is messaging middleware . You have a million subscribers and publishers of messages to various categories (in JMS terms - Topics or exchanges) , in such cases if you want to filter out messages based on topics (which are actually strings) , you definitely do not want create hash table for the million subscriptions with million topics . A better approach is store the topics in trie , so when filtering is done based on topic match , its complexity is independent of number of topics/subscriptions/publishers (only depends on the length of string). I like it because you can be creative with this data structure to optimize space requirements and hence have lower cache miss.
Use a tree:
If you need auto complete feature
Find all words beginning with 'a' or 'axe' so on.
A suffix tree is a special form of a tree. Suffix trees have a whole list of advantages that hash cannot cover.
Insertion and lookup on a trie is linear with the lengh of the input string O(s).
A hash will give you a O(1) for lookup ans insertion, but first you have to calculate the hash based on the input string which again is O(s).
Conclussion, the asymptotic time complexity is linear in both cases.
The trie has some more overhead from data perspective, but you can choose a compressed trie which will put you again, more or less on a tie with the hash table.
To break the tie ask yourself this question: Do i need to lookup for full words only? Or do I need to return all words matching a prefix? (As in a predictive text input system ). For the first case, go for a hash. It is simpler and cleaner code. Easier to test and maintain. For a more ellaborated use case where prefixes or sufixes matter, go for a trie.
And if you do it just for fun, implementing a trie would put a Sunday afternoon to a good use.
There's something I haven't seen anyone mention explicitly that I think is important to keep in mind. Both hash tables and tries of various kinds will typically have O(k) operations, where k is the length of the string in bits (or equivalently in chars).
This is assuming you have a good hash function. If you don't want "farm" and "farm animals" to hash to the same value, then the hash function will have to use all the bits of the key, and so hashing "farm animals" should take about twice as long as "farm" (unless you're in some sort of rolling hash scenario, but there are somewhat similar operation-saving scenarios with tries too). And with a vanilla trie, it's clear why inserting "farm animals" will take about twice as long as just "farm". In the long run it's true with compressed tries as well.
HashTable implementation is space efficient as compared to basic Trie implementation. But with strings, ordering is necessary in most of the practical applications. But HashTable totally disturbs the lexographical order. Now, if your application is doing operations based on lexographical order (like partial search, all strings with given prefix, all words in sorted order), you should use Tries. For only lookup, HashTable should be used (as arguably, it gives minimum lookup time).
P.S.: Other than these, Ternary Search Trees (TSTs) would be an excellent choice. Its lookup time is more than HashTable, but is time-efficient in all other operations. Also, its more space efficient than tries.
Some (usually embedded, real-time) applications require that the processing time be independent of the data. In that case, a hash table can guarantee a known execution time, while a trie varies based on the data.

Resources