Efficiently calculating containers hash codes - algorithm

The algorithm I know about for calculating the hash code of containers works by combining the hash of all elements in it recursively. How the hashes are combined is irrelevant for my question. But because the algorithm recurses, the calculation can become very expensive. O(n), where n is the total number of elements reachable.
My question is if there are any more efficient methods to do it? For example, if you have an array with 100k elements, you could calculate the hash by combining the hash of only 100 of the elements contained. That would make the calculation 1000 times faster, while still being a good hash function, wouldn't it?
The 100 elements you pick could be the 100 first or every 1000th (in the above example) or picked using some other deterministic formula.
So to answer my question, can you either tell me why my idea can't work or tell me where my idea has already been investigated. Like has any programming language implemented "sub O(n) sequence hashing" like I'm proposing?

In general, designing an appropriate hash function requires trading off computation time against quality, and this will be particularly true for very large objects.
Hashing only a fixed-size subset of a large object is a valid strategy (Lua uses this strategy for hashing large strings, for example), but it can obviously lead to problems if the hashed objects have few differences and it happens that the differences are not in the hashed subset. That opens the possibility of denial-of-service attacks (or inputs which accidentally trigger the same problem), so it is not generally a good idea if you are hashing uncontrolled inputs. (And if you're using the hash as part of a cryptographic exercise, then omitting part of the object makes falsification trivial, so in that context it's a really bad idea.)
Assuming you're using the hash as part of a database indexing strategy (that is, a hash table), remember that in the end you will need to compare the value being looked up with each potential match in the table; those comparisons are necessarily O(n) (unless you believe that almost all lookups will fail). Each false positive requires an additional comparison, so the quality-versus-computation-time tradeoff may turn out to be a false economy.
But, in the end, there is no definitive answer; you will have to decide based on the precise use case you have, including a consideration of what you are using the hash for, what the distribution of the data is (or is likely to be) and so on.

Related

Hash table - why is it faster than arrays?

In cases where I have a key for each element and I don't know the index of the element into an array, hashtables perform better than arrays (O(1) vs O(n)).
Why is that? I mean: I have a key, I hash it.. I have the hash.. shouldn't the algorithm compare this hash against every element's hash? I think there's some trick behind the memory disposition, isn't it?
In cases where I have a key for each element and I don't know the
index of the element into an array, hashtables perform better than
arrays (O(1) vs O(n)).
The hash table search performs O(1) in the average case. In the worst case, the hash table search performs O(n): when you have collisions and the hash function always returns the same slot. One may think "this is a remote situation," but a good analysis should consider it. In this case you should iterate through all the elements like in an array or linked lists (O(n)).
Why is that? I mean: I have a key, I hash it.. I have the hash..
shouldn't the algorithm compare this hash against every element's
hash? I think there's some trick behind the memory disposition, isn't
it?
You have a key, You hash it.. you have the hash: the index of the hash table where the element is present (if it has been located before). At this point you can access the hash table record in O(1). If the load factor is small, it's unlikely to see more than one element there. So, the first element you see should be the element you are looking for. Otherwise, if you have more than one element you must compare the elements you will find in the position with the element you are looking for. In this case you have O(1) + O(number_of_elements).
In the average case, the hash table search complexity is O(1) + O(load_factor) = O(1 + load_factor).
Remember, load_factor = n in the worst case. So, the search complexity is O(n) in the worst case.
I don't know what you mean with "trick behind the memory disposition". Under some points of view, the hash table (with its structure and collisions resolution by chaining) can be considered a "smart trick".
Of course, the hash table analysis results can be proven by math.
With arrays: if you know the value, you have to search on average half the values (unless sorted) to find its location.
With hashes: the location is generated based on the value. So, given that value again, you can calculate the same hash you calculated when inserting. Sometimes, more than 1 value results in the same hash, so in practice each "location" is itself an array (or linked list) of all the values that hash to that location. In this case, only this much smaller (unless it's a bad hash) array needs to be searched.
Hash tables are a bit more complex. They put elements in different buckets based on their hash % some value. In an ideal situation, each bucket holds very few items and there aren't many empty buckets.
Once you know the key, you compute the hash. Based on the hash, you know which bucket to look for. And as stated above, the number of items in each bucket should be relatively small.
Hash tables are doing a lot of magic internally to make sure buckets are as small as possible while not consuming too much memory for empty buckets. Also, much depends on the quality of the key -> hash function.
Wikipedia provides very comprehensive description of hash table.
A Hash Table will not have to compare every element in the Hash. It will calculate the hashcode according to the key. For example, if the key is 4, then hashcode may be - 4*x*y. Now the pointer knows exactly which element to pick.
Whereas if it has been an array, it will have to traverse through the whole array to search for this element.
Why is [it] that [hashtables perform lookups by key better than arrays (O(1) vs O(n))]? I mean: I have a key, I hash it.. I have the hash.. shouldn't the algorithm compare this hash against every element's hash? I think there's some trick behind the memory disposition, isn't it?
Once you have the hash, it lets you calculate an "ideal" or expected location in the array of buckets: commonly:
ideal bucket = hash % num_buckets
The problem is then that another value may have already hashed to that bucket, in which case the hash table implementation has two main choice:
1) try another bucket
2) let several distinct values "belong" to one bucket, perhaps by making the bucket hold a pointer into a linked list of values
For implementation 1, known as open addressing or closed hashing, you jump around other buckets: if you find your value, great; if you find a never-used bucket, then you can store your value in there if inserting, or you know you'll never find your value when searching. There's a potential for the searching to be even worse than O(n) if the way you traverse alternative buckets ends up searching the same bucket multiple times; for example, if you use quadratic probing you try the ideal bucket index +1, then +4, then +9, then +16 and so on - but you must avoid out-of-bounds bucket access using e.g. % num_buckets, so if there are say 12 buckets then ideal+4 and ideal+16 search the same bucket. It can be expensive to track which buckets have been searched, so it can be hard to know when to give up too: the implementation can be optimistic and assume it will always find either the value or an unused bucket (risking spinning forever), it can have a counter and after a threshold of tries either give up or start a linear bucket-by-bucket search.
For implementation 2, known as closed addressing or separate chaining, you have to search inside the container/data-structure of values that all hashed to the ideal bucket. How efficient this is depends on the type of container used. It's generally expected that the number of elements colliding at one bucket will be small, which is true of a good hash function with non-adversarial inputs, and typically true enough of even a mediocre hash function especially with a prime number of buckets. So, a linked list or contiguous array is often used, despite the O(n) search properties: linked lists are simple to implement and operate on, and arrays pack the data together for better memory cache locality and access speed. The worst possible case though is that every value in your table hashed to the same bucket, and the container at that bucket now holds all the values: your entire hash table is then only as efficient as the bucket's container. Some Java hash table implementations have started using binary trees if the number of elements hashing to the same buckets passes a threshold, to make sure complexity is never worse than O(log2n).
Python hashes are an example of 1 = open addressing = closed hashing. C++ std::unordered_set is an example of closed addressing = separate chaining.
The purpose of hashing is to produce an index into the underlying array, which enables you to jump straight to the element in question. This is usually accomplished by dividing the hash by the size of the array and taking the remainder index = hash%capacity.
The type/size of the hash is typically that of the smallest integer large enough to index all of RAM. On a 32 bit system this is a 32 bit integer. On a 64 bit system this is a 64 bit integer. In C++ this corresponds to unsigned int and unsigned long long respectively. To be pedantic C++ technically specifies minimum sizes for its primitives i.e. at least 32 bits and at least 64 bits, but that's beside the point. For the sake of making code portable C++ also provides a size_t primative which corresponds to the appropriate unsigned integer. You'll see that type a lot in for loops which index into arrays, in well written code. In the case of a language like Python the integer primitive grows to whatever size it needs to be. This is typically implemented in the standard libraries of other languages under the name "Big Integer". To deal with this the Python programming language simply truncates whatever value you return from the __hash__() method down to the appropriate size.
On this score I think it's worth giving a word to the wise. The result of arithmetic is the same regardless of whether you compute the remainder at the end or at each step along the way. Truncation is equivalent to computing the remainder modulo 2^n where n is the number of bits you leave intact. Now you might think that computing the remainder at each step would be foolish due to the fact that you're incurring an extra computation at every step along the way. However this is not the case for two reasons. First, computationally speaking, truncation is extraordinarily cheap, far cheaper than generalized division. Second, and this is the real reason as the first is insufficient, and the claim would generally hold even in its absence, taking the remainder at each step keeps the number (relatively) small. So instead of something like product = 31*product + hash(array[index]), you'll want something like product = hash(31*product + hash(array[index])). The primary purpose of the inner hash() call is to take something which might not be a number and turn it into one, where as the primary purpose of the outer hash() call is to take a potentially oversized number and truncate it. Lastly I'll note that in languages like C++ where integer primitives have a fixed size this truncation step is automatically performed after every operation.
Now for the elephant in the room. You've probably realized that hash codes being generally speaking smaller than the objects they correspond to, not to mention that the indices derived from them are again generally speaking even smaller still, it's entirely possible for two objects to hash to the same index. This is called a hash collision. Data structures backed by a hash table like Python's set or dict or C++'s std::unordered_set or std::unordered_map primarily handle this in one of two ways. The first is called separate chaining, and the second is called open addressing. In separate chaining the array functioning as the hash table is itself an array of lists (or in some cases where the developer feels like getting fancy, some other data structure like a binary search tree), and every time an element hashes to a given index it gets added to the corresponding list. In open addressing if an element hashes to an index which is already occupied the data structure probes over to the next index (or in some cases where the developer feels like getting fancy, an index defined by some other function as is the case in quadratic probing) and so on until it finds an empty slot, of course wrapping around when it reaches the end of the array.
Next a word about load factor. There is of course an inherent space/time trade off when it comes to increasing or decreasing the load factor. The higher the load factor the less wasted space the table consumes; however this comes at the expense of increasing the likelihood of performance degrading collisions. Generally speaking hash tables implemented with separate chaining are less sensitive to load factor than those implemented with open addressing. This is due to the phenomenon known as clustering where by clusters in an open addressed hash table tend to become larger and larger in a positive feed back loop as a result of the fact that the larger they become the more likely they are to contain the preferred index of a newly added element. This is actually the reason why the afore mentioned quadratic probing scheme, which progressively increases the jump distance, is often preferred. In the extreme case of load factors greater than 1, open addressing can't work at all as the number of elements exceeds the available space. That being said load factors greater than 1 are exceedingly rare in general. At time of writing Python's set and dict classes employ a max load factor of 2/3 where as Java's java.util.HashSet and java.util.HashMap use 3/4 with C++'s std::unordered_set and std::unordered_map taking the cake with a max load factor of 1. Unsurprisingly Python's hash table backed data structures handle collisions with open addressing where as their Java and C++ counterparts do it with separate chaining.
Last a comment about table size. When the max load factor is exceeded, the size of the hash table must of course be grown. Due to the fact that this requires that every element there in be reindexed, it's highly inefficient to grow the table by a fixed amount. To do so would incur order size operations every time a new element is added. The standard fix for this problem is the same as that employed by most dynamic array implementations. At every point where we need to grow the table we simply increase its size by its current size. This unsurprisingly is known as table doubling.
I think you answered your own question there. "shouldn't the algorithm compare this hash against every element's hash". That's kind of what it does when it doesn't know the index location of what you're searching for. It compares each element to find the one you're looking for:
E.g. Let's say you're looking for an item called "Car" inside an array of strings. You need to go through every item and check item.Hash() == "Car".Hash() to find out that that is the item you're looking for. Obviously it doesn't use the hash when searching always, but the example stands. Then you have a hash table. What a hash table does is it creates a sparse array, or sometimes array of buckets as the guy above mentioned. Then it uses the "Car".Hash() to deduce where in the sparse array your "Car" item is actually. This means that it doesn't have to search through the entire array to find your item.

Fuzzy matching deduplication in less than exponential time?

I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc).
I am looking for a strategy to remove inexact duplicates, and fuzzy matching seems to be the method of choice. My issue: many articles and SO questions deal with matching a single string against all records in a database. I am looking to deduplicate the entire database at once.
The former would be a linear time problem (comparing a value against a million other values, calculating some similarity measure each time). The latter is an exponential time problem (compare every record's values against every other record's value; for a million records, that's approx 5 x 10^11 calculations vs the 1,000,000 calculations for the former option).
I'm wondering if there is another approach than the "brute-force" method I mentioned. I was thinking of possibly generating a string to compare each record's value against, and then group strings that had roughly equal similarity measures, and then run the brute-force method through these groups. I wouldn't achieve linear time, but it might help. Also, if I'm thinking through this properly, this could miss a potential fuzzy match between strings A and B because the their similarity to string C (the generated check-string) is very different despite being very similar to each other.
Any ideas?
P.S. I realize I may have used the wrong terms for time complexity - it is a concept that I have a basic grasp of, but not well enough so I could drop an algorithm into the proper category on the spot. If I used the terms wrong, I welcome corrections, but hopefully I got my point across at least.
Edit
Some commenters have asked, given fuzzy matches between records, what my strategy was to choose which ones to delete (i.e. given "foo", "boo", and "coo", which would be marked the duplicate and deleted). I should note that I am not looking for an automatic delete here. The idea is to flag potential duplicates in a 60+ million record database for human review and assessment purposes. It is okay if there are some false positives, as long as it is a roughly predictable / consistent amount. I just need to get a handle on how pervasive the duplicates are. But if the fuzzy matching pass-through takes a month to run, this isn't even an option in the first place.
Have a look at http://en.wikipedia.org/wiki/Locality-sensitive_hashing. One very simple approach would be to divide up each address (or whatever) into a set of overlapping n-grams. This STACKOVERFLOW becomes the set {STACKO, TACKO, ACKOV, CKOVE... , RFLOW}. Then use a large hash-table or sort-merge to find colliding n-grams and check collisions with a fuzzy matcher. Thus STACKOVERFLOW and SXACKOVRVLOX will collide because both are associated with the colliding n-gram ACKOV.
A next level up in sophistication is to pick an random hash function - e.g. HMAC with an arbitrary key, and of the n-grams you find, keep only the one with the smallest hashed value. Then you have to keep track of fewer n-grams, but will only see a match if the smallest hashed value in both cases is ACKOV. There is obviously a trade-off here between the length of the n-gram and the probability of false hits. In fact, what people seem to do is to make n quite small and get higher precision by concatenating the results from more than one hash function in the same record, so you need to get a match in multiple different hash functions at the same time - I presume the probabilities work out better this way. Try googling for "duplicate detection minhash"
I think you may have mis-calculated the complexity for all the combinations. If comparing one string with all other strings is linear, this means due to the small lengths, each comparison is O(1). The process of comparing each string with every other string is not exponential but quadratic, which is not all bad. In simpler terms you are comparing nC2 or n(n-1)/2 pairs of strings, so its just O(n^2)
I couldnt think of a way you can sort them in order as you cant write an objective comparator, but even if you do so, sorting would take O(nlogn) for merge sort and since you have so many records and probably would prefer using no extra memory, you would use quick sort, which takes O(n^2) in worst case, no improvement over the worst case time in brute force.
You could use a Levenshtein transducer, which "accept[s] a query term and return[s] all terms in a dictionary that are within n spelling errors away from it". Here's a demo.
Pairwise comparisons of all the records is O(N^2) not exponential. There basically two ways to go to cut down on that complexity.
The first is blocking, where you only compare records that already have something in common that's easy to compute, like the first three letters or a common n-gram. This is basically the same idea as Locally Sensitive Hashing. The dedupe python library implements a number of blocking techniques and the documentation gives a good overview of the general approach.
In the worse case, pairwise comparisons with blocking is still O(N^2). In the best case it is O(N). Neither best or worst case are really met in practice. Typically, blocking reduces the number of pairs to compare by over 99.9%.
There are some interesting, alternative paradigms for record linkage that are not based on pairwise comparisons. These have better worse case complexity guarantees. See the work of Beka Steorts and Michael Wick.
I assume this is a one-time cleanup. I think the problem won't be having to do so many comparisons, it'll be having to decide what comparisons are worth making. You mention names and addresses, so see this link for some of the comparison problems you'll have.
It's true you have to do almost 500 billion brute-force compares for comparing a million records against themselves, but that's assuming you never skip any records previously declared a match (ie, never doing the "break" out of the j-loop in the pseudo-code below).
My pokey E-machines T6532 2.2gHz manages to do 1.4m seeks and reads per second of 100-byte text file records, so 500 billion compares would take about 4 days. Instead of spending 4 days researching and coding up some fancy solution (only to find I still need another x days to actually do the run), and assuming my comparison routine can't compute and save the keys I'd be comparing, I'd just let it brute-force all those compares while I find something else to do:
for i = 1 to LASTREC-1
seektorec(i)
getrec(i) into a
for j = i+1 to LASTREC
getrec(j) into b
if similarrecs(a, b) then [gotahit(); break]
Even if a given run only locates easy-to-define matches, hopefully it reduces the remaining unmatched records to a more reasonable smaller set for which further brute-force runs aren't so time-consuming.
But it seems unlikely similarrecs() can't independently compute and save the portions of a + b being compared, in which case the much more efficient approach is:
for i = 1 to LASTREC
getrec(i) in a
write fuzzykey(a) into scratchfile
sort scratchfile
for i = 1 to LASTREC-1
if scratchfile(i) = scratchfile(i+1) then gothit()
Most databases can do the above in one command line, if you're allowed to invoke your own custom code for computing each record's fuzzykey().
In any case, the hard part is going to be figuring out what makes two records a duplicate, per the link above.
Equivalence relations are particularly nice kinds of matching; they satisfy three properties:
reflexivity: for any value A, A ~ A
symmetry: if A ~ B, then necessarily B ~ A
transitivity: if A ~ B and B ~ C, then necessarily A ~ C
What makes these nice is that they allow you to partition your data into disjoint sets such that each pair of elements in any given set are related by ~. So, what you can do is apply the union-find algorithm to first partition all your data, then pick out a single representative element from each set in the partition; this completely de-duplicates the data (where "duplicate" means "related by ~"). Moreover, this solution is canonical in the sense that no matter which representatives you happen to pick from each partition, you get the same number of final values, and each of the final values are pairwise non-duplicate.
Unfortunately, fuzzy matching is not an equivalence relation, since it is presumably not transitive (though it's probably reflexive and symmetric). The result of this is that there isn't a canonical way to partition the data; you might find that any way you try to partition the data, some values in one set are equivalent to values from another set, or that some values from within a single set are not equivalent.
So, what behavior do you want, exactly, in these situations?

how to create a collection with O(1) complexity

I would like to create a data structure or collection which will have O(1) complexity in adding, removing and calculating no. of elements. How am I supposed to start?
I have thought of a solution: I will use a Hashtable and for each key / value pair inserted, I will have only one hash code, that is: my hash code algorithm will generate a unique hash value every time, so the index at which the value is stored will be unique (i.e. no collisions).
Will that give me O(1) complexity?
Yes that will work, but as you mentioned your hashing function needs to be 100% unique. Any duplicates will result in you having to use some sort of conflict resolution. I would recommend linear chaining.
edit: Hashmap.size() allows for O(1) access
edit 2: Respopnse to the confusion Larry has caused =P
Yes, Hashing is O(k) where k is the keylength. Everyone can agree on that. However, if you do not have a perfect hash, you simply cannot get O(1) time. Your claim was that you do not need uniqueness to acheive O(1) deletion of a specific element. I guarantee you that is wrong.
Consider a worst case scenario: every element hashes to the same thing. You end up with a single linked list which as everyone knows does not have O(1) deletion. I would hope, as you mentioned, nobody is dumb enough to make a hash like this.
Point is, uniqueness of the hash is a prerequisite for O(1) runtime.
Even then, though, it is technically not O(1) Big O efficiency. Only using amortized analysis you will acheive constant time efficiency in the worst case. As noted on wikipedia's article on amortized analysis
The basic idea is that a worst case operation can alter the state in such a way that the worst case cannot occur again for a long time, thus "amortizing" its cost.
That is referring to the idea that resizing your hashtable (altering the state of your data structure) at certain load factors can ensure a smaller chance of collisions etc.
I hope this clears everything up.
Adding, Removing and Size (provided it is tracked separately, using a simple counter) can be provided by a linked list. Unless you need to remove a specific item. You should be more specific about your requirements.
Doing a totally non-clashing hash function is quite tricky even when you know exactly the space of things being hashed, and it's impossible in general. It also depends deeply on the size of the array that you're hashing into. That is, you need to know exactly what you're doing to make that work.
But if you instead relax that a bit so that identical hash codes don't imply equality1, then you can use the existing Java HashMap framework for all the other parts. All you need to do is to plug in your own hashCode() implementation in your key class, which is something that Java has always supported. And make sure that you've got equality defined right too. At that point, you've got the various operations being not much more expensive than O(1), especially if you've got a good initial estimation for the capacity and load factor.
1 Equality must imply equal hash codes, of course.
Even if your hashcodes are unique this doesn't guarentee a collision free collection. This is because your hash map is not of an unlimited size. The hashcode has to be reduced to the number of buckets in your hash map and after this reduction you can still get collisions.
e.g. Say I have three objects A (hash: 2), B (hash: 18), C (hash: 66) All unique.
Say you put them in a HashMap of with a capacity of 16 (the default). If they were mapped to a bucket with % 16 (actually is more complex that this) after reducing the hash codes we now have A (hash: 2 % 16 = 2), B (hash: 18 % 16 = 2), C (hash: 66 % 16 = 2)
HashMap is likely to be faster than Hashtable, unless you need thread safety. (In which case I suggest you use CopncurrentHashMap)
IMHO, Hashtable has been a legacy collection for 12 years, and I would suggest you only use it if you have to.
What functionality do you need that a linked list won't give you?
Surprisingly, your idea will work, if you know all the keys you want to put in the collection in advance. The idea is to generate a special hash function which maps each key to a unique value in the range (1, n). Then our "hash table" is just a simple array (+ an integer to cache the number of elements)
Implementing this is not trivial, but it's not rocket science either. I'll leave it to Steve Hanov to explain the ins-and-outs, as he gives a much better explanation than I ever could.
It's simple. Just use a hash map. You don't need to do anything special. Hashmap itself is O(1) for insertion, deletion, calculating number of elements.
Even if the keys are not unique, the algorithm will still be O(1) as long as the Hashmap is automatically expanded in size if the collection gets too large (most implementations will do this for you automatically).
So, just use the Hash map according to the given documentation, and all will be well. Don't think up anything more complicated, it will just be a waste of time.
Avoiding collisions is really impossible with a hash .. if it was possible, then it would basically just be an array or a mapping to an array, not a hash. But it isn't necessary to avoid collisions, it will still be O(1) with collisions.

Efficiently estimating the number of unique elements in a large list

This problem is a little similar to that solved by reservoir sampling, but not the same. I think its also a rather interesting problem.
I have a large dataset (typically hundreds of millions of elements), and I want to estimate the number of unique elements in this dataset. There may be anywhere from a few, to millions of unique elements in a typical dataset.
Of course the obvious solution is to maintain a running hashset of the elements you encounter, and count them at the end, this would yield an exact result, but would require me to carry a potentially large amount of state with me as I scan through the dataset (ie. all unique elements encountered so far).
Unfortunately in my situation this would require more RAM than is available to me (nothing that the dataset may be far larger than available RAM).
I'm wondering if there would be a statistical approach to this that would allow me to do a single pass through the dataset and come up with an estimated unique element count at the end, while maintaining a relatively small amount of state while I scan the dataset.
The input to the algorithm would be the dataset (an Iterator in Java parlance), and it would return an estimated unique object count (probably a floating point number). It is assumed that these objects can be hashed (ie. you can put them in a HashSet if you want to). Typically they will be strings, or numbers.
You could use a Bloom Filter for a reasonable lower bound. You just do a pass over the data, counting and inserting items which were definitely not already in the set.
This problem is well-addressed in the literature; a good review of various approaches is http://www.edbt.org/Proceedings/2008-Nantes/papers/p618-Metwally.pdf. The simplest approach (and most compact for very high accuracy requirements) is called Linear Counting. You hash elements to positions in a bitvector just like you would a Bloom filter (except only one hash function is required), but at the end you estimate the number of distinct elements by the formula D = -total_bits * ln(unset_bits/total_bits). Details are in the paper.
If you have a hash function that you trust, then you could maintain a hashset just like you would for the exact solution, but throw out any item whose hash value is outside of some small range. E.g., use a 32-bit hash, but only keep items where the first two bits of the hash are 0. Then multiply by the appropriate factor at the end to approximate the total number of unique elements.
Nobody has mentioned approximate algorithm designed specifically for this problem, Hyperloglog.

How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?

So if I have to choose between a hash table or a prefix tree what are the discriminating factors that would lead me to choose one over the other. From my own naive point of view it seems as though using a trie has some extra overhead since it isn't stored as an array but that in terms of run time (assuming the longest key is the longest english word) it can be essentially O(1) (in relation to the upper bound). Maybe the longest english word is 50 characters?
Hash tables are instant look up once you get the index. Hashing the key to get the index however seems like it could easily take near 50 steps.
Can someone provide me a more experienced perspective on this? Thanks!
Advantages of tries:
The basics:
Predictable O(k) lookup time where k is the size of the key
Lookup can take less than k time if it's not there
Supports ordered traversal
No need for a hash function
Deletion is straightforward
New operations:
You can quickly look up prefixes of keys, enumerate all entries with a given prefix, etc.
Advantages of linked structure:
If there are many common prefixes, the space they require is shared.
Immutable tries can share structure. Instead of updating a trie in place, you can build a new one that's different only along one branch, elsewhere pointing into the old trie. This can be useful for concurrency, multiple simultaneous versions of a table, etc.
An immutable trie is compressible. That is, it can share structure on the suffixes as well, by hash-consing.
Advantages of hashtables:
Everyone knows hashtables, right? Your system will already have a nice well-optimized implementation, faster than tries for most purposes.
Your keys need not have any special structure.
More space-efficient than the obvious linked trie structure (see comments below)
It all depends on what problem you're trying to solve. If all you need to do is insertions and lookups, go with a hash table. If you need to solve more complex problems such as prefix-related queries, then a trie might be the better solution.
Everyone knows hash table and its uses but it is not exactly constant look up time , it depends on how big the hash table is , the computational complexity of the hash function.
Creating huge hash tables for efficient lookup is not an elegant solution in most of the industrial scenarios where even small latency/scalability matters (e.g.: high frequency trading). You have to care about the data structures to be optimized for space it takes up in memory too to reduce cache miss.
A very good example where trie better suits the requirements is messaging middleware . You have a million subscribers and publishers of messages to various categories (in JMS terms - Topics or exchanges) , in such cases if you want to filter out messages based on topics (which are actually strings) , you definitely do not want create hash table for the million subscriptions with million topics . A better approach is store the topics in trie , so when filtering is done based on topic match , its complexity is independent of number of topics/subscriptions/publishers (only depends on the length of string). I like it because you can be creative with this data structure to optimize space requirements and hence have lower cache miss.
Use a tree:
If you need auto complete feature
Find all words beginning with 'a' or 'axe' so on.
A suffix tree is a special form of a tree. Suffix trees have a whole list of advantages that hash cannot cover.
Insertion and lookup on a trie is linear with the lengh of the input string O(s).
A hash will give you a O(1) for lookup ans insertion, but first you have to calculate the hash based on the input string which again is O(s).
Conclussion, the asymptotic time complexity is linear in both cases.
The trie has some more overhead from data perspective, but you can choose a compressed trie which will put you again, more or less on a tie with the hash table.
To break the tie ask yourself this question: Do i need to lookup for full words only? Or do I need to return all words matching a prefix? (As in a predictive text input system ). For the first case, go for a hash. It is simpler and cleaner code. Easier to test and maintain. For a more ellaborated use case where prefixes or sufixes matter, go for a trie.
And if you do it just for fun, implementing a trie would put a Sunday afternoon to a good use.
There's something I haven't seen anyone mention explicitly that I think is important to keep in mind. Both hash tables and tries of various kinds will typically have O(k) operations, where k is the length of the string in bits (or equivalently in chars).
This is assuming you have a good hash function. If you don't want "farm" and "farm animals" to hash to the same value, then the hash function will have to use all the bits of the key, and so hashing "farm animals" should take about twice as long as "farm" (unless you're in some sort of rolling hash scenario, but there are somewhat similar operation-saving scenarios with tries too). And with a vanilla trie, it's clear why inserting "farm animals" will take about twice as long as just "farm". In the long run it's true with compressed tries as well.
HashTable implementation is space efficient as compared to basic Trie implementation. But with strings, ordering is necessary in most of the practical applications. But HashTable totally disturbs the lexographical order. Now, if your application is doing operations based on lexographical order (like partial search, all strings with given prefix, all words in sorted order), you should use Tries. For only lookup, HashTable should be used (as arguably, it gives minimum lookup time).
P.S.: Other than these, Ternary Search Trees (TSTs) would be an excellent choice. Its lookup time is more than HashTable, but is time-efficient in all other operations. Also, its more space efficient than tries.
Some (usually embedded, real-time) applications require that the processing time be independent of the data. In that case, a hash table can guarantee a known execution time, while a trie varies based on the data.

Resources