Substring search algorithms (very large haystack, small needle) - algorithm

I know there are already several similar questions here, but I need some recommendations for my case (couldn't find anything similar).
I have to search a very large amount of data for a substring that would be about a billion times smaller (10 bytes in 10 billion bytes). The haystack doesn't change, so I can bear with large precomputation if needed. I just need the searching part to be as fast as possible.
I found algorithms which take O(n) time (n = haystack size, m = needle size), and the naive search takes O(n+m). As the m in this particular case would be very small, is there any other algorithm I could look into?
Edit:
Thanks everyone for your suggestions!
Some more info -
The data could be considered random bits, so I don't think any kind of indexing / sorting would be possible. The data to be searched can be anything, not english words or anything predictable.

You are looking for the data structure called the Trie or "prefix tree". In short, this data structure encodes all the possible string prefixes which can be found in your corpus.
Here is a paper which searches a DNA sequences for a small substring, using a prefix tree. I imagine that might help you, since your case sounds similar.
If you know a definite limit on the length of the input search string, you can limit the growth of your Trie so that it does not store any prefixes longer than this length max. In this way, you may be able to fit a Trie representing all 10G into less than 10G of memory. Especially for highly repetitive data, any sort of Trie is a compressed data representation. (Or should be, if implemented sanely.) Limiting the Trie depth to the max input search string allows you to limit the memory consumption still further.

It's worth looking at suffix arrays and trees. They both require precomputation and significant memory, but they are better than reverse indexes in the sense that you can search for arbitrary substrings in O(m) (for suffix trees) and O(m + log n) (for suffix arrays with least common prefix info).
If you have a lot of time on your hands, you can look into compressed suffix arrays and succinct CSAs that are compressed versions of your data that are also self-indexing (i.e. the data is also the index, there is no external index). This is really the best of all worlds because not only do you have a compressed version of your data (you can throw the original data away), but it's also indexed! The problem is understanding the research papers and translating them into code :)
If you do not need perfect substring matching, but rather general searching capabilities, check out Lucene.

Prefix/suffix tree are generally the standard, best and most cautious solution for this sort of thing in my opinion. You can't go wrong with them.
But here is a different idea, which resorts to Bloom filters. You probably know what these are, but just in case (and for other people reading this answer): Bloom filters are very small, very compact bit-vectors which approximate set inclusion. If you have a set S and a Bloom filter for that set B(S), then
x ∈ S ⇒ x ∈ B(S)
but the reciprocate is false. This is what is probabilistic about the structure: there is a (quantifiable) probability that the Bloom filter will return a false positive. But approximating inclusion with the Bloom filter is wildly faster than testing it exactly on the set.
(A simple case use: in a lot of applications, the Bloom filter is used, well, as a filter. Checking cache is expensive, because you have to do a hard drive access, so programs like Squid will first check a small Bloom filter in memory, and if the Bloom filter returns a positive result, then Squid will go check the cache. If it was false positive, then that's OK, because Squid will find that out when actually visiting the cache---but the advantage is that the Bloom filter will have spared Squid having to check the cache for a lot of requests where it would have been useless.)
Bloom filters were used with some success in string search. Here is a sketch (I may remember some of the details wrong) of this application. A text file is a sequence of N lines. You are looking for a word composed of M letters (and no word can be spread accross two lines). A preprocessing phase will build ONE Bloom filter for each line, by adding every subsequence of the line to the Bloom filter; for instance, for this line
Touching this dreaded sight, twice seene of vs,
And the corresponding Bloom filter will be created with "T", "To", "Tou" ... "o", "ou", ... "vs,", "s", "s,", ",". (I may have this part wrong. Or you might want to optimize.)
Then when searching for the subword of size M, simply do one very fast check on each of the Bloom filters, and when there is a hit, examine the line closely with the KMP algorithm, for instance. In practice, if you tune your Bloom filters well, the trade-off is remarkable. Searching is incredibly fast because you eliminate all useless lines.
I believe from this concept you could derive a useful scheme for your situation. Right now, I see two evident adaptation:
either cut your data set in many blocks of size K (each with its Bloom filter, like the lines in the previous example);
or use a sort-of dichotomy where you split the set into two subset, each with a Bloom filter, then each subset split into two sub-subsets with their own Bloom filter, etc. (if you are going to add all substrings as suggested with the method I described, this second idea would be a bit prohibitive---except you don't have to add all substrings, only substrings ranging size 1 to 10).
Both ideas can be combined in inventive ways to create multi-layered schemes.

Given Knuth–Morris–Pratt or Boyer–Moore you're not going to do any better, what you should consider is parallelization of your search process.

If you can afford the space (a lot of space!) to create an index, it'd definitely be worth your while indexing small chunks (e.g. four byte blocks) and storing these with their offsets within the haystack - then searches for 10 bytes involve searching for all four byte blocks for the first four bytes of the search term and checking the next six bytes.

Related

Compressing words into one word consisting of them as subwords [duplicate]

I bet somebody has solved this before, but my searches have come up empty.
I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.
Example: doll dollhouse house
These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.
What I've come up with so far is:
Sort the words longest to shortest: (dollhouse, house, doll)
Scan the buffer to see if the string already exists as a substring, if so note the location.
If it doesn't already exist, add it to the end of the buffer.
Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.
This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.
As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm
This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.
As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.
Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.
I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.
I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).
My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.
Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.
I did a lab back in college where we tasked with implementing a simple compression program.
What we did was sequentially apply these techniques to text:
BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols
Here, I found the assignment page.
To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.
Refine step 3.
Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
If no, add word to end of list as in current step 3.
This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).
I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?
Here are a few good choices:
gzip for fast compression / decompression speed
bzip2 for a bit bitter compression but much slower decompression
LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
lzop for very fast compression / decompression
If you use Java, gzip is already integrated.
It's not clear what do you want to do.
Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?
Do you just want an array of words, compressed?
In the first case, you can go for a patricia trie or a String B-Tree.
For the second case, you can just adopt some index compression techinique, like that:
If you have something like:
aaa
aaab
aasd
abaco
abad
You can compress like that:
0aaa
3b
2sd
1baco
2ad
The number is the length of the largest common prefix with the preceding string.
You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

How to print frequency k words from more than 1TB of data?

This question was asked in my interview. I wanted to know about the solution.
Give a text file which contains one word in each line and size of the file is more than 1TB.
The task is to print only those words whose frequency is k in the file.
I didnt answer the question fully. But i guess, I started it in the right way.
You have use hashing technique and the code take atleast O(n) time (since it has to read through the file)
Can anyone answer me who this can be done efficiently.
In general this class of problems is the topic of "Top K" or "selection" algorithms. Here's a Wikipedia article on the general topic: Wikipedia: Selection algorithm. It seems to have come into vogue with "Big data" systems generally, and perhaps to get past a previous generation of interviews which focused on sorting algorithms long past the time when every serious candidate had memorized quicksort and heapsort code.
As a practical matter this is just about the textbook problem for which "Big Data" (Hadoop, and other Map/Reduce systems) were built. If the data is distributed over N nodes then each can compute separate partial histograms (mapping their histogram function over the entire data set) and merge their results (reducing their subtotals into grand totals).
For an interview scenario this is a popular question because there is no simple trick. You could enumerate over several published approaches in the academic literature or you could tackle the problem de novo.
If the "vocabulary" is relatively small (for example there are only a few tens of thousands of words in a typical English lexicon --- so a quarter million word vocabulary is pretty extensive). In that case we expect that the count could fit in RAM for typical modern hardware. If the "words" in this data set are more extensive --- more than tens or hundreds of millions --- then such an approach is no longer feasible.
Conceivably one could try an adaptive or statistical approach. If we know that there are no major clusters of any single "word" ... that any statistically significant sample of the data set is roughly similar to any other ... than we could build our histograms and throw away those "words" (and their counts) which are significantly more rare than others. If the data will only be presented as a stream and we are not given any hard guarantee about the distribution of terms then this is not a feasible approach. But if we have the data in some random access filesystem then we could possibly sample the data set sparsely and randomly to build a very likely set of top K * M (where M is some arbitrary multiple of our desired K elements such that it will all fit in RAM).
Hashing might help us locate our counters for each word, but we have to consider the chances of collisions if we try to keep counts of only the hashes without keeping the "words" themselves in the data structure. In general I'd think a heap would be better (possibly including dropping things from the bottom of the in-memory heap out into a storage heap or trie).
I said "adaptive" earlier because one could use caching (ultimately statistical modeling) to keep the currently most frequent "words" in RAM and shuffle the least frequent out to storage (to protect against some degenerate data set where the initially frequent "words" give way to some initially rare word which becomes more frequent as one digs deeper through the data set).
While a conversational exploration of these considerations might work well in some interviews, I'd suggest being familiar with the various sections of the Wikipedia article I've cited so that you can sketch out the psuedo-code to at least one or two of them and show that you do have some academic background in this material.
Absolutely do not neglect to discuss distributed processing in an interview where "Top K" class of questions is being posed. Do so even if only to clarify the constraints of the question being posed and to acknowledge that such problems have been the driving force for modern "big data" distributed processing systems.
Also here's a question on the same general topic: StackOverflow: The Most Efficient Way To Find Top K Frequent Words In A Big Word Sequence.
Answer to this question totally depend on the size of unique words, If the unique word count is small then you can use any string->number mapping data-structure (e.g. Trie tree) to count word frequency. Complexity will be n log(m) (m is the length of individual words), easy to implement. But the way problem was described, most probably unique word count is big enough to be able to store in the memory. In that case this following approach can be used:
1 TB data means there are about 1.0*10^12 bytes of data in the input file. 1 byte is one character and lets say on average a single word has 4 characters then we have about 2.5*10^11 words. We will divide this word list in to 50k different word lists. So, every time we will read about 5m unread words from the input file, sort this 5m word list and write this sorted list in a file. We will use 50k array of numbers (lets call it Parray) to store the starting location of all sorted list in the file (initially Parray will have numbers like: 0, 5m+1, 10m+1 etc). Now read the top 50k words from all list, put them in a min heap, you will get the smallest word on top of the heap. After getting the current smallest word (lets call it cur_small) from all sorted list read out that word from each list (after this operation your Parray will point to the next smallest word in each list). Here you will get the count of cur_small - so take the decision based on K then remove all entries of cur_small from the heap and lastly add a new word from each list to the heap from where at-least one word was cur_small. Continue this process until you read out all the sorted lists. Over all complexity is n log(n)

Fuzzy matching deduplication in less than exponential time?

I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc).
I am looking for a strategy to remove inexact duplicates, and fuzzy matching seems to be the method of choice. My issue: many articles and SO questions deal with matching a single string against all records in a database. I am looking to deduplicate the entire database at once.
The former would be a linear time problem (comparing a value against a million other values, calculating some similarity measure each time). The latter is an exponential time problem (compare every record's values against every other record's value; for a million records, that's approx 5 x 10^11 calculations vs the 1,000,000 calculations for the former option).
I'm wondering if there is another approach than the "brute-force" method I mentioned. I was thinking of possibly generating a string to compare each record's value against, and then group strings that had roughly equal similarity measures, and then run the brute-force method through these groups. I wouldn't achieve linear time, but it might help. Also, if I'm thinking through this properly, this could miss a potential fuzzy match between strings A and B because the their similarity to string C (the generated check-string) is very different despite being very similar to each other.
Any ideas?
P.S. I realize I may have used the wrong terms for time complexity - it is a concept that I have a basic grasp of, but not well enough so I could drop an algorithm into the proper category on the spot. If I used the terms wrong, I welcome corrections, but hopefully I got my point across at least.
Edit
Some commenters have asked, given fuzzy matches between records, what my strategy was to choose which ones to delete (i.e. given "foo", "boo", and "coo", which would be marked the duplicate and deleted). I should note that I am not looking for an automatic delete here. The idea is to flag potential duplicates in a 60+ million record database for human review and assessment purposes. It is okay if there are some false positives, as long as it is a roughly predictable / consistent amount. I just need to get a handle on how pervasive the duplicates are. But if the fuzzy matching pass-through takes a month to run, this isn't even an option in the first place.
Have a look at http://en.wikipedia.org/wiki/Locality-sensitive_hashing. One very simple approach would be to divide up each address (or whatever) into a set of overlapping n-grams. This STACKOVERFLOW becomes the set {STACKO, TACKO, ACKOV, CKOVE... , RFLOW}. Then use a large hash-table or sort-merge to find colliding n-grams and check collisions with a fuzzy matcher. Thus STACKOVERFLOW and SXACKOVRVLOX will collide because both are associated with the colliding n-gram ACKOV.
A next level up in sophistication is to pick an random hash function - e.g. HMAC with an arbitrary key, and of the n-grams you find, keep only the one with the smallest hashed value. Then you have to keep track of fewer n-grams, but will only see a match if the smallest hashed value in both cases is ACKOV. There is obviously a trade-off here between the length of the n-gram and the probability of false hits. In fact, what people seem to do is to make n quite small and get higher precision by concatenating the results from more than one hash function in the same record, so you need to get a match in multiple different hash functions at the same time - I presume the probabilities work out better this way. Try googling for "duplicate detection minhash"
I think you may have mis-calculated the complexity for all the combinations. If comparing one string with all other strings is linear, this means due to the small lengths, each comparison is O(1). The process of comparing each string with every other string is not exponential but quadratic, which is not all bad. In simpler terms you are comparing nC2 or n(n-1)/2 pairs of strings, so its just O(n^2)
I couldnt think of a way you can sort them in order as you cant write an objective comparator, but even if you do so, sorting would take O(nlogn) for merge sort and since you have so many records and probably would prefer using no extra memory, you would use quick sort, which takes O(n^2) in worst case, no improvement over the worst case time in brute force.
You could use a Levenshtein transducer, which "accept[s] a query term and return[s] all terms in a dictionary that are within n spelling errors away from it". Here's a demo.
Pairwise comparisons of all the records is O(N^2) not exponential. There basically two ways to go to cut down on that complexity.
The first is blocking, where you only compare records that already have something in common that's easy to compute, like the first three letters or a common n-gram. This is basically the same idea as Locally Sensitive Hashing. The dedupe python library implements a number of blocking techniques and the documentation gives a good overview of the general approach.
In the worse case, pairwise comparisons with blocking is still O(N^2). In the best case it is O(N). Neither best or worst case are really met in practice. Typically, blocking reduces the number of pairs to compare by over 99.9%.
There are some interesting, alternative paradigms for record linkage that are not based on pairwise comparisons. These have better worse case complexity guarantees. See the work of Beka Steorts and Michael Wick.
I assume this is a one-time cleanup. I think the problem won't be having to do so many comparisons, it'll be having to decide what comparisons are worth making. You mention names and addresses, so see this link for some of the comparison problems you'll have.
It's true you have to do almost 500 billion brute-force compares for comparing a million records against themselves, but that's assuming you never skip any records previously declared a match (ie, never doing the "break" out of the j-loop in the pseudo-code below).
My pokey E-machines T6532 2.2gHz manages to do 1.4m seeks and reads per second of 100-byte text file records, so 500 billion compares would take about 4 days. Instead of spending 4 days researching and coding up some fancy solution (only to find I still need another x days to actually do the run), and assuming my comparison routine can't compute and save the keys I'd be comparing, I'd just let it brute-force all those compares while I find something else to do:
for i = 1 to LASTREC-1
seektorec(i)
getrec(i) into a
for j = i+1 to LASTREC
getrec(j) into b
if similarrecs(a, b) then [gotahit(); break]
Even if a given run only locates easy-to-define matches, hopefully it reduces the remaining unmatched records to a more reasonable smaller set for which further brute-force runs aren't so time-consuming.
But it seems unlikely similarrecs() can't independently compute and save the portions of a + b being compared, in which case the much more efficient approach is:
for i = 1 to LASTREC
getrec(i) in a
write fuzzykey(a) into scratchfile
sort scratchfile
for i = 1 to LASTREC-1
if scratchfile(i) = scratchfile(i+1) then gothit()
Most databases can do the above in one command line, if you're allowed to invoke your own custom code for computing each record's fuzzykey().
In any case, the hard part is going to be figuring out what makes two records a duplicate, per the link above.
Equivalence relations are particularly nice kinds of matching; they satisfy three properties:
reflexivity: for any value A, A ~ A
symmetry: if A ~ B, then necessarily B ~ A
transitivity: if A ~ B and B ~ C, then necessarily A ~ C
What makes these nice is that they allow you to partition your data into disjoint sets such that each pair of elements in any given set are related by ~. So, what you can do is apply the union-find algorithm to first partition all your data, then pick out a single representative element from each set in the partition; this completely de-duplicates the data (where "duplicate" means "related by ~"). Moreover, this solution is canonical in the sense that no matter which representatives you happen to pick from each partition, you get the same number of final values, and each of the final values are pairwise non-duplicate.
Unfortunately, fuzzy matching is not an equivalence relation, since it is presumably not transitive (though it's probably reflexive and symmetric). The result of this is that there isn't a canonical way to partition the data; you might find that any way you try to partition the data, some values in one set are equivalent to values from another set, or that some values from within a single set are not equivalent.
So, what behavior do you want, exactly, in these situations?

Is there an edit distance algorithm that takes "chunk transposition" into account?

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.
The Wikipedia article on edit distance gives some good background on the concept.
By taking "chunk transposition" into account, I mean that
Turing, Alan.
should match
Alan Turing
more closely than it matches
Turing Machine
I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.
The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).
In the case of your application you should probably think about adapting some algorithms from bioinformatics.
For example you could firstly unify your strings by making sure, that all separators are spaces or anything else you like, such that you would compare "Alan Turing" with "Turing Alan". And then split one of the strings and do an exact string matching algorithm ( like the Horspool-Algorithm ) with the pieces against the other string, counting the number of matching substrings.
If you would like to find matches that are merely similar but not equal, something along the lines of a local alignment might be more suitable since it provides a score that describes the similarity, but the referenced Smith-Waterman-Algorithm is probably a bit overkill for your application and not even the best local alignment algorithm available.
Depending on your programming environment there is a possibility that an implementation is already available. I personally have worked with SeqAn lately, which is a bioinformatics library for C++ and definitely provides the desired functionality.
Well, that was a rather abstract answer, but I hope it points you in the right direction, but sadly it doesn't provide you with a simple formula to solve your problem.
Have a look at the Jaccard distance metric (JDM). It's an oldie-but-goodie that's pretty adept at token-level discrepancies such as last name first, first name last. For two string comparands, the JDM calculation is simply the number of unique characters the two strings have in common divided by the total number of unique characters between them (in other words the intersection over the union). For example, given the two arguments "JEFFKTYZZER" and "TYZZERJEFF," the numerator is 7 and the denominator is 8, yielding a value of 0.875. My choice of characters as tokens is not the only one available, BTW--n-grams are often used as well.
One of the easiest and most effective modern alternatives to edit distance is called the Normalized Compression Distance, or NCD. The basic idea is easy to explain. Choose a popular compressor that is implemented in your language such as zlib. Then, given string A and string B, let C(A) be the compressed size of A and C(B) be the compressed size of B. Let AB mean "A concatenated with B", so that C(AB) means "The compressed size of "A concatenated with B". Next, compute the fraction (C(AB) - min(C(A),C(B))) / max(C(A), C(B)) This value is called NCD(A,B) and measures similarity similar to edit distance but supports more forms of similarity depending on which data compressor you choose. Certainly, zlib supports the "chunk" style similarity that you are describing. If two strings are similar the compressed size of the concatenation will be near the size of each alone so the numerator will be near 0 and the result will be near 0. If two strings are very dissimilar the compressed size together will be roughly the sum of the compressed sizes added and so the result will be near 1. This formula is much easier to implement than edit distance or almost any other explicit string similarity measure if you already have access to a data compression program like zlib. It is because most of the "hard" work such as heuristics and optimization has already been done in the data compression part and this formula simply extracts the amount of similar patterns it found using generic information theory that is agnostic to language. Moreover, this technique will be much faster than most explicit similarity measures (such as edit distance) for the few hundred byte size range you describe. For more information on this and a sample implementation just search Normalized Compression Distance (NCD) or have a look at the following paper and github project:
http://arxiv.org/abs/cs/0312044 "Clustering by Compression"
https://github.com/rudi-cilibrasi/libcomplearn C language implementation
There are many other implementations and papers on this subject in the last decade that you may use as well in other languages and with modifications.
I think you're looking for Jaro-Winkler distance which is precisely for name matching.
You might find compression distance useful for this. See an answer I gave for a very similar question.
Or you could use a k-tuple based counting system:
Choose a small value of k, e.g. k=4.
Extract all length-k substrings of your string into a list.
Sort the list. (O(knlog(n) time.)
Do the same for the other string you're comparing to. You now have two sorted lists.
Count the number of k-tuples shared by the two strings. If the strings are of length n and m, this can be done in O(n+m) time using a list merge, since the lists are in sorted order.
The number of k-tuples in common is your similarity score.
With small alphabets (e.g. DNA) you would usually maintain a vector storing the count for every possible k-tuple instead of a sorted list, although that's not practical when the alphabet is any character at all -- for k=4, you'd need a 256^4 array.
I'm not sure that what you really want is edit distance -- which works simply on strings of characters -- or semantic distance -- choosing the most appropriate or similar meaning. You might want to look at topics in information retrieval for ideas on how to distinguish which is the most appropriate matching term/phrase given a specific term or phrase. In a sense what you're doing is comparing very short documents rather than strings of characters.

How Do I Choose Between a Hash Table and a Trie (Prefix Tree)?

So if I have to choose between a hash table or a prefix tree what are the discriminating factors that would lead me to choose one over the other. From my own naive point of view it seems as though using a trie has some extra overhead since it isn't stored as an array but that in terms of run time (assuming the longest key is the longest english word) it can be essentially O(1) (in relation to the upper bound). Maybe the longest english word is 50 characters?
Hash tables are instant look up once you get the index. Hashing the key to get the index however seems like it could easily take near 50 steps.
Can someone provide me a more experienced perspective on this? Thanks!
Advantages of tries:
The basics:
Predictable O(k) lookup time where k is the size of the key
Lookup can take less than k time if it's not there
Supports ordered traversal
No need for a hash function
Deletion is straightforward
New operations:
You can quickly look up prefixes of keys, enumerate all entries with a given prefix, etc.
Advantages of linked structure:
If there are many common prefixes, the space they require is shared.
Immutable tries can share structure. Instead of updating a trie in place, you can build a new one that's different only along one branch, elsewhere pointing into the old trie. This can be useful for concurrency, multiple simultaneous versions of a table, etc.
An immutable trie is compressible. That is, it can share structure on the suffixes as well, by hash-consing.
Advantages of hashtables:
Everyone knows hashtables, right? Your system will already have a nice well-optimized implementation, faster than tries for most purposes.
Your keys need not have any special structure.
More space-efficient than the obvious linked trie structure (see comments below)
It all depends on what problem you're trying to solve. If all you need to do is insertions and lookups, go with a hash table. If you need to solve more complex problems such as prefix-related queries, then a trie might be the better solution.
Everyone knows hash table and its uses but it is not exactly constant look up time , it depends on how big the hash table is , the computational complexity of the hash function.
Creating huge hash tables for efficient lookup is not an elegant solution in most of the industrial scenarios where even small latency/scalability matters (e.g.: high frequency trading). You have to care about the data structures to be optimized for space it takes up in memory too to reduce cache miss.
A very good example where trie better suits the requirements is messaging middleware . You have a million subscribers and publishers of messages to various categories (in JMS terms - Topics or exchanges) , in such cases if you want to filter out messages based on topics (which are actually strings) , you definitely do not want create hash table for the million subscriptions with million topics . A better approach is store the topics in trie , so when filtering is done based on topic match , its complexity is independent of number of topics/subscriptions/publishers (only depends on the length of string). I like it because you can be creative with this data structure to optimize space requirements and hence have lower cache miss.
Use a tree:
If you need auto complete feature
Find all words beginning with 'a' or 'axe' so on.
A suffix tree is a special form of a tree. Suffix trees have a whole list of advantages that hash cannot cover.
Insertion and lookup on a trie is linear with the lengh of the input string O(s).
A hash will give you a O(1) for lookup ans insertion, but first you have to calculate the hash based on the input string which again is O(s).
Conclussion, the asymptotic time complexity is linear in both cases.
The trie has some more overhead from data perspective, but you can choose a compressed trie which will put you again, more or less on a tie with the hash table.
To break the tie ask yourself this question: Do i need to lookup for full words only? Or do I need to return all words matching a prefix? (As in a predictive text input system ). For the first case, go for a hash. It is simpler and cleaner code. Easier to test and maintain. For a more ellaborated use case where prefixes or sufixes matter, go for a trie.
And if you do it just for fun, implementing a trie would put a Sunday afternoon to a good use.
There's something I haven't seen anyone mention explicitly that I think is important to keep in mind. Both hash tables and tries of various kinds will typically have O(k) operations, where k is the length of the string in bits (or equivalently in chars).
This is assuming you have a good hash function. If you don't want "farm" and "farm animals" to hash to the same value, then the hash function will have to use all the bits of the key, and so hashing "farm animals" should take about twice as long as "farm" (unless you're in some sort of rolling hash scenario, but there are somewhat similar operation-saving scenarios with tries too). And with a vanilla trie, it's clear why inserting "farm animals" will take about twice as long as just "farm". In the long run it's true with compressed tries as well.
HashTable implementation is space efficient as compared to basic Trie implementation. But with strings, ordering is necessary in most of the practical applications. But HashTable totally disturbs the lexographical order. Now, if your application is doing operations based on lexographical order (like partial search, all strings with given prefix, all words in sorted order), you should use Tries. For only lookup, HashTable should be used (as arguably, it gives minimum lookup time).
P.S.: Other than these, Ternary Search Trees (TSTs) would be an excellent choice. Its lookup time is more than HashTable, but is time-efficient in all other operations. Also, its more space efficient than tries.
Some (usually embedded, real-time) applications require that the processing time be independent of the data. In that case, a hash table can guarantee a known execution time, while a trie varies based on the data.

Resources