What would be a good data structure to store a dictionary(of words) to optimize the search time? - algorithm

Provided a list of valid words, and a search word, I want to find whether the search word is a valid word or not ALLOWING 2 typo characters.
What would be a good data structure to store a dictionary of words(assumingly it contains a million words) and algorithm to find whether the word exists in the dictionary(allowing 2 typo characters).
If no typo characters where allowed, then a Trie would be a good way to store the words but not sure if it stays the best way to store dictionary when typos are allowed. Not sure what the complexity for a backtracking algorithm(to search for a word in Trie allowing 2 typos) would be. Any idea about it?

You might want to checkout Directed Acyclic Word Graph or DAWG. It has more of an automata structure than a tree of graph structure. Multiple possibilities from one place may provide you with your solution.

If there is no need to also store all mistyped words I would consider to use a two-step approach for this problem.
Build a set containing hashes of all valid words (not including
typos). So probably we are talking here about some 10.000 entries,
which should still allow quite fast lookups with a binary search. If
the hash of a word is found in the set it is typed correctly.
If a words hash is not found in the set the word is probably
mistyped. So calculate a the Damerau-Levenshtein distance between
the word and all known words to figure out what the user might have
meant. To gain some performance here modify the DL-algorithm to
abort calculation if the distance gets bigger than your allowed
threshold of 2 typos.

Related

Compressing words into one word consisting of them as subwords [duplicate]

I bet somebody has solved this before, but my searches have come up empty.
I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.
Example: doll dollhouse house
These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.
What I've come up with so far is:
Sort the words longest to shortest: (dollhouse, house, doll)
Scan the buffer to see if the string already exists as a substring, if so note the location.
If it doesn't already exist, add it to the end of the buffer.
Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.
This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.
As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm
This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.
As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.
Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.
I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.
I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).
My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.
Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.
I did a lab back in college where we tasked with implementing a simple compression program.
What we did was sequentially apply these techniques to text:
BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols
Here, I found the assignment page.
To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.
Refine step 3.
Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
If no, add word to end of list as in current step 3.
This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).
I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?
Here are a few good choices:
gzip for fast compression / decompression speed
bzip2 for a bit bitter compression but much slower decompression
LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
lzop for very fast compression / decompression
If you use Java, gzip is already integrated.
It's not clear what do you want to do.
Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?
Do you just want an array of words, compressed?
In the first case, you can go for a patricia trie or a String B-Tree.
For the second case, you can just adopt some index compression techinique, like that:
If you have something like:
aaa
aaab
aasd
abaco
abad
You can compress like that:
0aaa
3b
2sd
1baco
2ad
The number is the length of the largest common prefix with the preceding string.
You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

Finding a list of adjacent words between two words

I am working on a programming challenge for practice and am having trouble finding a good data structure/algorithm to use to implement a solution.
Background:
Call two words “adjacent” if you can change one word into the other by adding, deleting, or changing a single letter.
A “word list” is an ordered list of unique words where successive words are adjacent.
The problem:
Write a program which takes two words as inputs and walks through the dictionary and creates a list of words between them.
Examples:
hate → love: hate, have, hove, love
dogs → wolves: dogs, does, doles, soles, solves, wolves
man → woman: man, ran, roan, roman, woman
flour → flower: flour, lour, dour, doer, dower, lower, flower
I am not quite sure how to approach this problem, my first attempt involved creating permutations of the first word then trying to replace letters in it. My second thought was maybe something like a suffix tree
Any thoughts or ideas toward at least breaking the problem down would be appreciated. Keep in mind that this is not homework, but a programming challenge I am working on myself.
This puzzle was first stated by Charles Dodgson, who wrote Alice's Adventures in Wonderland under his pseudonym Lewis Carroll.
The basic idea is to create a graph structure in which the nodes are words in a dictionary and the edges connect words that are one letter apart, then do a breadth-first search through the graph, starting at the first word, until you find the second word.
I discuss this problem, and give an implementation that includes a clever algorithm for identifying "adjacent to" words, at my blog.
I have done this myself and used it to create a (not very good) Windows game.
I used the approach recommended by others of implementing this as a graph, where each node is a word and they are connected if they differ in one letter. This means you can use well known graph theory results to find paths between words (eg simple recursion where knowing the words at distance 1 allows you to find the words at distance 2).
The tricky part is building up the graph. The bad news is that it is O(n^2). The good news is that it doesn't have to be done in real time - rather than your program reading the dictionary words from a file, it reads in the data structure you baked earlier.
The key insight is that the order doesn't matter, in fact it gets in the way. You need to construct another form in which to hold the words which strips out the order information and allows words to be compared more easily. You can do this in O(n). You have lots of choices; I will show two.
For word puzzles I quit often use an encoding which I call anagram dictionary. A word is represented by another word which has the same letters but in alphabetic sequence. So "cars" becomes "acrs". Both lists and slits become "ilsst". This is a better structure for comparison than the original word, but much better comparisons exist (however, it is a very useful structure for other word puzzles).
Letter counts. An array of 26 values which show the frequency of that letter in the word. So for "cars" it starts 1,0,1,0,0... as there is one "a" and one "c". Hold an external list of the non-zero entries (which letters appear in the word) so you only have to check 5 or 6 values at most instead of 26. Very simple to compare two words held in this form quickly by ensuring at most two counts are different. This is the one I would use.
So, this is how I did it.
I wrote a program which implemented the data structure up above.
It had a class called WordNode. This contains the original word; a List of all other WordNodes which are one letter different; an array of 26 integers giving the frequency of each letter, a list of the non-zero values in the letter count array.
The initialiser populates the letter frequency array and the corresponding list of non-zero values. It sets the list of connected WordNodes to zero.
After I have created an instance of the WordNode class for every word, I run a compare method which checks to see if the frequency counts are different in no more than two places. That normally takes slightly less compares than there are letters in the words; not too bad. If they are different in exactly two places they differ by one letter, and I add that WordNode into the list of WordNodes differing in only one letter.
This means we now have a graph of all the words one letter different.
You can export either the whole data structure or strip out the letter frequency and other stuff you don't need and save it (I used serialized XML. If you go that way, make sure you check it handles the List of WordNodes as references and not embedded objects).
Your actual game then only has to read in this data structure (instead of a dictionary) and it can find the words one letter different with a direct lookup, in essentially zero time.
Pity my game was crap.
I don't know if this is the type of solution that you're looking for, but an active area of research is in constructing "edit distance 1" dictionaries for quickly looking up adjacent words (to use your parlance) for search term suggestions, data entry correction, and bioinformatics (e.g. finding similarities in chromosomes). See for example this research paper. Short of indexing your entire dictionary, at the very least this might suggest a search heuristic that you can use.
The simplest (recursive) algorithm is I can think of (well, the only one I can think in the moment) is
Initialize a empty blacklist
Take all words from your dictionary that is a valid step for the current word
remove the ones that are in the blacklist
Check if you can find the target word.
if not, repeat the algorithm for all words you found in last step
if yes, you found it. Return the recursion printing all words in the path you found.
Maybe someone with a bit more time can add the ruby code for this?
Try this
x = 'hate'
puts x = x.next until x == 'love'
And if you couple it with dictionary lookup, you will get a list of all valid words in between in that dictionary.

How can we optimise the creation of a trie if we know the input is in alphabetical order?

I am implementing a prefix tree, with a standard insertion mechanism. If we know we will be given a list of words in alphabetical order, is there any way we can change the insertion to skip a few steps? I am coding in Java, although I'm not looking for code in any particular language. I have considered adding the Nodes for each word to a queue, then hopping backwards through it until we're at a prefix of the next word, but this may be circumventing the whole point of the prefix tree!
Any thoughts on something like this? I'm finding it hard to come up with an implementation that's of any use unless the input is many many very similar words ("aaaaaaaaaab", "aaaaaaaaaac", "aaaaaaaaaad", ...) or something. But even then doing a string comparison on the prefixes is probably a similar cost to just using the prefix tree normally.
There is no way that you can avoid looking at all the characters in the input strings from which you're building the tree. If there was a way to do this, then I could make your algorithm incorrect. In particular, suppose that there is a word w and you don't look at one of its characters (say, the kth character). Then when your algorithm runs and tries to place the word somewhere in the trie, it must be able to place it without knowing all the characters. Therefore, if I change the kth character of the word to something else, your algorithm would put it in exactly the same place as before, which is incorrect because one of the characters in the word won't be correct.
Since the normal algorithm for constructing a trie takes time proportional to the number of characters in the input, you won't be able to asymptotically outperform it without doing some crazy tricks like parallelizing the construction code or packing the characters into machine words and hitting them with your Hammer of Bit Hackery.
However, you could potentially get a constant factor speedup. Following large numbers of pointers in a linked structure can be slow due to cache performance, so you could speed up the algorithm by minimizing the number of pointers you have to follow. One thing you could do would be to maintain the position of the end of the last string that you inserted, along with a list (preferably as a dynamic array) of nodes tracing the path back up to the root. To insert a new character, you could do the following:
Find the longest prefix of the string that matches the last string you inserted.
Jump to the pointer in the array marking where that would take you.
Trace the rest of the path down as normal, adding all the nodes that you trace out to the array and overwriting the previous pointers.
This way, if you insert a lot of words with a common prefix of a reasonable length, you can avoid doing a bunch of pointer-chasing back through a shared part of the structure. This could conceivably give you a performance boost if you have lots of words with the same prefix. It's not asymptotically better than before (and, in fact, uses more memory), but the savings from not following pointers could add up. I haven't tested this, but it seems like it might work.
Hope this helps!

Algorithm to find multiple string matches

I'm looking for suggestions for an efficient algorithm for finding all matches in a large body of text. Terms to search for will be contained in a list and can have 1000+ possibilities. The search terms may be 1 or more words.
Obviously I could make multiple passes through the text comparing against each search term. Not too efficient.
I've thought about ordering the search terms and combining common sub-segments. That way I could eliminate large numbers of terms quickly. Language is C++ and I can use boost.
An example of search terms could be a list of Fortune 500 company names.
Ideas?
Don´t reinvent the wheel
This problem has been intensively researched. Curiously, the best algorithms for searching ONE pattern/string do not extrapolate easily to multi-string matching.
The "grep" family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.
In case you really need to implement the algorithm, I think the fastest way is to reproduce what agrep does (agrep excels in multi-string matching!). Here are the source and executable files.
And here you will find a paper describing the algorithms used, the theoretical background, and a lot of information and pointers about string matching.
A note of caution: multiple-string matching have been heavily researched by people like Knuth, Boyer, Moore, Baeza-Yates, and others. If you need a really fast algorithm don't hesitate on standing on their broad shoulders. Don't reinvent the wheel.
As in the case of single patterns, there are several algorithms for multiple-pattern matching, and you will have to find the one that fits your purpose best. The paper A fast algorithm for multi-pattern searching (archived copy) does a review of most of them, including Aho-Corasick (which is kind of the multi-pattern version of the Knuth-Morris-Pratt algorithm, with linear complexity) and Commentz-Walter (a combination of Boyer-Moore and Aho-Corasick), and introduces a new one, which uses ideas from Boyer-Moore for the task of matching multiple patterns.
An alternative, hash-based algorithm not mentioned in that paper, is the Rabin-Karp algorithm, which has a worst-case complexity bigger than other algorithms, but compensates it by reducing the linear factor via hashing. Which one is better depends ultimately on your use case. You may need to implement several of them and compare them in your application if you want to choose the fastest one.
Assuming that the large body of text is static english text and you need to match whole words you can try the following (you should really clarify what exactly is a 'match', what kind of text you are looking at etc in your question).
First preprocess the whole document into a Trie or a DAWG.
Trie/Dawg has the following property:
Given a trie/dawg and a search term of length K, you can in O(K) time lookup the data associated with the word (or tell if there is no match).
Using a DAWG could save you more space as compared to a trie. Tries exploit the fact that many words will have a common prefix and DAWGs exploit the common prefix as well as the common suffix property.
In the trie, also maintain exactly the list of positions of the word. For example if the text is
That is that and so it is.
The node for the last t in that will have the list {1,3} and the node for s in is will have the list {2,7} associated.
Now when you get a single word search term, you can walk the trie and get the list of matches for that word easily.
If you get a multiple word search term, you can do the following.
Walk the trie with the first word in the search term. Get the list of matches and insert into a hashTable H1.
Now walk the trie with the second word in the search term. Get the list of matches. For each match position x, check if x-1 exists in the HashTable H1. If so, add x to new hashtable H2.
Walk the trie with the third word, get list of matches. For each match position y, check if y-1 exists in H3, if so add to new hashtable H3.
Continue so forth.
At the end you get a list of matches for the search phrase, which give the positions of the last word of the phrase.
You could potentially optimize the phrase matching step by maintaining a sorted list of positions in the list and doing a binary search: i.e for eg. for each key k in H2, you binary search for k+1 in the sorted list for search term 3 and add k+1 to H3 if you find it etc.
An optimal solution for this problem is to use a suffix tree (or a suffix array). It's essentially a trie of all suffixes of a string. For a text of length O(N), this can be built in O(N).
Then all k occurrences of a string of length m can be answered optimally in O(m + k).
Suffix trees can also be used to efficiently find e.g. the longest palindrome, the longest common substring, the longest repeated substring, etc.
This is the typical data structure to use when analyzing DNA strings which can be millions/billions of bases long.
See also
Wikipedia/Suffix tree
Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology (Dan Gusfield).
So you have lots of search terms and want to see if any of them are in the document?
Purely algorithmically, you could sort all your possibilities in alphabetical order, join them with pipes, and use them as a regular expression, if the regex engine will look at /ant|ape/ and properly short-circuit the a in "ape" if it didn't find it in "ant". If not, you could do a "precompile" of a regex and "squish" the results down to their minimum overlap. I.e. in the above case /a(nt|pe)/ and so on, recursively for each letter.
However, doing the above is pretty much like putting all your search strings in a 26-ary tree (26 characters, more if also numbers). Push your strings onto the tree, using one level of depth per character of length.
You can do this with your search terms to make a hyper-fast "does this word match anything in my list of search terms" if your search terms number large.
You could theoretically do the reverse as well -- pack your document into the tree and then use the search terms on it -- if your document is static and the search terms change a lot.
Depends on how much optimization you need...
Are the search terms words that you are looking for or can it be full sentances too ?
If it's only words, then i would suggest building a Red-Black Tree from all the words, and then searching for each word in the tree.
If it could be sentances, then it could get a lot more complex... (?)

Efficient word scramble algorithm

I'm looking for an efficient algorithm for scrambling a set of letters into a permutation containing the maximum number of words.
For example, say I am given the list of letters: {e, e, h, r, s, t}. I need to order them in such a way as to contain the maximum number of words. If I order those letters into "theres", it contain the words "the", "there", "her", "here", and "ere". So that example could have a score of 5, since it contains 5 words. I want to order the letters in such a way as to have the highest score (contain the most words).
A naive algorithm would be to try and score every permutation. I believe this is O(n!), so 720 different permutations would be tried for just the 6 letters above (including some duplicates, since the example has e twice). For more letters, the naive solution quickly becomes impossible, of course.
The algorithm doesn't have to actually produce the very best solution, but it should find a good solution in a reasonable amount of time. For my application, simply guessing (Monte Carlo) at a few million permutations works quite poorly, so that's currently the mark to beat.
I am currently using the Aho-Corasick algorithm to score permutations. It searches for each word in the dictionary in just one pass through the text, so I believe it's quite efficient. This also means I have all the words stored in a trie, but if another algorithm requires different storage that's fine too. I am not worried about setting up the dictionary, just the run time of the actual ordering and searching. Even a fuzzy dictionary could be used if needed, like a Bloom Filter.
For my application, the list of letters given is about 100, and the dictionary contains over 100,000 entries. The dictionary never changes, but several different lists of letters need to be ordered.
I am considering trying a path finding algorithm. I believe I could start with a random letter from the list as a starting point. Then each remaining letter would be used to create a "path." I think this would work well with the Aho-Corasick scoring algorithm, since scores could be built up one letter at a time. I haven't tried path finding yet though; maybe it's not a even a good idea? I don't know which path finding algorithm might be best.
Another algorithm I thought of also starts with a random letter. Then the dictionary trie would be searched for "rich" branches containing the remain letters. Dictionary branches containing unavailable letters would be pruned. I'm a bit foggy on the details of how this would work exactly, but it could completely eliminate scoring permutations.
Here's an idea, inspired by Markov Chains:
Precompute the letter transition probabilities in your dictionary. Create a table with the probability that some letter X is followed by another letter Y, for all letter pairs, based on the words in the dictionary.
Generate permutations by randomly choosing each next letter from the remaining pool of letters, based on the previous letter and the probability table, until all letters are used up. Run this many times.
You can experiment by increasing the "memory" of your transition table - don't look only one letter back, but say 2 or 3. This increases the probability table, but gives you more chance of creating a valid word.
You might try simulated annealing, which has been used successfully for complex optimization problems in a number of domains. Basically you do randomized hill-climbing while gradually reducing the randomness. Since you already have the Aho-Corasick scoring you've done most of the work already. All you need is a way to generate neighbor permutations; for that something simple like swapping a pair of letters should work fine.
Have you thought about using a genetic algorithm? You have the beginnings of your fitness function already. You could experiment with the mutation and crossover (thanks Nathan) algorithms to see which do the best job.
Another option would be for your algorithm to build the smallest possible word from the input set, and then add one letter at a time so that the new word is also is or contains a new word. Start with a few different starting words for each input set and see where it leads.
Just a few idle thoughts.
It might be useful to check how others solved this:
http://sourceforge.net/search/?type_of_search=soft&words=anagram
On this page you can generate anagrams online. I've played around with it for a while and it's great fun. It doesn't explain in detail how it does its job, but the parameters give some insight.
http://wordsmith.org/anagram/advanced.html
With javascript and Node.js I implemented a jumble solver that uses a dictionary and create a tree and then traversal the tree after that you can get all possible words, I explained the algorithm in this article in detail and put the source code on GitHub:
Scramble or Jumble Word Solver with Express and Node.js

Resources