Find all (english word) substrings of a given string - algorithm

This is an interview question: Find all (english word) substrings of a given string. (every = every, ever, very).
Obviously, we can loop over all substrings and check each one against an English dictionary, organized as a set. I believe the dictionary is small enough to fit the RAM. How to organize the dictionary ? As for as I remember, the original spell command loaded the words file in a bitmap, represented a set of words hash values. I would start from that.
Another solution is a trie built from the dictionary. Using the trie we can loop over all string characters and check the trie for each character. I guess the complexity of this solution would be the same in the worst case (O(n^2))
Does it make sense? Would you suggest other solutions?

The Aho-Corasick string matching algorithm which "constructs a finite state machine that resembles a trie with additional links between the various internal nodes."
But everything considered the "build a trie from the English dictionary and do a simultaneous search on it for all suffixes of the given string" should be pretty good for an interview.

I'm not sure a Trie will work easily to match sub words that begin in the middle of the string.
Another solution with a similar concept is to use a state machine or regular expression.
the regular expression is just word1|word2|....
I'm not sure if standard regular expression engines can handle an expression covering the whole English language, but it shouldn't be hard to build the equivalent state machine given the dictionary.
Once the regular expression is compiled \ the state machine is built the complexity of analyzing a specific string is O(n)

The first solution can be refined to have a different hash map for each word length (to reduce collisions) but other than that I can't think of anything significantly better.

Related

What would be a good data structure to store a dictionary(of words) to optimize the search time?

Provided a list of valid words, and a search word, I want to find whether the search word is a valid word or not ALLOWING 2 typo characters.
What would be a good data structure to store a dictionary of words(assumingly it contains a million words) and algorithm to find whether the word exists in the dictionary(allowing 2 typo characters).
If no typo characters where allowed, then a Trie would be a good way to store the words but not sure if it stays the best way to store dictionary when typos are allowed. Not sure what the complexity for a backtracking algorithm(to search for a word in Trie allowing 2 typos) would be. Any idea about it?
You might want to checkout Directed Acyclic Word Graph or DAWG. It has more of an automata structure than a tree of graph structure. Multiple possibilities from one place may provide you with your solution.
If there is no need to also store all mistyped words I would consider to use a two-step approach for this problem.
Build a set containing hashes of all valid words (not including
typos). So probably we are talking here about some 10.000 entries,
which should still allow quite fast lookups with a binary search. If
the hash of a word is found in the set it is typed correctly.
If a words hash is not found in the set the word is probably
mistyped. So calculate a the Damerau-Levenshtein distance between
the word and all known words to figure out what the user might have
meant. To gain some performance here modify the DL-algorithm to
abort calculation if the distance gets bigger than your allowed
threshold of 2 typos.

Ukkonen's Algorithm Generalized Suffix Trees

I understand the ukkonen's algorithm. I am only curious how to extend it to have more than one string in it (ending with a special character say "$").
I read somewhere that Given strings s1(say "abcddefx$") and s2(say "abddefgh$"), I should insert the s1 normally by ukkonen's algo. Then traverse down the tree with s2. That is I should search for s2 in the tree.
Once I get to a node where the search ends ("ab", after 'b') I should resume the ukkonen's algorithm from there.
I understand the basic logic behind this. But what I am curious about is, what happens to the old suffix links. Are they still valid???
Also I am confused about my triple (active_node,active_length,remainder) should it be (node representing "ab",0,0) as I start the new pass???
For dealing with special characters you can use the Unicode Private Use Areas. These are a few special ranges of characters reserved for your own use, however the ranges are only around 4000 characters in size. Depending on the unicode support of the language you are using this can be really easy or difficult.
If that does not work, instead of inserting characters into your tree, wrap them in some other sort of variable (struct, object, dictionary) to 'extend' their meaning. That way you can provide the extra information needed (is this the end of a string? which string is this the end of?). Then you can provide custom operators for equality on this new wrapper instead of using characters directly.

What are the main differences between the Knuth-Morris-Pratt and Boyer-Moore search algorithms?

What are the main differences between the Knuth-Morris-Pratt search algorithm and the Boyer-Moore search algorithm?
I know KMP searches for Y in X, trying to define a pattern in Y, and saves the pattern in a vector. I also know that BM works better for small words, like DNA (ACTG).
What are the main differences in how they work? Which one is faster? Which one is less computer-greedy? In which cases?
Moore's UTexas webpage walks through both algorithms in a step-by-step fashion (he also provides various technical sources):
Knuth-Morris-Pratt
Boyer-Moore
According to the man himself,
The classic Boyer-Moore algorithm suffers from the phenomenon that it
tends not to work so efficiently on small alphabets like DNA. The skip
distance tends to stop growing with the pattern length because
substrings re-occur frequently. By remembering more of what has
already been matched, one can get larger skips through the text. One
can even arrange ``perfect memory'' and thus look at each character at
most once, whereas the Boyer-Moore algorithm, while linear, may
inspect a character from the text multiple times. This idea of
remembering more has been explored in the literature by others. It
suffers from the need for very large tables or state machines.
However, there have been some modifications of BM that have made small-alphabet searching viable.
In an rough explanation
Boyer-Moore's approach is to try to match the last character of the pattern instead of the first one with the assumption that if there's not match at the end no need to try to match at the beginning. This allows for "big jumps" therefore BM works better when the pattern and the text you are searching resemble "natural text" (i.e. English)
Knuth-Morris-Pratt searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters. (Source: Wiki)
This means KMP is better suited for small sets like DNA (ACTG)
Boyer-Moore technique match the characters from right to left, works well on long patterns.
knuth moris pratt match the characters from left to right, works fast on short patterns.

Algorithm to find multiple string matches

I'm looking for suggestions for an efficient algorithm for finding all matches in a large body of text. Terms to search for will be contained in a list and can have 1000+ possibilities. The search terms may be 1 or more words.
Obviously I could make multiple passes through the text comparing against each search term. Not too efficient.
I've thought about ordering the search terms and combining common sub-segments. That way I could eliminate large numbers of terms quickly. Language is C++ and I can use boost.
An example of search terms could be a list of Fortune 500 company names.
Ideas?
DonĀ“t reinvent the wheel
This problem has been intensively researched. Curiously, the best algorithms for searching ONE pattern/string do not extrapolate easily to multi-string matching.
The "grep" family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.
In case you really need to implement the algorithm, I think the fastest way is to reproduce what agrep does (agrep excels in multi-string matching!). Here are the source and executable files.
And here you will find a paper describing the algorithms used, the theoretical background, and a lot of information and pointers about string matching.
A note of caution: multiple-string matching have been heavily researched by people like Knuth, Boyer, Moore, Baeza-Yates, and others. If you need a really fast algorithm don't hesitate on standing on their broad shoulders. Don't reinvent the wheel.
As in the case of single patterns, there are several algorithms for multiple-pattern matching, and you will have to find the one that fits your purpose best. The paper A fast algorithm for multi-pattern searching (archived copy) does a review of most of them, including Aho-Corasick (which is kind of the multi-pattern version of the Knuth-Morris-Pratt algorithm, with linear complexity) and Commentz-Walter (a combination of Boyer-Moore and Aho-Corasick), and introduces a new one, which uses ideas from Boyer-Moore for the task of matching multiple patterns.
An alternative, hash-based algorithm not mentioned in that paper, is the Rabin-Karp algorithm, which has a worst-case complexity bigger than other algorithms, but compensates it by reducing the linear factor via hashing. Which one is better depends ultimately on your use case. You may need to implement several of them and compare them in your application if you want to choose the fastest one.
Assuming that the large body of text is static english text and you need to match whole words you can try the following (you should really clarify what exactly is a 'match', what kind of text you are looking at etc in your question).
First preprocess the whole document into a Trie or a DAWG.
Trie/Dawg has the following property:
Given a trie/dawg and a search term of length K, you can in O(K) time lookup the data associated with the word (or tell if there is no match).
Using a DAWG could save you more space as compared to a trie. Tries exploit the fact that many words will have a common prefix and DAWGs exploit the common prefix as well as the common suffix property.
In the trie, also maintain exactly the list of positions of the word. For example if the text is
That is that and so it is.
The node for the last t in that will have the list {1,3} and the node for s in is will have the list {2,7} associated.
Now when you get a single word search term, you can walk the trie and get the list of matches for that word easily.
If you get a multiple word search term, you can do the following.
Walk the trie with the first word in the search term. Get the list of matches and insert into a hashTable H1.
Now walk the trie with the second word in the search term. Get the list of matches. For each match position x, check if x-1 exists in the HashTable H1. If so, add x to new hashtable H2.
Walk the trie with the third word, get list of matches. For each match position y, check if y-1 exists in H3, if so add to new hashtable H3.
Continue so forth.
At the end you get a list of matches for the search phrase, which give the positions of the last word of the phrase.
You could potentially optimize the phrase matching step by maintaining a sorted list of positions in the list and doing a binary search: i.e for eg. for each key k in H2, you binary search for k+1 in the sorted list for search term 3 and add k+1 to H3 if you find it etc.
An optimal solution for this problem is to use a suffix tree (or a suffix array). It's essentially a trie of all suffixes of a string. For a text of length O(N), this can be built in O(N).
Then all k occurrences of a string of length m can be answered optimally in O(m + k).
Suffix trees can also be used to efficiently find e.g. the longest palindrome, the longest common substring, the longest repeated substring, etc.
This is the typical data structure to use when analyzing DNA strings which can be millions/billions of bases long.
See also
Wikipedia/Suffix tree
Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology (Dan Gusfield).
So you have lots of search terms and want to see if any of them are in the document?
Purely algorithmically, you could sort all your possibilities in alphabetical order, join them with pipes, and use them as a regular expression, if the regex engine will look at /ant|ape/ and properly short-circuit the a in "ape" if it didn't find it in "ant". If not, you could do a "precompile" of a regex and "squish" the results down to their minimum overlap. I.e. in the above case /a(nt|pe)/ and so on, recursively for each letter.
However, doing the above is pretty much like putting all your search strings in a 26-ary tree (26 characters, more if also numbers). Push your strings onto the tree, using one level of depth per character of length.
You can do this with your search terms to make a hyper-fast "does this word match anything in my list of search terms" if your search terms number large.
You could theoretically do the reverse as well -- pack your document into the tree and then use the search terms on it -- if your document is static and the search terms change a lot.
Depends on how much optimization you need...
Are the search terms words that you are looking for or can it be full sentances too ?
If it's only words, then i would suggest building a Red-Black Tree from all the words, and then searching for each word in the tree.
If it could be sentances, then it could get a lot more complex... (?)

What algorithm can you use to find duplicate phrases in a string?

Given an arbitrary string, what is an efficient method of finding duplicate phrases? We can say that phrases must be longer than a certain length to be included.
Ideally, you would end up with the number of occurrences for each phrase.
In theory
A suffix array is the 'best' answer since it can be implemented to use linear space and time to detect any duplicate substrings. However - the naive implementation actually takes time O(n^2 log n) to sort the suffixes, and it's not completely obvious how to reduce this down to O(n log n), let alone O(n), although you can read the related papers if you want to.
A suffix tree can take slightly more memory (still linear, though) than a suffix array, but is easier to implement to build quickly since you can use something like a radix sort idea as you add things to the tree (see the wikipedia link from the name for details).
The KMP algorithm is also good to be aware of, which is specialized for searching for a particular substring within a longer string very quickly. If you only need this special case, just use KMP and no need to bother building an index of suffices first.
In practice
I'm guessing you're analyzing a document of actual natural language (e.g. English) words, and you actually want to do something with the data you collect.
In this case, you might just want to do a quick n-gram analysis for some small n, such as just n=2 or 3. For example, you could tokenize your document into a list of words by stripping out punctuation, capitalization, and stemming words (running, runs both -> 'run') to increase semantic matches. Then just build a hash map (such as hash_map in C++, a dictionary in python, etc) of each adjacent pair of words to its number of occurrences so far. In the end you get some very useful data which was very fast to code, and not crazy slow to run.
Like the earlier folks mention that suffix tree is the best tool for the job. My favorite site for suffix trees is http://www.allisons.org/ll/AlgDS/Tree/Suffix/. It enumerates all the nifty uses of suffix trees on one page and has a test js applicaton embedded to test strings and work through examples.
Suffix trees are a good way to implement this. The bottom of that article has links to implementations in different languages.
Like jmah said, you can use suffix trees/suffix arrays for this.
There is a description of an algorithm you could use here (see Section 3.1).
You can find a more in-depth description in the book they cite (Gusfield, 1997), which is on google books.
suppose you are given sorted array A with n entries (i=1,2,3,...,n)
Algo(A(i))
{
while i<>n
{
temp=A[i];
if A[i]<>A[i+1] then
{
temp=A[i+1];
i=i+1;
Algo(A[i])
}
else if A[i]==A[i+1] then
mark A[i] and A[i+1] as duplicates
}
}
This algo runs at O(n) time.

Resources