Suffix vs Prefix Trie - algorithm

I know a similar question has been asked (Prefix vs Suffix Trie in String Matching) but the accepted answer did not help me understand my query.
The question is: What advantage does a suffix trie have over a prefix trie ?

Suffix tries allows you to choose the beginning of the string and look how long they match. It is probably similar to accepted answer on the original question but that's the best I can do.

You can try look at aho-corasick algorithm. It is a finite state machine and basically it uses a special prefix trie with failure links from the prefixes to the first occurence of the longest suffices in the trie. Basically it is a breadth-first search of the trie. AC is used in fast multiple pattern matching.

Related

Partial String Matching Algorithm

I am trying to find out if there is an algorithm that exists that is capable of the following:
Given a list of strings:
{"56B99Z", "78K80F", "50B49J", "28F11F"}
And given an input string of:
"??B?9?"
Then the algorithm should output:
{"56B99Z", "50B49J"}
Where ? are uknown characters.
I think some sort of trie-tree with additional links between nodes could work, but I don't want to re-invent the wheel if this has been done before.
Your question is really vague and you need to be more specific, are the strings have the same size? If so you can just look on the position which aren't question mark in your string you search for each other string, anyway if you looking for matching strings algorithms I suggest you read about kmp algorithm which have linear complexity for the given input => https://en.m.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm
use a regular expression to match on the 1,2,4,6 positions as an \w

given prefix, how to find most frequently words effiecitly

This is a interview question extended from this one: http://www.geeksforgeeks.org/find-the-k-most-frequent-words-from-a-file/
But this question required on more thing: GIVEN PREFIX
For example, given "Bl" return most frequently words such as "bloom, blame, bloomberg" etc.
SO using TRIE is a must. But then, how to efficitnly construct heap? It's not right or pratical to construct heap for each prefix at run time. what could be a good solution.
[suppose this TRIE or data structure is static, pre-build]
thanks!
Keep a trie for all the words appearing in the file and their counts. So now if it asked to return all the words with the prefix "BI", then you can do it efficiently using the trie formed. But since you are interested in giving the most frequently occuring words with the prefix "BI", you can now use a min-heap to give the answer efficiently just like it has been done in the post you have linked to.
Also note that since the space usage of a trie can grow very large, you can suggest the interviewer to use a Ternary search tree instead of a trie.

Find all (english word) substrings of a given string

This is an interview question: Find all (english word) substrings of a given string. (every = every, ever, very).
Obviously, we can loop over all substrings and check each one against an English dictionary, organized as a set. I believe the dictionary is small enough to fit the RAM. How to organize the dictionary ? As for as I remember, the original spell command loaded the words file in a bitmap, represented a set of words hash values. I would start from that.
Another solution is a trie built from the dictionary. Using the trie we can loop over all string characters and check the trie for each character. I guess the complexity of this solution would be the same in the worst case (O(n^2))
Does it make sense? Would you suggest other solutions?
The Aho-Corasick string matching algorithm which "constructs a finite state machine that resembles a trie with additional links between the various internal nodes."
But everything considered the "build a trie from the English dictionary and do a simultaneous search on it for all suffixes of the given string" should be pretty good for an interview.
I'm not sure a Trie will work easily to match sub words that begin in the middle of the string.
Another solution with a similar concept is to use a state machine or regular expression.
the regular expression is just word1|word2|....
I'm not sure if standard regular expression engines can handle an expression covering the whole English language, but it shouldn't be hard to build the equivalent state machine given the dictionary.
Once the regular expression is compiled \ the state machine is built the complexity of analyzing a specific string is O(n)
The first solution can be refined to have a different hash map for each word length (to reduce collisions) but other than that I can't think of anything significantly better.

Algorithm to find multiple string matches

I'm looking for suggestions for an efficient algorithm for finding all matches in a large body of text. Terms to search for will be contained in a list and can have 1000+ possibilities. The search terms may be 1 or more words.
Obviously I could make multiple passes through the text comparing against each search term. Not too efficient.
I've thought about ordering the search terms and combining common sub-segments. That way I could eliminate large numbers of terms quickly. Language is C++ and I can use boost.
An example of search terms could be a list of Fortune 500 company names.
Ideas?
Don´t reinvent the wheel
This problem has been intensively researched. Curiously, the best algorithms for searching ONE pattern/string do not extrapolate easily to multi-string matching.
The "grep" family implement the multi-string search in a very efficient way. If you can use them as external programs, do it.
In case you really need to implement the algorithm, I think the fastest way is to reproduce what agrep does (agrep excels in multi-string matching!). Here are the source and executable files.
And here you will find a paper describing the algorithms used, the theoretical background, and a lot of information and pointers about string matching.
A note of caution: multiple-string matching have been heavily researched by people like Knuth, Boyer, Moore, Baeza-Yates, and others. If you need a really fast algorithm don't hesitate on standing on their broad shoulders. Don't reinvent the wheel.
As in the case of single patterns, there are several algorithms for multiple-pattern matching, and you will have to find the one that fits your purpose best. The paper A fast algorithm for multi-pattern searching (archived copy) does a review of most of them, including Aho-Corasick (which is kind of the multi-pattern version of the Knuth-Morris-Pratt algorithm, with linear complexity) and Commentz-Walter (a combination of Boyer-Moore and Aho-Corasick), and introduces a new one, which uses ideas from Boyer-Moore for the task of matching multiple patterns.
An alternative, hash-based algorithm not mentioned in that paper, is the Rabin-Karp algorithm, which has a worst-case complexity bigger than other algorithms, but compensates it by reducing the linear factor via hashing. Which one is better depends ultimately on your use case. You may need to implement several of them and compare them in your application if you want to choose the fastest one.
Assuming that the large body of text is static english text and you need to match whole words you can try the following (you should really clarify what exactly is a 'match', what kind of text you are looking at etc in your question).
First preprocess the whole document into a Trie or a DAWG.
Trie/Dawg has the following property:
Given a trie/dawg and a search term of length K, you can in O(K) time lookup the data associated with the word (or tell if there is no match).
Using a DAWG could save you more space as compared to a trie. Tries exploit the fact that many words will have a common prefix and DAWGs exploit the common prefix as well as the common suffix property.
In the trie, also maintain exactly the list of positions of the word. For example if the text is
That is that and so it is.
The node for the last t in that will have the list {1,3} and the node for s in is will have the list {2,7} associated.
Now when you get a single word search term, you can walk the trie and get the list of matches for that word easily.
If you get a multiple word search term, you can do the following.
Walk the trie with the first word in the search term. Get the list of matches and insert into a hashTable H1.
Now walk the trie with the second word in the search term. Get the list of matches. For each match position x, check if x-1 exists in the HashTable H1. If so, add x to new hashtable H2.
Walk the trie with the third word, get list of matches. For each match position y, check if y-1 exists in H3, if so add to new hashtable H3.
Continue so forth.
At the end you get a list of matches for the search phrase, which give the positions of the last word of the phrase.
You could potentially optimize the phrase matching step by maintaining a sorted list of positions in the list and doing a binary search: i.e for eg. for each key k in H2, you binary search for k+1 in the sorted list for search term 3 and add k+1 to H3 if you find it etc.
An optimal solution for this problem is to use a suffix tree (or a suffix array). It's essentially a trie of all suffixes of a string. For a text of length O(N), this can be built in O(N).
Then all k occurrences of a string of length m can be answered optimally in O(m + k).
Suffix trees can also be used to efficiently find e.g. the longest palindrome, the longest common substring, the longest repeated substring, etc.
This is the typical data structure to use when analyzing DNA strings which can be millions/billions of bases long.
See also
Wikipedia/Suffix tree
Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology (Dan Gusfield).
So you have lots of search terms and want to see if any of them are in the document?
Purely algorithmically, you could sort all your possibilities in alphabetical order, join them with pipes, and use them as a regular expression, if the regex engine will look at /ant|ape/ and properly short-circuit the a in "ape" if it didn't find it in "ant". If not, you could do a "precompile" of a regex and "squish" the results down to their minimum overlap. I.e. in the above case /a(nt|pe)/ and so on, recursively for each letter.
However, doing the above is pretty much like putting all your search strings in a 26-ary tree (26 characters, more if also numbers). Push your strings onto the tree, using one level of depth per character of length.
You can do this with your search terms to make a hyper-fast "does this word match anything in my list of search terms" if your search terms number large.
You could theoretically do the reverse as well -- pack your document into the tree and then use the search terms on it -- if your document is static and the search terms change a lot.
Depends on how much optimization you need...
Are the search terms words that you are looking for or can it be full sentances too ?
If it's only words, then i would suggest building a Red-Black Tree from all the words, and then searching for each word in the tree.
If it could be sentances, then it could get a lot more complex... (?)

substring algorithm

Can anyone point to best algorithm for substring search in another string?
or search for a char array in another char array?
The best from what point of view? Knuth-Morris-Pratt is a good one. You can find more of them discussed on the Wikipedia entry for string searching algorithms.
It depends on what types of searching you are doing. Specific substring over a specific string? Specific substring over many different strings? Many different substrings over a specific string?
Here's a popular algorithm for a specific substring over many different strings.
Boyer-Moore algorithm: http://en.wikipedia.org/wiki/Boyer–Moore_string_search_algorithm
This strstr() implementation seems pretty slick.

Resources