Which is perfect Data Structure for exact string pattern matching? - data-structures

I am working on web content filtering where there is 10000 of words coming on a page. I have to match this with my 1500-2500words dictionary. And I have to find out whether any words present in the page or not.
Please suggest me best data structure to store my pattern faster searching.
I have studied Tree structure. But lets take a word (abc) that may have 26possibilities for a next character. I have to keep 26pointers for next node. (It consumes 26x4 Bytes). I cant spend that much memory for storing my patterns each word.
Suggest me best searching and best in memory.
I am beginner in this field.

Your problem is solved exactly by Aho-Corasick. After some pre-processing you can process each webpage in O(n) time, where n is the size of that page. You will need an amount of auxiliary storage roughly as large as your dictionary.
Your memory constraints seem pretty severe, but if you really need to reduce your memory footprint you can use a list of all characters present at a given state rather than having an array of 26 characters at each state. You will need to scan through these characters as you process a web page, which will slow you down by a fairly significant constant factor, but you will save on space.

For best search is trie http://en.wikipedia.org/wiki/Trie
And for best memory and a god complexity for a search i suggest you a http://en.wikipedia.org/wiki/Suffix_array or a http://en.wikipedia.org/wiki/Suffix_tree
Another approach is by sorting your dictionary(O(NlogN)), and your words O(MlogM) and than with one traversal you check if there is a match for each element O(N + M).
You start with 2 index , and at each step you increase one of them based on the result of comparing the string of the dictionary at one index with the word you have at second index , if they match you have a match and go to the next word you have, else if your word is lower than the dictionary word you go to the next word ( because you allready went through all the dictionary word before that and did not find a match ) else you go to the next element in to the dictionary( try to find a word in to the dictionary that is not lower than your word)

Related

Storing arrangement of same word efficiently

I had a question related to Dictionary storage.
I was reading about Trie Data-structures and so far I have read that it works pretty well as prefix tree. But, I came to Trie - DS in efforts to see if it can reduce the storage of arrangement of letters formed through same word efficiently.
For ex : words "ANT", "TAN" and NAT have same letters but according to Trie it goes on to create two separate paths for these words. I can understand that Trie is meant for prefix storage and reduce redundancy. But can anyone help me in reducing the redundancy here.
One way I was thinking was to change the behavior of Trie as to each node has a status of 'word complete'; In addition if I put 'word start' status too I can make this work as below :
A
N - A - T
T - A - N
Now, every time I can check if the word is starting form here and go till the end.
Does this makes sense ? and if this is feasible ?
Or is their any better method to do this ?
Thanks
You can use 2 tries and also store the reverse trie. Then you can use a wildcard expansion everywhere in the search for example you can split the search word into 2 half and search for one half by the prefix and the other half by its suffix:http://phpir.com/tries-and-wildcards/. When you concatenate the 2 you can efficient search with a wildcard.
If you add a status field to each node you will increase the memory cost of your tree (assuming 8-bit chars) by a possibly not insignificant portion.
I understand that you want to reduce the number of letters in the DS, but you have to consider what happens if some contents are subsets of other contents, e.g. how ANTAN would be represented. Think about the minimal number of chars (128) as nodes of a fully connected graph. Obviously all words are stored in this graph, however it is not suitable to store any specific words. There is no way of telling where words end. The information stored in a trie is not just letters, but complete and properly terminated words.
If you add a marker as you suggest, how will you be able to encode this: SUPERCHARGED, SUPER, PERCH. You would set word_starts at S and P and word_ends at R and H. How would you know that SUPERCH and PER are not contained? You could instead use a non-zero label and assign number-pairs to the beginning and end of words: S:1 P:2 R:1 H:2. To make sure that start and end can occur at the same letter, you would have to use specific bits as labels.
You could then use NATANT as minimal flat representation and N:001 A:000 T:011 A:100 N:010 T: 100. This requires #words bit for the marker in the worst case: A, AA, AAA.... If you would store that in a tree however, you would have to look for the other marker, which is not an operation supported by trees. So I see no good way of using a marker.
From an information theoretical point I think the critical issue here is to properly encode the length, ordering and contents of a word in a unique way for each possible combination of these.
I originally meant to just comment, but it got a bit lengthy. I am not sure if this answers your question, but I hope it helps.
Are you hoping that any search for "ant" also brings up "tan" and "nat"?
If so, then use a TrieMap, always sort keys before reads/writes and map to a container of all words in that "anagram class."
If you're just looking for ideas to reduce the space overhead of using a Trie, then look no further. I've found burst trie's to be very space efficient. I wrote my own burst trie in Scala that also re-uses some ideas that I found in GWT's trie implementation.

Search a string as you type the character

I have contacts stored in my mobile. Lets say my contacts are
Ram
Hello
Hi
Feat
Eat
At
When I type letter 'A' I should get all the matching contacts say "Ram, Feat, Eat, At".
Now I type one more letter T. Now my total string is "AT" now my program should reuse the results of previous search for "A". Now it should return me "Feat, Eat, At"
Design and develop a program for this.
This is interview question at Samsung mobile development
I tried solving with Trie data structures. Could not get good solution for reusing already searched string results. I also tried solution with dictionary data structure, solution has same disadvantage as Trie.
question is how do I search the contacts for each letter typed reusing the search results of earlier searched string? What data structure and algorithm should be used for efficiently solving the problem.
I am not asking for program. So programming language is immaterial for me.
State machine appears to be good solution. Does anyone have suggestion?
Solution should be fast enough for million contacts.
It kind of depends on how many items you're searching. If it's a relatively small list, you can do a string.contains check on everything. So when the user types "A", you search the entire list:
for each contact in contacts
if contact.Name.Contains("A")
Add contact to results
Then the user types "T", and you sequentially search the previous returned results:
for each contact in results
if contact.Name.Contains("AT")
Add contact to new search results
Things get more interesting if the list of contacts is huge, but for the number of contacts that you'd normally have in a phone (a thousand would be a lot!), this is going to work very well.
If the interviewer said, "use the results from the previous search for the new search," then I suspect that this is the answer he was looking for. It would take longer to create a new suffix tree than to just sequentially search the previous result set.
You could optimize this a little bit by storing the position of the substring along with the contact so that all you have to do the next time around is check to see if the next character is as expected, but doing so complicates the algorithm a bit (you have to treat the first search as a special case, and you have to explicitly check string lengths, etc.), and is unlikely to provide much benefit after the first few characters because the size of the list to be searched would be pretty small. The pure sequential search with contains check is going to be plenty fast. Users wouldn't notice the few microseconds you'd save with that optimization.
Update after edit to question
If you want to do this with a million contacts, sequential search might not be the best way to go at the start. Although I'd still give it a try. "Fast enough for a million contacts" raises the question of what exactly "fast enough" means. How long does it take to search one million contacts for the existence of a single letter? How long is the user willing to wait? Remember also that you only have to show one page of contacts before the user takes another action. And you can almost certainly to that before the user presses the second key. Especially if you have a background thread doing the search while the foreground thread handles input and writing the first page of matched strings to the display.
Anyway, you could speed up the initial search by creating a bigram index. That is, for each bigram (sequence of two characters), build a list of names that contain that bigram. You'll also want to create a list of strings for each single character. So, given your list of names, you'd have:
r - ram
a - ram, feat, eat, a
m - ram
h - hello, hi
...
ra - ram
am - ram
...
at - feat, eat, at
...
etc.
I think you get the idea.
That bigram index gets stored in a dictionary or hash map. There are only 325 possible bigrams in the English language, and of course the 26 letters, so at most your dictionary is going to have 351 entries.
So you have almost instant lookup of 1- and 2-character names. How does this help you?
An analysis of Project Gutenberg text shows that the most common bigram in the English language occurs only 3.8% of the time. I realize that names won't share exactly that distribution, but that's a pretty good rough number. So after the first two characters are typed, you'll probably be working with less than 5% of the total names in your list. Five percent of a million is 50,000. With just 50,000 names, you can start using the sequential search algorithm that I described originally.
The cost of this new structure isn't too bad, although it's expensive enough that I'd certainly try the simple sequential search first, anyway. This is going to cost you an extra 2 million references to the names, in the worst case. You could reduce that to a million extra references if you build a 2-level trie rather than a dictionary. That would take slightly longer to lookup and display the one-character search results, but not enough to be noticeable by the user.
This structure is also very easy to update. To add a name, just go through the string and make entries for the appropriate characters and bigrams. To remove a name, go through the name extracting bigrams, and remove the name from the appropriate lists in the bigram index.
Look up "generalized suffix tree", e.g. https://en.wikipedia.org/wiki/Generalized_suffix_tree . For a fixed alphabet size this data structure gives asymptotically optimal solution to find all z matches of a substring of length m in a set of strings in O(z + m) time. Thus you get the same sort of benefit as if you restricted your search to the matches for the previous prefix. Also the structure has optimal O(n) space and build time where n is the total length of all your contacts. I believe you can modify the structure so that you just find the k strings that contain the substring in O(k + m) time, but in general you probably shouldn't have too many matches per contact that have a match, so this may not even be necessary.
What I'm thinking to do is, keeping track of the so far matched string. Suppose in the first step, we identify the strings those have "A" in them and we keep trace of the positions of 'A". Then in the next step we only iterate over these strings and instead of searching them full we only check for occurrence of "T" as the next character to "A" we kept trace in the previous step and so on.

Most frequent words in a terabyte of data

I came across a problem where we have to find say the most 10 frequent words in a terabyte of file or string.
One solution I could think was using a hash table (word, count) along with a max heap. But fitting all the words if the words are unique might cause a problem.
I thought of another solution using Map-Reduce by splitting the chunks on different nodes.
Another solution would be to build a Trie for all the words and update the count of each word as we scan through the file or string.
Which one of the above would be a better solution? I think the first solution is pretty naive.
Split your available memory into two halves. Use one as a 4-bit counting Bloom filter and the other half as a fixed size hash table with counts. The role of the counting Bloom filter is to filter out rarely occuring words with high memory efficiency.
Check your 1 TB of words against the initially empty Bloom filter; if a word is already in and all buckets are set to the maximum value of 15 (this may be partly or wholly a false positive), pass it through. If it is not, add it.
Words that passed through get counted; for a majority of words, this is every time but the first 15 times you see them. A small percentage will start to get counted even sooner, bringing a potential inaccuracy of up to 15 occurrences per word into your results. That's a limitation of Bloom filters.
When the first pass is over, you can correct the inaccuracy with a second pass if desired. Deallocate the Bloom filter, deallocate also all counts that are not within 15 occurrences behind the tenth most frequent word. Go through the input again, this time accurately counting words (using a separate hash table), but ignoring words that have not been retained as approximate winners from the first pass.
Notes
The hash table used in the first pass may theoretically overflow with certain statistical distributions of the input (e.g., each word exactly 16 times) or with extremely limited RAM. It is up to you to calculate or try out whether this can realistically happen to you or not.
Note also that the bucket width (4 bits in the above description) is just a parameter of the construction. A non-counting Bloom filter (bucket width of 1) would filter out most unique words nicely, but do nothing to filter out other very rarely occuring words. A wider bucket size will be more prone to cross-talk between words (because there will be fewer buckets), and it will also reduce guaranteed accuracy level after the first pass (15 occurrences in the case of 4 bits). But these downsides will be quantitatively insignificant until some point, while I'm imagining the more aggresive filtering effect as completely crucial for keeping the hash table in sub-gigabyte sizes with non-repetitive natural language data.
As for the order of magnitude memory needs of the Bloom filter itself; these people are working way below 100 MB, and with a much more challenging application ("full" n-gram statistics, rather than threshold 1-gram statistics).
Sort the terabyte file alphabetically using mergesort. In the initial pass, use quick sort using all available physical RAM to pre-sort long runs of words.
When doing so, represent a continuous sequence of identical words by just one such word and a count. (That is, you are adding the counts during the merges.)
Then resort the file, again using mergesort with quick sort presorting, but this time by the counts rather than alphabetically.
This is slower but simpler to implement than my other answer.
The best I could think of:
Split data to parts you can store in memory.
For each part get N most frequent words, you will get N * partsNumber words.
Read all data again counting words you got before.
It won't always give you correct answer, but it will work in fixed memory and linear time.
And why do you think a building of the Trie structure is not the best decision? Mark all of child nodes by a counter and that's it! Maximum memory complexity will be O(26 * longest_word_length), and time complexity should be O(n), that's not bad, is it?

Creating a suggested words algorithm

I'm designing a cool spell checker (I know I know, modern browsers already have this), anyway, I am wondering what kind of effort would it take to develop a fairly simple but decent suggest-word algorithm.
My idea is that I would first look through the misspelled word's characters and count the amount of characters it matches in each word in the dictionary (sounds resources intensive), and then pick the top 5 matches (so if the misspelled word matches the most characters with 7 words from the dictionary, it will randomly display 5 of those words as suggested spelling).
Obviously to get more advanced, we would look at "common words" and have a dictionary file that is numbered with 'frequency of that word used in English language' ranking. I think that's taking it a bit overboard maybe.
What do you think? Anyone have ideas for this?
First of all you will have to consider the complexity in finding the "nearer" words to the misspelled word. I see that you are using a dictionary, a hash table perhaps. But this may not be enough. The best and cooler solution here is to go for a TRIE datastructure. The complexity of finding these so called nearer words will take linear order timing and it is very easy to exhaust the tree.
A small example
Take the word "njce". This is a level 1 example where one word is clearly misspelled. The obvious suggestion expected would be nice. The first step is very obvious to see whether this word is present in the dictionary. Using the search function of a TRIE, this could be done O(1) time, similar to a dictionary. The cooler part is finding the suggestions. You would obviously have to exhaust all the words that start with 'a' to 'z' that has words like ajce bjce cjce upto zjce. Now to find the occurences of this type is again linear depending on the character count. You should not carried away by multiplying this number with 26 the length of words. Since TRIE immediately diminishes as the length grows. Coming back to the problem. Once that search is done for which no result was found, you go the next character. Now you would be searching for nace nbce ncce upto nzce. In fact you wont have explore all the combinations as the TRIE data structure by itself will not be having the intermediate characters. Perhaps it will have na ni ne no nu characters and the search space becomes insanely simple. So are the further occurrences. You could develop on this concept further based on second and third order matches. Hope this helped.
I'm not sure how much of the wheel you're trying to reinvent, so you may want to check out Lucene.
Apache Lucene Coreā„¢ (formerly named Lucene Java), our flagship sub-project, provides a Java-based indexing and search implementation, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.

Efficent methods for finding most common phrases in a body of text AKA trending topics

I previously asked a similar question on this topic, I ended up deriving several solutions which worked, one based on bloom filters + ngrams, the other based on hash tables + ngrams. Both solutions perform fine with small data sets (<1000 texts, usually tweets) but the computation time grew exponentially meaning doing 10,000 could take hours.
I am currently working in Ruby and perhaps, that is the problem but are there any other solutions or approaches I could attempt to solve this problem?
If you are looking to do text searching in large sets of data, you might have to look into something like solr. There is a really easy to setup solr gem called sunspot http://outoftime.github.com/sunspot/
Your problem can be solved by following the steps below:
(Optional, for performance purpose) Run through all the documents, create a mapping between the a unique word and an integer. Also, it is better to create a special mapping for sentence termination (.!? etc.). This is to facilitate the check of phrases that do not cross sentence boundary.
Concatenate all the documents into a huge array of mapped integers (in previous step). This can be done online (to save space) as we go through the next steps.
Constructing a suffix array of the string in previous step, augmented with the longest common prefix array. The fastest implementation known is SA-IS that runs in O(n) worst-case time. See here. Some special handling is required to be sure that each common prefix does not cross the sentence boundary.
LCP array is basically the result you need. You can do whatever you want with it, such as: sort it to find the longest repeated phrases among the documents, find all 5-words, 4 words, 3-words phrases, etc. The most common phrases (I assume at least 2-word phrases here) can be found by looking at both the LCP and suffix array.
Quick Google search show that this library contains a Ruby suffix array implementation. You can generate LCP array from there in O(n) Reference.

Resources