prefix similarity search - algorithm

I am trying to find a way to build a fuzzy search where both the text database and the queries may have spelling variants. In particular, the text database is material collected from the web and likely would not benefit from full text engine's prep phase (word stemming)
I could imagine using pg_trgm as a starting point and then validate hits by Levenshtein.
However, people tend to do prefix queries E.g, in the realm of music, I would expect "beetho symphony" to be a reasonable search term. So, is someone were typing "betho symphony", is there a reasonable way (using postgresql with perhaps tcl or perl scripting) to discover that the "betho" part should be compared with "beetho" (returning an edit distance of 1)

What I ended up is a simple modification of the common algorithm: normally I would just pick up the last value from the matrix or vector pair. Referring to the "iterative" algorithm in http://en.wikipedia.org/wiki/Levenshtein_distance I put the strings to be probed as first argument and the query string as second one. Now, when the algorithm finishes, the minimum value in the result column gives the proper result
Sample results:
query "fantas", words in database "fantasy", "fantastic" => 0
query "fantas", wor in database "fan" => 3
The input to edit distance are words selected from a "most words" list based on trigram similarity

You can modify edit distance algorithm to give a lower weight to the latter part of the string.
Eg: Match(i,j) = 1/max(i,j)^2 instead of Match(i,j)=1 for every i&j. (i and j are the location of the symbols you are comparing).
What this does is: dist('ABCD', 'ABCE') < dist('ABCD', 'EBCD').

Related

How does OEIS do subsequence search?

The Online Encyclopedia of Integer Sequences supports search for sequences containing your query as a subsequence, eg. searching for subseq:212,364,420,428 will return the 8*n+4 sequence. (http://oeis.org/search?q=subseq:212,364,420,428)
This amazing feature was apparently implemented by Russ Cox as by http://oeis.org/wiki/User:Russ_Cox/OEIS_Server_Features, but it is not specified with what algorithm this is implemented.
I'm wondering how it is done. Clearly going through nearly a million of sequences for every search is impractical for a search engine. Just keeping an index (which is how the same Russ Cox did Google Code Regex Search) of the first number and brute forcing the rest also doesn't work, as numbers like 0 is in nearly all sequences. In fact some queries like 0 1 match a high percentage of the total database, so the algorithm needs a running time sensitive to the desired output size.
Does anyone happen to know how this feature is implemented?
My guess is part of the data is stored in an inverted index. That is each number is linked to a set of series, and when multiple sequences are entered, the set of common sequences is shown. This is extremely fast and used by almost every search engine.
Storing as a suffix trees or any linked data structure is useless for this application.
At least for some set of sequences (eg ax+b), I think it is better to save them parametrically rather than storing the actual sequence.
First of all, that online search only seems to work with numbers up to a 1000. Does it work for larger numbers too? Secondly, just out of curiosity, for the example that you have provided, for some reason, OEIS does not list A000027, which is just natural numbers, but obviously it should match.
Database Based Solution
If this was implemented purely in DB, for a 4 item search, it would be something like this.
Tables
sequence {seqid, seqname, etc..}
seqitem {value, seqid, location }
Query
select si1.ds, si1.location, si2.location ....
from seqitem si1, seqitem si2, seqitem si3, seqitem si4
where si1.seqid = si2.seqid and si2.seqid = si3.seqid and si3.seqid = si4.seqid
and si1.location < si2.location and si2.location < si3.location and si3.location < si4.location
and si1.value =$v1 and si2.value = $v2 and si3.value = $v3 and si4.value = $v4

Algorithm to search for a list of words in a text

I have a list of words, fairly small about 1000 or so. I want to check if any of the words in that list occur in an input text. If so I would like know which ones occur. The input text is a few hundred words each and these are text paragraphs from the web - meaning there a lot of them from different sites. I am trying to find the best algorithm for it.
I can see two obvious ways to do this --
A brute force way of searching for each word from the list in the text.
Create a hash table of words from the input text and then search for each word from the list in the hash table. This is fast.
Is there a better solution?
I am using python though I am not sure if that changes the algorithm anyway.
Also as an optimization to the solution 2 above, I would like to store the hash table generated to persistent storage (DB) so that if the list of words changes I can re-use the hash table without having to create it again. Of course if the input text changes I have to generate the hash table. Is it possible to save a hash table to a DB? Any recommendations? I am currently using MongoDB for my project and I can only store json documents in it. I am a new to MongoDB and have only just started working with it and still do not fully understand the full potential of it.
I have searched SO and see two questions along similar lines and one of them suggests a hash table but I would like to get any pointers towards the optimization I have in mind.
Here are the previously asked questions on SO -
Is there an efficient algorithm to perform inverted full text search?
Searching a large list of words in another large list
EDIT: I just found another question on SO which is about the same problem.
Algorithm for multiple word matching in text
I guess there is no better solution than a hash table. But I would really like to optimize it so that changes to the word list can let me run the algorithm on all the text I have stored up quickly. Should I change the tags added to the question to also include some database technologies?
There is a better solution than a hash table. If you have a fixed set of words that you want to search for over a large body of text, the way you do it is with the Aho-Corasick string matching algorithm.
The algorithm builds a state machine from the words you want to search, and then runs the input text through that state machine, outputting matches as they're found. Because it takes some amount of time to build the state machine, the algorithm is best suited for searching very large bodies of text.
You can do something similar with regular expressions. For example, you might want to find the words "dog", "cat", "horse", and "skunk" in some text. You can build a regular expression:
"dog|cat|horse|skunk"
And then run a regular expression match on the text. How you get all matches will depend on your particular regular expression library, but it does work. For very large lists of words, you'll want to write code that reads the words and generates the regex, but it's not terribly difficult to do and it works quite well.
There is a difference, though, in the results from a regex and the results from the Aho-Corasick algorithm. For example if you're searching for the words "dog" and "dogma" in the string "My karma ate your dogma." The regex library search will report finding "dogma". The Aho-Corasick implementation will report finding "dog" and "dogma" at the same position.
If you want the Aho-Corasick algorithm to report whole words only, you have to modify the algorithm slightly.
Regex, too, will report matches on partial words. That is, if you're searching for "dog", it will find it in "dogma". But you can modify the regex to only give whole words. Typically, that's done with the \b, as in:
"\b(cat|dog|horse|skunk)\b"
The algorithm you choose depends a lot on how large the input text is. If the input text isn't too large, you can create a hash table of the words you're looking for. Then go through the input text, breaking it into words, and checking the hash table to see if the word is in the table. In pseudo code:
hashTable = Build hash table from target words
for each word in input text
if word in hashTable then
output word
Or, if you want a list of matching words that are in the input text:
hashTable = Build hash table from target words
foundWords = empty hash table
for each word in input text
if word in hashTable then
add word to foundWords

Search a string as you type the character

I have contacts stored in my mobile. Lets say my contacts are
Ram
Hello
Hi
Feat
Eat
At
When I type letter 'A' I should get all the matching contacts say "Ram, Feat, Eat, At".
Now I type one more letter T. Now my total string is "AT" now my program should reuse the results of previous search for "A". Now it should return me "Feat, Eat, At"
Design and develop a program for this.
This is interview question at Samsung mobile development
I tried solving with Trie data structures. Could not get good solution for reusing already searched string results. I also tried solution with dictionary data structure, solution has same disadvantage as Trie.
question is how do I search the contacts for each letter typed reusing the search results of earlier searched string? What data structure and algorithm should be used for efficiently solving the problem.
I am not asking for program. So programming language is immaterial for me.
State machine appears to be good solution. Does anyone have suggestion?
Solution should be fast enough for million contacts.
It kind of depends on how many items you're searching. If it's a relatively small list, you can do a string.contains check on everything. So when the user types "A", you search the entire list:
for each contact in contacts
if contact.Name.Contains("A")
Add contact to results
Then the user types "T", and you sequentially search the previous returned results:
for each contact in results
if contact.Name.Contains("AT")
Add contact to new search results
Things get more interesting if the list of contacts is huge, but for the number of contacts that you'd normally have in a phone (a thousand would be a lot!), this is going to work very well.
If the interviewer said, "use the results from the previous search for the new search," then I suspect that this is the answer he was looking for. It would take longer to create a new suffix tree than to just sequentially search the previous result set.
You could optimize this a little bit by storing the position of the substring along with the contact so that all you have to do the next time around is check to see if the next character is as expected, but doing so complicates the algorithm a bit (you have to treat the first search as a special case, and you have to explicitly check string lengths, etc.), and is unlikely to provide much benefit after the first few characters because the size of the list to be searched would be pretty small. The pure sequential search with contains check is going to be plenty fast. Users wouldn't notice the few microseconds you'd save with that optimization.
Update after edit to question
If you want to do this with a million contacts, sequential search might not be the best way to go at the start. Although I'd still give it a try. "Fast enough for a million contacts" raises the question of what exactly "fast enough" means. How long does it take to search one million contacts for the existence of a single letter? How long is the user willing to wait? Remember also that you only have to show one page of contacts before the user takes another action. And you can almost certainly to that before the user presses the second key. Especially if you have a background thread doing the search while the foreground thread handles input and writing the first page of matched strings to the display.
Anyway, you could speed up the initial search by creating a bigram index. That is, for each bigram (sequence of two characters), build a list of names that contain that bigram. You'll also want to create a list of strings for each single character. So, given your list of names, you'd have:
r - ram
a - ram, feat, eat, a
m - ram
h - hello, hi
...
ra - ram
am - ram
...
at - feat, eat, at
...
etc.
I think you get the idea.
That bigram index gets stored in a dictionary or hash map. There are only 325 possible bigrams in the English language, and of course the 26 letters, so at most your dictionary is going to have 351 entries.
So you have almost instant lookup of 1- and 2-character names. How does this help you?
An analysis of Project Gutenberg text shows that the most common bigram in the English language occurs only 3.8% of the time. I realize that names won't share exactly that distribution, but that's a pretty good rough number. So after the first two characters are typed, you'll probably be working with less than 5% of the total names in your list. Five percent of a million is 50,000. With just 50,000 names, you can start using the sequential search algorithm that I described originally.
The cost of this new structure isn't too bad, although it's expensive enough that I'd certainly try the simple sequential search first, anyway. This is going to cost you an extra 2 million references to the names, in the worst case. You could reduce that to a million extra references if you build a 2-level trie rather than a dictionary. That would take slightly longer to lookup and display the one-character search results, but not enough to be noticeable by the user.
This structure is also very easy to update. To add a name, just go through the string and make entries for the appropriate characters and bigrams. To remove a name, go through the name extracting bigrams, and remove the name from the appropriate lists in the bigram index.
Look up "generalized suffix tree", e.g. https://en.wikipedia.org/wiki/Generalized_suffix_tree . For a fixed alphabet size this data structure gives asymptotically optimal solution to find all z matches of a substring of length m in a set of strings in O(z + m) time. Thus you get the same sort of benefit as if you restricted your search to the matches for the previous prefix. Also the structure has optimal O(n) space and build time where n is the total length of all your contacts. I believe you can modify the structure so that you just find the k strings that contain the substring in O(k + m) time, but in general you probably shouldn't have too many matches per contact that have a match, so this may not even be necessary.
What I'm thinking to do is, keeping track of the so far matched string. Suppose in the first step, we identify the strings those have "A" in them and we keep trace of the positions of 'A". Then in the next step we only iterate over these strings and instead of searching them full we only check for occurrence of "T" as the next character to "A" we kept trace in the previous step and so on.

Use of indexes for multi-word queries in full-text search (e.g. web search)

I understand that a fundamental aspect of full-text search is the use of inverted indexes. So, with an inverted index a one-word query becomes trivial to answer. Assuming the index is structured like this:
some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending)
To answer the query for that word the solution is just to find the correct entry in the index (which takes O(log n) time) and present some given number of documents (e.g. the first 10) from the list specified in the index.
But what about queries which return documents that match, say, two words? The most straightforward implementation would be the following:
set A to be the set of documents which have word 1 (by searching the index).
set B to be the set of documents which have word 2 (ditto).
compute the intersection of A and B.
Now, step three probably takes O(n log n) time to perform. For very large A and Bs that could make the query slow to answer. But search engines like Google always return their answer in a few milliseconds. So that can't be the full answer.
One obvious optimization is that since a search engine like Google doesn't return all the matching documents anyway, we don't have to compute the whole intersection. We can start with the smallest set (e.g. B) and find enough entries which also belong to the other set (e.g. A).
But can't we still have the following worst case? If we have set A be the set of documents matching a common word, and set B be the set of documents matching another common word, there might still be cases where A ∩ B is very small (i.e. the combination is rare). That means that the search engine has to linearly go through a all elements x member of B, checking if they are also elements of A, to find the few that match both conditions.
Linear isn't fast. And you can have way more than two words to search for, so just employing parallelism surely isn't the whole solution. So, how are these cases optimized? Do large-scale full-text search engines use some kind of compound indexes? Bloom filters? Any ideas?
As you said some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending), I think the search engine may not do this, the doc list should be sorted by doc ID, each doc has a rank according to the word.
When a query comes, it contains several keywords. For each word, you can find a doc list. For all keywords, you can do merge operations, and compute the relevance of doc to query. Finally return the top ranked relevance doc to user.
And the query process can be distributed to gain better performance.
Even without ranking, I wonder how the intersection of two sets is computed so fast by google.
Obviously the worst-case scenario for computing the intersection for some words A, B, C is when their indexes are very big and the intersection very small. A typical case would be a search for some very common ("popular" in DB terms) words in different languages.
Let's try "concrete" and 位置 ("site", "location") in chinese and 極端な ("extreme") in japanese.
Google search for 位置 returns "About 1,500,000,000 results (0.28 seconds) "
Google search for "concrete" returns "About 2,020,000,000 results (0.46 seconds) "
Google search for "極端な" About 7,590,000 results (0.25 seconds)
It is extremly improbable that all three terms would ever appear in the same document, but let's google them:
Google search for "concrete 位置 極端な" returns "About 174,000 results (0.13 seconds)"
Adding a russian word "игра" (game)
Search игра: About 212,000,000 results (0.37 seconds)
Search for all of them: " игра concrete 位置 極端な " returns About 12,600 results (0.33 seconds)
Of course the returned search results are nonsense and they do not contain all the search terms.
But looking at the query time for the composed ones, I wonder if there is some intersection computed on the word indexes at all. Even if everything is in RAM and heavily sharded, computing the intersection of two sets with 1,500,000,000 and 2,020,000,000 entries is O(n) and can hardly be done in <0.5 sec, since the data is on different machines and they have to communicate.
There must be some join computation, but at least for popular words, this is surely not done on the whole word index. Adding the fact that the results are fuzzy, it seems evident that Google uses some optimization of kind "give back some high-ranked results, and stop after 0,5 sec".
How this is implemented, I don't know. Any ideas?
Most systems somehow implement TF-IDF in one way or another. TF-IDF is a product of functions term frequency and inverse document frequency.
The IDF function relates the document frequency to the total number of documents in a collection. The common intuition for this function says that it should give a higher value for terms that appear in few documents and lower value for terms that appear in all documents making them irrelevant.
You mention Google, but Google optimises search with PageRank (links in/out) as well as term frequency and proximity. Google distributes the data and uses Map/Reduce to parallelise operations - to compute PageRank+TF-IDF.
There's a great explanation of the theory behind this in Information Retrieval: Implementing Search Engines chapter 2. Another idea to investigate further is also to look how Solr implements this.
Google does not need to actually find all results, only the top ones.
The index can be sorted by grade first and only then by id. Since the same ID always has the same grade this does not hurt sets intersection time.
So google starts intersection until it finds 10 results , and then does a statistical estimation to tell you how many more results it found.
A worst case is almost impossible.
If all words are "common" then intersection will give the first 10 results very fast. If there is a rare word, then intersection is fast because complexity is O(N long M) where N is the smallest group.
You need to remember that google keeps it's indexes in memory and uses parallel computing.For example U can split the problem into two searches each searching only half of the web, and then marge result and take the best. Google has millions of computes

Approximate string matching algorithms

Here at work, we often need to find a string from the list of strings that is the closest match to some other input string. Currently, we are using Needleman-Wunsch algorithm. The algorithm often returns a lot of false-positives (if we set the minimum-score too low), sometimes it doesn't find a match when it should (when the minimum-score is too high) and, most of the times, we need to check the results by hand. We thought we should try other alternatives.
Do you have any experiences with the algorithms?
Do you know how the algorithms compare to one another?
I'd really appreciate some advice.
PS: We're coding in C#, but you shouldn't care about it - I'm asking about the algorithms in general.
Oh, I'm sorry I forgot to mention that.
No, we're not using it to match duplicate data. We have a list of strings that we are looking for - we call it search-list. And then we need to process texts from various sources (like RSS feeds, web-sites, forums, etc.) - we extract parts of those texts (there are entire sets of rules for that, but that's irrelevant) and we need to match those against the search-list. If the string matches one of the strings in search-list - we need to do some further processing of the thing (which is also irrelevant).
We can not perform the normal comparison, because the strings extracted from the outside sources, most of the times, include some extra words etc.
Anyway, it's not for duplicate detection.
OK, Needleman-Wunsch(NW) is a classic end-to-end ("global") aligner from the bioinformatics literature. It was long ago available as "align" and "align0" in the FASTA package. The difference was that the "0" version wasn't as biased about avoiding end-gapping, which often allowed favoring high-quality internal matches easier. Smith-Waterman, I suspect you're aware, is a local aligner and is the original basis of BLAST. FASTA had it's own local aligner as well that was slightly different. All of these are essentially heuristic methods for estimating Levenshtein distance relevant to a scoring metric for individual character pairs (in bioinformatics, often given by Dayhoff/"PAM", Henikoff&Henikoff, or other matrices and usually replaced with something simpler and more reasonably reflective of replacements in linguistic word morphology when applied to natural language).
Let's not be precious about labels: Levenshtein distance, as referenced in practice at least, is basically edit distance and you have to estimate it because it's not feasible to compute it generally, and it's expensive to compute exactly even in interesting special cases: the water gets deep quick there, and thus we have heuristic methods of long and good repute.
Now as to your own problem: several years ago, I had to check the accuracy of short DNA reads against reference sequence known to be correct and I came up with something I called "anchored alignments".
The idea is to take your reference string set and "digest" it by finding all locations where a given N-character substring occurs. Choose N so that the table you build is not too big but also so that substrings of length N are not too common. For small alphabets like DNA bases, it's possible to come up with a perfect hash on strings of N characters and make a table and chain the matches in a linked list from each bin. The list entries must identify the sequence and start position of the substring that maps to the bin in whose list they occur. These are "anchors" in the list of strings to be searched at which an NW alignment is likely to be useful.
When processing a query string, you take the N characters starting at some offset K in the query string, hash them, look up their bin, and if the list for that bin is nonempty then you go through all the list records and perform alignments between the query string and the search string referenced in the record. When doing these alignments, you line up the query string and the search string at the anchor and extract a substring of the search string that is the same length as the query string and which contains that anchor at the same offset, K.
If you choose a long enough anchor length N, and a reasonable set of values of offset K (they can be spread across the query string or be restricted to low offsets) you should get a subset of possible alignments and often will get clearer winners. Typically you will want to use the less end-biased align0-like NW aligner.
This method tries to boost NW a bit by restricting it's input and this has a performance gain because you do less alignments and they are more often between similar sequences. Another good thing to do with your NW aligner is to allow it to give up after some amount or length of gapping occurs to cut costs, especially if you know you're not going to see or be interested in middling-quality matches.
Finally, this method was used on a system with small alphabets, with K restricted to the first 100 or so positions in the query string and with search strings much larger than the queries (the DNA reads were around 1000 bases and the search strings were on the order of 10000, so I was looking for approximate substring matches justified by an estimate of edit distance specifically). Adapting this methodology to natural language will require some careful thought: you lose on alphabet size but you gain if your query strings and search strings are of similar length.
Either way, allowing more than one anchor from different ends of the query string to be used simultaneously might be helpful in further filtering data fed to NW. If you do this, be prepared to possibly send overlapping strings each containing one of the two anchors to the aligner and then reconcile the alignments... or possibly further modify NW to emphasize keeping your anchors mostly intact during an alignment using penalty modification during the algorithm's execution.
Hope this is helpful or at least interesting.
Related to the Levenstein distance: you might wish to normalize it by dividing the result with the length of the longer string, so that you always get a number between 0 and 1 and so that you can compare the distance of pair of strings in a meaningful way (the expression L(A, B) > L(A, C) - for example - is meaningless unless you normalize the distance).
We are using the Levenshtein distance method to check for duplicate customers in our database. It works quite well.
Alternative algorithms to look at are agrep (Wikipedia entry on agrep),
FASTA and BLAST biological sequence matching algorithms. These are special cases of approximate string matching, also in the Stony Brook algorithm repositry. If you can specify the ways the strings differ from each other, you could probably focus on a tailored algorithm. For example, aspell uses some variant of "soundslike" (soundex-metaphone) distance in combination with a "keyboard" distance to accomodate bad spellers and bad typers alike.
Use FM Index with Backtracking, similar to the one in Bowtie fuzzy aligner
In order to minimize mismatches due to slight variations or errors in spelling, I've used the Metaphone algorithm, then Levenshtein distance (scaled to 0-100 as a percentage match) on the Metaphone encodings for a measure of closeness. That seems to have worked fairly well.
To expand on Cd-MaN's answer, it sounds like you're facing a normalization problem. It isn't obvious how to handle scores between alignments with varying lengths.
Given what you are interested in, you may want to obtain p-values for your alignment. If you are using Needleman-Wunsch, you can obtain these p-values using Karlin-Altschul statistics http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
BLAST will can local alignment and evaluate them using these statistics. If you are concerned about speed, this would be a good tool to use.
Another option is to use HMMER. HMMER uses Profile Hidden Markov Models to align sequences. Personally, I think this is a more powerful approach since it also provides positional information. http://hmmer.janelia.org/

Resources