How to find best fuzzy match for a string in a large string database - fuzzy-search

I have a database of strings (arbitrary length) which holds more than one million items (potentially more).
I need to compare a user-provided string against the whole database and retrieve an identical string if it exists or otherwise return the closest fuzzy match(es) (60% similarity or better). The search time should ideally be under one second.
My idea is to use edit distance for comparing each db string to the search string after narrowing down the candidates from the db based on their length.
However, as I will need to perform this operation very often, I'm thinking about building an index of the db strings to keep in memory and query the index, not the db directly.
Any ideas on how to approach this problem differently or how to build the in-memory index?

This paper seems to describe exactly what you want.
Lucene (http://lucene.apache.org/) also implements Levenshtein edit distance.

You didn't mention your database system, but for PostrgreSQL you could use the following contrib module: trgm - Trigram matching for PostgreSQL
The pg_trgm contrib module provides functions and index classes for determining the similarity of text based on trigram matching.

If your database supports it, you should use full-text search. Otherwise, you can use an indexer like lucene and its various implementations.

Compute the SOUNDEX hash (which is built into many SQL database engines) and index by it.
SOUNDEX is a hash based on the sound of the words, so spelling errors of the same word are likely to have the same SOUNDEX hash.
Then find the SOUNDEX hash of the search string, and match on it.

Since the amount of data is large, when inserting a record I would compute and store the value of the phonetic algorithm in an indexed column and then constrain (WHERE clause) my select queries within a range on that column.

A very extensive explanation of relevant algorithms is in the book Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology by Dan Gusfield.

https://en.wikipedia.org/wiki/Levenshtein_distance
Levenshtein algorithm has been implemented in some DBMS
(E.g. PostgreSql: http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html)

Related

How does Lucene (Solr/ElasticSearch) so quickly do filtered term counts?

From a data structure point of view how does Lucene (Solr/ElasticSearch) so quickly do filtered term counts? For example given all documents that contain the word "bacon" find the counts for all words in those documents.
First, for background, I understand that Lucene relies upon a compressed bit array data structure akin to CONCISE. Conceptually this bit array holds a 0 for every document that doesn't match a term and a 1 for every document that does match a term. But the cool/awesome part is that this array can be highly compressed and is very fast at boolean operations. For example if you want to know which documents contain the terms "red" and "blue" then you grab the bit array corresponding to "red" and the bit array corresponding to "blue" and AND them together to get a bit array corresponding to matching documents.
But how then does Lucene quickly determine counts for all words in documents that match "bacon"? In my naive understanding, Lucene would have to take the bit array associated with bacon and AND it with the bit arrays for every single other word. Am I missing something? I don't understand how this can be efficient. Also, do these bit arrays have to be pulled off of disk? That sounds all the worse!
How's the magic work?
You may already know this but it won't hurt to say that Lucene uses inverted indexing. In this indexing technique a dictionary of each word occurring in all documents is made and against each word information about that words occurrences is stored. Something like this image
To achieve this Lucene stores documents, indexes and their Metadata in different files formats. Follow this link for file details http://lucene.apache.org/core/3_0_3/fileformats.html#Overview
If you read the document numbers section, each document is given an internal ID so when documents are found with word 'consign' the lucene engine has the reference to the metadata of it.
Refer to the overview section to see what data gets saved in different lucene indexes.
Now that we have a pointer to the stored documents Lucene may be getting it in one of the following ways
Really count the number of words if the document is stored
Use Term Dictionary, frequency, and proximity data to get the count.
Finally, which API are you using to "quickly determine counts for all words"
Image credit to http://leanjavaengineering.wordpress.com/
Check about index file format here http://lucene.apache.org/core/8_2_0/core/org/apache/lucene/codecs/lucene80/package-summary.html#package.description
there are no bitsets involved: its an inverted index. Each term maps to a list of documents. In lucene the algorithms work on iterators of these "lists", so items from the iterator are read on-demand, not all at once.
this diagram shows a very simple conjunction algorithm that just uses a next() operation: http://nlp.stanford.edu/IR-book/html/htmledition/processing-boolean-queries-1.html
Behind the scenes, it is much like this diagram in lucene. Our lists are delta-encoded and bitpacked, and augmented with a skiplist which allows us to intersect more efficiently (via the additional advance() operation) than the above algorithm though.
DocIDSetIterator is this "enumerator" in lucene. it has the two main methods, next() and advance(). And yes, it is true you can decide to read the entire list + convert it into a bitset in memory, and implement this iterator over that in-memory bitset. This is what happens if you use CachingWrapperFilter.

Indexing full text search queries for efficient fanout

While thinking about the design of various applications I might like to build some day, in several cases I have had a need to fan out a stream of incoming events based on whether or not they match a large selection of full text search queries provided by users.
A simple example of this problem is the implementation of a tool like Twitter streaming search: given many thousands of new tweets every second, efficiently select only the streaming subscribers whose search query is likely to match an incoming tweet.
A statement of the problem would be something like, "inverse full text search", where the full text is the query, and the search results are the search queries that would match that text.
For single term queries an implementation is obvious: simply tokenize the incoming document, then search a map of term->(list of subscribers), but things become more difficult when boolean queries are possible. In fact the problem is more general than full text search, but it is simplest understood in that context. There are many other examples where a large set of boolean terms need combined some way to optimize cost of evaluating them.
For example, imagine 3 search subscriptions:
Google AND Glass
Google AND Analytics
((Glass AND Google) NOT Knol) OR Twitter
One possibility is to parse the query into a tree, then visit each node, extracting the term, and using the "map of term" approach, however this would require re-evaluating the subscribers query against the incoming document for each term. With enough subscribers, this is going to start getting slow very quickly.
Instead I am wondering if there is a well documented approach to rewrite the queries perhaps into a single query, where the result can be evaluated once, and tree nodes are annotated with a list of subscriber queries known to either exactly or almost certainly match any document that that point in the tree.
For example, the above queries might be rewritten so that a map of term->(query tree) exists, such as:
Google -> (Analytics[2]
Glass[1,3])
Twitter -> ([3])
Is there any existing publicly documented system that does something like this? Ideally the solution would allow incrementally adding and removing subscribers, without some expensive step to rewrite the entire structure.
One way to do this is with a simple dictionary that maps terms to queries. So given these four queries:
Query1: Google AND Glass
Query2: Google AND Analytics
Query3: ((Glass AND Google) NOT Knol) OR Twitter
Query4: Quick AND red AND fox
You build a dictionary, keyed by the term:
Google: Query1, Query2, Query3
Glass: Query1, Query3
Analytics: Query2
Knol: Query3
Twitter: Query3
Quick: Query4
red: Query4
fox: Query4
Now, consider a sentence like "The red glass on the knol is from Google."
Parse each word and look it up in the dictionary. For each word in the dictionary, add its list of queries to your master list of queries. Also, for every word that is found in the dictionary, add it to a hash table of relevant words. At the end of this step you'll have two structures: the list of queries to check, and the list of relevant words:
Queries list: Query1, Query2, Query3, Query4
Relevant words: Google, Glass, Knol, red
Now it's a matter of processing each query, checking to see if the words are in the relevant words list.
For Query1, for example, you'd check to see if the relevant words list contains Google and Glass.
The complexity of this isn't too bad. You have an O(1) lookup for each parsed word in the text. For each query identified during the parsing phase, you have some number, N, O(1) lookups against the relevant words hash table. There's some very small amount of logic involved in doing the Boolean evaluation, but most queries will be simple "all words" or "any word" type queries (i.e. "this AND that", or "this OR that").
The nice thing about this model is that it's pretty easy to farm out to multiple processors. You can parse the words in a single thread, pushing them to a concurrent queue. Multiple threads service that queue, doing the lookups and building their own lists of queries that need to be checked. When all those lookups are done, you merge the queries lists from the multiple threads and again put them on a concurrent queue that multiple threads can service.
Say you have a million queries, averaging five words each (which would likely be a big average). Absolute worst case here is that some text comes in that contains at least one word from each query. So you have a list of a million queries to check in pass 2. At worst, that's 5 million dictionary lookups.
The first pass of this algorithm is O(n), where n is the number of words in the incoming text. That will create a list of k queries. The second pass is O(km), where m is the average number of words per query.
The beauty of this approach is its simplicity, and it will perform well for moderately large numbers of queries, depending on the size of the text you're feeding it. There is a potentially faster way, but it's much more involved.
Rather than building a dictionary that maps terms to queries, you use a modified Aho-Corasick string search algorithm that is very similar to what the Unix fgrep program uses to match multiple regular expressions in a single pass over the text. The details of that are way beyond my ability to explain in a short note here. You might want to track down an old Dr. Dobb's Journal article called something like "Parallel Pattern Matching and fgrep", which as I recall had a reasonably good explanation of how this is done. (A quick search didn't find the article text, but you might have better luck.) You'll also want to read the original Aho-Corasick paper: Efficient String Matching: an Aid to Bibliographic Search. That discusses parallel pattern matching literal strings, but the basic idea works for matching regular expressions or Boolean search queries.
If you can parse your query into boolean expressions, what you have is a set of rules, with the input variables the presence or absence of terms in the search text. For each search text you could use parsing + table lookup or Aho-Corasick to work out which terms are present and then use an implementation of the Rete algorithm such as http://en.wikipedia.org/wiki/Drools to work out which rules to fire given that input.
(Alternately, you could batch up your input texts, build a small text search database from them, and then run your queries. My guess is that this stops being stupidly inefficient when you can afford to wait long enough between query runs for the text search database size to be comparable with the size of the combined queries).

Fuzzy match of an English sentence with a set of English sentences stored in a database

There are about 1000 records in a database table. There is a column named title which is used to store the title of articles. Before inserting a record, I need to check if there is already an article with similar title exists in that table. If so, I will skip.
What's the fastest way to perform this kind of fuzzy matching? Assuming all words in sentences can be found in a English dictionary. If 70% of words in sentence #1 can be found in sentence #2, we consider them a match. Ideally, the algorithm can pre-compute a value for each sentence so that the value can be stored in the database.
For a 1000 records, doing the dumb thing and just iterating over all the records could work (assuming that the strings aren't too long and you aren't getting hit with too many queries). Just pull all of the titles out of your database, and then sort them by their distance to your given string (for example, you could use Levenshtein distance for this metric).
A fancier way to do approximate string matching would be to precompute n-grams of all your strings and store them in your database (some systems support this feature natively). This will definitely scale better performance wise, but it could mean more work:
http://en.wikipedia.org/wiki/N-gram
You can read up on forward / reverse indexing of token - value storage for getting faster search results. I personally prefer reverse indexing which stores a hash map of token(key) to value (here title).
Whenever you write a new article, like a new stackoverflow question, the tokens in the title would be searched to map against all the titles available.
To optimize the result, i.e. get the fuzzy logic for results, you can sort the titles by the max amount of occurrences in tokens being searched for. Eg, if t1,t2 and t3 refer to the tokens 'what' 'is' 'love', and the title 'what this love is for?' would exist in all the tokens mappings, it would be placed at the topmost.
You can play around with this more. I hope this approach is more simple and appealing.

PHP MYSQL search engine using keywords

I am trying to implement search engine based on keywords search.
Can anyone tell me which is the best (fastest) algorithm to implement a search for key words?
What I need is:
My keywords:
search, faster, profitable
Their synonyms:
search: grope, google, identify, search
faster: smart, quick, faster
profitable: gain, profit
Now I should search all possible permutations of the above synonyms in a Database to identify the most matching words.
The best solution would be to use an existing search engine, like Lucene or one of its alternative ( see Which are the best alternatives to Lucene? ).
Now, if you want to implement that yourself (it's really a great and existing problem), you should have a look at the concept of Inverted Index. That's what Google and other search engines use. Of course, they have a LOT of additional systems on top of it, but that's the basic.
The idea of an inverted index, is that for each keyword (and synonyms), you store the id of the documents that contain the keyword. It's then very easy to lookup the matching documents for a set of keyword, because you just calculate an intersection (or an union depending on what you want to do) of their list in the inverted index. Example :
Let's assume that is your inverted index :
smart: [42,35]
gain: [42]
profit: [55]
Now if you have a query "smart, gain", your matching documents are the intersection (or the union) of [42, 35] and [42].
To handle synonyms, you just need to extend your query to include all synonyms for the words in the initial query. Based on your example, you query would become "faster, quick, gain, profit, profitable".
Once you've implemented that, a nice improvement is to add TFIDF weighting to your keywords. That's basically a way to weight rare words (programming) more than common ones (the).
The other approach is to just go through all your documents and find the ones that contain your words (or their synonyms). The inverted index will be MUCH faster though, because you don't have to go through all your documents every time. The time-consuming operation is building the index, which only has to be done once.

Efficent methods for finding most common phrases in a body of text AKA trending topics

I previously asked a similar question on this topic, I ended up deriving several solutions which worked, one based on bloom filters + ngrams, the other based on hash tables + ngrams. Both solutions perform fine with small data sets (<1000 texts, usually tweets) but the computation time grew exponentially meaning doing 10,000 could take hours.
I am currently working in Ruby and perhaps, that is the problem but are there any other solutions or approaches I could attempt to solve this problem?
If you are looking to do text searching in large sets of data, you might have to look into something like solr. There is a really easy to setup solr gem called sunspot http://outoftime.github.com/sunspot/
Your problem can be solved by following the steps below:
(Optional, for performance purpose) Run through all the documents, create a mapping between the a unique word and an integer. Also, it is better to create a special mapping for sentence termination (.!? etc.). This is to facilitate the check of phrases that do not cross sentence boundary.
Concatenate all the documents into a huge array of mapped integers (in previous step). This can be done online (to save space) as we go through the next steps.
Constructing a suffix array of the string in previous step, augmented with the longest common prefix array. The fastest implementation known is SA-IS that runs in O(n) worst-case time. See here. Some special handling is required to be sure that each common prefix does not cross the sentence boundary.
LCP array is basically the result you need. You can do whatever you want with it, such as: sort it to find the longest repeated phrases among the documents, find all 5-words, 4 words, 3-words phrases, etc. The most common phrases (I assume at least 2-word phrases here) can be found by looking at both the LCP and suffix array.
Quick Google search show that this library contains a Ruby suffix array implementation. You can generate LCP array from there in O(n) Reference.

Resources