Indexing full text search queries for efficient fanout - algorithm

While thinking about the design of various applications I might like to build some day, in several cases I have had a need to fan out a stream of incoming events based on whether or not they match a large selection of full text search queries provided by users.
A simple example of this problem is the implementation of a tool like Twitter streaming search: given many thousands of new tweets every second, efficiently select only the streaming subscribers whose search query is likely to match an incoming tweet.
A statement of the problem would be something like, "inverse full text search", where the full text is the query, and the search results are the search queries that would match that text.
For single term queries an implementation is obvious: simply tokenize the incoming document, then search a map of term->(list of subscribers), but things become more difficult when boolean queries are possible. In fact the problem is more general than full text search, but it is simplest understood in that context. There are many other examples where a large set of boolean terms need combined some way to optimize cost of evaluating them.
For example, imagine 3 search subscriptions:
Google AND Glass
Google AND Analytics
((Glass AND Google) NOT Knol) OR Twitter
One possibility is to parse the query into a tree, then visit each node, extracting the term, and using the "map of term" approach, however this would require re-evaluating the subscribers query against the incoming document for each term. With enough subscribers, this is going to start getting slow very quickly.
Instead I am wondering if there is a well documented approach to rewrite the queries perhaps into a single query, where the result can be evaluated once, and tree nodes are annotated with a list of subscriber queries known to either exactly or almost certainly match any document that that point in the tree.
For example, the above queries might be rewritten so that a map of term->(query tree) exists, such as:
Google -> (Analytics[2]
Glass[1,3])
Twitter -> ([3])
Is there any existing publicly documented system that does something like this? Ideally the solution would allow incrementally adding and removing subscribers, without some expensive step to rewrite the entire structure.

One way to do this is with a simple dictionary that maps terms to queries. So given these four queries:
Query1: Google AND Glass
Query2: Google AND Analytics
Query3: ((Glass AND Google) NOT Knol) OR Twitter
Query4: Quick AND red AND fox
You build a dictionary, keyed by the term:
Google: Query1, Query2, Query3
Glass: Query1, Query3
Analytics: Query2
Knol: Query3
Twitter: Query3
Quick: Query4
red: Query4
fox: Query4
Now, consider a sentence like "The red glass on the knol is from Google."
Parse each word and look it up in the dictionary. For each word in the dictionary, add its list of queries to your master list of queries. Also, for every word that is found in the dictionary, add it to a hash table of relevant words. At the end of this step you'll have two structures: the list of queries to check, and the list of relevant words:
Queries list: Query1, Query2, Query3, Query4
Relevant words: Google, Glass, Knol, red
Now it's a matter of processing each query, checking to see if the words are in the relevant words list.
For Query1, for example, you'd check to see if the relevant words list contains Google and Glass.
The complexity of this isn't too bad. You have an O(1) lookup for each parsed word in the text. For each query identified during the parsing phase, you have some number, N, O(1) lookups against the relevant words hash table. There's some very small amount of logic involved in doing the Boolean evaluation, but most queries will be simple "all words" or "any word" type queries (i.e. "this AND that", or "this OR that").
The nice thing about this model is that it's pretty easy to farm out to multiple processors. You can parse the words in a single thread, pushing them to a concurrent queue. Multiple threads service that queue, doing the lookups and building their own lists of queries that need to be checked. When all those lookups are done, you merge the queries lists from the multiple threads and again put them on a concurrent queue that multiple threads can service.
Say you have a million queries, averaging five words each (which would likely be a big average). Absolute worst case here is that some text comes in that contains at least one word from each query. So you have a list of a million queries to check in pass 2. At worst, that's 5 million dictionary lookups.
The first pass of this algorithm is O(n), where n is the number of words in the incoming text. That will create a list of k queries. The second pass is O(km), where m is the average number of words per query.
The beauty of this approach is its simplicity, and it will perform well for moderately large numbers of queries, depending on the size of the text you're feeding it. There is a potentially faster way, but it's much more involved.
Rather than building a dictionary that maps terms to queries, you use a modified Aho-Corasick string search algorithm that is very similar to what the Unix fgrep program uses to match multiple regular expressions in a single pass over the text. The details of that are way beyond my ability to explain in a short note here. You might want to track down an old Dr. Dobb's Journal article called something like "Parallel Pattern Matching and fgrep", which as I recall had a reasonably good explanation of how this is done. (A quick search didn't find the article text, but you might have better luck.) You'll also want to read the original Aho-Corasick paper: Efficient String Matching: an Aid to Bibliographic Search. That discusses parallel pattern matching literal strings, but the basic idea works for matching regular expressions or Boolean search queries.

If you can parse your query into boolean expressions, what you have is a set of rules, with the input variables the presence or absence of terms in the search text. For each search text you could use parsing + table lookup or Aho-Corasick to work out which terms are present and then use an implementation of the Rete algorithm such as http://en.wikipedia.org/wiki/Drools to work out which rules to fire given that input.
(Alternately, you could batch up your input texts, build a small text search database from them, and then run your queries. My guess is that this stops being stupidly inefficient when you can afford to wait long enough between query runs for the text search database size to be comparable with the size of the combined queries).

Related

Why are Lucene/Elasticsearch prefix queries slower than term queries?

I've been recently reading about Lucene and Elasticsearch and it seems the following are true (correct me if i'm wrong):
prefix queries are slower than term queries
suffix queries (* ing) are slower than prefix queries (ing *)
This seems like a strange combination of properties. Perhaps I need to broaden my scope of data structures I'm considering, but if a segment were structured like a hash table, I could easily see that 1 would be true (the term query would be O(1) and a prefix query would require a full scan) however 2 would not be true (both prefix and suffix would require a full scan). If the segment were laid out like a sorted array, I could easily see that 2 would be true (a prefix query could be performed with a binary search O(log n) and the suffix would require a full scan) however 1 would no longer be true (both a term and prefix query would require a binary search).
My only other thought is that there might be some combination of both hash and sort going on to account for this behavior (ex. hash to some partition and sort within that partition). However my understanding is that Elasticsearch partitions by a document identifier but the inverted index key is a term. So a query for a term still requires the request being sent to all partitions.
Can anyone provide me with some intuition as to how/why this behavior exists?
Note:
https://www.youtube.com/watch?v=T5RmMNDR5XI would suggest that a segment is structured similar to a sorted array rather than a hash table.
The reason I believe 1 is true is https://medium.com/#mourjo_sen/a-detailed-comparison-between-autocompletion-strategies-in-elasticsearch-66cb9e9c62c4 mentions "The most important reason why prefix-like queries should never be used in production environments is that they are extremely slow. The reason for this is that the tokens in ES are not directly prefix-able"
The reason I believe 2 is true is https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html mentions "Allowing a wildcard at the beginning of a word (eg "*ing") is particularly heavy, because all terms in the index need to be examined, just in case they match
I'm not that familiar with ES specific details so they might be doing something else than plain Solr - but #1 is not the case usually.
A prefix match will be more expensive than looking up a single term, but it's not that much more expensive. It can be compared to doing a range search (which you can perform if you want to - field:[aa TO ab) could be compared to doing field:aa* (in theory); effectively retrieving all tokens that lie within that range, then resolving the document set that matches those tokens.
The fact that there are more tokens that match means that you can't simply take the list attached to a single token (a matching term) and retrieve those documents, but you have to retrieve a possibly large set of matching tokens and then compute the document set for that. However, this is not a very expensive computation, but it is more expensive than just a single match. The lookup can be done by finding the starting and end indexed of the matching tokens in the index, then retrieving all terms between those two and find the set of matching document ids.
A query of foo* against an indexed with the following terms:
bar, baz, foo, foobar, spam
^----------^
will collect the list of documents attached to foo and foobar, merge it and then retrieve the documents.
Slower does not mean that it's catastrophic or not optimised in any way; just that it's more expensive than a direct match where the set of documents has already been determined. However, you probably have more than one term in your query already, so the same process (albeit slightly higher up in the hierarchy) happens there as well.
A postfix match (your #2) - i.e. matching a wildcard at the beginning of the token - is expensive, since all tokens in the index usually has to be considered. The index have the terms sorted alphanumerically, so when you want to only look at the end of the string you have to consider that each token could match, regardless of where it's located in the index - so you get a full index scan. However, if this is a use case you see happening often, you can use the reverse wildcard filter. This works by reversing the string and having tokens that match the terms in reverse order, so that foo is indexed as oof and a wildcard search gets turned into oof* instead.
A query of *ar against an indexed with the following terms:
bar, baz, foo, foobar, spam
?! ? ? ?! ?
will have to look at each term to decide if it ends with ar.
The reason for using an EdgeNGramFilter (your comment / #3) is that you move as much of the required processing as possible to indexing time (doing the work that you know do query time, even if prefix queries aren't really expensive, they still have a cost), and additionally: wildcard queries does not support most analysis. So many people end up with wildcard queries against a set of tokens that have been stemmed or otherwise processed, and are then surprised when their wildcard queries doesn't generate a match. Only a small subset of filters can be applied to wildcard queries (such as the LowercaseFilter). Those filters are known as being "Multi term aware", since the terms the process can end up being expanded to multiple terms before collection of documents happen.
Another reason is that using an EdgeNGramFilter will give you proper frequency scores for each prefix, giving you effective scoring for prefixed terms as well.

Minhashing on Strings with K-length

I have a application where I should implement Bloom Filters and Minhashing to find similar items.
I have the Bloom Filter implemented but I need to make sure i understand the Minhashing part to do it:
The aplication generates a number of k-length Strings and stores it in a document, then all of those are inserted in the Bloom.
Where I want to implement the MinHash is by giving the option for the user to insert a String and then compare it and try to find the most similar ones on the document.
Do i have to Shingle all the Strings on the document? The problem is that I can't really find something to help me in theis, all I find is regarding two documents and never one String to a set of Strings.
So: the user enters a string and the application finds the most similar strings within a single document. By "similarity", do you mean something like Levenstein distance (whereby "cat" is deemed similar to "rat" and "cart"), or some other measure? And are you (roughly speaking) looking for similar paragraphs, similar sentences, similar phrases or similar words? These are important considerations.
Also, you say you are comparing one string to a set of strings. What are these strings? Sentences? Paragraphs? If you are sure you don't want to find any similarities spanning multiple paragraphs (or multiple sentences, or what-have-you) then it makes sense to think of the document as multiple separate strings; otherwise, you should think of it as a single long string.
The MinHash algorithm is for comparing many documents to each other, when it's impossible to store all document in memory simultaneously, and individually comparing every document to every other would be an n-squared problem. MinHash overcomes these problems by storing hashes for only some shingles, and it sacrifices some accuracy as a result. You don't need MinHash, as you could simply store every shingle in memory, using, say, 4-character-grams for your shingles. But if you don't expect word orderings to be switched around, you may find the Smith-Waterman algorithm more suitable (see also here).
If you're expecting the user to enter long strings of words, you may get better results basing your shingles on words; so 3-word-grams, for instance, ignoring differences in whitespacing, case and punctuation.
Generating 4-character-grams is simple: "The cat sat on the mat" would yield "The ", "he c", "e ca", " cat", etc. Each of these would be stored in memory, along with the paragraph number it appeared in. When the user enters a search string, that would be shingled in identical manner, and the paragraphs containing the greatest number of shared shingles can be retrieved. For efficiency of comparison, rather than storing the shingles as strings, you can store them as hashes using FNV1a or a similar cheap hash.
Shingles can also be built up from words rather than characters (e.g. "the cat sat", "cat sat on", "sat on the"). This tends to be better with larger pieces of text: say, 30 words or more. I would typically ignore all differences in whitespace, case and punctuation if taking this approach.
If you want to find matches that can span paragraphs as well, it becomes quite a bit more complex, as you have to store the character positions for every shingle and consider many different configurations of possible matches, penalizing them according to how widely scattered their shingles are. That could end up quite complex code, and I would seriously consider just sticking with a Levenstein-based solution such as Smith-Waterman, even if it doesn't deal well with inversions of word order.
I don't think a bloom filter is likely to help you much, though I'm not sure how you're using it. Bloom filters might be useful if your document is highly structured: a limited set of possible strings and you're searching for the existence of one of them. For natural language, though, I doubt it will be very useful.

Use of indexes for multi-word queries in full-text search (e.g. web search)

I understand that a fundamental aspect of full-text search is the use of inverted indexes. So, with an inverted index a one-word query becomes trivial to answer. Assuming the index is structured like this:
some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending)
To answer the query for that word the solution is just to find the correct entry in the index (which takes O(log n) time) and present some given number of documents (e.g. the first 10) from the list specified in the index.
But what about queries which return documents that match, say, two words? The most straightforward implementation would be the following:
set A to be the set of documents which have word 1 (by searching the index).
set B to be the set of documents which have word 2 (ditto).
compute the intersection of A and B.
Now, step three probably takes O(n log n) time to perform. For very large A and Bs that could make the query slow to answer. But search engines like Google always return their answer in a few milliseconds. So that can't be the full answer.
One obvious optimization is that since a search engine like Google doesn't return all the matching documents anyway, we don't have to compute the whole intersection. We can start with the smallest set (e.g. B) and find enough entries which also belong to the other set (e.g. A).
But can't we still have the following worst case? If we have set A be the set of documents matching a common word, and set B be the set of documents matching another common word, there might still be cases where A ∩ B is very small (i.e. the combination is rare). That means that the search engine has to linearly go through a all elements x member of B, checking if they are also elements of A, to find the few that match both conditions.
Linear isn't fast. And you can have way more than two words to search for, so just employing parallelism surely isn't the whole solution. So, how are these cases optimized? Do large-scale full-text search engines use some kind of compound indexes? Bloom filters? Any ideas?
As you said some-word -> [doc385, doc211, doc39977, ...] (sorted by rank, descending), I think the search engine may not do this, the doc list should be sorted by doc ID, each doc has a rank according to the word.
When a query comes, it contains several keywords. For each word, you can find a doc list. For all keywords, you can do merge operations, and compute the relevance of doc to query. Finally return the top ranked relevance doc to user.
And the query process can be distributed to gain better performance.
Even without ranking, I wonder how the intersection of two sets is computed so fast by google.
Obviously the worst-case scenario for computing the intersection for some words A, B, C is when their indexes are very big and the intersection very small. A typical case would be a search for some very common ("popular" in DB terms) words in different languages.
Let's try "concrete" and 位置 ("site", "location") in chinese and 極端な ("extreme") in japanese.
Google search for 位置 returns "About 1,500,000,000 results (0.28 seconds) "
Google search for "concrete" returns "About 2,020,000,000 results (0.46 seconds) "
Google search for "極端な" About 7,590,000 results (0.25 seconds)
It is extremly improbable that all three terms would ever appear in the same document, but let's google them:
Google search for "concrete 位置 極端な" returns "About 174,000 results (0.13 seconds)"
Adding a russian word "игра" (game)
Search игра: About 212,000,000 results (0.37 seconds)
Search for all of them: " игра concrete 位置 極端な " returns About 12,600 results (0.33 seconds)
Of course the returned search results are nonsense and they do not contain all the search terms.
But looking at the query time for the composed ones, I wonder if there is some intersection computed on the word indexes at all. Even if everything is in RAM and heavily sharded, computing the intersection of two sets with 1,500,000,000 and 2,020,000,000 entries is O(n) and can hardly be done in <0.5 sec, since the data is on different machines and they have to communicate.
There must be some join computation, but at least for popular words, this is surely not done on the whole word index. Adding the fact that the results are fuzzy, it seems evident that Google uses some optimization of kind "give back some high-ranked results, and stop after 0,5 sec".
How this is implemented, I don't know. Any ideas?
Most systems somehow implement TF-IDF in one way or another. TF-IDF is a product of functions term frequency and inverse document frequency.
The IDF function relates the document frequency to the total number of documents in a collection. The common intuition for this function says that it should give a higher value for terms that appear in few documents and lower value for terms that appear in all documents making them irrelevant.
You mention Google, but Google optimises search with PageRank (links in/out) as well as term frequency and proximity. Google distributes the data and uses Map/Reduce to parallelise operations - to compute PageRank+TF-IDF.
There's a great explanation of the theory behind this in Information Retrieval: Implementing Search Engines chapter 2. Another idea to investigate further is also to look how Solr implements this.
Google does not need to actually find all results, only the top ones.
The index can be sorted by grade first and only then by id. Since the same ID always has the same grade this does not hurt sets intersection time.
So google starts intersection until it finds 10 results , and then does a statistical estimation to tell you how many more results it found.
A worst case is almost impossible.
If all words are "common" then intersection will give the first 10 results very fast. If there is a rare word, then intersection is fast because complexity is O(N long M) where N is the smallest group.
You need to remember that google keeps it's indexes in memory and uses parallel computing.For example U can split the problem into two searches each searching only half of the web, and then marge result and take the best. Google has millions of computes

Efficent methods for finding most common phrases in a body of text AKA trending topics

I previously asked a similar question on this topic, I ended up deriving several solutions which worked, one based on bloom filters + ngrams, the other based on hash tables + ngrams. Both solutions perform fine with small data sets (<1000 texts, usually tweets) but the computation time grew exponentially meaning doing 10,000 could take hours.
I am currently working in Ruby and perhaps, that is the problem but are there any other solutions or approaches I could attempt to solve this problem?
If you are looking to do text searching in large sets of data, you might have to look into something like solr. There is a really easy to setup solr gem called sunspot http://outoftime.github.com/sunspot/
Your problem can be solved by following the steps below:
(Optional, for performance purpose) Run through all the documents, create a mapping between the a unique word and an integer. Also, it is better to create a special mapping for sentence termination (.!? etc.). This is to facilitate the check of phrases that do not cross sentence boundary.
Concatenate all the documents into a huge array of mapped integers (in previous step). This can be done online (to save space) as we go through the next steps.
Constructing a suffix array of the string in previous step, augmented with the longest common prefix array. The fastest implementation known is SA-IS that runs in O(n) worst-case time. See here. Some special handling is required to be sure that each common prefix does not cross the sentence boundary.
LCP array is basically the result you need. You can do whatever you want with it, such as: sort it to find the longest repeated phrases among the documents, find all 5-words, 4 words, 3-words phrases, etc. The most common phrases (I assume at least 2-word phrases here) can be found by looking at both the LCP and suffix array.
Quick Google search show that this library contains a Ruby suffix array implementation. You can generate LCP array from there in O(n) Reference.

How do you Index Files for Fast Searches?

Nowadays, Microsoft and Google will index the files on your hard drive so that you can search their contents quickly.
What I want to know is how do they do this? Can you describe the algorithm?
The simple case is an inverted index.
The most basic algorithm is simply:
scan the file for words, creating a list of unique words
normalize and filter the words
place an entry tying that word to the file in your index
The details are where things get tricky, but the fundamentals are the same.
By "normalize and filter" the words, I mean things like converting everything to lowercase, removing common "stop words" (the, if, in, a etc.), possibly "stemming" (removing common suffixes for verbs and plurals and such).
After that, you've got a unique list of words for the file and you can build your index off of that.
There are optimizations for reducing storage, techniques for checking locality of words (is "this" near "that" in the document, for example).
But, that's the fundamental way it's done.
Here's a really basic description; for more details, you can read this textbook (free online): http://informationretrieval.org/¹
1). For all files, create an index. The index consists of all unique words that occur in your dataset (called a "corpus"). With each word, a list of document ids is associated; each document id refers to a document that contains the word.
Variations: sometimes when you generate the index you want to ignore stop words ("a", "the", etc). You have to be careful, though ("to be or not to be" is a real query composed of stopwords).
Sometimes you also stem the words. This has more impact on search quality in non-English languages that use suffixes and prefixes to a greater extent.
2) When a user enters a query, look up the corresponding lists, and merge them. If it's a strict boolean query, the process is pretty straightforward -- for AND, a docid has to occur in all the word lists, for OR, in at least one wordlist, etc.
3) If you want to rank your results, there are a number of ways to do that, but the basic idea is to use the frequency with which a word occurs in a document, as compared to the frequency you expect it to occur in any document in the corpus, as a signal that the document is more or less relevant. See textbook.
4) You can also store word positions to infer phrases, etc.
Most of that is irrelevant for desktop search, as you are more interested in recall (all documents that include the term) than ranking.
¹ previously on http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html, accessible via wayback machine
You could always look into something like Apache Lucene.
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Resources