How do you Index Files for Fast Searches? - algorithm

Nowadays, Microsoft and Google will index the files on your hard drive so that you can search their contents quickly.
What I want to know is how do they do this? Can you describe the algorithm?

The simple case is an inverted index.
The most basic algorithm is simply:
scan the file for words, creating a list of unique words
normalize and filter the words
place an entry tying that word to the file in your index
The details are where things get tricky, but the fundamentals are the same.
By "normalize and filter" the words, I mean things like converting everything to lowercase, removing common "stop words" (the, if, in, a etc.), possibly "stemming" (removing common suffixes for verbs and plurals and such).
After that, you've got a unique list of words for the file and you can build your index off of that.
There are optimizations for reducing storage, techniques for checking locality of words (is "this" near "that" in the document, for example).
But, that's the fundamental way it's done.

Here's a really basic description; for more details, you can read this textbook (free online): http://informationretrieval.org/¹
1). For all files, create an index. The index consists of all unique words that occur in your dataset (called a "corpus"). With each word, a list of document ids is associated; each document id refers to a document that contains the word.
Variations: sometimes when you generate the index you want to ignore stop words ("a", "the", etc). You have to be careful, though ("to be or not to be" is a real query composed of stopwords).
Sometimes you also stem the words. This has more impact on search quality in non-English languages that use suffixes and prefixes to a greater extent.
2) When a user enters a query, look up the corresponding lists, and merge them. If it's a strict boolean query, the process is pretty straightforward -- for AND, a docid has to occur in all the word lists, for OR, in at least one wordlist, etc.
3) If you want to rank your results, there are a number of ways to do that, but the basic idea is to use the frequency with which a word occurs in a document, as compared to the frequency you expect it to occur in any document in the corpus, as a signal that the document is more or less relevant. See textbook.
4) You can also store word positions to infer phrases, etc.
Most of that is irrelevant for desktop search, as you are more interested in recall (all documents that include the term) than ranking.
¹ previously on http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html, accessible via wayback machine

You could always look into something like Apache Lucene.
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Related

Minhashing on Strings with K-length

I have a application where I should implement Bloom Filters and Minhashing to find similar items.
I have the Bloom Filter implemented but I need to make sure i understand the Minhashing part to do it:
The aplication generates a number of k-length Strings and stores it in a document, then all of those are inserted in the Bloom.
Where I want to implement the MinHash is by giving the option for the user to insert a String and then compare it and try to find the most similar ones on the document.
Do i have to Shingle all the Strings on the document? The problem is that I can't really find something to help me in theis, all I find is regarding two documents and never one String to a set of Strings.
So: the user enters a string and the application finds the most similar strings within a single document. By "similarity", do you mean something like Levenstein distance (whereby "cat" is deemed similar to "rat" and "cart"), or some other measure? And are you (roughly speaking) looking for similar paragraphs, similar sentences, similar phrases or similar words? These are important considerations.
Also, you say you are comparing one string to a set of strings. What are these strings? Sentences? Paragraphs? If you are sure you don't want to find any similarities spanning multiple paragraphs (or multiple sentences, or what-have-you) then it makes sense to think of the document as multiple separate strings; otherwise, you should think of it as a single long string.
The MinHash algorithm is for comparing many documents to each other, when it's impossible to store all document in memory simultaneously, and individually comparing every document to every other would be an n-squared problem. MinHash overcomes these problems by storing hashes for only some shingles, and it sacrifices some accuracy as a result. You don't need MinHash, as you could simply store every shingle in memory, using, say, 4-character-grams for your shingles. But if you don't expect word orderings to be switched around, you may find the Smith-Waterman algorithm more suitable (see also here).
If you're expecting the user to enter long strings of words, you may get better results basing your shingles on words; so 3-word-grams, for instance, ignoring differences in whitespacing, case and punctuation.
Generating 4-character-grams is simple: "The cat sat on the mat" would yield "The ", "he c", "e ca", " cat", etc. Each of these would be stored in memory, along with the paragraph number it appeared in. When the user enters a search string, that would be shingled in identical manner, and the paragraphs containing the greatest number of shared shingles can be retrieved. For efficiency of comparison, rather than storing the shingles as strings, you can store them as hashes using FNV1a or a similar cheap hash.
Shingles can also be built up from words rather than characters (e.g. "the cat sat", "cat sat on", "sat on the"). This tends to be better with larger pieces of text: say, 30 words or more. I would typically ignore all differences in whitespace, case and punctuation if taking this approach.
If you want to find matches that can span paragraphs as well, it becomes quite a bit more complex, as you have to store the character positions for every shingle and consider many different configurations of possible matches, penalizing them according to how widely scattered their shingles are. That could end up quite complex code, and I would seriously consider just sticking with a Levenstein-based solution such as Smith-Waterman, even if it doesn't deal well with inversions of word order.
I don't think a bloom filter is likely to help you much, though I'm not sure how you're using it. Bloom filters might be useful if your document is highly structured: a limited set of possible strings and you're searching for the existence of one of them. For natural language, though, I doubt it will be very useful.

Predicting phrases instead of just next word

For an application that we built, we are using a simple statistical model for word prediction (like Google Autocomplete) to guide search.
It uses a sequence of ngrams gathered from a large corpus of relevant text documents. By considering the previous N-1 words, it suggests the 5 most likely "next words" in descending order of probability, using Katz back-off.
We would like to extend this to predict phrases (multiple words) instead of a single word. However, when we are predicting a phrase, we would prefer not to display its prefixes.
For example, consider the input the cat.
In this case we would like to make predictions like the cat in the hat, but not the cat in & not the cat in the.
Assumptions:
We do not have access to past search statistics
We do not have tagged text data (for instance, we do not know the parts of speech)
What is a typical way to make these kinds of multi-word predictions? We've tried multiplicative and additive weighting of longer phrases, but our weights are arbitrary and overfit to our tests.
For this question, you need to define what it is you consider to be a valid completion -- then it should be possible to come up with a solution.
In the example you've given, "the cat in the hat" is much better than "the cat in the". I could interpret this as, "it should end with a noun" or "it shouldn't end with overly common words".
You've restricted the use of "tagged text data" but you could use a pretrained model, (e.g. NLTK, spacy, StanfordNLP) to guess the parts of speech and make an attempt to restrict predictions to only complete noun-phrases (or sequence ending in noun). Note that you would not necessarily need to tag all documents fed into the model, but only those phrases you're keeping in your autocomplete db.
Alternately, you could avoid completions that end in stopwords (or very high frequency words). Both "in" and "the" are words that occur in almost all English documents, so you could experimentally find a frequency cutoff (can't end in a word that occurs in more than 50% of documents) that help you filter. You could also look at phrases -- if the end of the phrase is drastically more common as a shorter phrase, then it doesn't make sense to tag it on, as the user could come up with it on their own.
Ultimately, you could create a labeled set of good and bad instances and try to create a supervised re-ranker based on word features -- both ideas above could be strong features in a supervised model (document frequency = 2, pos tag = 1). This is typically how search engines with data can do it. Note that you don't need search statistics or users for this, just a willingness to label the top-5 completions for a few hundred queries. Building a formal evaluation (that can be run in an automated manner) would probably help when trying to improve the system in the future. Any time you observe a bad completion, you could add it to the database and do a few labels -- over time, a supervised approach would get better.

Matching words and valid sub-words in elasticseach

I've been working with ElasticSearch within an existing code base for a few days, so I expect that the answer is easy once I know what I'm doing. I want to extend a search to yield the same results when I search with a compound word, like "eyewitness", or its component words separated by a whitespace, like "eye witness".
For example, I have a catalog of toy cars that includes both "firetruck" toys and "fire truck" toys. I would like to ensure that if someone searched on either of these terms, the results would include both the "firetruck" and the "fire truck" entries.
I attempted to do this at first with the "fuzziness" of a match, hoping that "fire truck" would be considered one transform away from "firetruck", but that does not work: ES fuzziness is per-word and will not add or remove whitespace characters as a valid transformation.
I know that I could do some brute-forcing before generating the query by trying to come up with additional search terms by breaking big words into smaller words and also joining smaller words into bigger words and checking all of them against a dictionary, but that falls apart pretty quickly when "fuzziness" and proper names are part of the task.
It seems like this is exactly the kind of thing that ES should do well, and that I simply don't have the right vocabulary yet for searching for the solution.
Thanks, everyone.
there are two things you could could do:
you could split words into their compounds, i.e. firetruck would be split into two tokens fire and truck, see here
you could use n-grams, i.e. for 4 grams the original firetruck get split into the tokens fire, iret, retr, etru, truc, ruck. In queries, the scoring function helps you ending up with pretty decent results. Check out this.
Always remember to do the same tokenization on both the analysis and the query side.
I would start with the ngrams and if that is not good enough you should go with the compounds and split them yourself - but that's a lot of work depending on the vocabulary you have under consideration.
hope the concepts and the links help, fricke

Indexing full text search queries for efficient fanout

While thinking about the design of various applications I might like to build some day, in several cases I have had a need to fan out a stream of incoming events based on whether or not they match a large selection of full text search queries provided by users.
A simple example of this problem is the implementation of a tool like Twitter streaming search: given many thousands of new tweets every second, efficiently select only the streaming subscribers whose search query is likely to match an incoming tweet.
A statement of the problem would be something like, "inverse full text search", where the full text is the query, and the search results are the search queries that would match that text.
For single term queries an implementation is obvious: simply tokenize the incoming document, then search a map of term->(list of subscribers), but things become more difficult when boolean queries are possible. In fact the problem is more general than full text search, but it is simplest understood in that context. There are many other examples where a large set of boolean terms need combined some way to optimize cost of evaluating them.
For example, imagine 3 search subscriptions:
Google AND Glass
Google AND Analytics
((Glass AND Google) NOT Knol) OR Twitter
One possibility is to parse the query into a tree, then visit each node, extracting the term, and using the "map of term" approach, however this would require re-evaluating the subscribers query against the incoming document for each term. With enough subscribers, this is going to start getting slow very quickly.
Instead I am wondering if there is a well documented approach to rewrite the queries perhaps into a single query, where the result can be evaluated once, and tree nodes are annotated with a list of subscriber queries known to either exactly or almost certainly match any document that that point in the tree.
For example, the above queries might be rewritten so that a map of term->(query tree) exists, such as:
Google -> (Analytics[2]
Glass[1,3])
Twitter -> ([3])
Is there any existing publicly documented system that does something like this? Ideally the solution would allow incrementally adding and removing subscribers, without some expensive step to rewrite the entire structure.
One way to do this is with a simple dictionary that maps terms to queries. So given these four queries:
Query1: Google AND Glass
Query2: Google AND Analytics
Query3: ((Glass AND Google) NOT Knol) OR Twitter
Query4: Quick AND red AND fox
You build a dictionary, keyed by the term:
Google: Query1, Query2, Query3
Glass: Query1, Query3
Analytics: Query2
Knol: Query3
Twitter: Query3
Quick: Query4
red: Query4
fox: Query4
Now, consider a sentence like "The red glass on the knol is from Google."
Parse each word and look it up in the dictionary. For each word in the dictionary, add its list of queries to your master list of queries. Also, for every word that is found in the dictionary, add it to a hash table of relevant words. At the end of this step you'll have two structures: the list of queries to check, and the list of relevant words:
Queries list: Query1, Query2, Query3, Query4
Relevant words: Google, Glass, Knol, red
Now it's a matter of processing each query, checking to see if the words are in the relevant words list.
For Query1, for example, you'd check to see if the relevant words list contains Google and Glass.
The complexity of this isn't too bad. You have an O(1) lookup for each parsed word in the text. For each query identified during the parsing phase, you have some number, N, O(1) lookups against the relevant words hash table. There's some very small amount of logic involved in doing the Boolean evaluation, but most queries will be simple "all words" or "any word" type queries (i.e. "this AND that", or "this OR that").
The nice thing about this model is that it's pretty easy to farm out to multiple processors. You can parse the words in a single thread, pushing them to a concurrent queue. Multiple threads service that queue, doing the lookups and building their own lists of queries that need to be checked. When all those lookups are done, you merge the queries lists from the multiple threads and again put them on a concurrent queue that multiple threads can service.
Say you have a million queries, averaging five words each (which would likely be a big average). Absolute worst case here is that some text comes in that contains at least one word from each query. So you have a list of a million queries to check in pass 2. At worst, that's 5 million dictionary lookups.
The first pass of this algorithm is O(n), where n is the number of words in the incoming text. That will create a list of k queries. The second pass is O(km), where m is the average number of words per query.
The beauty of this approach is its simplicity, and it will perform well for moderately large numbers of queries, depending on the size of the text you're feeding it. There is a potentially faster way, but it's much more involved.
Rather than building a dictionary that maps terms to queries, you use a modified Aho-Corasick string search algorithm that is very similar to what the Unix fgrep program uses to match multiple regular expressions in a single pass over the text. The details of that are way beyond my ability to explain in a short note here. You might want to track down an old Dr. Dobb's Journal article called something like "Parallel Pattern Matching and fgrep", which as I recall had a reasonably good explanation of how this is done. (A quick search didn't find the article text, but you might have better luck.) You'll also want to read the original Aho-Corasick paper: Efficient String Matching: an Aid to Bibliographic Search. That discusses parallel pattern matching literal strings, but the basic idea works for matching regular expressions or Boolean search queries.
If you can parse your query into boolean expressions, what you have is a set of rules, with the input variables the presence or absence of terms in the search text. For each search text you could use parsing + table lookup or Aho-Corasick to work out which terms are present and then use an implementation of the Rete algorithm such as http://en.wikipedia.org/wiki/Drools to work out which rules to fire given that input.
(Alternately, you could batch up your input texts, build a small text search database from them, and then run your queries. My guess is that this stops being stupidly inefficient when you can afford to wait long enough between query runs for the text search database size to be comparable with the size of the combined queries).

Search Query Tokenizer

We're trying to add a simple search functionality to our website that lists restaurants. We try to detect the place name, location, and place features from the search string, something like "cheap restaurants near cairo" or "chinese and high-end food in virginia".
What we are doing right now it tokenizing the query and searching in the tables with the least performance cost first (the table of prices (cheap-budget-expensive-high-end) is smaller than the tables of the places list). Is this the right approach ?
--
Regards.
Yehia
I'd say you should build sets of synonyms (e.g. cheap, low budget, etc go into synset:1) and map each token from the search string to one of those groups.
Btw, it will be easy to handle spelling mistakes here since this is genereally a pretty small search space. Edit distance, common k-grams, ... anything should be alright.
In a next step you should build inverted index lists for each of those syn-groups the map to a sorted list of restaurants that can be associated with that property. For each syngroup from a query, get all those lists and simply intersect them.
Words that cannot be mapped to one of those synsets will probably have to be ignored unless you have some sort of fulltexts about the restaurants that you could index as well. In that can you can also buildsuch restaurant lists for "normal" words and intersect them as well. But this would already be quite close to classical search engines and it might be a good idea to use a technology like apache lucence. Without fulltexts I don't think you'd need such a thing because an inverted index of snygroups is really easy to process on your own.
Seems you may be missing how misspelled queries are handled.

Resources