which data structure to use for phrase match - data-structures

Given a phrase I want to check if one of the entries in a predefined list of phrases is contained in the phrase ("phrase match"). The predefined list of phrases may be huge, and I'm reading it from a file.
If I was looking for an exact match, I would have read the list into a hush. But since I'm not looking for an exact match, I don't know which data structure to use.
Do I have to go over the entire list for each new phrase? Do you know of a data structure that fits a phrase match purpose?
Thanks,
Li

Related

Search for multiple words by prefix (trie data structure)

How can I use a trie (or another data structure or algorithm) to efficiently search for multiple words by prefix?
For example: suppose this is my data set:
Alice Jones
Bob Smith
Bobby Walker
John Doe
(10000 names in total)
A trie data structure allows me to efficiently retrieve all names starting with "Bo" (thus without iterating over all the names). But I also want to search on the last name by prefix, thus searching for "Wa" should find "Bobby Walker". And to complicate things: when the user searches for "Bo Wa" this should also find the same name. How can I implement this? Should I use a separate trie structure for each part of the name? (And how to combine the results)?
Background: I'm writing the search functionality for a big address book (10000+ names). I want to have a really fast autocomplete function that shows results while people are typing the first few letters of the first & last name. I already have a solution that uses a regex, but it needs to iterate over all names which is to slow.
A very good data structure would be a Burst Trie
There's a Scala implementation.
You could try a second trie with the reversed string and a wildcard search:http://phpir.com/tries-and-wildcards/
I think that a sorted array will also fit for your requirements, an array containing Person objects (they have a firstName and a lastName field). Let's say that you have a prefix and want to find all the values that fit your prefix. Simply run a binary search to find the first position (let's say is firstIndex) where your prefix appears on firstName and one more to find the last position (lastIndex). Now you can retrieve your values in O(lastIndex - firstIndex). The same goes when you want to find them by lastName. When you have a prefixFirstName and a prefixLastName you can search for the interval where values match for prefixFirstName and then, on this interval, you can check for the values matching the prefixLastName. For a conclusion, when you have one or two prefixes you run 4 binary searches (around 17 iterations per search for 100k names) which is fast enough and you can retrieve them in linear time. Even if it isn't the fastest solution, I suggested it since it's easy to understand and easy to code.

How does Lucene (Solr/ElasticSearch) so quickly do filtered term counts?

From a data structure point of view how does Lucene (Solr/ElasticSearch) so quickly do filtered term counts? For example given all documents that contain the word "bacon" find the counts for all words in those documents.
First, for background, I understand that Lucene relies upon a compressed bit array data structure akin to CONCISE. Conceptually this bit array holds a 0 for every document that doesn't match a term and a 1 for every document that does match a term. But the cool/awesome part is that this array can be highly compressed and is very fast at boolean operations. For example if you want to know which documents contain the terms "red" and "blue" then you grab the bit array corresponding to "red" and the bit array corresponding to "blue" and AND them together to get a bit array corresponding to matching documents.
But how then does Lucene quickly determine counts for all words in documents that match "bacon"? In my naive understanding, Lucene would have to take the bit array associated with bacon and AND it with the bit arrays for every single other word. Am I missing something? I don't understand how this can be efficient. Also, do these bit arrays have to be pulled off of disk? That sounds all the worse!
How's the magic work?
You may already know this but it won't hurt to say that Lucene uses inverted indexing. In this indexing technique a dictionary of each word occurring in all documents is made and against each word information about that words occurrences is stored. Something like this image
To achieve this Lucene stores documents, indexes and their Metadata in different files formats. Follow this link for file details http://lucene.apache.org/core/3_0_3/fileformats.html#Overview
If you read the document numbers section, each document is given an internal ID so when documents are found with word 'consign' the lucene engine has the reference to the metadata of it.
Refer to the overview section to see what data gets saved in different lucene indexes.
Now that we have a pointer to the stored documents Lucene may be getting it in one of the following ways
Really count the number of words if the document is stored
Use Term Dictionary, frequency, and proximity data to get the count.
Finally, which API are you using to "quickly determine counts for all words"
Image credit to http://leanjavaengineering.wordpress.com/
Check about index file format here http://lucene.apache.org/core/8_2_0/core/org/apache/lucene/codecs/lucene80/package-summary.html#package.description
there are no bitsets involved: its an inverted index. Each term maps to a list of documents. In lucene the algorithms work on iterators of these "lists", so items from the iterator are read on-demand, not all at once.
this diagram shows a very simple conjunction algorithm that just uses a next() operation: http://nlp.stanford.edu/IR-book/html/htmledition/processing-boolean-queries-1.html
Behind the scenes, it is much like this diagram in lucene. Our lists are delta-encoded and bitpacked, and augmented with a skiplist which allows us to intersect more efficiently (via the additional advance() operation) than the above algorithm though.
DocIDSetIterator is this "enumerator" in lucene. it has the two main methods, next() and advance(). And yes, it is true you can decide to read the entire list + convert it into a bitset in memory, and implement this iterator over that in-memory bitset. This is what happens if you use CachingWrapperFilter.

Algorithm to search for a list of words in a text

I have a list of words, fairly small about 1000 or so. I want to check if any of the words in that list occur in an input text. If so I would like know which ones occur. The input text is a few hundred words each and these are text paragraphs from the web - meaning there a lot of them from different sites. I am trying to find the best algorithm for it.
I can see two obvious ways to do this --
A brute force way of searching for each word from the list in the text.
Create a hash table of words from the input text and then search for each word from the list in the hash table. This is fast.
Is there a better solution?
I am using python though I am not sure if that changes the algorithm anyway.
Also as an optimization to the solution 2 above, I would like to store the hash table generated to persistent storage (DB) so that if the list of words changes I can re-use the hash table without having to create it again. Of course if the input text changes I have to generate the hash table. Is it possible to save a hash table to a DB? Any recommendations? I am currently using MongoDB for my project and I can only store json documents in it. I am a new to MongoDB and have only just started working with it and still do not fully understand the full potential of it.
I have searched SO and see two questions along similar lines and one of them suggests a hash table but I would like to get any pointers towards the optimization I have in mind.
Here are the previously asked questions on SO -
Is there an efficient algorithm to perform inverted full text search?
Searching a large list of words in another large list
EDIT: I just found another question on SO which is about the same problem.
Algorithm for multiple word matching in text
I guess there is no better solution than a hash table. But I would really like to optimize it so that changes to the word list can let me run the algorithm on all the text I have stored up quickly. Should I change the tags added to the question to also include some database technologies?
There is a better solution than a hash table. If you have a fixed set of words that you want to search for over a large body of text, the way you do it is with the Aho-Corasick string matching algorithm.
The algorithm builds a state machine from the words you want to search, and then runs the input text through that state machine, outputting matches as they're found. Because it takes some amount of time to build the state machine, the algorithm is best suited for searching very large bodies of text.
You can do something similar with regular expressions. For example, you might want to find the words "dog", "cat", "horse", and "skunk" in some text. You can build a regular expression:
"dog|cat|horse|skunk"
And then run a regular expression match on the text. How you get all matches will depend on your particular regular expression library, but it does work. For very large lists of words, you'll want to write code that reads the words and generates the regex, but it's not terribly difficult to do and it works quite well.
There is a difference, though, in the results from a regex and the results from the Aho-Corasick algorithm. For example if you're searching for the words "dog" and "dogma" in the string "My karma ate your dogma." The regex library search will report finding "dogma". The Aho-Corasick implementation will report finding "dog" and "dogma" at the same position.
If you want the Aho-Corasick algorithm to report whole words only, you have to modify the algorithm slightly.
Regex, too, will report matches on partial words. That is, if you're searching for "dog", it will find it in "dogma". But you can modify the regex to only give whole words. Typically, that's done with the \b, as in:
"\b(cat|dog|horse|skunk)\b"
The algorithm you choose depends a lot on how large the input text is. If the input text isn't too large, you can create a hash table of the words you're looking for. Then go through the input text, breaking it into words, and checking the hash table to see if the word is in the table. In pseudo code:
hashTable = Build hash table from target words
for each word in input text
if word in hashTable then
output word
Or, if you want a list of matching words that are in the input text:
hashTable = Build hash table from target words
foundWords = empty hash table
for each word in input text
if word in hashTable then
add word to foundWords

Fast filtering of a string collection by substring?

Do you know of a method for quickly filtering a list of strings to obtain the subset that contain a specified string? The obvious implementation is to just iterate through the list, checking each string for whether it contains the search string. Is there a way to index the string list so that the search can be done faster?
Wikipedia article lists a few ways to index substrings. You've got:
Suffix tree
Suffix array
N-gram index, an inverted file for all N-grams of the text
Compressed suffix array1
FM-index
LZ-index
Yes, you could for example create an index for all character combinations in the strings. A string like "hello" would be added in the indexes for "he", "el", "ll" and "lo". To search for the string "hell" you would get the index of all strings that exist in all of the "he", "el" and "ll" indexes, then loop through those to check for the actual content in the strings.
If you can preprocess the collection then you can do a lot of different things.
For example, you could build a trie including all your strings' suffixes, then use that to do very fast matching.
If you're going to be repeatedly searching the same text, then a suffix tree is probably worthwhile. If carefully applied, you can achieve linear time processing for most string problems. If not, then in practice you won't be able to do much better than Rabin-Karp, which is based on hashing, and is linear in expected time.
There are many freely available implementations of suffix trees. See for example, this C implementation, or for Java, check out the Biojava framework.
Not really anything that's viable, no, unless you have additional a priori knowledge of your data and/or search term - for instance, if you're only searching for matches at the beginning of your strings, then you could sort the strings and only look at the ones within the bounds of your search term (or even store them in a binary tree and only look at the branches that could possibly match). Likewise, if your potential search terms are limited, you could run all the possible searchs against a string when it's initially input and then just store a table of which terms match and which don't.
Aside from that kind of thing, just iterating through is basically it.
That depends on if the substring is at the beginning of the string or can be anywhere in the string.
If it's anywhere then you pretty much need to iterate over the entire list unless your list is so large and the query happens sufficiently often that it's worth building a more sophisticated indexing solution.
If the substring is at the beginning of the string then it's easy. Sort the list, find the start/end by biseciton search and take that subset.

How do you Index Files for Fast Searches?

Nowadays, Microsoft and Google will index the files on your hard drive so that you can search their contents quickly.
What I want to know is how do they do this? Can you describe the algorithm?
The simple case is an inverted index.
The most basic algorithm is simply:
scan the file for words, creating a list of unique words
normalize and filter the words
place an entry tying that word to the file in your index
The details are where things get tricky, but the fundamentals are the same.
By "normalize and filter" the words, I mean things like converting everything to lowercase, removing common "stop words" (the, if, in, a etc.), possibly "stemming" (removing common suffixes for verbs and plurals and such).
After that, you've got a unique list of words for the file and you can build your index off of that.
There are optimizations for reducing storage, techniques for checking locality of words (is "this" near "that" in the document, for example).
But, that's the fundamental way it's done.
Here's a really basic description; for more details, you can read this textbook (free online): http://informationretrieval.org/¹
1). For all files, create an index. The index consists of all unique words that occur in your dataset (called a "corpus"). With each word, a list of document ids is associated; each document id refers to a document that contains the word.
Variations: sometimes when you generate the index you want to ignore stop words ("a", "the", etc). You have to be careful, though ("to be or not to be" is a real query composed of stopwords).
Sometimes you also stem the words. This has more impact on search quality in non-English languages that use suffixes and prefixes to a greater extent.
2) When a user enters a query, look up the corresponding lists, and merge them. If it's a strict boolean query, the process is pretty straightforward -- for AND, a docid has to occur in all the word lists, for OR, in at least one wordlist, etc.
3) If you want to rank your results, there are a number of ways to do that, but the basic idea is to use the frequency with which a word occurs in a document, as compared to the frequency you expect it to occur in any document in the corpus, as a signal that the document is more or less relevant. See textbook.
4) You can also store word positions to infer phrases, etc.
Most of that is irrelevant for desktop search, as you are more interested in recall (all documents that include the term) than ranking.
¹ previously on http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html, accessible via wayback machine
You could always look into something like Apache Lucene.
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Resources