Index of sentences - algorithm

I have a few tens of thousands of short documents, consisting of 10 to 20 English sentences each (as well as some other non-sentence stuff like possibly HTML formatting or other "junk"). These documents are chopped out of other longer documents - in other words the shorter document "A1" might be sentence 10 through 20 of original document "A" and another shorter document "A2" might be sentence 11 through 25 of the same document original document "A" , and some of the original source documents might be summaries or copies of other original source documents, so that original source document "B" might also have sentences 10 through 20 of original source document "A", although not necessarily in the same location. And that same group of sentences might have been extracted from "B" into another short document "B3".
For each sentence, or at least each sentence over a certain length (say, > 3 words long), I'd like to produce a list of every short document that sentence occurs in. I'd like to scan the existing shorter documents and produce that index, and also update that index as I break up further longer original source documents into shorter documents.
I'm thinking what I need is some code to make an efficient hash code for a sentence which has a very low likelihood of producing the same hash code for two different sentences. Is the hash algorithm used in Java String.hashCode() a good choice for that? MD5 or other cryptographic hash seems like it would be too expensive and overkill for this purpose.

I recently evaluated hash algorithms with the requirement that in a few million inputs there should be virtually no possibility of hash collision, and the hashing must be very fast. CityHash was the winner, hands-down.
If you're interested in calculating the probability of a hash collision, that subject is sometimes referred to as the Birthday Problem. The math behind it is here:
https://sites.google.com/site/craigandera/craigs-stuff/odds-ends/the-birthday-problem-calculator

More broadly, you would probably benefit from reading this book. The structure you are describing is a classic inverted index: the book describes efficient algorithms for creating, updating and performing interesting queries on them.

Related

Minhashing on Strings with K-length

I have a application where I should implement Bloom Filters and Minhashing to find similar items.
I have the Bloom Filter implemented but I need to make sure i understand the Minhashing part to do it:
The aplication generates a number of k-length Strings and stores it in a document, then all of those are inserted in the Bloom.
Where I want to implement the MinHash is by giving the option for the user to insert a String and then compare it and try to find the most similar ones on the document.
Do i have to Shingle all the Strings on the document? The problem is that I can't really find something to help me in theis, all I find is regarding two documents and never one String to a set of Strings.
So: the user enters a string and the application finds the most similar strings within a single document. By "similarity", do you mean something like Levenstein distance (whereby "cat" is deemed similar to "rat" and "cart"), or some other measure? And are you (roughly speaking) looking for similar paragraphs, similar sentences, similar phrases or similar words? These are important considerations.
Also, you say you are comparing one string to a set of strings. What are these strings? Sentences? Paragraphs? If you are sure you don't want to find any similarities spanning multiple paragraphs (or multiple sentences, or what-have-you) then it makes sense to think of the document as multiple separate strings; otherwise, you should think of it as a single long string.
The MinHash algorithm is for comparing many documents to each other, when it's impossible to store all document in memory simultaneously, and individually comparing every document to every other would be an n-squared problem. MinHash overcomes these problems by storing hashes for only some shingles, and it sacrifices some accuracy as a result. You don't need MinHash, as you could simply store every shingle in memory, using, say, 4-character-grams for your shingles. But if you don't expect word orderings to be switched around, you may find the Smith-Waterman algorithm more suitable (see also here).
If you're expecting the user to enter long strings of words, you may get better results basing your shingles on words; so 3-word-grams, for instance, ignoring differences in whitespacing, case and punctuation.
Generating 4-character-grams is simple: "The cat sat on the mat" would yield "The ", "he c", "e ca", " cat", etc. Each of these would be stored in memory, along with the paragraph number it appeared in. When the user enters a search string, that would be shingled in identical manner, and the paragraphs containing the greatest number of shared shingles can be retrieved. For efficiency of comparison, rather than storing the shingles as strings, you can store them as hashes using FNV1a or a similar cheap hash.
Shingles can also be built up from words rather than characters (e.g. "the cat sat", "cat sat on", "sat on the"). This tends to be better with larger pieces of text: say, 30 words or more. I would typically ignore all differences in whitespace, case and punctuation if taking this approach.
If you want to find matches that can span paragraphs as well, it becomes quite a bit more complex, as you have to store the character positions for every shingle and consider many different configurations of possible matches, penalizing them according to how widely scattered their shingles are. That could end up quite complex code, and I would seriously consider just sticking with a Levenstein-based solution such as Smith-Waterman, even if it doesn't deal well with inversions of word order.
I don't think a bloom filter is likely to help you much, though I'm not sure how you're using it. Bloom filters might be useful if your document is highly structured: a limited set of possible strings and you're searching for the existence of one of them. For natural language, though, I doubt it will be very useful.

Efficient way to look up dictionary

I have a dictionary, it map word to id, like:
at: 0
hello: 1
school: 2
fortune:3
high:4
we: 5
eat: 6
....
high_school: 17
fortune_cookie: 18
....
Then, i have a document. What is the fastest and efficient way to transfer content of document to id.
Eg:
"At high school, we eat fortune cookie."
=> "0 17, 5 6 18"
Hope to see your suggestion.
Thank for readinng.
It really depends on how large your document is, whether your keyword list is static, and whether you need to find multi-word phrases. The naive way to do it is to look up every word from the document in the dictionary. Because dictionary lookups are O(1), looking up every word will take O(n) time, where n is the number of words in the document. If you need to find multi-word phrases, you can post-process the output to find those.
That's not the most efficient way to do things, but it's really easy to implement, reasonably fast, and will work very well if your documents aren't huge.
If you have very large documents, then you probably want something like the Aho-Corasick string matching algorithm. That algorithm works in two stages. First it builds a trie from the words in your dictionary, and then it makes a single pass through the document and outputs all of the matches. It's more complicated to implement than the naive method, but it works very well once the trie is built. And, truth to tell, it's not that hard to implement. The original paper, which is linked from the Wikipedia article, explains the algorithm well and it's not difficult to convert their pseudocode into a working program.
Note, however, that you might get some unexpected results. For example, if your dictionary contains the words "high" and "school" as well as the two-word phrase "high school", the Aho-Corasick will give you matches for all three when it sees the phrase "high school".
You can try a trie data structure or a red-black tree if the document hasn't so much duplicates. A trie is much less expensive. You can also combine a trie with a wildcard: http://phpir.com/tries-and-wildcards

Algorithm to search for a list of words in a text

I have a list of words, fairly small about 1000 or so. I want to check if any of the words in that list occur in an input text. If so I would like know which ones occur. The input text is a few hundred words each and these are text paragraphs from the web - meaning there a lot of them from different sites. I am trying to find the best algorithm for it.
I can see two obvious ways to do this --
A brute force way of searching for each word from the list in the text.
Create a hash table of words from the input text and then search for each word from the list in the hash table. This is fast.
Is there a better solution?
I am using python though I am not sure if that changes the algorithm anyway.
Also as an optimization to the solution 2 above, I would like to store the hash table generated to persistent storage (DB) so that if the list of words changes I can re-use the hash table without having to create it again. Of course if the input text changes I have to generate the hash table. Is it possible to save a hash table to a DB? Any recommendations? I am currently using MongoDB for my project and I can only store json documents in it. I am a new to MongoDB and have only just started working with it and still do not fully understand the full potential of it.
I have searched SO and see two questions along similar lines and one of them suggests a hash table but I would like to get any pointers towards the optimization I have in mind.
Here are the previously asked questions on SO -
Is there an efficient algorithm to perform inverted full text search?
Searching a large list of words in another large list
EDIT: I just found another question on SO which is about the same problem.
Algorithm for multiple word matching in text
I guess there is no better solution than a hash table. But I would really like to optimize it so that changes to the word list can let me run the algorithm on all the text I have stored up quickly. Should I change the tags added to the question to also include some database technologies?
There is a better solution than a hash table. If you have a fixed set of words that you want to search for over a large body of text, the way you do it is with the Aho-Corasick string matching algorithm.
The algorithm builds a state machine from the words you want to search, and then runs the input text through that state machine, outputting matches as they're found. Because it takes some amount of time to build the state machine, the algorithm is best suited for searching very large bodies of text.
You can do something similar with regular expressions. For example, you might want to find the words "dog", "cat", "horse", and "skunk" in some text. You can build a regular expression:
"dog|cat|horse|skunk"
And then run a regular expression match on the text. How you get all matches will depend on your particular regular expression library, but it does work. For very large lists of words, you'll want to write code that reads the words and generates the regex, but it's not terribly difficult to do and it works quite well.
There is a difference, though, in the results from a regex and the results from the Aho-Corasick algorithm. For example if you're searching for the words "dog" and "dogma" in the string "My karma ate your dogma." The regex library search will report finding "dogma". The Aho-Corasick implementation will report finding "dog" and "dogma" at the same position.
If you want the Aho-Corasick algorithm to report whole words only, you have to modify the algorithm slightly.
Regex, too, will report matches on partial words. That is, if you're searching for "dog", it will find it in "dogma". But you can modify the regex to only give whole words. Typically, that's done with the \b, as in:
"\b(cat|dog|horse|skunk)\b"
The algorithm you choose depends a lot on how large the input text is. If the input text isn't too large, you can create a hash table of the words you're looking for. Then go through the input text, breaking it into words, and checking the hash table to see if the word is in the table. In pseudo code:
hashTable = Build hash table from target words
for each word in input text
if word in hashTable then
output word
Or, if you want a list of matching words that are in the input text:
hashTable = Build hash table from target words
foundWords = empty hash table
for each word in input text
if word in hashTable then
add word to foundWords

Efficent methods for finding most common phrases in a body of text AKA trending topics

I previously asked a similar question on this topic, I ended up deriving several solutions which worked, one based on bloom filters + ngrams, the other based on hash tables + ngrams. Both solutions perform fine with small data sets (<1000 texts, usually tweets) but the computation time grew exponentially meaning doing 10,000 could take hours.
I am currently working in Ruby and perhaps, that is the problem but are there any other solutions or approaches I could attempt to solve this problem?
If you are looking to do text searching in large sets of data, you might have to look into something like solr. There is a really easy to setup solr gem called sunspot http://outoftime.github.com/sunspot/
Your problem can be solved by following the steps below:
(Optional, for performance purpose) Run through all the documents, create a mapping between the a unique word and an integer. Also, it is better to create a special mapping for sentence termination (.!? etc.). This is to facilitate the check of phrases that do not cross sentence boundary.
Concatenate all the documents into a huge array of mapped integers (in previous step). This can be done online (to save space) as we go through the next steps.
Constructing a suffix array of the string in previous step, augmented with the longest common prefix array. The fastest implementation known is SA-IS that runs in O(n) worst-case time. See here. Some special handling is required to be sure that each common prefix does not cross the sentence boundary.
LCP array is basically the result you need. You can do whatever you want with it, such as: sort it to find the longest repeated phrases among the documents, find all 5-words, 4 words, 3-words phrases, etc. The most common phrases (I assume at least 2-word phrases here) can be found by looking at both the LCP and suffix array.
Quick Google search show that this library contains a Ruby suffix array implementation. You can generate LCP array from there in O(n) Reference.

How do you Index Files for Fast Searches?

Nowadays, Microsoft and Google will index the files on your hard drive so that you can search their contents quickly.
What I want to know is how do they do this? Can you describe the algorithm?
The simple case is an inverted index.
The most basic algorithm is simply:
scan the file for words, creating a list of unique words
normalize and filter the words
place an entry tying that word to the file in your index
The details are where things get tricky, but the fundamentals are the same.
By "normalize and filter" the words, I mean things like converting everything to lowercase, removing common "stop words" (the, if, in, a etc.), possibly "stemming" (removing common suffixes for verbs and plurals and such).
After that, you've got a unique list of words for the file and you can build your index off of that.
There are optimizations for reducing storage, techniques for checking locality of words (is "this" near "that" in the document, for example).
But, that's the fundamental way it's done.
Here's a really basic description; for more details, you can read this textbook (free online): http://informationretrieval.org/¹
1). For all files, create an index. The index consists of all unique words that occur in your dataset (called a "corpus"). With each word, a list of document ids is associated; each document id refers to a document that contains the word.
Variations: sometimes when you generate the index you want to ignore stop words ("a", "the", etc). You have to be careful, though ("to be or not to be" is a real query composed of stopwords).
Sometimes you also stem the words. This has more impact on search quality in non-English languages that use suffixes and prefixes to a greater extent.
2) When a user enters a query, look up the corresponding lists, and merge them. If it's a strict boolean query, the process is pretty straightforward -- for AND, a docid has to occur in all the word lists, for OR, in at least one wordlist, etc.
3) If you want to rank your results, there are a number of ways to do that, but the basic idea is to use the frequency with which a word occurs in a document, as compared to the frequency you expect it to occur in any document in the corpus, as a signal that the document is more or less relevant. See textbook.
4) You can also store word positions to infer phrases, etc.
Most of that is irrelevant for desktop search, as you are more interested in recall (all documents that include the term) than ranking.
¹ previously on http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html, accessible via wayback machine
You could always look into something like Apache Lucene.
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Resources