Related
I have a application where I should implement Bloom Filters and Minhashing to find similar items.
I have the Bloom Filter implemented but I need to make sure i understand the Minhashing part to do it:
The aplication generates a number of k-length Strings and stores it in a document, then all of those are inserted in the Bloom.
Where I want to implement the MinHash is by giving the option for the user to insert a String and then compare it and try to find the most similar ones on the document.
Do i have to Shingle all the Strings on the document? The problem is that I can't really find something to help me in theis, all I find is regarding two documents and never one String to a set of Strings.
So: the user enters a string and the application finds the most similar strings within a single document. By "similarity", do you mean something like Levenstein distance (whereby "cat" is deemed similar to "rat" and "cart"), or some other measure? And are you (roughly speaking) looking for similar paragraphs, similar sentences, similar phrases or similar words? These are important considerations.
Also, you say you are comparing one string to a set of strings. What are these strings? Sentences? Paragraphs? If you are sure you don't want to find any similarities spanning multiple paragraphs (or multiple sentences, or what-have-you) then it makes sense to think of the document as multiple separate strings; otherwise, you should think of it as a single long string.
The MinHash algorithm is for comparing many documents to each other, when it's impossible to store all document in memory simultaneously, and individually comparing every document to every other would be an n-squared problem. MinHash overcomes these problems by storing hashes for only some shingles, and it sacrifices some accuracy as a result. You don't need MinHash, as you could simply store every shingle in memory, using, say, 4-character-grams for your shingles. But if you don't expect word orderings to be switched around, you may find the Smith-Waterman algorithm more suitable (see also here).
If you're expecting the user to enter long strings of words, you may get better results basing your shingles on words; so 3-word-grams, for instance, ignoring differences in whitespacing, case and punctuation.
Generating 4-character-grams is simple: "The cat sat on the mat" would yield "The ", "he c", "e ca", " cat", etc. Each of these would be stored in memory, along with the paragraph number it appeared in. When the user enters a search string, that would be shingled in identical manner, and the paragraphs containing the greatest number of shared shingles can be retrieved. For efficiency of comparison, rather than storing the shingles as strings, you can store them as hashes using FNV1a or a similar cheap hash.
Shingles can also be built up from words rather than characters (e.g. "the cat sat", "cat sat on", "sat on the"). This tends to be better with larger pieces of text: say, 30 words or more. I would typically ignore all differences in whitespace, case and punctuation if taking this approach.
If you want to find matches that can span paragraphs as well, it becomes quite a bit more complex, as you have to store the character positions for every shingle and consider many different configurations of possible matches, penalizing them according to how widely scattered their shingles are. That could end up quite complex code, and I would seriously consider just sticking with a Levenstein-based solution such as Smith-Waterman, even if it doesn't deal well with inversions of word order.
I don't think a bloom filter is likely to help you much, though I'm not sure how you're using it. Bloom filters might be useful if your document is highly structured: a limited set of possible strings and you're searching for the existence of one of them. For natural language, though, I doubt it will be very useful.
I'm using tesseract for OCR and have noticed, that sometimes segmentation errors occur and characters, that "obviously" belong together are split into separted strings.
Based on a list of characters and their bounding boxes found in one text line and the prilimanary OCR result suggesting, which of these characters belong to one word, which algorithms can I apply to correct segmentation errors or verify the result?
So this this the data available:
List<Word> words;
for(Word word : words){
for(Char c : word.getChars()){
char ch = c.getValue();
Rectangle rect = c.getRect();
}
}
For OCR post-correction that takes into account the characters and words, but admittedly not the bounding boxes, one common practice is
to use a dictionary of valid words, as comprehensive as possible
to check the words that results from the OCR algorithm against that dictionary
if a word cannot be found as an exact match in the dictionary, to try and find a similar one
To make this possible, you need to prepare the dictionary implementation so it enables a search for similar strings, also known as approximate string matching or fuzzy string matching.
The two main approaches for this that I am aware of are
Levenshtein atomata, as described by Schulz et al (DOI: 10.1007/s10032-002-0082-8)
Metric trees, such as the BK tree described by Baeza-Yates and Navarro (DOI: 10.1109/SPIRE.1998.712978)
These approaches, as well as general approximate string matching approaches (such as search tries, q-grams matching and n-gram matching) all inherently use some kind of edit distance measure, more or less similar to Levenshtein distance. After analysing the specific OCR errors you are dealing with, you might want to adjust the edit distance algorithm and the other resources you are using to your specific needs. This may involve things like:
Assume a lower substitution distance between characters that your OCR program confuses frequently, or characters that look particularly similar when rendered with the font or style you are dealing with
Take possible segmentation errors into account by putting frequently occurring pairs of words in the dictionary (in addition to single words)
Ensure that your dictionary contains as many named entities and other domain specific (or corpus specific) elements
Further more, you can try to use a grammar, and/or a statistical language model, such as a Hidden Markov Model or Conditional Random Field model -- similar to the models used by POS taggers -- to make word corrections in context.
I have 2 sources of information for the same data (companies), which I can join together via a unique ID (contract number). The presence of the second, different source, is due to the fact that the 2 sources are updated manually, independently. So what I have is an ID and a company Name in 2 tables.
I need to come up with an algorithm that would compare the Name in the 2 tables for the same ID, and order all the companies by a variable which indicates how different the strings are (to highlight the most different ones, to be placed at the top of the list).
I looked at the simple Levenshtein distance calculation algorithm, but it's at the letter level, so I am still looking for something better.
The reason why Levenshtein doesn't really do the job is this: companies have a name, prefixed or postfixed by the organizational form (LTD, JSC, co. etc). So we may have a lot of JSC "Foo" which will differ a lot from Foo JSC., but what I am really looking for in the database is pairs of different strings like SomeLongCompanyName JSC and JSC OtherName.
Are there any Good ways to do this? (I don't really like the idea of using regex to separate words in each string, then find matches for every word in the other string by using the Levenshtein distance, so I am searching for other ideas)
How about:
1. Replace all punctuation by whitespace.
2. Break the string up into whitespace-delimited words.
3. Move all words of <= 4 characters to the end, sorted alphabetically.
4. Levenshtein.
Could you filter out (remove) those "common words" (similar to removing stop words for fulltext indexing) and then search on that? If not, could you sort the words alphabetically before comparing?
As an alternative or in addition to the Levenshtein distance, you could use Soundex. It's not terribly good, but it can be used to index the data (which is not possible when using Levenshtein).
Thank you both for ideas.
I used 4 indices which are levenshtein distances divided by the sum of the length of both words (relative distances) of the following:
Just the 2 strings
The string composed of the result after separating the word sequences, eliminating the non-word chars, ordering ascending and joining with space as separator.
The string which is contained between quotes (if no such string is present, the original string is taken)
The string composed of alphabetically ordered first characters of each word.
each of these in return is an integer value between 1 and 1000. The resulting value is the product of:
X1^E1 * X2^E2 * X3^E3 * X4^E4
Where X1..X4 are the indices, and E1..E4 are user-provided preferences of valuable (significant) is each index. To keep the result inside reasonable values of 1..1000, the vector (E1..E4) is normalized.
The results are impressive. The whole thing works much faster than I've expected (built it as a CLR assembly in C# for Microsoft SQL Server 2008). After picking E1..E4 correctly, the largest index (biggest difference) on non-null values in the whole database is 765. Right untill about 300 there is virtually no matching company name. Around 200 there are companies that have kind of similar names, and some are the same names but written in very different ways, with abbreviations, additional words, etc. When it comes down to 100 and less - practically all the records contain names that are the same but written with slight differences, and by 30, only the order or the punctuation may differ.
Totally works, result is better than I've expected.
I wrote a post on my blog, to share this library in case someone else needs it.
We have a list of about 150,000 words, and when the user enters a free text, the system should present a list of words from the dictionary, that are very close to words in the free text.
For instance, the user enters: "I would like to buy legoe toys in Walmart". If the dictionary contains "Lego", "Car" and "Walmart", the system should present "Lego" and "Walmart" in the list. "Walmart" is obvious because it is identical to a word in the sentence, but "Lego" is similar enough to "Legoe" to be mentioned, too. However, nothing is similar to "Car", so that word is not shown.
Showing the list should be realtime, meaning that when the user has entered the sentence, the list of words must be present on the screen. Does anybody know a good algorithm for this?
The dictionary actually contains concepts which may include a space. For instance, "Lego spaceship". The perfect solution recognizes these multi-word concepts, too.
Any suggestions are appreciated.
Take a look at http://norvig.com/spell-correct.html for a simple algorithm. The article uses Python, but there are links to implementations in other languages at the end.
You will be doing quite a few lookups of words against a fixed dictionary. Therefore you need to prepare your dictionary. Logically, you can quickly eliminate candidates that are "just too different".
For instance, the words car and dissimilar may share a suffix, but they're obviously not misspellings of each other. Now why is that so obvious to us humans? For starters, the length is entirely different. That's an immediate disqualification (but with one exception - below). So, your dictionary should be sorted by word length. Match your input word with words of similar length. For short words that means +/- 1 character; longer words should have a higher margin (exactly how well can your demographic spell?)
Once you've restricted yourself to candidate words of similar length, you'd want to strip out words that are entirely dissimilar. With this I mean that they use entirely different letters. This is easiest to compare if you sort the letters in a word alphabetically. E.g. car becomes "acr"; rack becomes "ackr". You'll do this in preprocessing for your dictionary, and for each input word. The reason is that it's cheap to determine the (size of an) difference of two sorted sets. (Add a comment if you need explanation). car and rack have an difference of size 1, car and hat have a difference of size 2. This narrows down your set of candidates even further. Note that for longer words, you can bail out early when you've found too many differences. E.g. dissimilar and biography have a total difference of 13, but considering the length (8/9) you can probably bail out once you've found 5 differences.
This leaves you with a set of candidate words that use almost the same letters, and also are almost the same length. At this point you can start using more refined algorithms; you don't need to run 150.000 comparisons per input word anymore.
Now, for the length exception mentioned before: The problem is in "words" like greencar. It doesn't really match a word of length 8, and yet for humans it's quite obvious what was meant. In this case, you can't really break the input word at any random boundary and run an additional N-1 inexact matches against both halves. However, it is feasible to check for just a missing space. Just do a lookup for all possible prefixes. This is efficient because you'll be using the same part of the dictionary over and over, e.g. g gr, gre, gree, etc. For every prefix that you've found, check if the remaining suffixis also in the dictionery, e.g. reencar, eencar. If both halves of the input word are in the dictionary, but the word itself isn't, you can assume a missing space.
You would likely want to use an algorithm which calculates the Levenshtein distance.
However, since your data set is quite large, and you'll be comparing lots of words against it, a direct implementation of typical algorithms that do this won't be practical.
In order to find words in a reasonable amount of time, you will have to index your set of words in some way that facilitates fuzzy string matching.
One of these indexing methods would be to use a suffix tree. Another approach would be to use n-grams.
I would lean towards using a suffix tree since I find it easier to wrap my head around it and I find it more suited to the problem.
It might be of interest to look at a some algorithms such as the Levenshtein distance, which can calculate the amount of difference between 2 strings.
I'm not sure what language you are thinking of using but PHP has a function called levenshtein that performs this calculation and returns the distance. There's also a function called similar_text that does a similar thing. There's a code example here for the levenshtein function that checks a word against a dictionary of possible words and returns the closest words.
I hope this gives you a bit of insight into how a solution could work!
Here at work, we often need to find a string from the list of strings that is the closest match to some other input string. Currently, we are using Needleman-Wunsch algorithm. The algorithm often returns a lot of false-positives (if we set the minimum-score too low), sometimes it doesn't find a match when it should (when the minimum-score is too high) and, most of the times, we need to check the results by hand. We thought we should try other alternatives.
Do you have any experiences with the algorithms?
Do you know how the algorithms compare to one another?
I'd really appreciate some advice.
PS: We're coding in C#, but you shouldn't care about it - I'm asking about the algorithms in general.
Oh, I'm sorry I forgot to mention that.
No, we're not using it to match duplicate data. We have a list of strings that we are looking for - we call it search-list. And then we need to process texts from various sources (like RSS feeds, web-sites, forums, etc.) - we extract parts of those texts (there are entire sets of rules for that, but that's irrelevant) and we need to match those against the search-list. If the string matches one of the strings in search-list - we need to do some further processing of the thing (which is also irrelevant).
We can not perform the normal comparison, because the strings extracted from the outside sources, most of the times, include some extra words etc.
Anyway, it's not for duplicate detection.
OK, Needleman-Wunsch(NW) is a classic end-to-end ("global") aligner from the bioinformatics literature. It was long ago available as "align" and "align0" in the FASTA package. The difference was that the "0" version wasn't as biased about avoiding end-gapping, which often allowed favoring high-quality internal matches easier. Smith-Waterman, I suspect you're aware, is a local aligner and is the original basis of BLAST. FASTA had it's own local aligner as well that was slightly different. All of these are essentially heuristic methods for estimating Levenshtein distance relevant to a scoring metric for individual character pairs (in bioinformatics, often given by Dayhoff/"PAM", Henikoff&Henikoff, or other matrices and usually replaced with something simpler and more reasonably reflective of replacements in linguistic word morphology when applied to natural language).
Let's not be precious about labels: Levenshtein distance, as referenced in practice at least, is basically edit distance and you have to estimate it because it's not feasible to compute it generally, and it's expensive to compute exactly even in interesting special cases: the water gets deep quick there, and thus we have heuristic methods of long and good repute.
Now as to your own problem: several years ago, I had to check the accuracy of short DNA reads against reference sequence known to be correct and I came up with something I called "anchored alignments".
The idea is to take your reference string set and "digest" it by finding all locations where a given N-character substring occurs. Choose N so that the table you build is not too big but also so that substrings of length N are not too common. For small alphabets like DNA bases, it's possible to come up with a perfect hash on strings of N characters and make a table and chain the matches in a linked list from each bin. The list entries must identify the sequence and start position of the substring that maps to the bin in whose list they occur. These are "anchors" in the list of strings to be searched at which an NW alignment is likely to be useful.
When processing a query string, you take the N characters starting at some offset K in the query string, hash them, look up their bin, and if the list for that bin is nonempty then you go through all the list records and perform alignments between the query string and the search string referenced in the record. When doing these alignments, you line up the query string and the search string at the anchor and extract a substring of the search string that is the same length as the query string and which contains that anchor at the same offset, K.
If you choose a long enough anchor length N, and a reasonable set of values of offset K (they can be spread across the query string or be restricted to low offsets) you should get a subset of possible alignments and often will get clearer winners. Typically you will want to use the less end-biased align0-like NW aligner.
This method tries to boost NW a bit by restricting it's input and this has a performance gain because you do less alignments and they are more often between similar sequences. Another good thing to do with your NW aligner is to allow it to give up after some amount or length of gapping occurs to cut costs, especially if you know you're not going to see or be interested in middling-quality matches.
Finally, this method was used on a system with small alphabets, with K restricted to the first 100 or so positions in the query string and with search strings much larger than the queries (the DNA reads were around 1000 bases and the search strings were on the order of 10000, so I was looking for approximate substring matches justified by an estimate of edit distance specifically). Adapting this methodology to natural language will require some careful thought: you lose on alphabet size but you gain if your query strings and search strings are of similar length.
Either way, allowing more than one anchor from different ends of the query string to be used simultaneously might be helpful in further filtering data fed to NW. If you do this, be prepared to possibly send overlapping strings each containing one of the two anchors to the aligner and then reconcile the alignments... or possibly further modify NW to emphasize keeping your anchors mostly intact during an alignment using penalty modification during the algorithm's execution.
Hope this is helpful or at least interesting.
Related to the Levenstein distance: you might wish to normalize it by dividing the result with the length of the longer string, so that you always get a number between 0 and 1 and so that you can compare the distance of pair of strings in a meaningful way (the expression L(A, B) > L(A, C) - for example - is meaningless unless you normalize the distance).
We are using the Levenshtein distance method to check for duplicate customers in our database. It works quite well.
Alternative algorithms to look at are agrep (Wikipedia entry on agrep),
FASTA and BLAST biological sequence matching algorithms. These are special cases of approximate string matching, also in the Stony Brook algorithm repositry. If you can specify the ways the strings differ from each other, you could probably focus on a tailored algorithm. For example, aspell uses some variant of "soundslike" (soundex-metaphone) distance in combination with a "keyboard" distance to accomodate bad spellers and bad typers alike.
Use FM Index with Backtracking, similar to the one in Bowtie fuzzy aligner
In order to minimize mismatches due to slight variations or errors in spelling, I've used the Metaphone algorithm, then Levenshtein distance (scaled to 0-100 as a percentage match) on the Metaphone encodings for a measure of closeness. That seems to have worked fairly well.
To expand on Cd-MaN's answer, it sounds like you're facing a normalization problem. It isn't obvious how to handle scores between alignments with varying lengths.
Given what you are interested in, you may want to obtain p-values for your alignment. If you are using Needleman-Wunsch, you can obtain these p-values using Karlin-Altschul statistics http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html
BLAST will can local alignment and evaluate them using these statistics. If you are concerned about speed, this would be a good tool to use.
Another option is to use HMMER. HMMER uses Profile Hidden Markov Models to align sequences. Personally, I think this is a more powerful approach since it also provides positional information. http://hmmer.janelia.org/