What data structure/algorithm to use to compute similarity between input sequence and a database of stored sequences? - algorithm

By this question, I mean if I have an input sequence abchytreq and a database / data structure containing jbohytbbq, I would compare the two elements pairwise to get a match of 5/9, or 55%, because of the pairs (b-b, hyt-hyt, q-q). Each sequence additionally needs to be linked to another object (but I don't think this will be hard to do). The sequence does not necessarily need to be a string.
The maximum number of elements in the sequence is about 100. This is easy to do when the database/datastructure has only one or a few sequences to compare to, but I need to compare the input sequence to over 100000 (mostly) unique sequences, and then return a certain number of the most similar previously stored data matches. Additionally, each element of the sequence could have a different weighting. Back to the first example: if the first input element was weighted double, abchytreq would only be a 50% match to jbohytbbq.
I was thinking of using BLAST and creating a little hack as needed to account for any weighting, but I figured that might be a little bit overkill. What do you think?
One more thing. Like I said, comparison needs to be pairwise, e.g. abcdefg would be a zero percent match to bcdefgh.

A modified Edit Distance algorithm with weightings for character positions could help.
https://www.biostars.org/p/11863/
Multiply the resulting distance matrix with a matrix of weights for character positions/

I'm not entirely clear on the question; for instance, would you return all matches of 90% or better, regardless of how many or few there are, or would you return the best 10% of the input, even if some of them match only 50%? Here are a couple of suggestions:
First: Do you know the story of the wise bachelor? The foolish bachelor makes a list of requirements for his mate --- slender, not blonde (Mom was blonde, and he hates her), high IQ, rich, good cook, loves horses, etc --- then spends his life considering one mate after another, rejecting each for failing one of his requirements, and dies unfulfilled. The wise bachelor considers that he will meet 100 marriageable women in his life, examines the first sqrt(100) = 10 of them, then marries the next mate with a better score than the best of the first ten; she might not be perfect, but she's good enough. There's some theorem of statistics that says the square root of the population size is the right cutoff, but I don't know what it's called.
Second: I suppose that you have a scoring function that tells you exactly which of two dictionary words is the better match to the target, but is expensive to compute. Perhaps you can find a partial scoring function that is easy to compute and would allow you to quickly scan the dictionary, discarding those inputs that are unlikely to be winners, and then apply your total scoring function only to that subset of the dictionary that passed the partial scoring function. You'll have to define the partial scoring function based on your needs. For instance, you might want to apply your total scoring function to only the first five characters of the target and the dictionary word; if that doesn't eliminate enough of the dictionary, increase to ten characters on each side.

Related

Generate or find a shortest text given list of words

Let's say I have a list of 1000+ words and I would like to generate a text that includes these words from the list. I would like to use as few extra words outside of the list as possible. How would one tackle such a problem? Or alternatively, is there a way to efficiently search for a smaller portion of text containing these words the most, given some larger text (millions of words)? Basically, the resulting text from the search should be optimized to be shortest but to contain all the words from the list.
I am not sure how you'd like the text to be generated, so I'll attempt to answer the second question:
Is there a way to efficiently search for a smaller portion of text containing these words the most, given some larger text (millions of words)? Basically, the resulting text from the search should be optimized to be shortest but to contain all the words from the list.
This is obviously a computationally demanding endeavour so I'll assume you are alright with spending like a gig of RAM on this and some time (but maybe not too long). Since you are looking for the shortest continuous text which satisfies some condition, one can conclude the following:
If the text satisfies the condition, you want to shorten it.
If it doesn't, you want to make it longer so that hopefully it will start satisfying the condition.
Now, when it comes to the condition, it is whatever predicate that will say whether the continuous section of the text is "good enough" or not, based on some relatively simple statistics. For instance, the predicate could check if some cumulative index based on what ratio of the words from your list are included in the section, modified by the number of words from outside the list, is greater than some expected value.
What my mind races to when I see something like this is the sliding window technique, described in this article. I do not know if this is a good article, I did not take the time to read it, but scanning through it seems to be decent. It's also known as the caterpillar method, which is a particularly common name for it in Poland.
Basically, you have two pointers, a left pointer and a right pointer. In the case of looking for the shortest continuous fragment of a larger text, such that the fragment satisfies a condition and given that if a condition is met for a fragment, then it is met for a larger fragment containing the previous fragment, you advance the right pointer forward as long as the condition is unmet, and then once it is met, you advance the left pointer, until the condition isn't met. This repeats until either or both pointers reach the end of the text.
This is a neat technique, which allows you to iterate over the whole text exactly once, linearly. It is clearly desirable in your case to have an algorithm linear with respect to the length of the text.
Now, we have to consider the statistics you will be collecting. You will probably want to know how many words from the list, and how many words from outside of the list are present in a continuous fragment. An extra condition for these statistics is that they will need to be relatively easily modifiable (preferably in constant time, but that will be hard to achieve) every time one of the pointers advances.
In order to keep track of the words, we will use a hashmap of ordered sets of indeces. In Java these data structures are called HashMap and TreeSet, in C++ they're unordered_map and set. The keys to the hashmap will be strings representing words. The values will be sets of indices of where the words appear in the text. Note that lookup in a hashmap is linear relative to the length of the key, so we can assume constant as most words are like <10 characters long, and checking how many values in a set there are between two given values is logarithmic relative to the size of the set. So getting the number of times a word appears in a fragment of the text is easy and fast. Keeping track of whether a word exists in the given list or not can also be achieved with a hashmap (or a hashset).
So let's get back to the statistics. Say you want to keep track of the number of words from inside and from outside your list in a given fragment. This can be achieved very simply:
Every time you add a word to the fragment by advancing its right end, you check if it appears in the list in constant time and if so, you add one to the "good words" number, and otherwise, you add one to the "bad words" number.
Every time you remove a word from the fragment by advancing the left end, you do the same but you decrement the counters instead.
Now if you want to track how many unique words from inside and from outside the list there are in the fragment, every time you will need to check the number of times a given word exists in the fragment. We established earlier that this can be done logarithmically relative to the length of the fragment, so now the trick is simple. You only modify the counters if the number of appearances of a word in the fragment either
rose from 0 to 1 when advancing the right pointer, or
fell from 1 to 0 when advancing the left pointer.
Otherwise, you ignore the word, not changing the counters.
Additional memory optimisations include removing indices from the sets of indices when they are out of scope of the fragment and removing hashmap entries from the hashmap if a set of indices becomes empty.
It is now up to you to perhaps find a better heuristic, some other statistical values which you can easily track whatever it is you intend to check in your predicate. Although it is important that whenever a fragment meets your condition, a bigger fragment must meet it too.
In the case described above you could keep track of all the fragments which had at least... I don't know... 90% of the words from your list and from those choose the shortest one or the one with the fewest foreign words.

How to find the closest pairs (Hamming Distance) of a string of binary bins in Ruby without O^2 issues?

I've got a MongoDB with about 1 million documents in it. These documents all have a string that represents a 256 bit bin of 1s and 0s, like:
0110101010101010110101010101
Ideally, I'd like to query for near binary matches. This means, if the two documents have the following numbers. Yes, this is Hamming Distance.
This is NOT currently supported in Mongo. So, I'm forced to do it in the application layer.
So, given this, I am trying to find a way to avoid having to do individual Hamming distance comparisons between the documents. that makes the time to do this basically impossible.
I have a LOT of RAM. And, in ruby, there seems to be a great gem (algorithms) that can create a number of trees, none of which I can seem to make work (yet) that would reduce the number of queries I'd need to make.
Ideally, I'd like to make 1 million queries, find the near duplicate strings, and be able to update them to reflect that.
Anyone's thoughts would be appreciated.
I ended up doing a retrieval of all the documents into memory.. (subset with the id and the string).
Then, I used a BK Tree to compare the strings.
The Hamming distance defines a metric space, so you could use the O(n log n) algorithm to find the closest pair of points, which is of the typical divide-and-conquer nature.
You can then apply this repeatedly until you have "enough" pairs.
Edit: I see now that Wikipedia doesn't actually give the algorithm, so here is one description.
Edit 2: The algorithm can be modified to give up if there are no pairs at distance less than n. For the case of the Hamming distance: simply count the level of recursion you are in. If you haven't found something at level n in any branch, then give up (in other words, never enter n + 1). If you are using a metric where splitting on one dimension doesn't always yield a distance of 1, you need to adjust the level of recursion where you give up.
As far as I could understand, you have an input string X and you want to query the database for a document containing string field b such that Hamming distance between X and document.b is less than some small number d.
You can do this in linear time, just by scanning all of your N=1M documents and calculating the distance (which takes small fixed time per document). Since you only want documents with distance smaller than d, you can give up comparison after d unmatched characters; you only need to compare all 256 characters if most of them match.
You can try to scan fewer than N documents, that is, to get better than linear time.
Let ones(s) be the number of 1s in string s. For each document, store ones(document.b) as a new indexed field ones_count. Then you can only query documents where number of ones is close enough to ones(X), specifically, ones(X) - d <= document.ones_count <= ones(X) + d. The Mongo index should kick in here.
If you want to find all close enough pairs in the set, see #Philippe's answer.
This sounds like an algorithmic problem of some sort. You could try comparing those with a similar number of 1 or 0 bits first, then work down through the list from there. Those that are identical will, of course, come out on top. I don't think having tons of RAM will help here.
You could also try and work with smaller chunks. Instead of dealing with 256 bit sequences, could you treat that as 32 8-bit sequences? 16 16-bit sequences? At that point you can compute differences in a lookup table and use that as a sort of index.
Depending on how "different" you care to match on, you could just permute changes on the source binary value and do a keyed search to find the others that match.

Fuzzy matching deduplication in less than exponential time?

I have a large database (potentially in the millions of records) with relatively short strings of text (on the order of street address, names, etc).
I am looking for a strategy to remove inexact duplicates, and fuzzy matching seems to be the method of choice. My issue: many articles and SO questions deal with matching a single string against all records in a database. I am looking to deduplicate the entire database at once.
The former would be a linear time problem (comparing a value against a million other values, calculating some similarity measure each time). The latter is an exponential time problem (compare every record's values against every other record's value; for a million records, that's approx 5 x 10^11 calculations vs the 1,000,000 calculations for the former option).
I'm wondering if there is another approach than the "brute-force" method I mentioned. I was thinking of possibly generating a string to compare each record's value against, and then group strings that had roughly equal similarity measures, and then run the brute-force method through these groups. I wouldn't achieve linear time, but it might help. Also, if I'm thinking through this properly, this could miss a potential fuzzy match between strings A and B because the their similarity to string C (the generated check-string) is very different despite being very similar to each other.
Any ideas?
P.S. I realize I may have used the wrong terms for time complexity - it is a concept that I have a basic grasp of, but not well enough so I could drop an algorithm into the proper category on the spot. If I used the terms wrong, I welcome corrections, but hopefully I got my point across at least.
Edit
Some commenters have asked, given fuzzy matches between records, what my strategy was to choose which ones to delete (i.e. given "foo", "boo", and "coo", which would be marked the duplicate and deleted). I should note that I am not looking for an automatic delete here. The idea is to flag potential duplicates in a 60+ million record database for human review and assessment purposes. It is okay if there are some false positives, as long as it is a roughly predictable / consistent amount. I just need to get a handle on how pervasive the duplicates are. But if the fuzzy matching pass-through takes a month to run, this isn't even an option in the first place.
Have a look at http://en.wikipedia.org/wiki/Locality-sensitive_hashing. One very simple approach would be to divide up each address (or whatever) into a set of overlapping n-grams. This STACKOVERFLOW becomes the set {STACKO, TACKO, ACKOV, CKOVE... , RFLOW}. Then use a large hash-table or sort-merge to find colliding n-grams and check collisions with a fuzzy matcher. Thus STACKOVERFLOW and SXACKOVRVLOX will collide because both are associated with the colliding n-gram ACKOV.
A next level up in sophistication is to pick an random hash function - e.g. HMAC with an arbitrary key, and of the n-grams you find, keep only the one with the smallest hashed value. Then you have to keep track of fewer n-grams, but will only see a match if the smallest hashed value in both cases is ACKOV. There is obviously a trade-off here between the length of the n-gram and the probability of false hits. In fact, what people seem to do is to make n quite small and get higher precision by concatenating the results from more than one hash function in the same record, so you need to get a match in multiple different hash functions at the same time - I presume the probabilities work out better this way. Try googling for "duplicate detection minhash"
I think you may have mis-calculated the complexity for all the combinations. If comparing one string with all other strings is linear, this means due to the small lengths, each comparison is O(1). The process of comparing each string with every other string is not exponential but quadratic, which is not all bad. In simpler terms you are comparing nC2 or n(n-1)/2 pairs of strings, so its just O(n^2)
I couldnt think of a way you can sort them in order as you cant write an objective comparator, but even if you do so, sorting would take O(nlogn) for merge sort and since you have so many records and probably would prefer using no extra memory, you would use quick sort, which takes O(n^2) in worst case, no improvement over the worst case time in brute force.
You could use a Levenshtein transducer, which "accept[s] a query term and return[s] all terms in a dictionary that are within n spelling errors away from it". Here's a demo.
Pairwise comparisons of all the records is O(N^2) not exponential. There basically two ways to go to cut down on that complexity.
The first is blocking, where you only compare records that already have something in common that's easy to compute, like the first three letters or a common n-gram. This is basically the same idea as Locally Sensitive Hashing. The dedupe python library implements a number of blocking techniques and the documentation gives a good overview of the general approach.
In the worse case, pairwise comparisons with blocking is still O(N^2). In the best case it is O(N). Neither best or worst case are really met in practice. Typically, blocking reduces the number of pairs to compare by over 99.9%.
There are some interesting, alternative paradigms for record linkage that are not based on pairwise comparisons. These have better worse case complexity guarantees. See the work of Beka Steorts and Michael Wick.
I assume this is a one-time cleanup. I think the problem won't be having to do so many comparisons, it'll be having to decide what comparisons are worth making. You mention names and addresses, so see this link for some of the comparison problems you'll have.
It's true you have to do almost 500 billion brute-force compares for comparing a million records against themselves, but that's assuming you never skip any records previously declared a match (ie, never doing the "break" out of the j-loop in the pseudo-code below).
My pokey E-machines T6532 2.2gHz manages to do 1.4m seeks and reads per second of 100-byte text file records, so 500 billion compares would take about 4 days. Instead of spending 4 days researching and coding up some fancy solution (only to find I still need another x days to actually do the run), and assuming my comparison routine can't compute and save the keys I'd be comparing, I'd just let it brute-force all those compares while I find something else to do:
for i = 1 to LASTREC-1
seektorec(i)
getrec(i) into a
for j = i+1 to LASTREC
getrec(j) into b
if similarrecs(a, b) then [gotahit(); break]
Even if a given run only locates easy-to-define matches, hopefully it reduces the remaining unmatched records to a more reasonable smaller set for which further brute-force runs aren't so time-consuming.
But it seems unlikely similarrecs() can't independently compute and save the portions of a + b being compared, in which case the much more efficient approach is:
for i = 1 to LASTREC
getrec(i) in a
write fuzzykey(a) into scratchfile
sort scratchfile
for i = 1 to LASTREC-1
if scratchfile(i) = scratchfile(i+1) then gothit()
Most databases can do the above in one command line, if you're allowed to invoke your own custom code for computing each record's fuzzykey().
In any case, the hard part is going to be figuring out what makes two records a duplicate, per the link above.
Equivalence relations are particularly nice kinds of matching; they satisfy three properties:
reflexivity: for any value A, A ~ A
symmetry: if A ~ B, then necessarily B ~ A
transitivity: if A ~ B and B ~ C, then necessarily A ~ C
What makes these nice is that they allow you to partition your data into disjoint sets such that each pair of elements in any given set are related by ~. So, what you can do is apply the union-find algorithm to first partition all your data, then pick out a single representative element from each set in the partition; this completely de-duplicates the data (where "duplicate" means "related by ~"). Moreover, this solution is canonical in the sense that no matter which representatives you happen to pick from each partition, you get the same number of final values, and each of the final values are pairwise non-duplicate.
Unfortunately, fuzzy matching is not an equivalence relation, since it is presumably not transitive (though it's probably reflexive and symmetric). The result of this is that there isn't a canonical way to partition the data; you might find that any way you try to partition the data, some values in one set are equivalent to values from another set, or that some values from within a single set are not equivalent.
So, what behavior do you want, exactly, in these situations?

Is there an edit distance algorithm that takes "chunk transposition" into account?

I put "chunk transposition" in quotes because I don't know whether or what the technical term should be. Just knowing if there is a technical term for the process would be very helpful.
The Wikipedia article on edit distance gives some good background on the concept.
By taking "chunk transposition" into account, I mean that
Turing, Alan.
should match
Alan Turing
more closely than it matches
Turing Machine
I.e. the distance calculation should detect when substrings of the text have simply been moved within the text. This is not the case with the common Levenshtein distance formula.
The strings will be a few hundred characters long at most -- they are author names or lists of author names which could be in a variety of formats. I'm not doing DNA sequencing (though I suspect people that do will know a bit about this subject).
In the case of your application you should probably think about adapting some algorithms from bioinformatics.
For example you could firstly unify your strings by making sure, that all separators are spaces or anything else you like, such that you would compare "Alan Turing" with "Turing Alan". And then split one of the strings and do an exact string matching algorithm ( like the Horspool-Algorithm ) with the pieces against the other string, counting the number of matching substrings.
If you would like to find matches that are merely similar but not equal, something along the lines of a local alignment might be more suitable since it provides a score that describes the similarity, but the referenced Smith-Waterman-Algorithm is probably a bit overkill for your application and not even the best local alignment algorithm available.
Depending on your programming environment there is a possibility that an implementation is already available. I personally have worked with SeqAn lately, which is a bioinformatics library for C++ and definitely provides the desired functionality.
Well, that was a rather abstract answer, but I hope it points you in the right direction, but sadly it doesn't provide you with a simple formula to solve your problem.
Have a look at the Jaccard distance metric (JDM). It's an oldie-but-goodie that's pretty adept at token-level discrepancies such as last name first, first name last. For two string comparands, the JDM calculation is simply the number of unique characters the two strings have in common divided by the total number of unique characters between them (in other words the intersection over the union). For example, given the two arguments "JEFFKTYZZER" and "TYZZERJEFF," the numerator is 7 and the denominator is 8, yielding a value of 0.875. My choice of characters as tokens is not the only one available, BTW--n-grams are often used as well.
One of the easiest and most effective modern alternatives to edit distance is called the Normalized Compression Distance, or NCD. The basic idea is easy to explain. Choose a popular compressor that is implemented in your language such as zlib. Then, given string A and string B, let C(A) be the compressed size of A and C(B) be the compressed size of B. Let AB mean "A concatenated with B", so that C(AB) means "The compressed size of "A concatenated with B". Next, compute the fraction (C(AB) - min(C(A),C(B))) / max(C(A), C(B)) This value is called NCD(A,B) and measures similarity similar to edit distance but supports more forms of similarity depending on which data compressor you choose. Certainly, zlib supports the "chunk" style similarity that you are describing. If two strings are similar the compressed size of the concatenation will be near the size of each alone so the numerator will be near 0 and the result will be near 0. If two strings are very dissimilar the compressed size together will be roughly the sum of the compressed sizes added and so the result will be near 1. This formula is much easier to implement than edit distance or almost any other explicit string similarity measure if you already have access to a data compression program like zlib. It is because most of the "hard" work such as heuristics and optimization has already been done in the data compression part and this formula simply extracts the amount of similar patterns it found using generic information theory that is agnostic to language. Moreover, this technique will be much faster than most explicit similarity measures (such as edit distance) for the few hundred byte size range you describe. For more information on this and a sample implementation just search Normalized Compression Distance (NCD) or have a look at the following paper and github project:
http://arxiv.org/abs/cs/0312044 "Clustering by Compression"
https://github.com/rudi-cilibrasi/libcomplearn C language implementation
There are many other implementations and papers on this subject in the last decade that you may use as well in other languages and with modifications.
I think you're looking for Jaro-Winkler distance which is precisely for name matching.
You might find compression distance useful for this. See an answer I gave for a very similar question.
Or you could use a k-tuple based counting system:
Choose a small value of k, e.g. k=4.
Extract all length-k substrings of your string into a list.
Sort the list. (O(knlog(n) time.)
Do the same for the other string you're comparing to. You now have two sorted lists.
Count the number of k-tuples shared by the two strings. If the strings are of length n and m, this can be done in O(n+m) time using a list merge, since the lists are in sorted order.
The number of k-tuples in common is your similarity score.
With small alphabets (e.g. DNA) you would usually maintain a vector storing the count for every possible k-tuple instead of a sorted list, although that's not practical when the alphabet is any character at all -- for k=4, you'd need a 256^4 array.
I'm not sure that what you really want is edit distance -- which works simply on strings of characters -- or semantic distance -- choosing the most appropriate or similar meaning. You might want to look at topics in information retrieval for ideas on how to distinguish which is the most appropriate matching term/phrase given a specific term or phrase. In a sense what you're doing is comparing very short documents rather than strings of characters.

Efficient word scramble algorithm

I'm looking for an efficient algorithm for scrambling a set of letters into a permutation containing the maximum number of words.
For example, say I am given the list of letters: {e, e, h, r, s, t}. I need to order them in such a way as to contain the maximum number of words. If I order those letters into "theres", it contain the words "the", "there", "her", "here", and "ere". So that example could have a score of 5, since it contains 5 words. I want to order the letters in such a way as to have the highest score (contain the most words).
A naive algorithm would be to try and score every permutation. I believe this is O(n!), so 720 different permutations would be tried for just the 6 letters above (including some duplicates, since the example has e twice). For more letters, the naive solution quickly becomes impossible, of course.
The algorithm doesn't have to actually produce the very best solution, but it should find a good solution in a reasonable amount of time. For my application, simply guessing (Monte Carlo) at a few million permutations works quite poorly, so that's currently the mark to beat.
I am currently using the Aho-Corasick algorithm to score permutations. It searches for each word in the dictionary in just one pass through the text, so I believe it's quite efficient. This also means I have all the words stored in a trie, but if another algorithm requires different storage that's fine too. I am not worried about setting up the dictionary, just the run time of the actual ordering and searching. Even a fuzzy dictionary could be used if needed, like a Bloom Filter.
For my application, the list of letters given is about 100, and the dictionary contains over 100,000 entries. The dictionary never changes, but several different lists of letters need to be ordered.
I am considering trying a path finding algorithm. I believe I could start with a random letter from the list as a starting point. Then each remaining letter would be used to create a "path." I think this would work well with the Aho-Corasick scoring algorithm, since scores could be built up one letter at a time. I haven't tried path finding yet though; maybe it's not a even a good idea? I don't know which path finding algorithm might be best.
Another algorithm I thought of also starts with a random letter. Then the dictionary trie would be searched for "rich" branches containing the remain letters. Dictionary branches containing unavailable letters would be pruned. I'm a bit foggy on the details of how this would work exactly, but it could completely eliminate scoring permutations.
Here's an idea, inspired by Markov Chains:
Precompute the letter transition probabilities in your dictionary. Create a table with the probability that some letter X is followed by another letter Y, for all letter pairs, based on the words in the dictionary.
Generate permutations by randomly choosing each next letter from the remaining pool of letters, based on the previous letter and the probability table, until all letters are used up. Run this many times.
You can experiment by increasing the "memory" of your transition table - don't look only one letter back, but say 2 or 3. This increases the probability table, but gives you more chance of creating a valid word.
You might try simulated annealing, which has been used successfully for complex optimization problems in a number of domains. Basically you do randomized hill-climbing while gradually reducing the randomness. Since you already have the Aho-Corasick scoring you've done most of the work already. All you need is a way to generate neighbor permutations; for that something simple like swapping a pair of letters should work fine.
Have you thought about using a genetic algorithm? You have the beginnings of your fitness function already. You could experiment with the mutation and crossover (thanks Nathan) algorithms to see which do the best job.
Another option would be for your algorithm to build the smallest possible word from the input set, and then add one letter at a time so that the new word is also is or contains a new word. Start with a few different starting words for each input set and see where it leads.
Just a few idle thoughts.
It might be useful to check how others solved this:
http://sourceforge.net/search/?type_of_search=soft&words=anagram
On this page you can generate anagrams online. I've played around with it for a while and it's great fun. It doesn't explain in detail how it does its job, but the parameters give some insight.
http://wordsmith.org/anagram/advanced.html
With javascript and Node.js I implemented a jumble solver that uses a dictionary and create a tree and then traversal the tree after that you can get all possible words, I explained the algorithm in this article in detail and put the source code on GitHub:
Scramble or Jumble Word Solver with Express and Node.js

Resources