OCR'ed and real string similarity - algorithm

The problem:
There is a set of word S = {W1,W2.. Wn} where n < 10. This set just exists, we do not know its content.
These words are drawn on some image and then recognized. The OCR algorytm is poor as well as dpi and as a result there are mistakes. So we have a second set of errorneous words S' = {W1',W2'..Wn'}
Now we have a word W that is a member of original set S. And now I need and algorythm which, given W and S', return index of the word in S'. most similar to W.
Example. S is {"alpha", "bravo", "charlie"}, S' is for example {"alPha","hravc","onarlio"} (these are real possible ocr erros).
So the target function should return F("alpha") => 0, F("bravo") => 1, F("charlie") => 2
I tried Levenshtein distance, but it does not work well, because it returns small numbers on small strings and OCRed string can be longer than original.
Example if W' is {'hornist','cornrnunist'} and the given word is 'communist' the Levenshtein distance is 4 for the both words, but the right one is second.
Any suggestions?

As a zero approach, I'd suggest you to use the modification of Levenshtein distance algorithm with conditional cost of replacing/deleting/adding characters:
Distance(i, j) = min(Distance(i-1, j-1) + replace_cost(a.charAt(i), b.charAt(j)),
Distance(i-1, j ) + insert_cost(b.charAt(j)),
Distance(i , j-1) + delete_cost(a.charAt(i)))
You can implement function replace_cost in such way, that it will returns small values for visually similar characters (and high values for visually different characters), e.g.:
// visually similar characters
replace_cost('o', '0') = 0.1
replace_cost('o', 'O') = 0.1
replace_cost('O', '0') = 0.1
...
// visually different characters
replace_cost('O', 'K') = 0.9
...
And the similar approach can be used for insert_cost and delete_cost (e.g. you may notice, that during the OCR - some characters are more likely to disappear than others).
Also, in case when approach from above is not enough for you, I'd suggest you to look at Noisy channel model - which is widely used for spelling correction (this subject described very well in Natural Language Processing course by Dan Jurafsky, Christopher Manning - "Week 2 - Spelling Correction").

This appears to be quite difficult to do because the misread strings are not necessarily textually similar to the input, which is why Levinshtein distance won't work for you. The words are visually corrupted, not simply mistyped. You could try creating a dataset of common errors (o => 0, l -> 1, e => o) and then do some sort of comparison based on that.
If you have access to the OCR algorithm, you could run that algorithm again on a much broader set of inputs (with known outputs) and train a neural network to recognize common errors. Then you could use that model to predict mistakes in your original dataset (maybe overkill for an array of only ten items).

Related

Splitting a sentence to minimize sentence lengths

I have come across the following problem statement:
You have a sentence written entirely in a single row. You would like to split it into several rows by replacing some of the spaces
with "new row" indicators. Your goal is to minimize the width of the
longest row in the resulting text ("new row" indicators do not count
towards the width of a row). You may replace at most K spaces.
You will be given a sentence and a K. Split the sentence using the
procedure described above and return the width of the longest row.
I am a little lost with where to start. To me, it seems I need to try to figure out every possible sentence length that satisfies the criteria of splitting the single sentence up into K lines.
I can see a couple of edge cases:
There are <= K words in the sentence, therefore return the longest word.
The sentence length is 0, return 0
If neither of those criteria are true, then we have to determine all possible combinations of splitting the sentence and the return the minimum of all those options. This is the part I don't know how to do (and is obviously the heart of the problem).
You can solve it by inverting the problem. Let's say I fix the length of the longest split to L. Can you compute the minimum number of breaks you need to satisfy it?
Yes, you just break before the first word that would go over L and count them up (O(N)).
So now that we have that we just have to find a minimum L that would require less or equal K breaks. You can do a binary search in the length of the input. Final complexity O(NlogN).
First Answer
What you want to achieve is Minimum Raggedness. If you just want the algorithm, it is here as a PDF. If the research paper's link goes bad, please search for the famous paper named Breaking Paragraphs into Lines by Knuth.
However if you want to get your hands over some implementations of the same, in the question Balanced word wrap (Minimum raggedness) in PHP on SO, people have actually given implementation not only in PHP but in C, C++ and bash as well.
Second Answer
Though this is not exactly a correct approach, it is quick and dirty if you are looking for something like that. This method will not return correct answer for every case. It is for those people for whom time to ship their product is more important.
Idea
You already know the length of your input string. Let's call it L;
When putting in K breaks, the best scenario would be to be able to break the string to parts of exactly L / (K + 1) size;
So break your string at that word which makes the resulting sentence part's length least far from L / (K + 1);
My recursive solution, which can be improved through memoization or dynamic programming.
def split(self,sentence, K):
if not sentence: return 0
if ' ' not in sentence or K == 0: return len(sentence)
spaces = [i for i, s in enumerate(sentence) if s == ' ']
res = 100000
for space in spaces:
res = min(res, max(space, self.split(sentence[space+1:], K-1)))
return res

Reducing the time complexity of this algorithm

I'm playing a game that has a weapon-forging component, where you combine two weapons to get a new one. The sheer number of weapon combinations (see "6.1. Blade Combination Tables" at http://www.gamefaqs.com/ps/914326-vagrant-story/faqs/8485) makes it difficult to figure out what you can ultimately create out of your current weapons through repeated forging, so I tried writing a program that would do this for me. I give it a list of weapons that I currently have, such as:
francisca
tabarzin
kris
and it gives me the list of all weapons that I can forge:
ball mace
chamkaq
dirk
francisca
large crescent
throwing knife
The problem is that I'm using a brute-force algorithm that scales extremely poorly; it takes about 15 seconds to calculate all possible weapons for seven starting weapons, and a few minutes to calculate for eight starting weapons. I'd like it to be able to calculate up to 64 weapons (the maximum that you can hold at once), but I don't think I'd live long enough to see the results.
function find_possible_weapons(source_weapons)
{
for (i in source_weapons)
{
for (j in source_weapons)
{
if (i != j)
{
result_weapon = combine_weapons(source_weapons[i], source_weapons[j]);
new_weapons = array();
new_weapons.add(result_weapon);
for (k in source_weapons)
{
if (k != i && k != j)
new_weapons.add(source_weapons[k]);
}
find_possible_weapons(new_weapons);
}
}
}
}
In English: I attempt every combination of two weapons from my list of source weapons. For each of those combinations, I create a new list of all weapons that I'd have following that combination (that is, the newly-combined weapon plus all of the source weapons except the two that I combined), and then I repeat these steps for the new list.
Is there a better way to do this?
Note that combining weapons in the reverse order can change the result (Rapier + Firangi = Short Sword, but Firangi + Rapier = Spatha), so I can't skip those reversals in the j loop.
Edit: Here's a breakdown of the test example that I gave above, to show what the algorithm is doing. A line in brackets shows the result of a combination, and the following line is the new list of weapons that's created as a result:
francisca,tabarzin,kris
[francisca + tabarzin = chamkaq]
chamkaq,kris
[chamkaq + kris = large crescent]
large crescent
[kris + chamkaq = large crescent]
large crescent
[francisca + kris = dirk]
dirk,tabarzin
[dirk + tabarzin = francisca]
francisca
[tabarzin + dirk = francisca]
francisca
[tabarzin + francisca = chamkaq]
chamkaq,kris
[chamkaq + kris = large crescent]
large crescent
[kris + chamkaq = large crescent]
large crescent
[tabarzin + kris = throwing knife]
throwing knife,francisca
[throwing knife + francisca = ball mace]
ball mace
[francisca + throwing knife = ball mace]
ball mace
[kris + francisca = dirk]
dirk,tabarzin
[dirk + tabarzin = francisca]
francisca
[tabarzin + dirk = francisca]
francisca
[kris + tabarzin = throwing knife]
throwing knife,francisca
[throwing knife + francisca = ball mace]
ball mace
[francisca + throwing knife = ball mace]
ball mace
Also, note that duplicate items in a list of weapons are significant and can't be removed. For example, if I add a second kris to my list of starting weapons so that I have the following list:
francisca
tabarzin
kris
kris
then I'm able to forge the following items:
ball mace
battle axe
battle knife
chamkaq
dirk
francisca
kris
kudi
large crescent
scramasax
throwing knife
The addition of a duplicate kris allowed me to forge four new items that I couldn't before. It also increased the total number of forge tests to 252 for a four-item list, up from 27 for the three-item list.
Edit: I'm getting the feeling that solving this would require more math and computer science knowledge than I have, so I'm going to give up on it. It seemed like a simple enough problem at first, but then, so does the Travelling Salesman. I'm accepting David Eisenstat's answer since the suggestion of remembering and skipping duplicate item lists made such a huge difference in execution time and seems like it would be applicable to a lot of similar problems.
Start by memoizing the brute force solution, i.e., sort source_weapons, make it hashable (e.g., convert to a string by joining with commas), and look it up in a map of input/output pairs. If it isn't there, do the computation as normal and add the result to the map. This often results in big wins for little effort.
Alternatively, you could do a backward search. Given a multiset of weapons, form predecessors by replacing one of the weapon with two weapons that forge it, in all possible ways. Starting with the singleton list consisting of the singleton multiset consisting of the goal weapon, repeatedly expand the list by predecessors of list elements and then cull multisets that are supersets of others. Stop when you reach a fixed point.
If linear programming is an option, then there are systematic ways to prune search trees. In particular, let's make the problem easier by (i) allowing an infinite supply of "catalysts" (maybe not needed here?) (ii) allowing "fractional" forging, e.g., if X + Y => Z, then 0.5 X + 0.5 Y => 0.5 Z. Then there's an LP formulation as follows. For all i + j => k (i and j forge k), the variable x_{ijk} is the number of times this forge is performed.
minimize sum_{i, j => k} x_{ijk} (to prevent wasteful cycles)
for all i: sum_{j, k: j + k => i} x_{jki}
- sum_{j, k: j + i => k} x_{jik}
- sum_{j, k: i + j => k} x_{ijk} >= q_i,
for all i + j => k: x_{ijk} >= 0,
where q_i is 1 if i is the goal item, else minus the number of i initially available. There are efficient solvers for this easy version. Since the reactions are always 2 => 1, you can always recover a feasible forging schedule for an integer solution. Accordingly, I would recommend integer programming for this problem. The paragraph below may still be of interest.
I know shipping an LP solver may be inconvenient, so here's an insight that will let you do without. This LP is feasible if and only if its dual is bounded. Intuitively, the dual problem is to assign a "value" to each item such that, however you forge, the total value of your inventory does not increase. If the goal item is valued at more than the available inventory, then you can't forge it. You can use any method that you can think of to assign these values.
I think you are unlikely to get a good general answer to this question because
if there was an efficient algorithm to solve your problem, then it would also be able to solve NP-complete problems.
For example, consider the problem of finding the maximum number of independent rows in a binary matrix.
This is a known NP-complete problem (e.g. by showing equivalence to the maximum independent set problem).
We can reduce this problem to your question in the following manner:
We can start holding one weapon for each column in the binary matrix, and then we imagine each row describes an alternative way of making a new weapon (say a battle axe).
We construct the weapon translation table such that to make the battle axe using method i, we need all weapons j such that M[i,j] is equal to 1 (this may involve inventing some additional weapons).
Then we construct a series of super weapons which can be made by combining different numbers of our battle axes.
For example, the mega ultimate battle axe may require 4 battle axes to be combined.
If we are able to work out the best weapon that can be constructed from your starting weapons, then we have solved the problem of finding the maximum number of independent rows in the original binary matrix.
It's not a huge saving, however looking at the source document, there are times when combining weapons produces the same weapon as one that was combined. I assume that you won't want to do this as you'll end up with less weapons.
So if you added a check for if the result_weapon was the same type as one of the inputs, and didn't go ahead and recursively call find_possible_weapons(new_weapons), you'd trim the search down a little.
The other thing I could think of, is you are not keeping a track of work done, so if the return from find_possible_weapons(new_weapons) returns the same weapon that you already have got by combining other weapons, you might well be performing the same search branch multiple times.
e.g. if you have a, b, c, d, e, f, g, and if a + b = x, and c + d = x, then you algorithm will be performing two lots of comparing x against e, f, and g. So if you keep a track of what you've already computed, you'll be onto a winner...
Basically, you have to trim the search tree. There are loads of different techniques to do this: it's called search. If you want more advice, I'd recommend going to the computer science stack exchange.
If you are still struggling, then you could always start weighting items/resulting items, and only focus on doing the calculation on 'high gain' objects...
You might want to start by creating a Weapon[][] matrix, to show the results of forging each pair. You could map the name of the weapon to the index of the matrix axis, and lookup of the results of a weapon combination would occur in constant time.

Algorithm to give more weight to the first word

Right now, I'm trying to create an algorithm that gives a score to a user, depending on his input in a text field.
This score is supposed to encourage the user to add more text to his personal profile.
The way the algorithm should work, is that it should account a certain weight to the first word, and a little less weight to the second word. The third word will receive a little less weight than the second word, and so on.
The goal is to encourage users to expand their texts, but to avoid spam in general as well. For instance, the added value of the 500th word shouldn't be much at all.
The difference between a text of 100 words and a text of 500 words should be substantial.
Am I making any sense so far?
Right now, I wouldn't know where to begin with this question. I've tried multiple Google queries, but didn't seem to find anything of the sort. Can anyone point me in the right direction?
I suppose such an algorithm must already exist somewhere (or at least the general idea probably exists) but I can't seem to be able to find some help on the subject.
Can anyone point me in the right direction?
I'd really appreciate any help you can give me.
Thanks a lot.
// word count in user description
double word_count = ...;
// word limit over which words do not improve score
double word_limit = ...;
// use it to change score progression curve
// if factor = 1, progression is linear
// if factor < 1, progression is steeper at the beginning
// if factor > 1, progression is steeper at the end
double factor = ...;
double score = pow(min(word_count, word_limit) / word_limit, factor);
It depends how complex you want/need it to be, and whether or not you want a constant reduction in the weight applied to a particular word.
The simplest would possibly be to apply a relatively high weight (say 1000) to the first word, and then each subsequent word has a weight one less than the weight of the previous word; so the second word has a weight of 999, the third word has a weight of 998, etc. That has the "drawback" that the sum of the weights doesn't increase past the 1000 word mark - you'll have to decide for yourself whether or not that's bad for your particular situation. That may not do exactly what you need to do, though.
If you don't want a linear reduction, it could be something simple such as the first word has a weight of X, the second word has a weight equal to Y% of X, the third word has a weight equal to Y% of Y% of X, etc. The difference between the first and second word is going to be larger than the difference between the second and third word, and by the time you reach the 500th word, the difference is going to be far smaller. It's also not difficult to implement, since it's not a complex formula.
Or, if you really need to, you could use a more complex mathematical function to calculate the weight - try googling 'exponential decay' and see if that's of any use to you.
It is not very difficult to implement a custom scoring function. Here is one in pseudo code:
function GetScore( word_count )
// no points for the lazy user
if word_count == 0
return 0
// 20 points for the first word and then up to 90 points linearly:
else if word_count >= 1 and word_count <= 100
return 20 + 70 * (word_count - 1) / (100)
// 90 points for the first 100 words and then up to 100 points linearly:
else if word_count >= 101 and word_count <= 1000
return 90 + 10 * (word_count - 100) / (900)
// 100 points is the maximum for 1000 words or more:
else
return 100
end function
I would go with something like result = 2*sqrt(words_count), anyway you can use any function that has derivative less then 1 e.g. log

String similarity score/hash

Is there a method to calculate something like general "similarity score" of a string? In a way that I am not comparing two strings together but rather I get some number (hash) for each string that can later tell me that two strings are or are not similar. Two similar strings should have similar (close) hashes.
Let's consider these strings and scores as an example:
Hello world 1000
Hello world! 1010
Hello earth 1125
Foo bar 3250
FooBarbar 3750
Foo Bar! 3300
Foo world! 2350
You can see that Hello world! and Hello world are similar and their scores are close to each other.
This way, finding the most similar strings to a given string would be done by subtracting given strings score from other scores and then sorting their absolute value.
I believe what you're looking for is called a Locality Sensitive Hash. Whereas most hash algorithms are designed such that small variations in input cause large changes in output, these hashes attempt the opposite: small changes in input generate proportionally small changes in output.
As others have mentioned, there are inherent issues with forcing a multi-dimensional mapping into a 2-dimensional mapping. It's analogous to creating a flat map of the Earth... you can never accurately represent a sphere on a flat surface. Best you can do is find a LSH that is optimized for whatever feature it is you're using to determine whether strings are "alike".
Levenstein distance or its derivatives is the algorithm you want.
Match given string to each of strings from dictionary.
(Here, if you need only fixed number of most similar strings, you may want to use min-heap.)
If running Levenstein distance for all strings in dictionary is too expensive, then use some rough
algorithm first that will exclude too distant words from list of candidates.
After that, run levenstein distance on left candidates.
One way to remove distant words is to index n-grams.
Preprocess dictionary by splitting each of words into list of n-grams.
For example, consider n=3:
(0) "Hello world" -> ["Hel", "ell", "llo", "lo ", "o w", " wo", "wor", "orl", "rld"]
(1) "FooBarbar" -> ["Foo", "ooB", "oBa", "Bar", "arb", "rba", "bar"]
(2) "Foo world!" -> ["Foo", "oo ", "o w", " wo", "wor", "orl", "rld", "ld!"]
Next, create index of n-gramms:
" wo" -> [0, 2]
"Bar" -> [1]
"Foo" -> [1, 2]
"Hel" -> [0]
"arb" -> [1]
"bar" -> [1]
"ell" -> [0]
"ld!" -> [2]
"llo" -> [0]
"lo " -> [0]
"o w" -> [0, 2]
"oBa" -> [1]
"oo " -> [2]
"ooB" -> [1]
"orl" -> [0, 2]
"rba" -> [1]
"rld" -> [0, 2]
"wor" -> [0, 2]
When you need to find most similar strings for given string, you split given string into n-grams and select only those
words from dictionary which have at least one matching n-gram.
This reduces number of candidates to reasonable amount and you may proceed with levenstein-matching given string to each of left candidates.
If your strings are long enough, you may reduce index size by using min-hashing technnique:
you calculate ordinary hash for each of n-grams and use only K smallest hashes, others are thrown away.
P.S. this presentation seems like a good introduction to your problem.
This isn't possible, in general, because the set of edit distances between strings forms a metric space, but not one with a fixed dimension. That means that you can't provide a mapping between strings and integers that preserves a distance measure between them.
For example, you cannot assign numbers to these three phrases:
one two
one six
two six
Such that the numbers reflect the difference between all three phrases.
While the idea seems extremely sweet... I've never heard of this.
I've read many, many, technics, thesis, and scientific papers on the subject of spell correction / typo correction and the fastest proposals revolve around an index and the levenshtein distance.
There are fairly elaborated technics, the one I am currently working on combines:
A Bursted Trie, with level compactness
A Levenshtein Automaton
Even though this doesn't mean it is "impossible" to get a score, I somehow think there would not be so much recent researches on string comparisons if such a "scoring" method had proved efficient.
If you ever find such a method, I am extremely interested :)
Would Levenshtein distance work for you?
In an unbounded problem, there is no solution which can convert any possible sequence of words, or any possible sequence of characters to a single number which describes locality.
Imagine similarity at the character level
stops
spots
hello world
world hello
In both examples the messages are different, but the characters in the message are identical, so the measure would need to hold a position value , as well as a character value. (char 0 == 'h', char 1 == 'e' ...)
Then compare the following similar messages
hello world
ello world
Although the two strings are similar, they could differ at the beginning, or at the end, which makes scaling by position problematic.
In the case of
spots
stops
The words only differ by position of the characters, so some form of position is important.
If the following strings are similar
yesssssssssssssss
yessssssssssssss
Then you have a form of paradox. If you add 2 s characters to the second string, it should share the distance it was from the first string, but it should be distinct. This can be repeated getting progressively longer strings, all of which need to be close to the strings just shorter and longer than them. I can't see how to achieve this.
In general this is treated as a multi-dimensional problem - breaking the string into a vector
[ 'h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd' ]
But the values of the vector can not be
represented by a fixed size number, or
give good quality difference measure.
If the number of words, or length of strings were bounded, then a solution of coding may be possible.
Bounded values
Using something like arithmetic compression, then a sequence of words can be converted into a floating point number which represents the sequence. However this would treat items earlier in the sequence as more significant than the last item in the sequence.
data mining solution
If you accept that the problem is high dimensional, then you can store your strings in a metric-tree wikipedia : metric tree. This would limit your search space, whilst not solving your "single number" solution.
I have code for such at github : clustering
Items which are close together, should be stored together in a part of the tree, but there is really no guarantee. The radius of subtrees is used to prune the search space.
Edit Distance or Levenshtein distance
This is used in a sqlite extension to perform similarity searching, but with no single number solution, it works out how many edits change one string into another. This then results in a score, which shows similarity.
I think of something like this:
remove all non-word characters
apply soundex
Your idea sounds like ontology but applied to whole phrases. The more similar two phrases are, the closer in the graph they are (assuming you're using weighted edges). And vice-versa: non similar phrases are very far from each other.
Another approach, is to use Fourier transform to get sort of the 'index' for a given string (it won't be a single number, but always). You may find little bit more in this paper.
And another idea, that bases on the Levenshtein distance: you may compare n-grams that will give you some similarity index for two given phrases - the more they are similar the value is closer to 1. This may be used to calculate distance in the graph. wrote a paper on this a few years ago, if you'd like I can share it.
Anyways: despite I don't know the exact solution, I'm also interested in what you'll came up with.
Maybe use PCA, where the matrix is a list of the differences between the string and a fixed alphabet (à la ABCDEFGHI...). The answer could be simply the length of the principal component.
Just an idea.
ready-to-run PCA in C#
It is unlikely one can get a rather small number from two phrases that, being compared, provide a relevant indication of the similarity of their initial phrases.
A reason is that the number gives an indication in one dimension, while phrases are evolving in two dimensions, length and intensity.
The number could evolve as well in length as in intensity but I'm not sure it'll help a lot.
In two dimensions, you better look at a matrix, which some properties like the determinant (a kind of derivative of the matrix) could give a rough idea of the phrase trend.
In Natural Language Processing we have a thing call Minimum Edit Distance (also known as Levenshtein Distance)
Its basically defined as the smallest amount of operation needed in order to transform string1 to string2
Operations included Insertion, Deletion, Subsitution, each operation is given a score to which you add to the distance
The idea to solve your problem is to calculate the MED from your chosen string, to all the other string, sort that collection and pick out the n-th first smallest distance string
For example:
{"Hello World", "Hello World!", "Hello Earth"}
Choosing base-string="Hello World"
Med(base-string, "Hello World!") = 1
Med(base-string, "Hello Earth") = 8
1st closest string is "Hello World!"
This have somewhat given a score to each string of your string-collection
C# Implementation (Add-1, Deletion-1, Subsitution-2)
public static int Distance(string s1, string s2)
{
int[,] matrix = new int[s1.Length + 1, s2.Length + 1];
for (int i = 0; i <= s1.Length; i++)
matrix[i, 0] = i;
for (int i = 0; i <= s2.Length; i++)
matrix[0, i] = i;
for (int i = 1; i <= s1.Length; i++)
{
for (int j = 1; j <= s2.Length; j++)
{
int value1 = matrix[i - 1, j] + 1;
int value2 = matrix[i, j - 1] + 1;
int value3 = matrix[i - 1, j - 1] + ((s1[i - 1] == s2[j - 1]) ? 0 : 2);
matrix[i, j] = Math.Min(value1, Math.Min(value2, value3));
}
}
return matrix[s1.Length, s2.Length];
}
Complexity O(n x m) where n, m is length of each string
More info on Minimum Edit Distance can be found here
Well, you could add up the ascii value of each character and then compare the scores, having a maximum value on which they can differ. This does not guarantee however that they will be similar, for the same reason two different strings can have the same hash value.
You could of course make a more complex function, starting by checking the size of the strings, and then comparing each caracter one by one, again with a maximum difference set up.

Fastest way to find minimal Hamming distance to any substring?

Given a long string L and a shorter string S (the constraint is that L.length must be >= S.length), I want to find the minimum Hamming distance between S and any substring of L with length equal to S.length. Let's call the function for this minHamming(). For example,
minHamming(ABCDEFGHIJ, CDEFGG) == 1.
minHamming(ABCDEFGHIJ, BCDGHI) == 3.
Doing this the obvious way (enumerating every substring of L) requires O(S.length * L.length) time. Is there any clever way to do this in sublinear time? I search the same L with several different S strings, so doing some complicated preprocessing to L once is acceptable.
Edit: The modified Boyer-Moore would be a good idea, except that my alphabet is only 4 letters (DNA).
Perhaps surprisingly, this exact problem can be solved in just O(|A|nlog n) time using Fast Fourier Transforms (FFTs), where n is the length of the larger sequence L and |A| is the size of the alphabet.
Here is a freely available PDF of a paper by Donald Benson describing how it works:
Fourier methods for biosequence analysis (Donald Benson, Nucleic Acids Research 1990 vol. 18, pp. 3001-3006)
Summary: Convert each of your strings S and L into several indicator vectors (one per character, so 4 in the case of DNA), and then convolve corresponding vectors to determine match counts for each possible alignment. The trick is that convolution in the "time" domain, which ordinarily requires O(n^2) time, can be implemented using multiplication in the "frequency" domain, which requires just O(n) time, plus the time required to convert between domains and back again. Using the FFT each conversion takes just O(nlog n) time, so the overall time complexity is O(|A|nlog n). For greatest speed, finite field FFTs are used, which require only integer arithmetic.
Note: For arbitrary S and L this algorithm is clearly a huge performance win over the straightforward O(mn) algorithm as |S| and |L| become large, but OTOH if S is typically shorter than log|L| (e.g. when querying a large DB with a small sequence), then obviously this approach provides no speedup.
UPDATE 21/7/2009: Updated to mention that the time complexity also depends linearly on the size of the alphabet, since a separate pair of indicator vectors must be used for each character in the alphabet.
Modified Boyer-Moore
I've just dug up some old Python implementation of Boyer-Moore I had lying around and modified the matching loop (where the text is compared to the pattern). Instead of breaking out as soon as the first mismatch is found between the two strings, simply count up the number of mismatches, but remember the first mismatch:
current_dist = 0
while pattern_pos >= 0:
if pattern[pattern_pos] != text[text_pos]:
if first_mismatch == -1:
first_mismatch = pattern_pos
tp = text_pos
current_dist += 1
if current_dist == smallest_dist:
break
pattern_pos -= 1
text_pos -= 1
smallest_dist = min(current_dist, smallest_dist)
# if the distance is 0, we've had a match and can quit
if current_dist == 0:
return 0
else: # shift
pattern_pos = first_mismatch
text_pos = tp
...
If the string did not match completely at this point, go back to the point of the first mismatch by restoring the values. This makes sure that the smallest distance is actually found.
The whole implementation is rather long (~150LOC), but I can post it on request. The core idea is outlined above, everything else is standard Boyer-Moore.
Preprocessing on the Text
Another way to speed things up is preprocessing the text to have an index on character positions. You only want to start comparing at positions where at least a single match between the two strings occurs, otherwise the Hamming distance is |S| trivially.
import sys
from collections import defaultdict
import bisect
def char_positions(t):
pos = defaultdict(list)
for idx, c in enumerate(t):
pos[c].append(idx)
return dict(pos)
This method simply creates a dictionary which maps each character in the text to the sorted list of its occurrences.
The comparison loop is more or less unchanged to naive O(mn) approach, apart from the fact that we do not increase the position at which comparison is started by 1 each time, but based on the character positions:
def min_hamming(text, pattern):
best = len(pattern)
pos = char_positions(text)
i = find_next_pos(pattern, pos, 0)
while i < len(text) - len(pattern):
dist = 0
for c in range(len(pattern)):
if text[i+c] != pattern[c]:
dist += 1
if dist == best:
break
c += 1
else:
if dist == 0:
return 0
best = min(dist, best)
i = find_next_pos(pattern, pos, i + 1)
return best
The actual improvement is in find_next_pos:
def find_next_pos(pattern, pos, i):
smallest = sys.maxint
for idx, c in enumerate(pattern):
if c in pos:
x = bisect.bisect_left(pos[c], i + idx)
if x < len(pos[c]):
smallest = min(smallest, pos[c][x] - idx)
return smallest
For each new position, we find the lowest index at which a character from S occurs in L. If there is no such index any more, the algorithm will terminate.
find_next_pos is certainly complex, and one could try to improve it by only using the first several characters of the pattern S, or use a set to make sure characters from the pattern are not checked twice.
Discussion
Which method is faster largely depends on your dataset. The more diverse your alphabet is, the larger will be the jumps. If you have a very long L, the second method with preprocessing might be faster. For very, very short strings (like in your question), the naive approach will certainly be the fastest.
DNA
If you have a very small alphabet, you could try to get the character positions for character bigrams (or larger) rather than unigrams.
You're stuck as far as big-O is concerned.. At a fundamental level, you're going to need to test if every letter in the target matches each eligible letter in the substring.
Luckily, this is easily parallelized.
One optimization you can apply is to keep a running count of mismatches for the current position. If it's greater than the lowest hamming distance so far, then obviously you can skip to the next possibility.

Resources