How to remove OCR artifacts from text? - algorithm

OCR generated texts sometimes come with artifacts, such as this one:
Diese grundsätzliche V e r b o r g e n h e i t Gottes, die sich n u r dem N a c h f o l g e r ö f f n e t , ist m i t d e m Messiasgeheimnis gemeint
While it is not unusual, that the spacing between letters is used as emphasis (probably due to early printing press limitations), it is unfavorable for retrieval tasks.
How can one turn the above text into a more, say, canonical form, like:
Diese grundsätzliche Verborgenheit Gottes, die sich nur dem Nachfolger öffnet, ist mit dem Messiasgeheimnis gemeint
Can this be done efficiently for large amounts of text?
One idea would be to concatenate the whole string (to skip the guessing, where word boundaries are) and then run a text segmentation algorithm on it, maybe something similar to this: http://norvig.com/ngrams/

If you have a dictionary for the target language, and all spaced-out words consist of just a single word, then it's easy: Just scan through the text, looking for maximal-length runs of spaced-out single letters, and replace them with the single corresponding dictionary word if it exists (and otherwise leave them unchanged).
The only real difficulty is with strings like m i t d e m that correspond to two or more separate words. A simple way would be to greedily "nibble off" prefixes that appear in the dictionary, but this might lead to suboptimal results, and in particular to a suffix that doesn't correspond to any dictionary string even though a different choice of breakpoints would have worked (e.g. b e i m A r z t won't work if you greedily grab bei instead of beim from the front). Fortunately there's a simple linear-time DP approach that will do a better job -- and can even incorporate weights on words, which can help to get the most likely decomposition in the event that there is more than one. Given a string S[1 .. n] (with spaces removed), we will compute f(i), the score of the best decomposition of the length-i prefix of S, for all 1 <= i <= n:
f(0) = 0
f(i) = max over all 0 <= j < i of f(j) + dictScore(S[j+1 .. i])
f(n) will then be the score of the best possible decomposition of the entire string. If you set dictScore(T) to 1 for words that exist in the dictionary and 0 for words that don't, you will get a decomposition into as many words as possible; if you set dictScore(T) to, e.g., -1 for words that exist in the dictionary and -2 for words that don't, you'll get a decomposition into as few words as possible. You can also choose to award higher scores for more "likely" words.
After computing these scores, you can walk back through the DP matrix to reconstruct a decomposition that corresponds to the maximal score.

Related

Generating a perfect hash function given known list of strings?

Suppose I have a list of N strings, known at compile-time.
I want to generate (at compile-time) a function that will map each string to a distinct integer between 1 and N inclusive. The function should take very little time or space to execute.
For example, suppose my strings are:
{"apple", "orange", "banana"}
Such a function may return:
f("apple") -> 2
f("orange") -> 1
f("banana") -> 3
What's a strategy to generate this function?
I was thinking to analyze the strings at compile time and look for a couple of constants I could mod or add by or something?
The compile-time generation time/space can be quite expensive (but obviously not ridiculously so).
Say you have m distinct strings, and let ai, j be the jth character of the ith string. In the following, I'll assume that they all have the same length. This can be easily translated into any reasonable programming language by treating ai, j as the null character if j ≥ |ai|.
The idea I suggest is composed of two parts:
Find (at most) m - 1 positions differentiating the strings, and store these positions.
Create a perfect hash function by considering the strings as length-m vectors, and storing the parameters of the perfect hash function.
Obviously, in general, the hash function must check at least m - 1 positions. It's easy to see this by induction. For 2 strings, at least 1 character must be checked. Assume it's true for i strings: i - 1 positions must be checked. Create a new set of strings by appending 0 to the end of each of the i strings, and add a new string that is identical to one of the strings, except it has a 1 at the end.
Conversely, it's obvious that it's possible to find at most m - 1 positions sufficient for differentiating the strings (for some sets the number of course might be lower, as low as log to the base of the alphabet size of m). Again, it's easy to see so by induction. Two distinct strings must differ at some position. Placing the strings in a matrix with m rows, there must be some column where not all characters are the same. Partitioning the matrix into two or more parts, and applying the argument recursively to each part with more than 2 rows, shows this.
Say the m - 1 positions are p1, ..., pm - 1. In the following, recall the meaning above for ai, pj for pj ≥ |ai|: it is the null character.
let us define h(ai) = ∑j = 1m - 1[qj ai, pj % n], for random qj and some n. Then h is known to be a universal hash function: the probability of pair-collision P(x ≠ y &wedge; h(x) = h(y)) ≤ 1/n.
Given a universal hash function, there are known constructions for creating a perfect hash function from it. Perhaps the simplest is creating a vector of size m2 and successively trying the above h with n = m2 with randomized coefficients, until there are no collisions. The number of attempts needed until this is achieved, is expected 2 and the probability that more attempts are needed, decreases exponentially.
It is simple. Make a dictionary and assign 1 to the first word, 2 to the second, ... No need to make things complicated, just number your words.
To make the lookup effective, use trie or binary search or whatever tool your language provides.

Need efficient algorithm in combinatorics

I am trying to find the best (realistic) algorithm for solving a cryptography challenge, in which:
the given cipher text C is made of about 6000 characters taken in the set S={A,B,C,...,Y,a,b,c,...y}. So |S| = 50.
the encryption scheme does not allow to have two identical adjacent characters in C
25 letters in S are called Nulls, and are unknown
these Nulls must be removed from C to obtain the actual cipher text C' which can then be attacked.
the list of Nulls in C is named N and |N| is close to |C|/2 = 3000
so: |N| + |C'| = |C|
My aim is to identify the 25 Nulls, satisfying these two conditions:
there may not be two identical adjacent characters in C'
there may not be two identical adjacent Nulls in N
Obviously by brute force there are 50!/(25! 25!) = 126410606437752 combinations of 25 Nulls in S, so this is not a realistic approach.
I have tried to recursively explore the tree of sets of Nulls and 'cut branches' as much and as soon as possible.
For example, when adding a letter of S to the subset of Nulls, if the sequence "x n1n2 x" appears in C where x is not yet a Null and n1n2 are Nulls, then x should be a Null too.
However this is not enough for a run-time lower than a few centuries...
Can you think of a more clever algorithm for identifying these 25 Nulls ?
Note: there might be more than one set of Nulls satisfying the two conditions
lets try something like this:
Create a list of sets - each set contains one char from S. the set is the null chars.
while you have more then two sets:
for each set
search the cipher text for X[<set-chars>]+X
if found, union the set with the set X in it.
if no sets where united, start recursing with two sets united.
You can speed up things if you keep a different cipher text for each set, removing from it the chars in the set. if you do so, the search is easier - you are searching for XX, witch is constant length. every time you union two sets you need to remove all the chars in the sets from the cipher text.
The time this well take depends on the string C you are given.
An explanation about the sets - each set is an option for C' or N. If you find that A and X are in the same group, then {A, X} is either a subset of N or of C'. If later you will find the same about Y and B, then {Y, B} is a subset. Later, finding a substring YAXAXY means that Y is in the same group as A and X, and so will B, because it's with Y. At the end you will end with two groups - one for C' and one for N, witch you can't distinguish between.
elyashiv's method is the good one.
It is very fast.
I have produced the two sets C' and N, which are equivalent.
The sub-sets of S, S1 and S2 which produce C' and N are adequately such that S = S1 U S2.
Thank you.

Implementing Parallel Algorithm for Longest Common Subsequence

I am trying to implement the Parallel Algorithm for Longest Common Subsequence Problem described in http://www.iaeng.org/publication/WCE2010/WCE2010_pp499-504.pdf
But i am having a problem with the variable C in Equation 6 on page 4
The paper refered to C on at the end of page 3 as
C as Let C[1 : l] bethe finite alphabet
I am not sure what is ment by this, as i guess it would it with the 2 strings ABCDEF and ABQXYEF be ABCDEFQXY. But what if my 2 stings is a list of objects (Where my match test for an example is obj1.Name = obj2.Name), what would my C be here? just a union on the 2 arrays?
Having read and studied the paper, I can say that C is supposed to be an array holding the alphabet of your strings, where the alphabet size (and, thus, the size of C) is l.
By the looks of your question, however, I feel the need to go deeper on this, because it looks like you didn't get the whole picture yet. What is P[i,j], and why do you need it? The answer is that you don't really need it, but it's an elegant optimization. In page 3, a little bit before Theorem 1, it is said that:
[...] This process ends when j-k = 0 at the k-th step, or a(i) =
b(j-k) at the k-th step. Assume that the process stops at the k-th
step, and k must be the minimum number that makes a(i) = b(j-k) or j-k
= 0. [...]
The recurrence relation in (3) is equivalent to (2), but the fundamental difference is that (2) expands recursively, whereas with (3) you never have recursive calls, provided that you know k. In other words, the magic behind (3) not expanding recursively is that you somehow know the spot where the recursion on (2) would stop, so you look at that cell immediately, rather than recursively approaching it.
Ok then, but how do you find out the value for k? Since k is the spot where (2) reaches a base case, it can be seen that k is the amount of columns that you have to "go back" on B until you are either off the limits (i.e., the first column that is filled with 0's) OR you find a match between a character in B and a character in A (which corresponds to the base case conditions in (2)). Remember that you will be matching the character a(i-1), where i is the current row.
So, what you really want is to find the last position in B before j where the character a(i-1) appears. If no such character ever appears in B before j, then that would be equivalent to reaching the case i = 0 or j-1 = 0 in (2); otherwise, it's the same as reaching a(i) = b(j-1) in (2).
Let's look at an example:
Consider that the algorithm is working on computing the values for i = 2 and j = 3 (the row and column are highlighted in gray). Imagine that the algorithm is working on the cell highlighted in black and is applying (2) to determine the value of S[2,2] (the position to the left of the black one). By applying (2), it would then start by looking at a(2) and b(2). a(2) is C, b(2) is G, to there's no match (this is the same procedure as the original, well-known algorithm). The algorithm now wants to find the value of S[2,2], because it is needed to compute S[2,3] (where we are). S[2,2] is not known yet, but the paper shows that it is possible to determine that value without refering to the row with i = 2. In (2), the 3rd case is chosen: S[2,2] = max(S[1, 2], S[2, 1]). Notice, if you will, that all this formula is doing is looking at the positions that would have been used to calculate S[2,2]. So, to rephrase that: we're computing S[2,3], we need S[2,2] for that, we don't know it yet, so we're going back on the table to see what's the value of S[2,2] in pretty much the same way we did in the original, non-parallel algorithm.
When will this stop? In this example, it will stop when we find the letter C (this is our a(i)) in TGTTCGACA before the second T (the letter on the current column) OR when we reach column 0. Because there is no C before T, we reach column 0. Another example:
Here, (2) would stop with j-1 = 5, because that is the last position in TGTTCGACA where C shows up. Thus, the recursion reaches the base case a(i) = b(j-1) when j-1 = 5.
With this in mind, we can see a shortcut here: if you could somehow know the amount k such that j-1-k is a base case in (2), then you wouldn't have to go through the score table to find the base case.
That's the whole idea behind P[i,j]. P is a table where you lay down the whole alphabet vertically (on the left side); the string B is, once again, placed horizontally in the upper side. This table is computed as part of a preprocessing step, and it will tell you exactly what you will need to know ahead of time: for each position j in B, it says, for each character C[i] in C (the alphabet), what is the last position in B before j where C[i] is found (note that i is used to index C, the alphabet, and not the string A. Maybe the authors should have used another index variable to avoid confusion).
So, you can think of the semantics for an entry P[i,j] as something along the lines of: The last position in B where I saw C[i] before position j. For example, if you alphabet is sigma = {A, E, I, O, U}, and B = "AOOIUEI", thenP` is:
Take the time to understand this table. Note the row for O. Remember: this row lists, for every position in B, where is the last known "O". Only when j = 3 will we have a value that is not zero (it's 2), because that's the position after the first O in AOOIUEI. This entry says that the last position in B where O was seen before is position 2 (and, indeed, B[2] is an O, the one that follows A). Notice, in that same row, that for j = 4, we have the value 3, because now the last position for O is the one that correspnds to the second O in B (and since no more O's exist, the rest of the row will be 3).
Recall that building P is a preprocessing step necessary if you want to easily find the value of k that makes the recursion from equation (2) stop. It should make sense by now that P[i,j] is the k you're looking for in (3). With P, you can determine that value in O(1) time.
Thus, the C[i] in (6) is a letter of the alphabet - the letter that we are currently considering. In the example above, C = [A,E,I,O,U], and C[1] = A, C[2] = E, etc. In equaton (7), c is the position in C where a(i) (the current letter of string A being considered) lives. It makes sense: after all, when building the score table position S[i,j], we want to use P to find the value of k - we want to know where was the last time we saw an a(i) in B before j. We do that by reading P[index_of(a(i)), j].
Ok, now that you understand the use of P, let's see what's happening with your implementation.
About your specific case
In the paper, P is shown as a table that lists the whole alphabet. It is a good idea to iterate through the alphabet because the typical uses of this algorithm are in bioinformatics, where the alphabet is much, much smaller than the string A, making the iteration through the alphabet cheaper.
Because your strings are sequences of objects, your C would be the set of all possible objects, so you'd have to build a table P with the set of all possible object instance (nonsense, of course). This is definitely a case where the alphabet size is huge when compared to your string size. However, note that you will only be indexing P in those rows that correspond to letters from A: any row in P for a letter C[i] that is not in A is useless and will never be used. This makes your life easier, because it means you can build P with the string A instead of using the alphabet of every possible object.
Again, an example: if your alphabet is AEIOU, A is EEI and B is AOOIUEI, you will only be indexing P in the rows for E and I, so that's all you need in P:
This works and suffices, because in (7), P[c,j] is the entry in P for the character c, and c is the index of a(i). In other words: C[c] always belongs to A, so it makes perfect sense to build P for the characters of A instead of using the whole alphabet for the cases where the size of A is considerably smaller than the size of C.
All you have to do now is to apply the same principle to whatever your objects are.
I really don't know how to explain it any better. This may be a little dense at first. Make sure to re-read it until you really get it - and I mean every little detail. You have to master this before thinking about implementing it.
NOTE: You said you were looking for a credible and / or official source. I'm just another CS student, so I'm not an official source, but I think I can be considered "credible". I've studied this before and I know the subject. Happy coding!

string of integers puzzle

I apologize for not have the math background to put this question in a more formal way.
I'm looking to create a string of 796 letters (or integers) with certain properties.
Basically, the string is a variation on a De Bruijn sequence B(12,4), except order and repetition within each n-length subsequence are disregarded.
i.e. ABBB BABA BBBA are each equivalent to {AB}.
In other words, the main property of the string involves looking at consecutive groups of 4 letters within the larger string
(i.e. the 1st through 4th letters, the 2nd through 5th letters, the 3rd through 6th letters, etc)
And then producing the set of letters that comprise each group (repetitions and order disregarded)
For example, in the string of 9 letters:
A B B A C E B C D
the first 4-letter groups is: ABBA, which is comprised of the set {AB}
the second group is: BBAC, which is comprised of the set {ABC}
the third group is: BACE, which is comprised of the set {ABCE}
etc.
The goal is for every combination of 1-4 letters from a set of N letters to be represented by the 1-4-letter resultant sets of the 4-element groups once and only once in the original string.
For example, if there is a set of 5 letters {A, B, C, D, E} being used
Then the possible 1-4 letter combinations are:
A, B, C, D, E,
AB, AC, AD, AE, BC, BD, BE, CD, CE, DE,
ABC, ABD, ABE, ACD, ACE, ADE, BCD, BCE, BDE, CDE,
ABCD, ABCE, ABDE, ACDE, BCDE
Here is a working example that uses a set of 5 letters {A, B, C, D, E}.
D D D D E C B B B B A E C C C C D A E E E E B D A A A A C B D D B
The 1st through 4th elements form the set: D
The 2nd through 5th elements form the set: DE
The 3rd through 6th elements form the set: CDE
The 4th through 7th elements form the set: BCDE
The 5th through 8th elements form the set: BCE
The 6th through 9th elements form the set: BC
The 7th through 10th elements form the set: B
etc.
* I am hoping to find a working example of a string that uses 12 different letters (a total of 793 4-letter groups within a 796-letter string) starting (and if possible ending) with 4 of the same letter. *
Here is a working solution for 7 letters:
AAAABCDBEAAACDECFAAADBFBACEAGAADEFBAGACDFBGCCCCDGEAFAGCBEEECGFFBFEGGGGFDEEEEFCBBBBGDCFFFFDAGBEGDDDDBE
Beware that in order to attempt exhaustive search (answer in VB is trying a naive version of that) you'll first have to solve the problem of generating all possible expansions while maintaining lexicographical order. Just ABC, expands to all perms of AABC, plus all perms of ABBC, plus all perms of ABCC which is 3*4! instead of just AABC. If you just concatenate AABC and AABD it would cover just 4 out of 4! perms of AABC and even that by accident. Just this expansion will bring you exponential complexity - end of game. Plus you'll need to maintain association between all explansions and the set (the set becomes a label).
Your best bet is to use one of known efficient De Bruijn constuctors and try to see if you can put your set-equivalence in there. Check out
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.674&rep=rep1&type=pdf
and
http://www.dim.uchile.cl/~emoreno/publicaciones/FINALES/copyrighted/IPL05-De_Bruijn_sequences_and_De_Bruijn_graphs_for_a_general_language.pdf
for a start.
If you know graphs, another viable option is to start with De Bruijn graph and formulate your set-equivalence as a graph rewriting. 2nd paper does De Bruijn graph partitioning.
BTW, try VB answer just for A,B,AB (at least expansion is small) - it will make AABBAB and construct ABBA or ABBAB (or throw in a decent language) both of which are wrong. You can even prove that it will always miss with 1st lexical expansions (that's what AAB, AAAB etc. are) just by examining first 2 passes (it will always miss 2nd A for NxA because (N-1)xA+B is in the string (1st expansion of {AB}).
Oh and if we could establish how many of each letters an optimal soluton should have (don't look at B(5,2) it's too easy and regular :-) a random serch would be feasible - you generate candidates with provable traits (like AAAA, BBBB ... are present and not touching and is has n1 A-s, n2 B-s ...) and random arrangement and then test whether they are solutions (checking is much faster than exhaustive search in this case).
Cool problem. Just a draft/psuedo algo:
dim STR-A as string = getall(ABCDEFGHIJKL)
//custom function to generate concat list of all 793 4-char combos.
//should be listed side-by-side to form 3172 character-long string.
//different ordering may ultimately produce different results.
//brute-forcing all orders of combos is too much work (793! is a big #).
//need to determine how to find optimal ordering, for this particular
//approach below.
dim STR-B as string = "" // to hold the string you're searching for
dim STR-C as string = "" // to hold the sub-string you are searching in
dim STR-A-NEW as string = "" //variable to hold your new string
dim MATCH as boolean = false //variable to hold matching status
while len(STR-A) > 0
//check each character in STR-A, which will be shorted by 1 char on each
//pass.
MATCH = false
STR-B = left(STR-A, 4)
STR-B = reduce(STR-B)
//reduce(str) is a custom re-usable function to sort & remove duplicates
for i as integer = 1 to len((STR-A) - 1)
STR-C = substr(STR-A, i, 4)
//gives you the 4-character sequence beginning at position i
STR-C = reduce(STR-C)
IF STR-B = STR-C Then
MATCH = true
exit for
//as long as there is even one match, you can throw-away the first
//letter
END IF
i = i+1
next
IF match = false then
//if you didn't find a match, then the first letter should be saved
STR-A-NEW += LEFT(STR-B, 1)
END IF
MATCH = false //re-init MATCH
STR-A = RIGHT(STR-A, LEN(STR-A) - 1) //re-init STR_A
wend
Anyway -- there could be problems at this, and you'd need to write another function to parse your result string (STR-A-NEW) to prove that it's a viable answer...
I've been thinking about this one and I'm sketching out a solution.
Let's call a string of four symbols a word and we'll write S(w) to denote the set of symbols in word w.
Each word abcd has "follow-on" words bcde where a,...,e are all symbols.
Let succ(w) be the set of follow-on words v for w such that S(w) != S(v). succ(w) is the set of successor words that can follow on from the first symbol in w if w is in a solution.
For each non-empty set of symbols s of cardinality at most four, let words(s) be the set of words w such that S(w) = s. Any solution must contain exactly one word in words(s) for each such set s.
Now we can do a reasonable search. The basic idea is this: say we are exploring a search path ending with word w. The follow-on word must be a non-excluded word in succ(w). A word v is excluded if the search path contains some word w such that v in words(S(w)).
You can be slightly more cunning: if we track the possible "predecessor" words to a set s (i.e., words w with a successor v such that v in words(s)) and reach a point where every predecessor of s is excluded, then we know we have reached a dead end, since we'll never be able to obtain s from any extension of the current search path.
Code to follow after the weekend, with a bit of luck...
Here is my proposal. I'll admit upfront this is a performance and memory hog.
This may be overkill, but have a class We'll call it UniqueCombination This will contain a unique 1-4 char reduced combination of the input set (i.e. A,AB,ABC,...) This will also contain a list of possible combination (AB {AABB,ABAB,BBAA,...}) this will need a method that determines if any possible combination overlaps any possible combination of another UniqueCombination by three characters. Also need a override that takes a string as well.
Then we start with the string "AAAA" then we find all of the UniqueCombinations that overlap this string. Then we find how many uniqueCombinations those possible matches overlap with. (we could be smart at this point an store this number.) Then we pick the one with the least number of overlaps greater than 0. Use up the ones with the least possible matches first.
Then we find a specific combination for the chosen UniqueCombination and add it to the final string. Remove this UniqueCombination from the list, then as we find overlaps for current string. rinse and repeat. (we could be smart and on subsequent runs while searching for overlaps we could remove any of the unreduced combination that are contained in the final string.)
Well that's my plan I will work on the code this weekend. Granted this does not guarantee that the final 4 characters will be 4 of the same letter (it might actually be trying to avoid that but I will look into that as well.)
If there is a non-exponential solution at all it may need to be formulated in terms of a recursive "growth" from a problem with a smaller size i.e to contruct B(N,k) from B(N-1,k-1) or from B(N-1,k) or from B(N,k-1).
Systematic construction for B(5,2) - one step at the time :-) It's bound to get more complex latter [card stands for cardinality, {AB} has card=2, I'll also call them 2-s, 3-s etc.] Note, 2-s and 3-s will be k-1 and k latter (I hope).
Initial. Start with k-1 result and inject symbols for singletons
(unique expansion empty intersection):
ABCDE -> AABBCCDDEE
mark used card=2 sets: AB,BC,CD,DE
Rewriting. Form card=3 sets to inject symbols into marked card=2.
1st feasible lexicographic expansion fires (may have to backtrack for k>2)
it's OK to use already marked 2-s since they'll all get replaced
but may have to do a verification pass for higher k
AB->ACB, BC->BCD, CD->CED, DE->DAE ==> AACBBDCCEDDAEEB
mark/verify used 2s
normally keep marking/unmarking during the construction but also keep keep old
mark list
marking/unmarking can get expensive if there's backtracking in #3
Unused: AB, BE
For higher k may need several recursive rewriting passes
possibly partitioning new sets into classes
Finalize: unused 2-s should overlap around the edge (that's why it's cyclic)
ABE - B can go to the begining or and: AACBBDCCEDDAEEB
Note: a step from B(N-1,k) to B(N,k) may need injection of pseudo-signletons, like doubling or trippling A
B(5,2) -> B(5,3) - B(5,4)
Initial. same: - ABCDE -> AAACBBBDCCCEDDDAEEEB
no use of marking 3-sets since they are all going to be chenged
Rewriting.
choose systematic insertion positions
AAA_CBBB_DCCC_EDDD_AEEE_B
mark all 2-s released by this: AC,AD,BD,BE,CE
use marked 2-s to decide inserted symbols - totice total regularity:
AxCB D -> ADCB
BxDC E -> BEDC
CxED A -> CAED
DxAE B => DBAE
ExBA C -> ECBA
Verify that 3-s are all used (marked inserted symbols just for fun)
AAA[D]CBBB[E]DCCC[A]EDDD[B]AEEE[C]B
Note: Systematic choice if insertion point deterministically dictated insertions (only AD can fit 1st, AC would create duplicate 2-set (AAC, ACC))
Note: It's not going to be so nice for B(6,2) and B(6,3) since number of 2-s will exceede 2x the no of 1-s. This is important since 2-s sit naturally on the sides of 1-s like CBBBE and the issue is how to place them when you run out of 1-s.
B(5,3) is so symetrical that just repeating #1 produces B(5.4):
AAAADCBBBBEDCCCCAEDDDDBAEEEECB

Fastest way to find minimal Hamming distance to any substring?

Given a long string L and a shorter string S (the constraint is that L.length must be >= S.length), I want to find the minimum Hamming distance between S and any substring of L with length equal to S.length. Let's call the function for this minHamming(). For example,
minHamming(ABCDEFGHIJ, CDEFGG) == 1.
minHamming(ABCDEFGHIJ, BCDGHI) == 3.
Doing this the obvious way (enumerating every substring of L) requires O(S.length * L.length) time. Is there any clever way to do this in sublinear time? I search the same L with several different S strings, so doing some complicated preprocessing to L once is acceptable.
Edit: The modified Boyer-Moore would be a good idea, except that my alphabet is only 4 letters (DNA).
Perhaps surprisingly, this exact problem can be solved in just O(|A|nlog n) time using Fast Fourier Transforms (FFTs), where n is the length of the larger sequence L and |A| is the size of the alphabet.
Here is a freely available PDF of a paper by Donald Benson describing how it works:
Fourier methods for biosequence analysis (Donald Benson, Nucleic Acids Research 1990 vol. 18, pp. 3001-3006)
Summary: Convert each of your strings S and L into several indicator vectors (one per character, so 4 in the case of DNA), and then convolve corresponding vectors to determine match counts for each possible alignment. The trick is that convolution in the "time" domain, which ordinarily requires O(n^2) time, can be implemented using multiplication in the "frequency" domain, which requires just O(n) time, plus the time required to convert between domains and back again. Using the FFT each conversion takes just O(nlog n) time, so the overall time complexity is O(|A|nlog n). For greatest speed, finite field FFTs are used, which require only integer arithmetic.
Note: For arbitrary S and L this algorithm is clearly a huge performance win over the straightforward O(mn) algorithm as |S| and |L| become large, but OTOH if S is typically shorter than log|L| (e.g. when querying a large DB with a small sequence), then obviously this approach provides no speedup.
UPDATE 21/7/2009: Updated to mention that the time complexity also depends linearly on the size of the alphabet, since a separate pair of indicator vectors must be used for each character in the alphabet.
Modified Boyer-Moore
I've just dug up some old Python implementation of Boyer-Moore I had lying around and modified the matching loop (where the text is compared to the pattern). Instead of breaking out as soon as the first mismatch is found between the two strings, simply count up the number of mismatches, but remember the first mismatch:
current_dist = 0
while pattern_pos >= 0:
if pattern[pattern_pos] != text[text_pos]:
if first_mismatch == -1:
first_mismatch = pattern_pos
tp = text_pos
current_dist += 1
if current_dist == smallest_dist:
break
pattern_pos -= 1
text_pos -= 1
smallest_dist = min(current_dist, smallest_dist)
# if the distance is 0, we've had a match and can quit
if current_dist == 0:
return 0
else: # shift
pattern_pos = first_mismatch
text_pos = tp
...
If the string did not match completely at this point, go back to the point of the first mismatch by restoring the values. This makes sure that the smallest distance is actually found.
The whole implementation is rather long (~150LOC), but I can post it on request. The core idea is outlined above, everything else is standard Boyer-Moore.
Preprocessing on the Text
Another way to speed things up is preprocessing the text to have an index on character positions. You only want to start comparing at positions where at least a single match between the two strings occurs, otherwise the Hamming distance is |S| trivially.
import sys
from collections import defaultdict
import bisect
def char_positions(t):
pos = defaultdict(list)
for idx, c in enumerate(t):
pos[c].append(idx)
return dict(pos)
This method simply creates a dictionary which maps each character in the text to the sorted list of its occurrences.
The comparison loop is more or less unchanged to naive O(mn) approach, apart from the fact that we do not increase the position at which comparison is started by 1 each time, but based on the character positions:
def min_hamming(text, pattern):
best = len(pattern)
pos = char_positions(text)
i = find_next_pos(pattern, pos, 0)
while i < len(text) - len(pattern):
dist = 0
for c in range(len(pattern)):
if text[i+c] != pattern[c]:
dist += 1
if dist == best:
break
c += 1
else:
if dist == 0:
return 0
best = min(dist, best)
i = find_next_pos(pattern, pos, i + 1)
return best
The actual improvement is in find_next_pos:
def find_next_pos(pattern, pos, i):
smallest = sys.maxint
for idx, c in enumerate(pattern):
if c in pos:
x = bisect.bisect_left(pos[c], i + idx)
if x < len(pos[c]):
smallest = min(smallest, pos[c][x] - idx)
return smallest
For each new position, we find the lowest index at which a character from S occurs in L. If there is no such index any more, the algorithm will terminate.
find_next_pos is certainly complex, and one could try to improve it by only using the first several characters of the pattern S, or use a set to make sure characters from the pattern are not checked twice.
Discussion
Which method is faster largely depends on your dataset. The more diverse your alphabet is, the larger will be the jumps. If you have a very long L, the second method with preprocessing might be faster. For very, very short strings (like in your question), the naive approach will certainly be the fastest.
DNA
If you have a very small alphabet, you could try to get the character positions for character bigrams (or larger) rather than unigrams.
You're stuck as far as big-O is concerned.. At a fundamental level, you're going to need to test if every letter in the target matches each eligible letter in the substring.
Luckily, this is easily parallelized.
One optimization you can apply is to keep a running count of mismatches for the current position. If it's greater than the lowest hamming distance so far, then obviously you can skip to the next possibility.

Resources