Related
Expected Behaviour of the algorithm
I have two strings a and b, with a being the shorter string. I would like to find the substring of b, that has the biggest similarity to a. The substring has to be of len(a), or has to be placed at the end of b.
e.g. for the following two strings:
a = "aa"
b = "bbaba"
the possible substrings of b would be
"bb"
"ba"
"ab"
"ba"
"a"
""
The edit distance is defined as amount of Insertions and Deletion. Substitutions are not possible (Insertion + Deletion has to be used instead). The similarity between the two strings is calulated according to the following equation: norm = 1 - distance / (len(a) + len(substring)).
So the substrings above would provide the following results:
"bb" -> 2 DEL + 2 INS -> 1 - 4 / 4 = 0
"ba" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"ab" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"ba" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"a" -> 1 INS -> 1 - 1 / 3 = 0.66
"" -> 2 INS -> 1 - 2 / 2 = 0
So the algorithm should return 0.66.
Different implementations
A similar ratio is implemented by the Python library FuzzyWuzzy in the form of fuzz.partial_ratio. It calculates the ratio in two steps:
searches for matching subsequences in the longer sequence using difflib.SequenceMatcher.get_matching_blocks
calculates the ratio for substrings of len(shorter_string) starting at the matching subsequences and returns the maximum ratio
This is really slow, so it uses python-Levenshtein for this similarity calculation when it is available. This performs the same calculation based on the Levenshtein distance, which is faster. However in edge cases the calculated matching_blocks used for the ratio calculation is completely wrong (see issue 16), which does not make it a suitable replacement, when the correctness is relevant.
Current implementation
I currently use a C++ port of difflib in combination with a fast bitparallel implementation of the Levenshtein distance with the weights insertion=1, deletion=1 and substitution=2. The current implementation can be found here:
extracting matching_blocks: matching_blocks
calculating weighted Levenshtein: weighted Levenshtein
combining them to calculate the end ratio: partial_ratio
Question
Is there a faster algorithm to calculate this kind of similarity. Requirements are:
only uses Replacement/Insertion (or gives substitutions the weight 2, which has a similar effect)
allows a gap at the beginning of the longer string
allows a gap at the end of the longer string, as long as the remaining substring does not become shorter, than the length of the shorter string.
optimally it enforces, that the substring has a similar length (when it is not in the end), so it matches the behaviour of FuzzyWuzzy, but it would be fine when it allows longer subsequences to be matched aswell: e.g. for aaba:aaa this would mean, that it is allowed to use aaba as optimal subsequence instead of aab.
I am trying to solve this problem. the problem can be summarized as:
Given a sequence of integers find no of safe partitions, where safe partitions are defined as:
A safe partition is a partition into subsequences S1,S2,…,SK such that for each valid i, min(Si)≤|Si|≤max(Si)— that is, for each subsequence in this partition, its length is greater or equal to its smallest element and smaller or equal to its largest element.
Ex:
Input => 1 6 2 3 4 3 4
Output => 6 partitions
[1],[6,2,3,4,3,4]
[1,6,2],[3,4,3,4]
[1,6,2,3],[4,3,4]
[1],[6,2],[3,4,3,4]
[1],[6,2,3],[4,3,4]
[1,6],[2,3],[4,3,4]
I can probably find out the solution somewhere on internet which includes the code but i am more intrested in finding out the approach to solve this problem so i am asking here what are the points that I am missing in my observation.
These are the things that pop in my mind when I read this problem:
if an element at index i extends a sequence safely its quite
possible that it could also be the start of a new sequence.so at
every element i am left with two choices whether it extends the
sequence or not.
so i think it can be represented mathematically as ,
p(0..N)=1+P(i..N)+P(i+1..N),if A[i] is safe to extend current partition
p(0..N)=1+ p(i..N), if A[i] can't be used to extend
where P is the partition function.
is this reasoning valid? am i missing something?
[I'm having trouble giving a direction without actually giving the solution, because once a person thinks in the right direction then the solution becomes evident. I'll try to highlight some facts which may put a person on the right track.]
Explicitly enumerating safe partitions is problematic, since there are O(2n) safe partitions. For example in:
1,N,1,N,1,N ... [N elements]
For this sequence, at any subsequence of length > 1 and the subsequence [1] matches the criteria. The number of safe partitions for such a sequence of length n=2k is 3k-1. To prove that, look at the following
Base k = 1: f(1) = f(2) = 1
Step assumption: f(2k) = 3k-1.
f(2k+1) =
f(2k+2) = (f(2k) + f(2k-1)) + (f(2k-2) + f(2k-3)) + ... + f(1) + 1
= 2*(f(2k) + f(2k-2) + .. + f(2)) + 1
= 2 * (3k-1 + 3k-2 + ... + 1) + 1
= 2 * (3k - 1) / 2 + 1
= 3k
Since enumeration is out of the question, for any reasonable performance, the solution must somehow count without iterating. Since the proof that 1,N,...,1,N has 3k-1 did not have to explicitly enumerate all sequences, its principles can be generalized to any sequence.
NOTES:
I have solved similar problems before, so the direction was clear to me. For this question I tried to break my thoughts into something manageable and came up with the thought about complexity. I had a very strong feeling that this is exponential even before writing it down, and trying to prove it. This comes from experience and from seeing other problems. The complexity function felt worse than a Fibbonacci because adding an element to a sequence seemed to be adding at least two elements of smaller sizes (similar to the Fibbonacci sequence). Since Fibbonacci is exponential, so the 1,...,1 partitioning must be exponential. From there went on and analyzed it with a recurrence relation.
The exact way I reached the solution matches my way of thought. Everybody has a different way of thought that works for them, and they need to develop and find it.
This is how I came to suspect that the number of safe sequences in tge example was 3k-1:
I recursively calculated f(2k), with base condition f(1)=f(2)=1. Then for 3:
[1,N,1]
[1],[N,1]
[1,N],[1]
And for 4:
[1,N,1,N]
[1],[N,1,N]
[1,N],[1,N]
Meaning f(3)=f(4)=3. Then I recursively applied
f(2k+2)=2*(f(2k) + f(2k-2) + .. + f(2)) + 1
resulting with f(2)=1, f(4)=3, f(6)=9, f(8)=27. This suspiciously looks like 3k-1. Then I simply had to prove that with induction.
According to wikipedia, the definition of the recursive formula which calculates the Levenshtein distance between two strings a and b is the following:
I don't understand why we don't take into consideration the cases in which we delete a[j], or we insert b[i]. Also, correct me if I am wrong, isn't the case of insertion the same as the case of the deletion? I mean, instead of deleting a character from one string, we could insert the same character in the second string, and the opposite. So why not merge the insert/delete operations into one operation with cost equal to min{cost_insert, cost_delete}?
This is not done, because you are not allowed to edit both strings. The definition of the edit distance (from wikipedia) is:
the minimum-weight series of edit operations that transforms a into b.
So you are specifically looking for (the weight of) a sequence of operations to execute on the string a to transform it into string b.
Also, the edit distance is not necessarily symmetric. If your costs for inserts and deletions are identical the distance is symmetric: d(a,b) = d(b,a)
Consider the wikipedia example but with different costs:
costs for insertions: w_ins = 1
costs for deletions: w_del = 2
costs for substitutions: w_sub = 1
The distance of kitten and sitting still is 3,
kitten -> sitten (substitution k->s, cost 1)
sitten -> sittin (substitution e->i, cost 1)
sittin -> sitting (insertion of g, cost 1)
=> d(kitten, sitting) = 3
but the distance of sitting and kitten is not:
sitting -> kitting (substitution s->k, cost 1)
kitting -> kitteng (substitution i->e, cost 1)
kitteng -> kitten (deletion of g, cost 2)
=> d(kitten, sitting) = 4
You see that d(kitten, sitting) != d(kitten, sitting).
On the other hand if you do use symmetric costs, as the Levenshtein distance (which is an edit distance with unit costs) does, you can assume that d(a,b) = d(b,a) holds. Then you do not win anything by also considering the inverse cases. What you lose is the information which character has been replaced in which string, what makes it harder to extract the sequence of operations afterwards.
The Wagner-Fisher algorithm which you are showing in your question can extract this from the DP matrix by backtracking the path with minimal costs. Consider this two edit matrices between to and foo with unit costs:
t o f o o
f 1 2 t 1 2 3
o 2 1 o 2 1 2
o 3 2
Note that if you transpose the matrix for d(to, foo) you get the matrix for d(foo, to). Note that by this, an insertion in the first matrix becomes a deletion in the second matrix and vice versa. So this is where this symmetry you are looking for is coming up again.
I hope this helps :)
If the costs of insertions and deletions differ, inserting in one string isn't the same as deleting from the other. Even the cost of a substitution could differ from the cost of insertion+deletion and it is wise to keep them separate.
The problem is usually asymmetric: you have a list of valid strings, and you want to match to another that has errors in it.
Say S = 5 and N = 3 the solutions would look like - <0,0,5> <0,1,4> <0,2,3> <0,3,2> <5,0,0> <2,3,0> <3,2,0> <1,2,2> etc etc.
In the general case, N nested loops can be used to solve the problem. Run N nested loop, inside them check if the loop variables add upto S.
If we do not know N ahead of time, we can use a recursive solution. In each level, run a loop starting from 0 to N, and then call the function itself again. When we reach a depth of N, see if the numbers obtained add up to S.
Any other dynamic programming solution?
Try this recursive function:
f(s, n) = 1 if s = 0
= 0 if s != 0 and n = 0
= sum f(s - i, n - 1) over i in [0, s] otherwise
To use dynamic programming you can cache the value of f after evaluating it, and check if the value already exists in the cache before evaluating it.
There is a closed form formula : binomial(s + n - 1, s) or binomial(s+n-1,n-1)
Those numbers are the simplex numbers.
If you want to compute them, use the log gamma function or arbitrary precision arithmetic.
See https://math.stackexchange.com/questions/2455/geometric-proof-of-the-formula-for-simplex-numbers
I have my own formula for this. We, together with my friend Gio made an investigative report concerning this. The formula that we got is [2 raised to (n-1) - 1], where n is the number we are looking for how many addends it has.
Let's try.
If n is 1: its addends are o. There's no two or more numbers that we can add to get a sum of 1 (excluding 0). Let's try a higher number.
Let's try 4. 4 has addends: 1+1+1+1, 1+2+1, 1+1+2, 2+1+1, 1+3, 2+2, 3+1. Its total is 7.
Let's check with the formula. 2 raised to (4-1) - 1 = 2 raised to (3) - 1 = 8-1 =7.
Let's try 15. 2 raised to (15-1) - 1 = 2 raised to (14) - 1 = 16384 - 1 = 16383. Therefore, there are 16383 ways to add numbers that will equal to 15.
(Note: Addends are positive numbers only.)
(You can try other numbers, to check whether our formula is correct or not.)
This can be calculated in O(s+n) (or O(1) if you don't mind an approximation) in the following way:
Imagine we have a string with n-1 X's in it and s o's. So for your example of s=5, n=3, one example string would be
oXooXoo
Notice that the X's divide the o's into three distinct groupings: one of length 1, length 2, and length 2. This corresponds to your solution of <1,2,2>. Every possible string gives us a different solution, by counting the number of o's in a row (a 0 is possible: for example, XoooooX would correspond to <0,5,0>). So by counting the number of possible strings of this form, we get the answer to your question.
There are s+(n-1) positions to choose for s o's, so the answer is Choose(s+n-1, s).
There is a fixed formula to find the answer. If you want to find the number of ways to get N as the sum of R elements. The answer is always:
(N+R-1)!/((R-1)!*(N)!)
or in other words:
(N+R-1) C (R-1)
This actually looks a lot like a Towers of Hanoi problem, without the constraint of stacking disks only on larger disks. You have S disks that can be in any combination on N towers. So that's what got me thinking about it.
What I suspect is that there is a formula we can deduce that doesn't require the recursive programming. I'll need a bit more time though.
How can I generate the shortest sequence with contains all possible permutations?
Example:
For length 2 the answer is 121, because this list contains 12 and 21, which are all possible permutations.
For length 3 the answer is 123121321, because this list contains all possible permutations:
123, 231, 312, 121 (invalid), 213, 132, 321.
Each number (within a given permutation) may only occur once.
This greedy algorithm produces fairly short minimal sequences.
UPDATE: Note that for n ≥ 6, this algorithm does not produce the shortest possible string!
Make a collection of all permutations.
Remove the first permutation from the collection.
Let a = the first permutation.
Find the sequence in the collection that has the greatest overlap with the end of a. If there is a tie, choose the sequence is first in lexicographic order. Remove the chosen sequence from the collection and add the non-overlapping part to the end of a. Repeat this step until the collection is empty.
The curious tie-breaking step is necessary for correctness; breaking the tie at random instead seems to result in longer strings.
I verified (by writing a much longer, slower program) that the answer this algorithm gives for length 4, 123412314231243121342132413214321, is indeed the shortest answer. However, for length 6 it produces an answer of length 873, which is longer than the shortest known solution.
The algorithm is O(n!2).
An implementation in Python:
import itertools
def costToAdd(a, b):
for i in range(1, len(b)):
if a.endswith(b[:-i]):
return i
return len(b)
def stringContainingAllPermutationsOf(s):
perms = set(''.join(tpl) for tpl in itertools.permutations(s))
perms.remove(s)
a = s
while perms:
cost, next = min((costToAdd(a, x), x) for x in perms)
perms.remove(next)
a += next[-cost:]
return a
The length of the strings generated by this function are 1, 3, 9, 33, 153, 873, 5913, ... which appears to be this integer sequence.
I have a hunch you can do better than O(n!2).
Create all permutations.
Let each
permutation represent a node in a
graph.
Now, for any two states add an
edge with a value 1 if they share
n-1 digits (for the source from the
end, and for the target from the
end), two if they share n-2 digits
and so on.
Now, you are left to find
the shortest path containing n
vertices.
Here is a fast algorithm that produces a short string containing all permutations. I am pretty sure it produces the shortest possible answer, but I don't have a complete proof in hand.
Explanation. Below is a tree of All Permutations. The picture is incomplete; imagine that the tree goes on forever to the right.
1 --+-- 12 --+-- 123 ...
| |
| +-- 231 ...
| |
| +-- 312 ...
|
+-- 21 --+-- 213 ...
|
+-- 132 ...
|
+-- 321 ...
The nodes at level k of this tree are all the permutations of length
k. Furthermore, the permutations are in a particular order with a lot
of overlap between each permutation and its neighbors above and below.
To be precise, each node's first child is found by simply adding the next
symbol to the end. For example, the first child of 213 would be 2134. The rest
of the children are found by rotating to the first child to left one symbol at
a time. Rotating 2134 would produce 1342, 3421, 4213.
Taking all the nodes at a given level and stringing them together, overlapping
as much as possible, produces the strings 1, 121, 123121321, etc.
The length of the nth string in that sequence is the sum for x=1 to n of x!. (You can prove this by observing how much non-overlap there is between neighboring permutations. Siblings overlap in all but 1 symbol; first-cousins overlap in all but 2 symbols; and so on.)
Sketch of proof. I haven't completely proved that this is the best solution, but here's a sketch of how the proof would proceed. First show that any string containing n distinct permutations has length ≥ 2n - 1. Then show that adding any string containing n+1 distinct permutations has length 2n + 1. That is, adding one more permutation will cost you two digits. Proceed by calculating the minimum length of strings containing nPr and nPr + 1 distinct permutations, up to n!. In short, this sequence is optimal because you can't make it worse somewhere in the hope of making it better someplace else. It's already locally optimal everywhere. All the moves are forced.
Algorithm. Given all this background, the algorithm is very simple. Walk this tree to the desired depth and string together all the nodes at that depth.
Fortunately we do not actually have to build the tree in memory.
def build(node, s):
"""String together all descendants of the given node at the target depth."""
d = len(node) # depth of this node. depth of "213" is 3.
n = len(s) # target depth
if d == n - 1:
return node + s[n - 1] + node # children of 213 join to make "2134213"
else:
c0 = node + s[d] # first child node
children = [c0[i:] + c0[:i] for i in range(d + 1)] # all child nodes
strings = [build(c, s) for c in children] # recurse to the desired depth
for j in range(1, d + 1):
strings[j] = strings[j][d:] # cut off overlap with previous sibling
return ''.join(strings) # join what's left
def stringContainingAllPermutationsOf(s):
return build(s[:1], s)
Performance. The above code is already much faster than my other solution, and it does a lot of cutting and pasting of large strings that you can optimize away. The algorithm can be made to run in time and memory proportional to the size of the output.
For n 3 length chain is 8
12312132
Seems to me we are working with cycled system - it's ring, saying in other words. But we are are working with ring as if it is chain. Chain is realy 123121321 = 9
But the ring is 12312132 = 8
We take last 1 for 321 from the beginning of the sequence 12312132[1].
These are called (minimal length) superpermutations (cf. Wikipedia).
Interest on this has re-sparked when an anonymous user has posted a new lower bound on 4chan. (See Wikipedia and many other web pages for history.)
AFAIK, as of today we just know:
Their length is A180632(n) ≤ A007489(n) = Sum_{k=1..n} k! but this bound is only sharp for n ≤ 5, i.e., we have equality for n ≤ 5 but strictly less for n > 5.
There's a very simple recursive algorithm, given below, producing a superpermutation of length A007489(n), which is always palindromic (but as said above this is not the minimal length for n > 5).
For n ≥ 7 we have the better upper bound n! + (n−1)! + (n−2)! + (n−3)! + n − 3.
For n ≤ 5 all minimal SP's are known; and for all n > 5 we don't know which is the minimal SP.
For n = 1, 2, 3, 4 the minimal SP's are unique (up to changing the symbols), given by (1, 121, 123121321, 123412314231243121342132413214321) of length A007489(1..4) = (1, 3, 9, 33).
For n = 5 there are 8 inequivalent ones of minimal length 153 = A007489(5); the palindromic one produced by the algorithm below is the 3rd in lexicographic order.
For n = 6 Houston produced thousands of the smallest known length 872 = A007489(6) - 1, but AFAIK we still don't know whether this is minimal.
For n = 7 Egan produced one of length 5906 (one less than the better upper bound given above) but again we don't know whether that's minimal.
I've written a very short PARI/GP program (you can paste to run it on the PARI/GP web site) which implements the standard algorithm producing a palindromic superpermutation of length A007489(n):
extend(S,n=vecmax(s))={ my(t); concat([
if(#Set(s)<n, [], /* discard if not a permutation */
s=concat([s, n+1, s]); /* Now merge with preceding segment: */
forstep(i=min(#s, #t)-1, 0, -1,
if(s[1..1+i]==t[#t-i..#t], s=s[2+i..-1]; break));
t=s /* store as previous for next */
)/*endif*/
| s <- [ S[i+1..i+n] | i <- [0..#S-n] ]])
}
SSP=vector(6, n, s=if(n>1, extend(s), [1])); // gives the first 6, the 6th being non-minimal
I think that easily translates to any other language. (For non-PARI speaking persons: "| x <-" means "for x in".)