string similarity of optimal alignment

string similarity of optimal alignment - algorithm

Expected Behaviour of the algorithm
I have two strings a and b, with a being the shorter string. I would like to find the substring of b, that has the biggest similarity to a. The substring has to be of len(a), or has to be placed at the end of b.
e.g. for the following two strings:
a = "aa"
b = "bbaba"
the possible substrings of b would be
"bb"
"ba"
"ab"
"ba"
"a"
""
The edit distance is defined as amount of Insertions and Deletion. Substitutions are not possible (Insertion + Deletion has to be used instead). The similarity between the two strings is calulated according to the following equation: norm = 1 - distance / (len(a) + len(substring)).
So the substrings above would provide the following results:
"bb" -> 2 DEL + 2 INS -> 1 - 4 / 4 = 0
"ba" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"ab" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"ba" -> 1 DEL + 1 INS -> 1 - 2 / 4 = 0.5
"a" -> 1 INS -> 1 - 1 / 3 = 0.66
"" -> 2 INS -> 1 - 2 / 2 = 0
So the algorithm should return 0.66.
Different implementations
A similar ratio is implemented by the Python library FuzzyWuzzy in the form of fuzz.partial_ratio. It calculates the ratio in two steps:
searches for matching subsequences in the longer sequence using difflib.SequenceMatcher.get_matching_blocks
calculates the ratio for substrings of len(shorter_string) starting at the matching subsequences and returns the maximum ratio
This is really slow, so it uses python-Levenshtein for this similarity calculation when it is available. This performs the same calculation based on the Levenshtein distance, which is faster. However in edge cases the calculated matching_blocks used for the ratio calculation is completely wrong (see issue 16), which does not make it a suitable replacement, when the correctness is relevant.
Current implementation
I currently use a C++ port of difflib in combination with a fast bitparallel implementation of the Levenshtein distance with the weights insertion=1, deletion=1 and substitution=2. The current implementation can be found here:
extracting matching_blocks: matching_blocks
calculating weighted Levenshtein: weighted Levenshtein
combining them to calculate the end ratio: partial_ratio
Question
Is there a faster algorithm to calculate this kind of similarity. Requirements are:
only uses Replacement/Insertion (or gives substitutions the weight 2, which has a similar effect)
allows a gap at the beginning of the longer string
allows a gap at the end of the longer string, as long as the remaining substring does not become shorter, than the length of the shorter string.
optimally it enforces, that the substring has a similar length (when it is not in the end), so it matches the behaviour of FuzzyWuzzy, but it would be fine when it allows longer subsequences to be matched aswell: e.g. for aaba:aaa this would mean, that it is allowed to use aaba as optimal subsequence instead of aab.

Related

Substring search with max 1's in a binary sequence

Problem
The task is to find a substring from the given binary string with highest score. The substring should be at least of given min length.
score = number of 1s / substring length where score can range from 0 to 1.
Inputs:
1. min length of substring
2. binary sequence
Outputs:
1. index of first char of substring
2. index of last char of substring
Example 1:
input
-----
5
01010101111100
output
------
7
11
explanation
-----------
1. start with minimum window = 5
2. start_ind = 0, end_index = 4, score = 2/5 (0.4)
3. start_ind = 1, end_index = 5, score = 3/5 (0.6)
4. and so on...
5. start_ind = 7, end_index = 11, score = 5/5 (1) [max possible]
Example 2:
input
-----
5
10110011100
output
------
2
8
explanation
-----------
1. while calculating all scores for windows 5 to len(sequence)
2. max score occurs in the case: start_ind=2, end_ind=8, score=5/7 (0.7143) [max possible]
Example 3:
input
-----
4
00110011100
output
------
5
8
What I attempted
The only technique i could come up with was a brute force technique, with nested for loops
for window_size in (min to max)
for ind 0 to end
calculate score
save max score
Can someone suggest a better algorithm to solve this problem?

There's a few observations to make before we start talking about an algorithm- some of these observations already have been pointed out in the comments.
Maths
Take the minimum length to be M, the length of the entire string to be L, and a substring from the ith char to the jth char (inclusive-exclusive) to be S[i:j].
All optimal substrings will satisfy at least one of two conditions:
It is exactly M characters in length
It starts and ends with a 1 character
The reason for the latter being if it were longer than M characters and started/ended with a 0, we could just drop that 0 resulting in a higher ratio.
In the same spirit (again, for the 2nd case), there exists an optimal substring which is not preceded by a 1. Otherwise, if it were, we could include that 1, resulting in an equal or higher ratio. The same logic applies to the end of S and a following 1.
Building on the above- such a substring being preceded or followed by another 1 will NOT be optimal, unless the substring contains no 0s. In the case where it doesn't contain 0s, there will exist an optimal substring of length M as well anyways.
Again, that all only applies to the length greater than M case substrings.
Finally, there exists an optimal substring that has length at least M (by definition), and at most 2 * M - 1. If an optimal substring had length K, we could split it into two substrings of length floor(K/2) and ceil(K/2) - S[i:i+floor(K/2)] and S[i+floor(K/2):i+K]. If the substring has the score (ratio) R, and its halves R0 and R1, we would have one of two scenarios:
R = R0 = R1, meaning we could pick either half and get the same score as the combined substring, giving us a shorter substring.
If this substring has length less than 2 * M, we are done- we have an optimal substring of length [M, 2*M).
Otherwise, recurse on the new substring.
R0 != R1, so (without loss of generality) R0 < R < R1, meaning the combined substring would not be optimal in the first place.
Note that I say "there exists an optimal" as opposed to "the optimal". This is because there may be multiple optimal solutions, and the observations above may refer to different instances.
Algorithm
You could search every window size [M, 2*M) at every offset, which would already be better than a full search for small M. You can also try a two-phase approach:
search every M sized window, find the max score
search from the beginning of every run of 1s forward through a special list of ends of runs of 1s, implicitly skipping over 0s and irrelevant 1s, breaking when out of the [M, 2 * M) bound.
For random data, I only expect this to save a small factor- skipping 15/16 of the windows (ignoring the added overhead). For less-random data, you could potentially see huge benefits, particularly if there's LOTS of LARGE runs of 1s and 0s.
The biggest speedup you'll be able to do (besides limiting the window max to 2 * M) is computing a cumulative sum of the bit array. This lets you query "how many 1s were seen up to this point". You can then take the difference of two elements in this array to query "how many 1s occurred between these offsets" in constant time. This allows for very quick calculation of the score.

You can use 2 pointer method, starting from both left-most and right-most ends. then adjust them searching for highest score.
We can add some cache to optimize time.
Example: (Python)
binary="01010101111100"
length=5
def get_score(binary,left,right):
ones=0
for i in range(left,right+1):
if binary[i]=="1":
ones+=1
score= ones/(right-left+1)
return score
cache={}
def get_sub(binary,length,left,right):
if (left,right) in cache:
return cache[(left,right)]
table=[0,set()]
if right-left+1<length:
pass
else:
scores=[[get_score(binary,left,right),set([(left,right)])],
get_sub(binary,length,left+1,right),
get_sub(binary,length,left,right-1),
get_sub(binary,length,left+1,right-1)]
for s in scores:
if s[0]>table[0]:
table[0]=s[0]
table[1]=s[1]
elif s[0]==table[0]:
for e in s[1]:
table[1].add(e)
cache[(left,right)]=table
return table
result=get_sub(binary,length,0,len(binary)-1)
print("Score: %f"%result[0])
print("Index: %s"%result[1])
Output
Score: 1
Index: {(7, 11)}

Solve this matching with exponential running time in brute force manner

I have a question about matching. It is stated as follows.
I have a sequence which consists of two characters, {l,h}.
The character l can be mapped to a number from {1,2,3}. (l stands for low)
The character h can be mapped to a number from {4,5,6}. (h stands for high)
For example, I have a sequence (I call it the original sequence) with length = 6. It is [h,l,h,l,h,l].
This sequence can be transformed to a detailed sequence by the above mapping rules. A detailed sequence can be [6,1,5,2,4,3]. And for a sequence with length = 6, there are 6^3 detailed sequences.
I obtain a difference sequence from a detailed sequence by computing the pairwise differences. For example, my detailed sequence is [6,1,5,2,4,3], then the corresponding difference sequence is [6-1,1-5,5-2,2-4,4-3] = [5,-4,3,-2,1]. Hence, the largest value of an entry of the difference sequence is 5 resulted from 6 minus 1 and the smallest value of it is -5 resulted from 1 minus 6.
Now, I have a database consists of m difference sequences with length = 5.
My query sequence is a original sequence with length = 6. I want to find that:
Among the m difference sequences, what are those the corresponding original sequence can be my query sequence. If they do not exist, the program will return null set. If they exist, the program will return the set consists of them.
For example, for a difference sequence [5,-4,3,-2,1], its corresponding original sequence can be [h,l,h,l,h,l]. Hence, ff my query sequence is [h,l,h,l,h,l], [5,-4,3,-2,1] will be in the returned set if [5,-4,3,-2,1] is in the database.
For my real problem, the query sequence has length = n. And the database consists of m difference sequences with length = n-1.
The brute force method can be as follows:
For the input original sequence, enumerates its 3^n detailed sequences and gets the 3^n difference sequences. For each of the difference sequences, check that does it exist in the database.
The brute force will take O(3^n). I know that this exponential running time is not good.
I want to have a faster algorithm. An approximation algorithm also looks good to me.
Thank you so much.

Each difference sequence can be created from at most four h-l sequences.
So pre-process your database to build an index from h-l sequences to the difference sequences you're interested in.
For example:
[1, -1, 0, 2, -1]
This must come from one of these sequences:
[3 2 3 3 1 2] -> l l l l l l
[4 3 4 4 2 3] -> h l h h l l
[5 4 5 5 3 4] -> h h h h l h
[6 5 6 6 4 5] -> h h h h h h
You can generate them all by starting the specific sequence from 1 to 6, and applying the differences. Sometimes the sequence will underflow or overflow the legal range of 1..6 and then you can discard that possibility.
Building this index is O(mn) time, and uses an additional O(mn) space. After you've built it, you can efficiently query a specific h-l sequence using the index.

Best choose of substrings for dictionary based compression algorithm

I have a string that need to be compressed by a dictionary compression algorithm. If a substring is found in the dictionary, it is encoded with cost 2. If no match is found, the cost will be the size of the substring. Given a fixed dictionary and a string, how could I chose the best substrings in the dictionary resulting in the minimum cost?
For example, consider the string ABBBBBCD and the following dictionary:
entry 1 - ABBB
entry 2 - BBCD
entry 3 - BBBBB
entry 4 - ABBBB
entry 5 - CD
The best solution is to chose ABBB and BBCD, resulting in cost 2 + 2 = 4.
If I choose A, BBBBB, C and D, the cost would be 1 + 2 + 1 + 1 = 5, that is worst than the first.
Yet, if I choose ABBBB, B, CD, the cost will be 2 + 1 + 2 = 5.
After the explanations, my question is: is there a known algorithm that solves this problem? Or, is there some known algorithm that could be modified in order that I can solve the problem not using brute force method?
Please, ask me if something is not clear.

You can formulate and solve it as a shortest path problem.
Create a graph with each index as a vertex. Now add a directed edge from i to j (i
Now find the shortest path from index 1 to n. (See: http://www.geeksforgeeks.org/shortest-path-for-directed-acyclic-graphs/ )

Fast algorithm to optimize a sequence of arithmetic expression

EDIT: clarified description of problem
Is there a fast algorithm solving following problem?
And, is also for extendend version of this problem
that is replaced natural numbers to Z/(2^n Z)?(This problem was too complex to add more quesion in one place, IMO.)
Problem:
For a given set of natural numbers like {7, 20, 17, 100}, required algorithm
returns the shortest sequence of additions, mutliplications and powers compute
all of given numbers.
Each item of sequence are (correct) equation that matches following pattern:
<number> = <number> <op> <number>
where <number> is a natual number, <op> is one of {+, *, ^}.
In the sequence, each operand of <op> should be one of
1
numbers which are already appeared in the left-hand-side of equal.
Example:
Input: {7, 20, 17, 100}
Output:
2 = 1 + 1
3 = 1 + 2
6 = 2 * 3
7 = 1 + 6
10 = 3 + 7
17 = 7 + 10
20 = 2 * 10
100 = 10 ^ 2
I wrote backtracking algorithm in Haskell.
it works for small input like above, but my real query is
randomly distributed ~30 numbers in [0,255].
for real query, following code takes 2~10 minutes in my PC.
(Actual code,
very simple test)
My current (Pseudo)code:
-- generate set of sets required to compute n.
-- operater (+) on set is set union.
requiredNumbers 0 = { {} }
requiredNumbers 1 = { {} }
requiredNumbers n =
{ {j, k} | j^k == n, j >= 2, k >= 2 }
+ { {j, k} | j*k == n, j >= 2, k >= 2 }
+ { {j, k} | j+k == n, j >= 1, k >= 1 }
-- remember the smallest set of "computed" number
bestSet := {i | 1 <= i <= largeNumber}
-- backtracking algorithm
-- from: input
-- to: accumulator of "already computed" number
closure from to =
if (from is empty)
if (|bestSet| > |to|)
bestSet := to
return
else if (|from| + |to| >= |bestSet|)
-- cut branch
return
else
m := min(from)
from' := deleteMin(from)
foreach (req in (requiredNumbers m))
closure (from' + (req - to)) (to + {m})
-- recoverEquation is a function converts set of number to set of equation.
-- it can be done easily.
output = recoverEquation (closure input {})
Additional Note:
Answers like
There isn't a fast algorithm, because...
There is a heuristic algorithm, it is...
are also welcomed. Now I'm feeling that there is no fast and exact algorithm...
Answer #1 can be used as a heuristic, I think.

What if you worked backwards from the highest number in a sorted input, checking if/how to utilize the smaller numbers (and numbers that are being introduced) in its construction?
For example, although this may not guarantee the shortest sequence...
input: {7, 20, 17, 100}
(100) = (20) * 5 =>
(7) = 5 + 2 =>
(17) = 10 + (7) =>
(20) = 10 * 2 =>
10 = 5 * 2 =>
5 = 3 + 2 =>
3 = 2 + 1 =>
2 = 1 + 1

What I recommend is to transform it into some kind of graph shortest path algorithm.
For each number, you compute (and store) the shortest path of operations. Technically one step is enough: For each number you can store the operation and the two operands (left and right, because power operation is not commutative), and also the weight ("nodes")
Initially you register 1 with the weight of zero
Every time you register a new number, you have to generate all calculations with that number (all additions, multiplications, powers) with all already-registered numbers. ("edges")
Filter for the calculations: it the result of the calculation is already registered, you shouldn't store that, because there is an easier way to get to that number
Store only 1 operation for the commutative ones (1+2=2+1)
Prefilter the power operation because that may even cause overflow
You have to order this list to the shortest sum path (weight of the edge). Weight = (weight of operand1) + (weight of operand2) + (1, which is the weight of the operation)
You can exclude all resulting numbers which are greater than the maximum number that we have to find (e.g. if we found 100 already, anything greater that 20 can be excluded) - this can be refined so that you can check the members of the operations also.
If you hit one of your target numbers, then you found the shortest way of calculating one of your target numbers, you have to restart the generations:
Recalculate the maximum of the target numbers
Go back on the paths of the currently found number, set their weight to 0 (they will be given from now on, because their cost is already paid)
Recalculate the weight for the operations in the generation list, because the source operand weight may have been changed (this results reordering at the end) - here you can exclude those where either operand is greater than the new maximum
If all the numbers are hit, then the search is over
You can build your expression using the "backlinks" (operation, left and right operands) for each of your target numbers.
The main point is that we always keep our eye on the target function, which is that the total number of operation must be the minimum possible. In order to get this, we always calculate the shortest path to a certain number, then considering that number (and all the other numbers on the way) as given numbers, then extending our search to the remaining targets.
Theoretically, this algorithm processes (registers) each numbers only once. Applying the proper filters cuts the unnecessary branches, so nothing is calculated twice (except the weights of the in-queue elements)

Composing an average stream piecewise

I have a list of n floating point streams each having a different size.
The streams can be be composed together using the following rules:
You can put a stream starting at any point in time (its zero before it started). You can use the same stream few times (it can overlap itself and even be in the same position few times) and you are allowed to not use a certain stream at all.
e.g.:
input streams:
1 2 3 4
2 4 5 6 7
1 5 6
Can be composed like:
1 2 3 4
1 5 6
1 5 6
After the placements an output stream is composed by the rule that each output float equals to the square root of the sum of the square of each term.
e.g.:
If the streams at a position are:
1
2
3
The output is:
sqrt(1*1 + 2*2 + 3*3) = sqrt(14) = 3.74...
So for the the example composition:
1 2 3 4
1 5 6
1 5 6
The output is:
1 5.09 6.32 3 4.12 5 6
What I have is the output stream and the input streams. I need to compute the composition that lead to that output. an exact composition doesn't have to exists - I need a composition as close as possible to the output (smallest accumulated difference).
e.g.:
Input:
Stream to mimic:
1 5.09 6.32 3 4.12 5 6
and a list:
1 2 3 4
2 4 5 6 7
1 5 6
Expected output:
Stream 0 starting at 1,
Stream 2 starting at 0,
Stream 2 starting at 4.
This seems like an NP problem, is there any fast way to solve this? it can be somewhat brute force (but not totally, its not theoretic problem) and it can give not the best answer as long as its close enough.
The algorithm will be usually used with stream to mimic with very long length (can be few megabytes) while it will have around 20 streams to be composed from, while each stream will be around kilobyte long.

I think you can speed up a greedy search a bit over the obvious. First of all, square each element in all of the streams involved. Then you are looking for a sum of squared streams that looks a lot like the squared target stream. Suppose that "it looks like" is the euclidean distance between the squared streams, considered as vectors.
Then we have (a-b)^2 = a^2 + b^2 - 2a.b. So if we can find the dot product of two vectors quickly, and we know their absolute size, we can find the distance quickly. But using the FFT and the http://en.wikipedia.org/wiki/Convolution_theorem, we can work out a.b_i where a is the target stream and b_i is stream b at some offset of i, by using the FFT to convolve a reversed version of b - for the cost of doing an FFT on a, an FFT on reversed b, and an FFT on the result, we get a.b_i for every offset i.
If we do a greedy search, the first step will be to find the b_i that makes (a-b_i)^2 smallest and subtract it from a. Then we are looking for a stream c_j that makes (a-b_i-c_j)^2 as small as possible. But this is a^2 + b_i^2 + c_j^2 - 2a.b_i - 2a.c_j + 2b_i.c_j and we have already calculated everything except b_i.c_j in the step above. If b and c are shorter streams it will be cheap to calculate b_i.c_j, and we can use the FFT as before.
So we have a not too horrible way to do a greedy search - at each stage subtract off the stream from the adjusted target stream so far that makes the residual smallest (considered as vectors in euclidean space), and carry on from there. At some stage we will find that none of the streams we have available make the residual any smaller. We can stop there, because our calculation above shows us that using two streams at once won't help either then - this follows because b_i.c_j >= 0, since each element of b_i is >= 0, because it is a square.
If you do a greedy search and are not satisfied, but have more cpu to burn, try Limited Discrepancy Search.

If I can use C#, LINQ & the Rx framework's System.Interactive extensions then this works:
First up - define a jagged array for the allowable arrays.
int[][] streams =
new []
{
new [] { 1, 2, 3, 4, },
new [] { 2, 4, 5, 6, 7, },
new [] { 1, 5, 6, },
};
Need an infinite iterator on integers to represent each step.
IEnumerable<int> steps =
EnumerableEx.Generate(0, x => true, x => x + 1, x => x);
Need a random number generator to randomly select which streams to add to each step.
var rnd = new Random();
In my LINQ query I've used these operators:
Scan^ - runs an accumulator function over a sequence producing an
output value for every input value
Where - filters the sequence based on the predicate
Empty - returns an empty sequence
Concat - concatenates two sequences
Skip - skips over the specified number of elements in a sequence
Any - returns true if the sequence contains any elements
Select - projects the sequence using a selector function
Sum - sums the values in the sequence
^ - from the Rx System.Interactive library
Now for the LINQ query that does all of the hard work.
IEnumerable<double> results =
steps
// Randomly select which streams to add to this step
.Scan(Enumerable.Empty<IEnumerable<int>>(), (xs, _) =>
streams.Where(st => rnd.NextDouble() > 0.8).ToArray())
// Create a list of "Heads" & "Tails" for each step
// Heads are the first elements of the current streams in the step
// Tails are the remaining elements to push forward to the next step
.Scan(new
{
Heads = Enumerable.Empty<int>(),
Tails = Enumerable.Empty<IEnumerable<int>>()
}, (acc, ss) => new
{
Heads = acc.Tails.Concat(ss)
.Select(s => s.First()),
Tails = acc.Tails.Concat(ss)
.Select(s => s.Skip(1)).Where(s => s.Any()),
})
// Keep the Heads only
.Select(x => x.Heads)
// Filter out any steps that didn't produce any values
.Where(x => x.Any())
// Calculate the square root of the sum of the squares
.Select(x => System.Math.Sqrt((double)x.Select(y => y * y).Sum()));
Nice lazy evaluation per step - scary though...

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio