How to calculate the number of longest common subsequences - algorithm

I'm trying to calculate the amount of longest possible subsequences that exist between two strings.
e.g.
String X = "efgefg";
String Y = "efegf";
output: The Number of longest common sequences is: 3
(i.e.: efeg, efef, efgf - this doesn't need to be calculated by the algorithm, just shown here for demonstration)
I've managed to do this in O(|X|*|Y|) using dynamic programming based on the general idea here: Cheapest path algorithm.
Can anyone think of a way to do this calculation with better runtime efficiently?
--Edited in response to Jason's comment.

Longest common subsequence problem is a well studied CS problem.
You may want to read up on it here: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem

I don't know but here are some attempts at thinking aloud:
The worst case I was able to construct has an exponential - 2**(0.5 |X|) - number of longest common subsequences:
X = "aAbBcCdD..."
Y = "AaBbCcDd..."
where the longest common subsequences include exactly one of {A, a}, exactly one of {B, b} and so forth... (nitpicking: if you alphabet is limited to 256 chars, this breaks down eventually - but 2**128 is already huge.)
However, you don't necessarily have to generate all subsequences to count them.
If you've got O(|X| * |Y|), you are already better than that! What we learn from this is that any algorithm better than yours must not attempt to generate the actual subsequences.

First of all, we do know that finding any longest common subsequence of two sequences with length n cannot be done in O(n2-ε) time unless the Strong Exponential Time Hypothesis fails, see:
https://arxiv.org/abs/1412.0348
This pretty much implies that you cannot count the number of ways how to align common subsequences to the input sequences in O(n2-ε) time.
On the other hand, it is possible to count the number of ways of such alignments in O(n2) time. It is also possible to count them in O(n2/log(n)) time with the so-called four-Russians speed-up.
Now the real question if you really intended to calculate this or you want to find the number of different subsequences? I am afraid that this latter is a #P-complete counting problem. At least, we do know that counting the number of sequences with a given length that a regular grammar can generate is #P-complete:
S. Kannan, Z. Sweedyk, and S. R. Mahaney. Counting
and random generation of strings in regular languages.
In ACM-SIAM Symposium on Discrete Algorithms
(SODA), pages 551–557, 1995
This is a similar problem in that sense that counting the number of ways a regular grammar can generate sequences of a given length is a trivial dynamic programming algorithm. However, if you do not want to distinguish generations resulting the same sequence, then the problem turns from easy to extremely hard. My natural conjecture is that this should be the case for sequence alignment problems, too (longest common subsequence, edit distance, shortest common superstring, etc.).
So if you would like to calculate the number of different subsequences of two sequences, then very likely your current algorithm is wrong and any algorithm cannot calculate it in polynomial time unless P = NP (and more...).

Best Explanation(with Code) I found :
Count all LCS

Related

Longest Common Sub-sequence of N sequences (for diff purposes)

I want to find the longest common sub-sequence of N strings. I got the algorithm that uses Dynamic Programming for 2 strings, but if I extend it to N, it will consume exponential amount of memory, as I need an array of N dimensions. It is not an option.
In the common case (90%), almost all strings will be the same.
If I try to break my N sequences in N/2 pairs of 2 strings each, run the LCS of 2 strings separately for each pair, I'll have N/2 sub-sequences. I can remove the duplicates and repeat this process until I have only one sub-sequence, that is common to all strings in the input.
Is there something that I am missing? It doesn't look like a solution to a N-hard problem...
I know that each call to LCS with each pair of strings may have more than one sub-sequence as solution, but if I get only one of these sub-sequences to use as input in the next call, maybe my final sub-sequence isn't the longest possible, but I have something that may fit my needs.
If I try to use all possible solutions for one pair and combine then with all possible solutions from another pairs (that each of them may have more than one too), I may end up with exponential time. Am I right?
Yes, you're missing the correctness: there is no guarantee that the LCS of a pair of strings will have any overlap whatsoever with the LCS of the set overall. Consider this example:
aaabb1xyz
aaabb2xyz
cccdd1xyz
cccdd2xyz
If you pair these in the given order, you'll get LCSs of aaabb and cccdd, missing the xyz for the set.
If, as you say, the strings are almost all identical, perhaps the differences aren't a problem for you. If the not-identical strings are very similar to the "median" string, then your incremental solution will work well enough for your purposes.
Another possibility is to do LCS on random pairs of strings until that median string emerges; then you start from that common point, and you should have a "good enough" solution.

Cycle detection in non-iterated sequence

My understanding is that tortoise-hare like algorithms works on iterated sequences
That is, for any x, succ(x) = x0.
I would like to implement an algortihm that can detect cycles in both deterministic and non-deterministic infinite repeating sequences.
The sequences may have a non-repeating prefix subsequence, for example in the sequence 1666666..., has the prefix of 1 and the repeating pattern 6.
This algorithm would return the longest repeating pattern in a sequence.
The repeating pattern of 001100110011... would be 0011, the repeating pattern of 22583575837583758... would be 58357.
My idea was to generate a guess of the longest possible pattern length somehow go from there, but I can't get things in order.
The tortoise-hare algorithm uses same address to identify cycles. This problem requires a different sort of algorithm. Some form of trie or structure such as LZW compression, would be where I would look for a solution.

Select some from many binary sequences so that the result of "or" them together is 1111111111....111

I have N binary sequences of length L, where N and L maybe very large, and those sequences maybe very sparse, say have much more 0s then 1s.
I want to select M sequences from them, namely b_1, b_2, b_3..., such that
b_1 | b_2 | b_3 ... | b_M = 1111...11 (L 1s)
Is there an algorithm to achieve it?
My idea is:
STEP1: for position from 1 to L, count the total number of sequences which has 1 at that position. Name it 'owning number'
STEP2: consider the position having minimum owning number, and choose the sequence having the maximum number of 1s from the owning sequence of that position.
STEP3: ignore the chosen sequence, update owning number and go back to STEP2.
I believe that my method cannot generate the best answer.
Does anyone has a better idea?
This is the well known set cover problem. It is NP-hard — in fact, its decision version is one of the canonical NP-complete problems and was among the 21 problems included in Karp's 1972 paper — and so no efficient algorithm is known for solving it.
The algorithm you describe in your question is known as the "greedy algorithm" and (unless your problem has some special features that you are not telling us) it's essentially the best known approach. It finds a collection of sets that is no more than O(log |N|) times the size of the smallest such collection.
Sounds like a typical backtrack task.
Yes, your algoryth sounds reasonable if you want to have a good answer quickly. If you want to have the combination of the least possible samples you can't do better than try all combinations.
Depending on the exact structure of the problem, there is an other technique that often works well (and actually gives an optimal result):
Let x[j] be a boolean variable representing the choice whether to include the j'th binary sequence in the result. A zero-suppressed binary decision diagram can now represent (maybe succinctly - depending on the characteristics of the problem) the family of sets such that the OR of the binary sequences corresponding to a variable x[j] included in the set is all ones. Finding the smallest such set (thus minimizing the number of sequences included) is relatively easy if the ZDD was succinct. Details can be found in The Art of Computer Programming chapter 7.1.4 (volume 4A).
It's also easy to adapt to an exact cover, by taking the family of sets such that there is exactly one 1 for every position.

String analysis

Given a sequence of operations:
a*b*a*b*a*a*b*a*b
is there a way to get the optimal subdivision to enable reusage of substring.
making
a*b*a*b*a*a*b*a*b => c*a*c, where c = a*b*a*b
and then seeing that
a*b*a*b => d*d, where d = a*b
all in all reducing the 8 initial operations into the 4 described here?
(c = (d = a*b)*d)*a*c
The goal of course is to minimize the number of operations
I'm considering a suffixtree of sorts.
I'm especially interested in linear time heuristics or solutions.
The '*' operations are actually matrix multiplications.
This whole problem is known as "Common Subexpression Elimination" or CSE. It is a slightly smaller version of the problem called "Graph Reduction" faced by the implementer of compilers for functional programming languages. Googling "Common Subexpression elimination algorithm" gives lots of solutions, though none that I can see especially for the constraints given by matrix multiplication.
The pages linked to give a lot of references.
My old answer is below. However, having researched a bit more, the solution is simply building a suffix tree. This can be done in O(N) time (lots of references on the wikipedia page). Having done this, the sub-expressions (c, d etc. in your question) are just nodes in the suffix tree - just pull them out.
However, I think MarcoS is on to something with the suggestion of Longest repeating Substring, as graph reduction precedence might not allow optimisations that can be allowed here.
sketch of algorithm:
optimise(s):
let sub = longestRepeatingSubstring(s).
optimisedSub = optimise(sub)
return s with sub replaced by optimisedSub
Each run of longest repeating substring takes time N. You can probably re-use the suffix tree you build to solve the whole thing in time N.
edit: The orders-of-growth in this answer are needed in addition to the accepted answer in order to run CSE or matrix-chain multiplication
Interestingly, a compression algorithm may be what you want: a compression algorithm seeks to reduce the size of what it's compressing, and if the only way it can do that is substitution, you can trace it and obtain the necessary subcomponents for your algorithm. This may not give nice results though for small inputs.
What subsets of your operations are commutative will be an important consideration in choosing such an algorithm. [edit: OP says no operations are commutative in his/her situation]
We can also define an optimal solution, if we ignore effects such as caching:
input: [some product of matrices to compute]
given that multiplying two NxN matrices is O(N^2.376)
given we can visualize the product as follows:
[[AxB][BxC][CxD][DxE]...]
we must for example perform O(max(A,B,C)^2.376) or so operations in order to combine
[AxB][BxC] -> [AxC]
The max(...) is an estimate based on how fast it is to multiply two square matrices;
a better estimate of cost(A,B,C) for multiplying an AxB * BxC matrix can be gotten
from actually looking at the algorithm, or running benchmarks if you don't know the
algorithm used.
However note that multiplying the same matrix with itself, i.e. calculating
a power, can be much more efficient, and we also need to take that into account.
At worst, it takes log_2(power) multiplies each of O(N^2.376), but this could be
made more efficient by diagonalizing the matrix first.
There is the question about whether a greedy approach is feasible for not: whether one SHOULD compress repeating substrings at each step. This may not be the case, e.g.
aaaaabaab
compressing 'aa' results in ccabcb and compressing 'aab' is now impossible
However I have a hunch that, if we try all orders of compressing substrings, we will probably not run into this issue too often.
Thus having written down what we want (the costs) and considered possibly issues, we already have a brute-force algorithm which can do this, and it will run for very small numbers of matrices:
# pseudocode
def compress(problem, substring)
x = new Problem(problem)
x.string.replaceall(substring, newsymbol)
x.subcomputations += Subcomputation(newsymbol=substring)
def bestCompression(problem)
candidateCompressions = [compress(problem,substring) for each substring in problem.string]
# etc., recursively return problem with minimum cost
# dynamic programming may help make this more efficient, but one must watch
# out for the note above, how it may be hard to be greedy
Note: according to another answer by Asgeir, this is known as the Matrix Chain Multiplication optimization problem. Nick Fortescue notes this is also known more generally as http://en.wikipedia.org/wiki/Common_subexpression_elimination -- thus one could find any generic CSE or Matrix-Chain-Multiplication algorithm/library from the literature, and plug in the cost orders-of-magnitude I mentioned earlier (you will need those nomatter which solution you use). Note that the cost of the above calculations (multiplication, exponentiation, etc.) assume that they are being done efficiently with state-of-the-art algorithms; if this is not the case, replace the exponents with appropriate values which correspond to the way the operations will be carried out.
If you want to use the fewest arithmetic operations then you should have a look at matrix chain multiplication which can be reduced to O(n log n)
From the top of the head the problem seems in NP for me. Depending on the substitutions you are doing other substitions will be possible or impossible for example for the string
d*e*a*b*c*d*e*a*b*c*d*e*a there are several possibilities.
If you take the longest common string it will be:
f = d*e*a*b*c and you could substitute f*f*e*a leaving you with three multiplications in the end and four intermediate ones (total seven).
If you instead substitute the following way:
f = d*e*a you get f*b*c*f*b*c*f which you can further substitute using g = f*b*c to
g*g*f for a total of six multiplication.
There are other possible substitutions in this problem, but I do not have the time to count them all right now.
I am guessing for a complete minimal substitution it is not only necessary to figure out the longest common substring but also the number of times each substring repeats, which probably means you have to track all substitutions so far and do backtracking. Still it might be faster than the actual multiplications.
Isn't this the Longest repeated substring problem?

What would be a good general algorithm for approaching integer sequence problems?

Say the input will always be the same number N of numbers (e.g., 5) and assume the integers actually have a mathematical relation (no lengths of the numbers 'one', 'two', days in the nth month, etc.). The output would be either the next integer and the rule discovered or a message that no rule could be detected.
I was thinking to have in one-two-three order, a module that tries to find arithmetic sequence rules by doing sums and/or differences between numbers adjacent, one away, two away, etc. looking for patterns, then having a module focused on geometric sequences by multiplying and/or dividing in the same way, and then, if there is a general approach, a module for detecting recursive sequences.
Thanks!
The On-Line Encyclopedia of Integer Sequences solves precisely this problem :-)
Given any sequence of numbers, we can come up with a formula which 'fits'!
Given a1, a2, ..., an
All you need to do is find an n-1 degree polynomial (using Polynomial interpolation) so that
P(i) = ai
and that's it, you have a formula. Polynomial interpolation can be as easy as solving a matrix equation Ax = b (with A being a Vandermonde matrix).
Check out: http://en.wikipedia.org/wiki/Polynomial_interpolation
That is one of the reasons I find these 'guess the next number' problems a bit silly (read: pathetic IQ tests). Not everyone thinks the same way.

Resources