Longest Common Sub-sequence of N sequences (for diff purposes) - algorithm

I want to find the longest common sub-sequence of N strings. I got the algorithm that uses Dynamic Programming for 2 strings, but if I extend it to N, it will consume exponential amount of memory, as I need an array of N dimensions. It is not an option.
In the common case (90%), almost all strings will be the same.
If I try to break my N sequences in N/2 pairs of 2 strings each, run the LCS of 2 strings separately for each pair, I'll have N/2 sub-sequences. I can remove the duplicates and repeat this process until I have only one sub-sequence, that is common to all strings in the input.
Is there something that I am missing? It doesn't look like a solution to a N-hard problem...
I know that each call to LCS with each pair of strings may have more than one sub-sequence as solution, but if I get only one of these sub-sequences to use as input in the next call, maybe my final sub-sequence isn't the longest possible, but I have something that may fit my needs.
If I try to use all possible solutions for one pair and combine then with all possible solutions from another pairs (that each of them may have more than one too), I may end up with exponential time. Am I right?

Yes, you're missing the correctness: there is no guarantee that the LCS of a pair of strings will have any overlap whatsoever with the LCS of the set overall. Consider this example:
aaabb1xyz
aaabb2xyz
cccdd1xyz
cccdd2xyz
If you pair these in the given order, you'll get LCSs of aaabb and cccdd, missing the xyz for the set.
If, as you say, the strings are almost all identical, perhaps the differences aren't a problem for you. If the not-identical strings are very similar to the "median" string, then your incremental solution will work well enough for your purposes.
Another possibility is to do LCS on random pairs of strings until that median string emerges; then you start from that common point, and you should have a "good enough" solution.

Related

Cycle detection in non-iterated sequence

My understanding is that tortoise-hare like algorithms works on iterated sequences
That is, for any x, succ(x) = x0.
I would like to implement an algortihm that can detect cycles in both deterministic and non-deterministic infinite repeating sequences.
The sequences may have a non-repeating prefix subsequence, for example in the sequence 1666666..., has the prefix of 1 and the repeating pattern 6.
This algorithm would return the longest repeating pattern in a sequence.
The repeating pattern of 001100110011... would be 0011, the repeating pattern of 22583575837583758... would be 58357.
My idea was to generate a guess of the longest possible pattern length somehow go from there, but I can't get things in order.
The tortoise-hare algorithm uses same address to identify cycles. This problem requires a different sort of algorithm. Some form of trie or structure such as LZW compression, would be where I would look for a solution.

Levenstein Transpose Distance

How can I implement the transpose/swap/twiddle/exchange distance alone using dynamic programming. I must stress that I do not want to check for the other operations (ie copy, delete, insert, kill etc) just transpose/swap.
I wish to apply Levenstein's algorithm just for swap distance. How would the code look like?
I'm not sure that Levenstein's algorithm can be used in this case. Without insert or delete operation, distance is good defined only between strings with same length and same characters. Examples of strings that isn't possible to transform to same string with only transpositions:
AB, ABC
AAB, ABB
With that, algorithm can be to find all possible permutations of positions of characters not on same places in both strings and look for one that can be represent with minimum number of transpositions or swaps.
An efficient application of dynamic programming usually requires that the task decompose into several instances of the same task for a shorter input. In case of the Levenstein distance, this boils down to prefixes of the two strings and the number of edits required to get from one to the other. I don't see how such a decomposition can be achieved in your case. At least I don't see one that would result in a polynomial time algorithm.
Also, it is not quite clear what operations you are talking about. Depending on the context, a swap or exchange can mean either the same thing as transposition or a replacement of a letter with an arbitrary other letter, e.g. test->text. If by "transpose/swap/twiddle/exchange" you try to say just "transpose", than you should have a look at Counting the adjacent swaps required to convert one permutation into another. If not, please clarify the question.

Finding partial substrings within a string

I have two strings which must be compared for similarity. The algorithm must be designed to find the maximal similarity. In this instance, the ordering matters, but intervening (or missing) characters do not. Edit distance cannot be used in this case for various reasons.
The situation is basically as follows:
string 1: ABCDEFG
string 2: AFENBCDGRDLFG
the resulting algorithm would find the substrings A, BCD, FG
I currently have a recursive solution, but because this must be run on massive amounts of data, any improvements would be greatly appreciated
Looking at your sole example it looks like you want to find longest common subsequence.
Take a look at LCS
Is it just me, or is this NP-hard? – David Titarenco (from comment)
If you want LCS of arbitrary number of strings its NP-hard. But it the number of input strings is constant ( as in this case, 2) this can be done in polynomial time.

Algorithm to find common substring across N strings

I'm familiar with LCS algorithms for 2 strings. Looking for suggestions for finding common substrings in 2..N strings. There may be multiple common substrings in each pair. There can be different common substrings in subsets of the strings.
strings: (ABCDEFGHIJKL) (DEF) (ABCDEF) (BIJKL) (FGH)
common strings:
1/2 (DEF)
1/3 (ABCDEF)
1/4 (IJKL)
1/5 (FGH)
2/3 (DEF)
longest common strings:
1/3 (ABCDEF)
most common strings:
1/2/3 (DEF)
This sort of thing is done all the time in DNA sequence analysis. You can find a variety of algorithms for it. One reasonable collection is listed here.
There's also the brute-force approach of making tables of every substring (if you're interested only in short ones): form an N-ary tree (N=26 for letters, 256 for ASCII) at each level, and store histograms of the count at every node. If you prune off little-used nodes (to keep the memory requirements reasonable), you end up with an algorithm that finds all subsequences of length up to M in something like N*M^2*log(M) time for input of length N. If you instead split this up into K separate strings, you can build the tree structure and just read off the answer(s) in a single pass through the tree.
SUffix trees are the answer unless you have really large strings where memory becomes a problem. Expect 10~30 bytes of memory usage per character in the string for a good implementation. There are a couple of open-source implementations too, which make your job easier.
There are other, more succint algorithms too, but they are harder to implement (look for "compressed suffix trees").

How to calculate the number of longest common subsequences

I'm trying to calculate the amount of longest possible subsequences that exist between two strings.
e.g.
String X = "efgefg";
String Y = "efegf";
output: The Number of longest common sequences is: 3
(i.e.: efeg, efef, efgf - this doesn't need to be calculated by the algorithm, just shown here for demonstration)
I've managed to do this in O(|X|*|Y|) using dynamic programming based on the general idea here: Cheapest path algorithm.
Can anyone think of a way to do this calculation with better runtime efficiently?
--Edited in response to Jason's comment.
Longest common subsequence problem is a well studied CS problem.
You may want to read up on it here: http://en.wikipedia.org/wiki/Longest_common_subsequence_problem
I don't know but here are some attempts at thinking aloud:
The worst case I was able to construct has an exponential - 2**(0.5 |X|) - number of longest common subsequences:
X = "aAbBcCdD..."
Y = "AaBbCcDd..."
where the longest common subsequences include exactly one of {A, a}, exactly one of {B, b} and so forth... (nitpicking: if you alphabet is limited to 256 chars, this breaks down eventually - but 2**128 is already huge.)
However, you don't necessarily have to generate all subsequences to count them.
If you've got O(|X| * |Y|), you are already better than that! What we learn from this is that any algorithm better than yours must not attempt to generate the actual subsequences.
First of all, we do know that finding any longest common subsequence of two sequences with length n cannot be done in O(n2-ε) time unless the Strong Exponential Time Hypothesis fails, see:
https://arxiv.org/abs/1412.0348
This pretty much implies that you cannot count the number of ways how to align common subsequences to the input sequences in O(n2-ε) time.
On the other hand, it is possible to count the number of ways of such alignments in O(n2) time. It is also possible to count them in O(n2/log(n)) time with the so-called four-Russians speed-up.
Now the real question if you really intended to calculate this or you want to find the number of different subsequences? I am afraid that this latter is a #P-complete counting problem. At least, we do know that counting the number of sequences with a given length that a regular grammar can generate is #P-complete:
S. Kannan, Z. Sweedyk, and S. R. Mahaney. Counting
and random generation of strings in regular languages.
In ACM-SIAM Symposium on Discrete Algorithms
(SODA), pages 551–557, 1995
This is a similar problem in that sense that counting the number of ways a regular grammar can generate sequences of a given length is a trivial dynamic programming algorithm. However, if you do not want to distinguish generations resulting the same sequence, then the problem turns from easy to extremely hard. My natural conjecture is that this should be the case for sequence alignment problems, too (longest common subsequence, edit distance, shortest common superstring, etc.).
So if you would like to calculate the number of different subsequences of two sequences, then very likely your current algorithm is wrong and any algorithm cannot calculate it in polynomial time unless P = NP (and more...).
Best Explanation(with Code) I found :
Count all LCS

Resources