Algorithm to find common substring across N strings - algorithm

I'm familiar with LCS algorithms for 2 strings. Looking for suggestions for finding common substrings in 2..N strings. There may be multiple common substrings in each pair. There can be different common substrings in subsets of the strings.
strings: (ABCDEFGHIJKL) (DEF) (ABCDEF) (BIJKL) (FGH)
common strings:
1/2 (DEF)
1/3 (ABCDEF)
1/4 (IJKL)
1/5 (FGH)
2/3 (DEF)
longest common strings:
1/3 (ABCDEF)
most common strings:
1/2/3 (DEF)

This sort of thing is done all the time in DNA sequence analysis. You can find a variety of algorithms for it. One reasonable collection is listed here.
There's also the brute-force approach of making tables of every substring (if you're interested only in short ones): form an N-ary tree (N=26 for letters, 256 for ASCII) at each level, and store histograms of the count at every node. If you prune off little-used nodes (to keep the memory requirements reasonable), you end up with an algorithm that finds all subsequences of length up to M in something like N*M^2*log(M) time for input of length N. If you instead split this up into K separate strings, you can build the tree structure and just read off the answer(s) in a single pass through the tree.

SUffix trees are the answer unless you have really large strings where memory becomes a problem. Expect 10~30 bytes of memory usage per character in the string for a good implementation. There are a couple of open-source implementations too, which make your job easier.
There are other, more succint algorithms too, but they are harder to implement (look for "compressed suffix trees").

Related

Longest Common Sub-sequence of N sequences (for diff purposes)

I want to find the longest common sub-sequence of N strings. I got the algorithm that uses Dynamic Programming for 2 strings, but if I extend it to N, it will consume exponential amount of memory, as I need an array of N dimensions. It is not an option.
In the common case (90%), almost all strings will be the same.
If I try to break my N sequences in N/2 pairs of 2 strings each, run the LCS of 2 strings separately for each pair, I'll have N/2 sub-sequences. I can remove the duplicates and repeat this process until I have only one sub-sequence, that is common to all strings in the input.
Is there something that I am missing? It doesn't look like a solution to a N-hard problem...
I know that each call to LCS with each pair of strings may have more than one sub-sequence as solution, but if I get only one of these sub-sequences to use as input in the next call, maybe my final sub-sequence isn't the longest possible, but I have something that may fit my needs.
If I try to use all possible solutions for one pair and combine then with all possible solutions from another pairs (that each of them may have more than one too), I may end up with exponential time. Am I right?
Yes, you're missing the correctness: there is no guarantee that the LCS of a pair of strings will have any overlap whatsoever with the LCS of the set overall. Consider this example:
aaabb1xyz
aaabb2xyz
cccdd1xyz
cccdd2xyz
If you pair these in the given order, you'll get LCSs of aaabb and cccdd, missing the xyz for the set.
If, as you say, the strings are almost all identical, perhaps the differences aren't a problem for you. If the not-identical strings are very similar to the "median" string, then your incremental solution will work well enough for your purposes.
Another possibility is to do LCS on random pairs of strings until that median string emerges; then you start from that common point, and you should have a "good enough" solution.

Cycle detection in non-iterated sequence

My understanding is that tortoise-hare like algorithms works on iterated sequences
That is, for any x, succ(x) = x0.
I would like to implement an algortihm that can detect cycles in both deterministic and non-deterministic infinite repeating sequences.
The sequences may have a non-repeating prefix subsequence, for example in the sequence 1666666..., has the prefix of 1 and the repeating pattern 6.
This algorithm would return the longest repeating pattern in a sequence.
The repeating pattern of 001100110011... would be 0011, the repeating pattern of 22583575837583758... would be 58357.
My idea was to generate a guess of the longest possible pattern length somehow go from there, but I can't get things in order.
The tortoise-hare algorithm uses same address to identify cycles. This problem requires a different sort of algorithm. Some form of trie or structure such as LZW compression, would be where I would look for a solution.

Compressing words into one word consisting of them as subwords [duplicate]

I bet somebody has solved this before, but my searches have come up empty.
I want to pack a list of words into a buffer, keeping track of the starting position and length of each word. The trick is that I'd like to pack the buffer efficiently by eliminating the redundancy.
Example: doll dollhouse house
These can be packed into the buffer simply as dollhouse, remembering that doll is four letters starting at position 0, dollhouse is nine letters at 0, and house is five letters at 3.
What I've come up with so far is:
Sort the words longest to shortest: (dollhouse, house, doll)
Scan the buffer to see if the string already exists as a substring, if so note the location.
If it doesn't already exist, add it to the end of the buffer.
Since long words often contain shorter words, this works pretty well, but it should be possible to do significantly better. For example, if I extend the word list to include ragdoll, then my algorithm comes up with dollhouseragdoll which is less efficient than ragdollhouse.
This is a preprocessing step, so I'm not terribly worried about speed. O(n^2) is fine. On the other hand, my actual list has tens of thousands of words, so O(n!) is probably out of the question.
As a side note, this storage scheme is used for the data in the `name' table of a TrueType font, cf. http://www.microsoft.com/typography/otspec/name.htm
This is the shortest superstring problem: find the shortest string that contains a set of given strings as substrings. According to this IEEE paper (which you may not have access to unfortunately), solving this problem exactly is NP-complete. However, heuristic solutions are available.
As a first step, you should find all strings that are substrings of other strings and delete them (of course you still need to record their positions relative to the containing strings somehow). These fully-contained strings can be found efficiently using a generalised suffix tree.
Then, by repeatedly merging the two strings having longest overlap, you are guaranteed to produce a solution whose length is not worse than 4 times the minimum possible length. It should be possible to find overlap sizes quickly by using two radix trees as suggested by a comment by Zifre on Konrad Rudolph's answer. Or, you might be able to use the generalised suffix tree somehow.
I'm sorry I can't dig up a decent link for you -- there doesn't seem to be a Wikipedia page, or any publicly accessible information on this particular problem. It is briefly mentioned here, though no suggested solutions are provided.
I think you can use a Radix Tree. It costs some memory because of pointers to leafs and parents, but it is easy to match up strings (O(k) (where k is the longest string size).
My first thought here is: use a data structure to determine common prefixes and suffixes of your strings. Then sort the words under consideration of these prefixes and postfixes. This would result in your desired ragdollhouse.
Looks similar to the Knapsack problem, which is NP-complete, so there is not a "definitive" algorithm.
I did a lab back in college where we tasked with implementing a simple compression program.
What we did was sequentially apply these techniques to text:
BWT (Burrows-Wheeler transform): helps reorder letters into sequences of identical letters (hint* there are mathematical substitutions for getting the letters instead of actually doing the rotations)
MTF (Move to front transform): Rewrites the sequence of letters as a sequence of indices of a dynamic list.
Huffman encoding: A form of entropy encoding that constructs a variable-length code table in which shorter codes are given to frequently encountered symbols and longer codes are given to infrequently encountered symbols
Here, I found the assignment page.
To get back your original text, you do (1) Huffman decoding, (2) inverse MTF, and then (3) inverse BWT. There are several good resources on all of this on the Interwebs.
Refine step 3.
Look through current list and see whether any word in the list starts with a suffix of the current word. (You might want to keep the suffix longer than some length - longer than 1, for example).
If yes, then add the distinct prefix to this word as a prefix to the existing word, and adjust all existing references appropriately (slow!)
If no, add word to end of list as in current step 3.
This would give you 'ragdollhouse' as the stored data in your example. It is not clear whether it would always work optimally (if you also had 'barbiedoll' and 'dollar' in the word list, for example).
I would not reinvent this wheel yet another time. There has already gone an enormous amount of manpower into compression algorithms, why not take one of the already available ones?
Here are a few good choices:
gzip for fast compression / decompression speed
bzip2 for a bit bitter compression but much slower decompression
LZMA for very high compression ratio and fast decompression (faster than bzip2 but slower than gzip)
lzop for very fast compression / decompression
If you use Java, gzip is already integrated.
It's not clear what do you want to do.
Do you want a data structure that lets to you store in a memory-conscious manner the strings while letting operations like search possible in a reasonable amount of time?
Do you just want an array of words, compressed?
In the first case, you can go for a patricia trie or a String B-Tree.
For the second case, you can just adopt some index compression techinique, like that:
If you have something like:
aaa
aaab
aasd
abaco
abad
You can compress like that:
0aaa
3b
2sd
1baco
2ad
The number is the length of the largest common prefix with the preceding string.
You can tweak that schema, for ex. planning a "restart" of the common prefix after just K words, for a fast reconstruction

Finding partial substrings within a string

I have two strings which must be compared for similarity. The algorithm must be designed to find the maximal similarity. In this instance, the ordering matters, but intervening (or missing) characters do not. Edit distance cannot be used in this case for various reasons.
The situation is basically as follows:
string 1: ABCDEFG
string 2: AFENBCDGRDLFG
the resulting algorithm would find the substrings A, BCD, FG
I currently have a recursive solution, but because this must be run on massive amounts of data, any improvements would be greatly appreciated
Looking at your sole example it looks like you want to find longest common subsequence.
Take a look at LCS
Is it just me, or is this NP-hard? – David Titarenco (from comment)
If you want LCS of arbitrary number of strings its NP-hard. But it the number of input strings is constant ( as in this case, 2) this can be done in polynomial time.

Efficient most common suffix algorithm?

I have a few GBs worth of strings, and for every prefix I want to find 10 most common suffixes. Is there an efficient algorithm for that?
An obvious solution would be:
Store sorted list of <string, count> pairs.
Identify by binary search extent for prefix we're searching.
Find 10 highest counts in this extent.
Possibly precompute it for all short prefixes, so it doesn't ever need to look at large portion of data.
I'm not sure if that would actually be efficient at all. Is there a better way I overlooked?
Answers must be real time, but it can take as much preprocessing as necessary.
Place the words in a tree e.g. trie or radix, placing a "number of occurrences" counter for each full word, so you know which nodes are endings and how common they are.
Find the prefix/postfix combos by iteration.
Both these operations are O(n*k) where k is the length of the longest word; this is the same complexity as a hash-table.
The HAT-trie is a cache-conscious version that promises high performance.

Resources