Find longest common substring of multiple strings using factor oracle enhanced with LRS array - algorithm

Can we use a factor-oracle with suffix link (paper here) to compute the longest common substring of multiple strings? Here, substring means any part of the original string. For example "abc" is the substring of "ffabcgg", while "abg" is not.
I've found a way to compute the maximum length common substring of two strings s1 and s2. It works by concatenating the two strings using a character not in them, '$' for example. Then for each prefix of the concatenated string s with length i >= |s1| + 2, we calculate its LRS (longest repeated suffix) length lrs[i] and sp[i] (the end position of the first occurence of its LRS). Finally, the answer is
max{lrs[i]| i >= |s1| + 2 and sp[i] <= |s1|}
I've written a C++ program that uses this method, which can solve the problem within 200ms on my laptop when |s1|+|s2| <= 200000, using the factor oracle.
s1 = 'ffabcgg'
s2 = 'gfbcge'
s = s1+'$'+s2
= 'ffabcgg$gfbcge'
p: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
s: f f a b c g g $ g f b c g e
sp: 0 1 0 0 0 0 6 0 6 1 4 5 6 0
lrs:0 1 0 0 0 0 1 0 1 1 1 2 3 0
ans = lrs[13] = 3
I know the both problems can be solved using suffix-array and suffix-tree with high efficiency, but I wonder if there is a method using factor oracle to solve it. I am interested in this because the factor oracle is easy to construct (with 30 lines of C++, suffix-array needs about 60, and suffix-tree needs 150), and it runs faster than suffix-array and suffix-tree.
You can test your method of the first problem in this OnlineJudge, and the second problem in here.

Can we use a factor-oracle with suffix link (paper here) to compute
the longest common substring of multiple strings?
I don't think the algorithm is a very good fit (it is designed to factor a single string) but you can use it by concatenating the original strings with a unique separator.
Given abcdefg and hijcdekl and mncdop, find the longest common substring cd:
# combine with unique joiners
>>> s = "abcdefg" + "1" + "hijcdekl" + "2" + "mncdop"
>>> factor_oracle(s)
"cd"
As part of its linear-time and space algorithm, the factor-oracle quickly rediscover the break points between the input strings as part of its search for common factors (the unique joiners provide and immediate cue to stop extending the best factor found so far).

Related

Checking if two strings are equal after removing a subset of characters from both

I recently came across this problem:
You are given two strings, s1 and s2, comprised entirely of lowercase letters 'a' through 'r', and need to process a series of queries. Each query provides a subset of lowercase English letters from 'a' through 'r'. For each query, determine whether s1 and s2, when restricted only to the letters in the query, are equal.
s1 and s2 can contain up to 10^5 characters, and there are up to 10^5 queries.
For instance, if s1 is "aabcd" and s2 is "caabd", and you are asked to process a query with the subset "ac", then s1 becomes "aac" while s2 becomes "caa". These don't match, so the query would return false.
I was able to solve this in O(N^2) time by doing the following: For each query, I checked if s1 and s2 would be equal by iterating through both strings, one character at a time, skipping the characters that do not lie within the subset of allowed characters, and checking to see if the "allowed" characters from both s1 and s2 match. If at some point, the characters don't match, then the strings are not equal. Otherwise, the s1 and s2 are equal when restricted only to letters in the query. Each query takes O(N) time to process, and there are N queries, for a total of O(N^2) time.
However, I was told that there was a way to solve this faster in O(N). Does anyone know how this might be done?
The first obvious speedup is to ensure your set membership test is O(1). To do that, there's a couple of options:
Represent every letter as a single bit -- now every character is an 18-bit value with only one bit set. The set of allowed characters is now a mask with these bits ORed together and you can test membership of a character with a bitwise-AND;
Alternatively, you can have an 18-value array and index it by character (c - 'a' would give a value between 0 and 17). The test for membership is then basically the cost of an array lookup (and you can save operations by not doing the subtraction -- instead just make the array larger and index directly by character.
Thought experiment
The next potential speedup is to recognize that any character which does not appear exactly the same number of times in both strings will instantly be a failed match. You can count all character frequencies in both strings with a histogram which can be done in O(N) time. In this way, you can prune the search space if such a character were to appear in the query, and you can test for this in constant time.
Of course, that won't help for a real stress-test which will guarantee that all possible letters have a frequency matched in both strings. So, what do you do then?
Well, you extend the above premise by recognizing that for any position of character x in string 1 and some position of that character in string 2 that would be a valid match (i.e the same number of character x appears in both strings up to their respective positions), then the total count of any other character up to those positions must also be equal. For any character where that is not true, it cannot possibly be compatible with character x.
Concept
Let's start by thinking about this in terms of a technique known as memoization where you can leverage precomputed or partially-computed information and get a whole lot out of it. So consider two strings like this:
a b a a c d e | e a b a c a d
What useful thing can you do here? Well, why not store the partial sums of counts for each letter:
a b a a c d e | e a b a c a d
----------------------|----------------------
freq(a) = 1 1 2 3 3 3 3 | 0 1 1 2 2 3 3
freq(b) = 0 1 1 1 1 1 1 | 0 0 1 1 1 1 1
freq(c) = 0 0 0 0 1 1 1 | 0 0 0 0 1 1 1
freq(d) = 0 0 0 0 0 1 1 | 0 0 0 0 0 0 1
freq(e) = 0 0 0 0 0 0 1 | 1 1 1 1 1 1 1
This uses a whole lot of memory, but don't worry -- we'll deal with that later. Instead, take the time to absorb what we're actually doing.
Looking at the table above, we have the running character count totals for both strings at every position in those strings.
Now let's see how our matching rules work by showing an example of a matching query "ab" and a non-matching query "acd":
For "ab":
a b a a c d e | e a b a c a d
----------------------|----------------------
freq(a) = 1 1 2 3 3 3 3 | 0 1 1 2 2 3 3
freq(b) = 0 1 1 1 1 1 1 | 0 0 1 1 1 1 1
^ ^ ^ ^ ^ ^ ^ ^
We scan the frequency arrays until we locate one of our letters in the query. The locations I have marked with ^ above. If you remove all the unmarked columns, you'll see the remaining columns match on both sides. So this is a match.
For "acd":
a b a a c d e | e a b a c a d
----------------------|----------------------
freq(a) = 1 1 2 3 3 3 3 | 0 1 1 2 2 3 3
freq(c) = 0 0 0 0 1 1 1 | 0 0 0 0 1 1 1
freq(d) = 0 0 0 0 0 1 1 | 0 0 0 0 0 0 1
^ ^ # # ^ ^ ^ # # ^
Here, all columns are matching except those marked with #.
Putting it together
All right so you can see how this works, but you may wondering about the runtime, because those examples above seem to be doing even more scanning than you were doing before!
Here's where things get interesting, and where our character frequency counts really kick in.
First, observe what we're actually doing on those marked columns. For any one character that we care about (for example, the 'a'), we're looking for only the positions in both strings where its count matches, and then we're comparing these two columns to see what other values match. This gives us a set of all other characters that are valid when used with 'a'. And of course, 'a' itself is also valid.
And what was our very first optimization? A bitset -- an 18-bit value the represents valid characters. You can now use this. For the two columns in each string, you set a 1 for characters with matching counts and a zero for characters with non-matching counts. If you process every single pair of matching 'a' values in this manner, what you get is a bunch of sets that they work with. And you can keep a running "master" set that represents the intersection of these -- you just intersect it with each intermediate set you calculate, which is a single bitwise-AND.
By the time you reach the end of both strings, you have performed a O(N) search and you examined 18 rows any time you encountered 'a'. And the result was the set of all characters that work with 'a'.
Now repeat for the other characters, one at a time. Every time it's a O(N) scan as above and you wind up with the set of all other characters that work with that the one you're processing.
After processing all these rows you now have an array containing 18 values representing the set of all characters that work with any one character. The operation took O(18N) time and O(18N) storage.
Query
Since you now have an array where for each character you have all possible characters that work with it, you simply look up each character in the query, and you intersect their sets (again, with bitwise-AND). A further intersection is required by using the set of all characters present in the query. That prunes off all characters that are not relevant.
This leaves you with a value which, for all values in the query, represents the set of all values that can result in a matching string. So if this value is then equal to the query then you have a match. Otherwise, you don't.
This is the part that is now fast. It has essentially reduced your query tests to constant time. However, the original indexing did cost us a lot of memory. What can be done about that?
Dynamic Programming
Is it really necessary to allocate all that storage for our frequency arrays?
Well actually, no. It was useful as a visual tool by laying out our counts in tabular form, and to explain the method conceptually but actually a lot of those values are not very useful most of the time and it sure made the method seem a bit complicated.
The good news is we can compute our master sets at the same time as computing character counts, without needing to store any frequency arrays. Remember that when we're computing the counts we use a histogram which is as simple as having one small 18-value array array where you say count[c] += 1 (if c is a character or an index derived from that character). So we could save a ton of memory if we just do the following:
Processing the set (mask) of all compatible characters for character c:
Initialize the mask for character c to all 1s (mask[c] = (1 << 18) - 1) -- this represents all characters are currently compatible. Initialize a character histogram (count) to all zero.
Walk through string 1 until you reach character c. For every character x you encounter along the way, increase its count in the histogram (count[x]++).
Walk through string 2 until you reach character c. For every character x you encounter along the way, decrease its count in the histogram (count[x]--).
Construct a 'good' set where any character that currently has a zero-count has a 1-bit, otherwise 0-bit. Intersect this with the current mask for c (using bitwise-AND): mask[c] &= good
Continue from step 2 until you have reached the end of both strings. If you reach the end of one of the strings prematurely, then the character count does not match and so you set the mask for this character to zero: mask[c] = 0
Repeat from 1 for every character, until all characters are done.
Above, we basically have the same time complexity of O(18N) except we have absolutely minimal extra storage -- only one array of counts for each character and one array of masks.
Combining techniques like the above to solve seemingly complex combinatorial problems really fast is commonly referred to as Dynamic Programming. We reduced the problem down to a truth table representing all possible characters that work with any single character. The time complexity remains linear with respect to the length of the strings, and only scales by the number of possible characters.
Here is the algorithm above hacked together in C++: https://godbolt.org/z/PxzYvGs8q
Let RESTRICT(s,q) be the restriction of string s to the letters in the set q.
If q contains more than two letters, then the full string RESTRICT(s,q) can be reconstructed from all the strings RESTRICT(s,qij) where qij is a pair of letters in q.
Therefore, RESTRICT(s1,q) = RESTRICT(s2,q) if and only if RESTRICT(s1,qij) = RESTRICT(s2,qij) for all pairs qij in q.
Since you are restricted to 18 letters, there are only 153 letter pairs, or only 171 fundamental single- or double-letter queries.
If you precalculate the results of these 171 queries, then you can answer any other more complicated query just by combining their results, without inspecting the string at all.

Palindrome partitioning with interval scheduling

So I was looking at the various algorithms of solving Palindrome partitioning problem.
Like for a string "banana" minimum no of cuts so that each sub-string is a palindrome is 1 i.e. "b|anana"
Now I tried solving this problem using interval scheduling like:
Input: banana
Transformed string: # b # a # n # a # n # a #
P[] = lengths of palindromes considering each character as center of palindrome.
I[] = intervals
String: # b # a # n # a # n # a #
P[i]: 0 1 0 1 0 3 0 5 0 3 0 1 0
I[i]: 0 1 2 3 4 5 6 7 8 9 10 11 12
Example: Palindrome considering 'a' (index 7) as center is 5 "anana"
Now constructing intervals for each character based on P[i]:
b = (0,2)
a = (2,4)
n = (2,8)
a = (2,12)
n = (6,12)
a = (10,12)
So, now if I have to schedule these many intervals on time 0 to 12 such that minimum no of intervals should be scheduled and no time slot remain empty, I would choose (0,2) and (2,12) intervals and hence the answer for the solution would be 1 as I have broken down the given string in two palindromes.
Another test case:
String: # E # A # B # A # E # A # B #
P[i]: 0 1 0 1 0 5 0 1 0 5 0 1 0 1 0
I[i]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Plotting on graph:
Now, the minimum no of intervals that can be scheduled are either:
1(0,2), 2(2,4), 5(4,14) OR
3(0,10), 6(10,12), 7(12,14)
Hence, we have 3 partitions so the no of cuts would be 2 either
E|A|BAEAB
EABAE|A|B
These are just examples. I would like to know if this algorithm will work for all cases or there are some cases where it would definitely fail.
Please help me achieve a proof that it will work in every scenario.
Note: Please don't discourage me if this post makes no sense as i have put enough time and effort on this problem, just state a reason or provide some link from where I can move forward with this solution. Thank you.
As long as you can get a partition of the string, your algorith will work.
Recall to mind that a partion P of a set S is a set of non empty subset A1, ..., An:
The union of every set A1, ... An gives the set S
The intersection between any Ai, Aj (with i != j) is empty
Even if the palindrome partitioning deals with strings (which are a bit different from sets), the properties of a partition are still true.
Hence, if you have a partition, you consequently have a set of time intervals without "holes" to schedule.
Choosing the partition with the minimum number of subsets, makes you have the minimum number of time intervals and therefore the minimum number of cuts.
Furthermore, you always have at least one palindrome partition of a string: in the worst case, you get a palindrome partition made of single characters.

Converting a number into a special base system

I want to convert a number in base 10 into a special base form like this:
A*2^2 + B*3^1 + C*2^0
A can take on values of [0,1]
B can take on values of [0,1,2]
C can take on values of [0,1]
For example, the number 8 would be
1*2^2 + 1*3 + 1.
It is guaranteed that the given number can be converted to this specialized base system.
I know how to convert from this base system back to base-10, but I do not know how to convert from base-10 to this specialized base system.
In short words, treat every base number (2^2, 3^1, 2^0 in your example) as weight of an item, and the whole number as the capacity of a bag. This problem wants us to find a combination of these items which they fill the bag exactly.
In the first place this problem is NP-complete. It is identical to the subset sum problem, which can also be seen as a derivative problem of the knapsack problem.
Despite this fact, this problem can however be solved by a pseudo-polynomial time algorithm using dynamic programming in O(nW) time, which n is the number of bases, and W is the number to decompose. The details can be find in this wikipedia page: http://en.wikipedia.org/wiki/Knapsack_problem#Dynamic_programming and this SO page: What's it called when I want to choose items to fill container as full as possible - and what algorithm should I use?.
Simplifying your "special base":
X = A * 4 + B * 3 + C
A E {0,1}
B E {0,1,2}
C E {0,1}
Obviously the largest number that can be represented is 4 + 2 * 3 + 1 = 11
To figure out how to get the values of A, B, C you can do one of two things:
There are only 12 possible inputs: create a lookup table. Ugly, but quick.
Use some algorithm. A bit trickier.
Let's look at (1) first:
A B C X
0 0 0 0
0 0 1 1
0 1 0 3
0 1 1 4
0 2 0 6
0 2 1 7
1 0 0 4
1 0 1 5
1 1 0 7
1 1 1 8
1 2 0 10
1 2 1 11
Notice that 2 and 9 cannot be expressed in this system, while 4 and 7 occur twice. The fact that you have multiple possible solutions for a given input is a hint that there isn't a really robust algorithm (other than a look up table) to achieve what you want. So your table might look like this:
int A[] = {0,0,-1,0,0,1,0,1,1,-1,1,1};
int B[] = {0,0,-1,1,1,0,2,1,1,-1,2,2};
int C[] = {0,1,-1,0,2,1,0,1,1,-1,0,1};
Then look up A, B, C. If A < 0, there is no solution.

Consolidate 10 bit Value into a Unique Byte

As part of an algorithm I'm writing, I need to find a way to convert a 10-bit word into a unique 8-bit word. The 10-bit word is made up of 5 pairs, where each pair can only ever equal 0, 1 or 2 (never 3). For example:
|00|10|00|01|10|
This value needs to somehow be consolidated into a single, unique byte.
As each pair can never equal 3, there are a wide range of values that this 10-bit word will never represent, which makes me think that it is possible to create an algorithm to perform this conversion. The simplest way to do this would be to use a lookup table, but it seems like a waste of resources to store ~680 values which will only be used once in my program. I've already tried to incorporate one of the pairs into the others somehow, but every attempt I've made has resulted in a non-unique value, and I'm now very quickly running out of ideas!
Any help?
The number you have is essentially base 3. You just need to convert this to base 2.
There are 5 pairs, so 3^5 = 243 numbers. And 8 bits is 2^8 = 256 numbers, so it's possible.
The simplest way to convert between bases is to go to base 10 first.
So, for your example:
00|10|00|01|10
Base 3: 02012
Base 10: 2*3^3 + 1*3^1 + 2*3^0
= 54 + 3 + 2
= 59
Base 2:
59 % 2 = 1
/2 29 % 2 = 1
/2 14 % 2 = 0
/2 7 % 2 = 1
/2 3 % 2 = 1
/2 1 % 2 = 1
So 111011 is your number in binary
This explains the above process in a bit more detail.
Note that once you have 59 above stored in a 1-byte integer, you'll probably already have what you want, thus explicitly converting to base 2 might not be necessary.
What you basically have is a base 3 number and you want to convert this to a single number 0 - 255, luckily 5 digits in ternary (base 3) gives 243 combinations.
What you'll need to do is:
Digit Action
( 1st x 3^4)
+ (2nd x 3^3)
+ (3rd x 3^2)
+ (4th x 3)
+ (5th)
This will give you a number 0 to 242.
You are considering to store some information in a byte. A byte can contain at most 2 ^ 8 = 256 status.
Your status is totally 3 ^ 5 = 243 < 256. That make the transfer possible.
Consider your pairs are ABCDE (each character can be 0, 1 or 2)
You can just calculate A*3^4 + B*3^3 + C*3^2 + D*3 + E as your result. I guarantee the result will be in range 0 -- 255.

String similarity: how exactly does Bitap work?

I'm trying to wrap my head around the Bitap algorithm, but am having trouble understanding the reasons behind the steps of the algorithm.
I understand the basic premise of the algorithm, which is (correct me if i'm wrong):
Two strings: PATTERN (the desired string)
TEXT (the String to be perused for the presence of PATTERN)
Two indices: i (currently processing index in PATTERN), 1 <= i < PATTERN.SIZE
j (arbitrary index in TEXT)
Match state S(x): S(PATTERN(i)) = S(PATTERN(i-1)) && PATTERN[i] == TEXT[j], S(0) = 1
In english terms, PATTERN.substring(0,i) matches a substring of TEXT if the previous substring PATTERN.substring(0, i-1) was successfully matched and the character at PATTERN[i] is the same as the character at TEXT[j].
What I don't understand is the bit-shifting implementation of this. The official paper detailing this algorithm basically lays it out, but I can't seem to visualize what's supposed to go on. The algorithm specification is only the first 2 pages of the paper, but I'll highlight the important parts:
Here is the bit-shifting version of the concept:
Here is T[text] for a sample search string:
And here is a trace of the algorithm.
Specifically, I don't understand what the T table signifies, and the reason behind ORing an entry in it with the current state.
I'd be grateful if anyone can help me understand what exactly is going on
T is slightly confusing because you would normally number positions in the
pattern from left to right:
0 1 2 3 4
a b a b c
...whereas bits are normally numbered from right to left.
But writing the
pattern backwards above the bits makes it clear:
bit: 4 3 2 1 0
c b a b a
T[a] = 1 1 0 1 0
c b a b a
T[b] = 1 0 1 0 1
c b a b a
T[c] = 0 1 1 1 1
c b a b a
T[d] = 1 1 1 1 1
Bit n of T[x] is 0 if x appears in position n, or 1 if it does not.
Equivalently, you can think of this as saying that if the current character
in the input string is x, and you see a 0 in position n of T[x], then you
can only possibly be matching the pattern if the match started n characters
previously.
Now to the matching procedure. A 0 in bit n of the state means that we started matching the pattern n characters ago (where 0 is the current character). Initially, nothing matches.
[start]
1 1 1 1 1
As we consume characters trying to match, the state is shifted left (which shifts a zero in
to the bottom bit, bit 0) and OR-ed with the table entry for the current character. The first character is a; shifting left and OR-ing in T[a] gives:
a
1 1 1 1 0
The 0 bit that was shifted in is preserved, because a current character of a can
begin a match of the pattern. For any other character, the bit would be have been set to
1.
The fact that bit 0 of the state is now 0 means that we started matching the pattern on
the current character; continuing, we get:
a b
1 1 1 0 1
...because the 0 bit has been shifted left - think of it as saying that we started matching the pattern 1 character ago - and T[b] has a 0 in the same position, telling
us that a seeing a b in the current position is good if we started matching 1 character
ago.
a b d
1 1 1 1 1
d can't match anywhere; all the bits get set back to 1.
a b d a
1 1 1 1 0
As before.
a b d a b
1 1 1 0 1
As before.
b d a b a
1 1 0 1 0
a is good if the match started either 2 characters ago or on the current character.
d a b a b
1 0 1 0 1
b is good if the match started either 1 or 3 characters ago. The 0 in bit 3 means
that we've almost matched the whole pattern...
a b a b a
1 1 0 1 0
...but the next character is a, which is no good if the match started 4 characters
ago. However, shorter matches might still be good.
b a b a b
1 0 1 0 1
Still looking good.
a b a b c
0 1 1 1 1
Finally, c is good if the match started 4 characters before. The fact that
a 0 has made it all the way to the top bit means that we have a match.
Sorry for not allowing anyone else to answer, but I'm pretty sure I've figured it out now.
The concept essential for groking the algorithm is the representation of match states (defined in the original post) in binary. The article in the original post explains it formally; I'll try my hand at doing so colloquially:
Let's have STR, which is a String created with characters from a given alphabet.
Let's represent STR with a set of binary digits: STR_BINARY. The algorithm requires for this representation to be backwards (so, the first letter corresponds to the last digit, second letter with the second-to-last digit, etc.).
Let's assume RANDOM refers to a String with random characters from the same alphabet STR is created from.
In STR_BINARY, a 0 at a given index indicates that, RANDOM matches STR from STR[0] to
STR[(index of letter in STR that the 0 in STR_BINARY corresponds to)]. Empty spaces count as matches. A 1 indicates that RANDOM does not match STR in inside those same boundaries.
The algorithm becomes simpler to learn once this is understood.

Resources