Checking if two strings are equal after removing a subset of characters from both - algorithm

I recently came across this problem:
You are given two strings, s1 and s2, comprised entirely of lowercase letters 'a' through 'r', and need to process a series of queries. Each query provides a subset of lowercase English letters from 'a' through 'r'. For each query, determine whether s1 and s2, when restricted only to the letters in the query, are equal.
s1 and s2 can contain up to 10^5 characters, and there are up to 10^5 queries.
For instance, if s1 is "aabcd" and s2 is "caabd", and you are asked to process a query with the subset "ac", then s1 becomes "aac" while s2 becomes "caa". These don't match, so the query would return false.
I was able to solve this in O(N^2) time by doing the following: For each query, I checked if s1 and s2 would be equal by iterating through both strings, one character at a time, skipping the characters that do not lie within the subset of allowed characters, and checking to see if the "allowed" characters from both s1 and s2 match. If at some point, the characters don't match, then the strings are not equal. Otherwise, the s1 and s2 are equal when restricted only to letters in the query. Each query takes O(N) time to process, and there are N queries, for a total of O(N^2) time.
However, I was told that there was a way to solve this faster in O(N). Does anyone know how this might be done?

The first obvious speedup is to ensure your set membership test is O(1). To do that, there's a couple of options:
Represent every letter as a single bit -- now every character is an 18-bit value with only one bit set. The set of allowed characters is now a mask with these bits ORed together and you can test membership of a character with a bitwise-AND;
Alternatively, you can have an 18-value array and index it by character (c - 'a' would give a value between 0 and 17). The test for membership is then basically the cost of an array lookup (and you can save operations by not doing the subtraction -- instead just make the array larger and index directly by character.
Thought experiment
The next potential speedup is to recognize that any character which does not appear exactly the same number of times in both strings will instantly be a failed match. You can count all character frequencies in both strings with a histogram which can be done in O(N) time. In this way, you can prune the search space if such a character were to appear in the query, and you can test for this in constant time.
Of course, that won't help for a real stress-test which will guarantee that all possible letters have a frequency matched in both strings. So, what do you do then?
Well, you extend the above premise by recognizing that for any position of character x in string 1 and some position of that character in string 2 that would be a valid match (i.e the same number of character x appears in both strings up to their respective positions), then the total count of any other character up to those positions must also be equal. For any character where that is not true, it cannot possibly be compatible with character x.
Concept
Let's start by thinking about this in terms of a technique known as memoization where you can leverage precomputed or partially-computed information and get a whole lot out of it. So consider two strings like this:
a b a a c d e | e a b a c a d
What useful thing can you do here? Well, why not store the partial sums of counts for each letter:
a b a a c d e | e a b a c a d
----------------------|----------------------
freq(a) = 1 1 2 3 3 3 3 | 0 1 1 2 2 3 3
freq(b) = 0 1 1 1 1 1 1 | 0 0 1 1 1 1 1
freq(c) = 0 0 0 0 1 1 1 | 0 0 0 0 1 1 1
freq(d) = 0 0 0 0 0 1 1 | 0 0 0 0 0 0 1
freq(e) = 0 0 0 0 0 0 1 | 1 1 1 1 1 1 1
This uses a whole lot of memory, but don't worry -- we'll deal with that later. Instead, take the time to absorb what we're actually doing.
Looking at the table above, we have the running character count totals for both strings at every position in those strings.
Now let's see how our matching rules work by showing an example of a matching query "ab" and a non-matching query "acd":
For "ab":
a b a a c d e | e a b a c a d
----------------------|----------------------
freq(a) = 1 1 2 3 3 3 3 | 0 1 1 2 2 3 3
freq(b) = 0 1 1 1 1 1 1 | 0 0 1 1 1 1 1
^ ^ ^ ^ ^ ^ ^ ^
We scan the frequency arrays until we locate one of our letters in the query. The locations I have marked with ^ above. If you remove all the unmarked columns, you'll see the remaining columns match on both sides. So this is a match.
For "acd":
a b a a c d e | e a b a c a d
----------------------|----------------------
freq(a) = 1 1 2 3 3 3 3 | 0 1 1 2 2 3 3
freq(c) = 0 0 0 0 1 1 1 | 0 0 0 0 1 1 1
freq(d) = 0 0 0 0 0 1 1 | 0 0 0 0 0 0 1
^ ^ # # ^ ^ ^ # # ^
Here, all columns are matching except those marked with #.
Putting it together
All right so you can see how this works, but you may wondering about the runtime, because those examples above seem to be doing even more scanning than you were doing before!
Here's where things get interesting, and where our character frequency counts really kick in.
First, observe what we're actually doing on those marked columns. For any one character that we care about (for example, the 'a'), we're looking for only the positions in both strings where its count matches, and then we're comparing these two columns to see what other values match. This gives us a set of all other characters that are valid when used with 'a'. And of course, 'a' itself is also valid.
And what was our very first optimization? A bitset -- an 18-bit value the represents valid characters. You can now use this. For the two columns in each string, you set a 1 for characters with matching counts and a zero for characters with non-matching counts. If you process every single pair of matching 'a' values in this manner, what you get is a bunch of sets that they work with. And you can keep a running "master" set that represents the intersection of these -- you just intersect it with each intermediate set you calculate, which is a single bitwise-AND.
By the time you reach the end of both strings, you have performed a O(N) search and you examined 18 rows any time you encountered 'a'. And the result was the set of all characters that work with 'a'.
Now repeat for the other characters, one at a time. Every time it's a O(N) scan as above and you wind up with the set of all other characters that work with that the one you're processing.
After processing all these rows you now have an array containing 18 values representing the set of all characters that work with any one character. The operation took O(18N) time and O(18N) storage.
Query
Since you now have an array where for each character you have all possible characters that work with it, you simply look up each character in the query, and you intersect their sets (again, with bitwise-AND). A further intersection is required by using the set of all characters present in the query. That prunes off all characters that are not relevant.
This leaves you with a value which, for all values in the query, represents the set of all values that can result in a matching string. So if this value is then equal to the query then you have a match. Otherwise, you don't.
This is the part that is now fast. It has essentially reduced your query tests to constant time. However, the original indexing did cost us a lot of memory. What can be done about that?
Dynamic Programming
Is it really necessary to allocate all that storage for our frequency arrays?
Well actually, no. It was useful as a visual tool by laying out our counts in tabular form, and to explain the method conceptually but actually a lot of those values are not very useful most of the time and it sure made the method seem a bit complicated.
The good news is we can compute our master sets at the same time as computing character counts, without needing to store any frequency arrays. Remember that when we're computing the counts we use a histogram which is as simple as having one small 18-value array array where you say count[c] += 1 (if c is a character or an index derived from that character). So we could save a ton of memory if we just do the following:
Processing the set (mask) of all compatible characters for character c:
Initialize the mask for character c to all 1s (mask[c] = (1 << 18) - 1) -- this represents all characters are currently compatible. Initialize a character histogram (count) to all zero.
Walk through string 1 until you reach character c. For every character x you encounter along the way, increase its count in the histogram (count[x]++).
Walk through string 2 until you reach character c. For every character x you encounter along the way, decrease its count in the histogram (count[x]--).
Construct a 'good' set where any character that currently has a zero-count has a 1-bit, otherwise 0-bit. Intersect this with the current mask for c (using bitwise-AND): mask[c] &= good
Continue from step 2 until you have reached the end of both strings. If you reach the end of one of the strings prematurely, then the character count does not match and so you set the mask for this character to zero: mask[c] = 0
Repeat from 1 for every character, until all characters are done.
Above, we basically have the same time complexity of O(18N) except we have absolutely minimal extra storage -- only one array of counts for each character and one array of masks.
Combining techniques like the above to solve seemingly complex combinatorial problems really fast is commonly referred to as Dynamic Programming. We reduced the problem down to a truth table representing all possible characters that work with any single character. The time complexity remains linear with respect to the length of the strings, and only scales by the number of possible characters.
Here is the algorithm above hacked together in C++: https://godbolt.org/z/PxzYvGs8q

Let RESTRICT(s,q) be the restriction of string s to the letters in the set q.
If q contains more than two letters, then the full string RESTRICT(s,q) can be reconstructed from all the strings RESTRICT(s,qij) where qij is a pair of letters in q.
Therefore, RESTRICT(s1,q) = RESTRICT(s2,q) if and only if RESTRICT(s1,qij) = RESTRICT(s2,qij) for all pairs qij in q.
Since you are restricted to 18 letters, there are only 153 letter pairs, or only 171 fundamental single- or double-letter queries.
If you precalculate the results of these 171 queries, then you can answer any other more complicated query just by combining their results, without inspecting the string at all.

Related

Longest substring where every character appear even number of times (possibly zero)

Suppose we have a string s. We want to find the length of the longest substring of s such that every character in the substring appears even number of times (possible zero).
WC Time: O(nlgn). WC space: O(n)
First, it's obvious that the substring must be of an even length. Second, I'm familiar with the sliding window method where we anchor some right index and look for the left-most index to match your criterion. I tried to apply this idea here but couldn't really formulate it.
Also, it seems to me like a priority queue could come in handy (since the O(nlgn) requirement is sort of hinting it)
I'd be glad for help!
Let's define the following bitsets:
B[c,i] = 1 if character c appeared in s[0,...,i] even number of times.
Calculating B[c,i] takes linear time (for all values):
for all c:
B[c,-1] = 0
for i from 0 to len(arr):
B[c, i] = B[s[i], i-1] XOR 1
Since the alphabet is of constant size, so are the bitsets (for each i).
Note that the condition:
every character in the substring appears even number of times
is true for substring s[i,j] if and only if the bitset of index i is identical to the bitset of index j (otherwise, there is a bit that repeated odd number of times in this substring ; other direction: If there is a bit that repeated number of times, then its bit cannot be identical).
So, if we store all bitsets in some set (hash set/tree set), and keep only latest entry, this preprocessing takes O(n) or O(nlogn) time (depending on hash/tree set).
In a second iteration, for each index, find the farther away index with identical bitset (O(1)/O(logn), depending if hash/tree set), find the substring length, and mark it as candidate. At the end, take the longest candidate.
This solution is O(n) space for the bitsets, and O(n)/O(nlogn) time, depending if using hash/tree solution.
Pseudo code:
def NextBitset(B, c): # O(1) time
for each x in alphabet \ {c}:
B[x, i] = B[x, i-1]
B[c, i] = B[c, i-1] XOR 1
for each c in alphabet: # O(1) time
B[c,-1] = 0
map = new hash/tree map (bitset->int)
# first pass: # O(n)/O(nlogn) time
for i from 0 to len(s):
# Note, we override with the latest element.
B = NextBitset(B, s[i])
map[B] = i
for each c in alphabet: # O(1) time
B[c,-1] = 0
max_distance = 0
# second pass: O(n)/ O(nlogn) time.
for i from 0 to len(s):
B = NextBitset(B, s[i])
j = map.find(B) # O(1) / O(logn)
max_distance = max(max_distance, j-i)
I'm not sure exactly what amit proposes so if this is it, please consider it another explanation. This can be accomplished in a single traversal.
Produce a bitset of length equal to the alphabet's for each index of the string. Store the first index for each unique bitset encountered while traversing the string. Update the largest interval between a current and previously seen bitset.
For example, the string, "aabccab":
a a b c c a b
0 1 2 3 4 5 6 (index)
_
0 1 0 0 0 0 1 1 | (vertical)
0 0 0 1 1 1 1 0 | bitset for
0 0 0 0 1 0 0 0 _| each index
^ ^
|___________|
largest interval
between current and
previously seen bitset
The update for each iteration can be accomplished in O(1) by preprocessing a bit mask for each character to XOR with the previous bitset:
bitset mask
0 1 1
1 XOR 0 = 1
0 0 0
means update the character associated with the first bit in the alphabet-bitset.

Palindrome partitioning with interval scheduling

So I was looking at the various algorithms of solving Palindrome partitioning problem.
Like for a string "banana" minimum no of cuts so that each sub-string is a palindrome is 1 i.e. "b|anana"
Now I tried solving this problem using interval scheduling like:
Input: banana
Transformed string: # b # a # n # a # n # a #
P[] = lengths of palindromes considering each character as center of palindrome.
I[] = intervals
String: # b # a # n # a # n # a #
P[i]: 0 1 0 1 0 3 0 5 0 3 0 1 0
I[i]: 0 1 2 3 4 5 6 7 8 9 10 11 12
Example: Palindrome considering 'a' (index 7) as center is 5 "anana"
Now constructing intervals for each character based on P[i]:
b = (0,2)
a = (2,4)
n = (2,8)
a = (2,12)
n = (6,12)
a = (10,12)
So, now if I have to schedule these many intervals on time 0 to 12 such that minimum no of intervals should be scheduled and no time slot remain empty, I would choose (0,2) and (2,12) intervals and hence the answer for the solution would be 1 as I have broken down the given string in two palindromes.
Another test case:
String: # E # A # B # A # E # A # B #
P[i]: 0 1 0 1 0 5 0 1 0 5 0 1 0 1 0
I[i]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Plotting on graph:
Now, the minimum no of intervals that can be scheduled are either:
1(0,2), 2(2,4), 5(4,14) OR
3(0,10), 6(10,12), 7(12,14)
Hence, we have 3 partitions so the no of cuts would be 2 either
E|A|BAEAB
EABAE|A|B
These are just examples. I would like to know if this algorithm will work for all cases or there are some cases where it would definitely fail.
Please help me achieve a proof that it will work in every scenario.
Note: Please don't discourage me if this post makes no sense as i have put enough time and effort on this problem, just state a reason or provide some link from where I can move forward with this solution. Thank you.
As long as you can get a partition of the string, your algorith will work.
Recall to mind that a partion P of a set S is a set of non empty subset A1, ..., An:
The union of every set A1, ... An gives the set S
The intersection between any Ai, Aj (with i != j) is empty
Even if the palindrome partitioning deals with strings (which are a bit different from sets), the properties of a partition are still true.
Hence, if you have a partition, you consequently have a set of time intervals without "holes" to schedule.
Choosing the partition with the minimum number of subsets, makes you have the minimum number of time intervals and therefore the minimum number of cuts.
Furthermore, you always have at least one palindrome partition of a string: in the worst case, you get a palindrome partition made of single characters.

Confused about prefix free and uniquely decodable with respect to binary code

A code is the assignment of a unique string of characters (a
codeword) to each character in an alphabet.
A code in which the codewords contain only zeroes and ones is
called a binary code.
All ASCII codewords have the same length. This ensures that an
important property called the prefix property holds true for the
ASCII code.
The encoding of a string of characters from an alphabet (the
cleartext) is the concatenation of the codewords corresponding to
the characters of the cleartext, in order, from left to right. A code
is uniquely decodable if the encoding of every possible cleartext
using that code is unique.
Based on the above information I was trying to do some exercises:
Considering the following matrix:
Code1 Code2 Code3 Code4
A 0 0 1 1
B 100 1 01 01
C 10 00 001 001
D 11 11 0001 000
The confusions:
Are all the above assignment considered as codes since they have a unique string of characters???
I understand that code 1 and code 2 are prefix free since they do not have equal length. Having said that, if you have a look at code 4 for alphabets D and C it cosists of 3 digits. Would code 4 be considered prefix free too?
Is code 3 the only uniquely decodable code?
I think you have misunderstood the prefix property - it isn't mainly about length (but enforcing the same length n on each code point will make the code prefix-free - you cannot have unique codes otherwise).
Rather, it is about uniquely being able to identify each code point so that a decoder greedily can take the first translation that matches. In the case of fixed length, the decoder knows that it has to read n digits.
In the case of variable length code like Code1, you don't know upon reading 10 if that can be translated to C or if it is the first two digits of the three-digit B - 10 is a prefix of 100. The same holds true for Code2: 0 is a prefix for 00 and 1 is a prefix of 11.
Consider reading the sequence 100 one digit at a time:
Code1:
Read 1
; "1" does not match any code - Remember the 1 and continue.
Read 0
; "10" matches reduction "C" - or is this the beginning of a "B"? Darn!
Read 0
; Ok, this was either "CA" or "B" - but there is no way of knowing which one.
Hope this helps you forward!

Find longest common substring of multiple strings using factor oracle enhanced with LRS array

Can we use a factor-oracle with suffix link (paper here) to compute the longest common substring of multiple strings? Here, substring means any part of the original string. For example "abc" is the substring of "ffabcgg", while "abg" is not.
I've found a way to compute the maximum length common substring of two strings s1 and s2. It works by concatenating the two strings using a character not in them, '$' for example. Then for each prefix of the concatenated string s with length i >= |s1| + 2, we calculate its LRS (longest repeated suffix) length lrs[i] and sp[i] (the end position of the first occurence of its LRS). Finally, the answer is
max{lrs[i]| i >= |s1| + 2 and sp[i] <= |s1|}
I've written a C++ program that uses this method, which can solve the problem within 200ms on my laptop when |s1|+|s2| <= 200000, using the factor oracle.
s1 = 'ffabcgg'
s2 = 'gfbcge'
s = s1+'$'+s2
= 'ffabcgg$gfbcge'
p: 0 1 2 3 4 5 6 7 8 9 10 11 12 13
s: f f a b c g g $ g f b c g e
sp: 0 1 0 0 0 0 6 0 6 1 4 5 6 0
lrs:0 1 0 0 0 0 1 0 1 1 1 2 3 0
ans = lrs[13] = 3
I know the both problems can be solved using suffix-array and suffix-tree with high efficiency, but I wonder if there is a method using factor oracle to solve it. I am interested in this because the factor oracle is easy to construct (with 30 lines of C++, suffix-array needs about 60, and suffix-tree needs 150), and it runs faster than suffix-array and suffix-tree.
You can test your method of the first problem in this OnlineJudge, and the second problem in here.
Can we use a factor-oracle with suffix link (paper here) to compute
the longest common substring of multiple strings?
I don't think the algorithm is a very good fit (it is designed to factor a single string) but you can use it by concatenating the original strings with a unique separator.
Given abcdefg and hijcdekl and mncdop, find the longest common substring cd:
# combine with unique joiners
>>> s = "abcdefg" + "1" + "hijcdekl" + "2" + "mncdop"
>>> factor_oracle(s)
"cd"
As part of its linear-time and space algorithm, the factor-oracle quickly rediscover the break points between the input strings as part of its search for common factors (the unique joiners provide and immediate cue to stop extending the best factor found so far).

String similarity: how exactly does Bitap work?

I'm trying to wrap my head around the Bitap algorithm, but am having trouble understanding the reasons behind the steps of the algorithm.
I understand the basic premise of the algorithm, which is (correct me if i'm wrong):
Two strings: PATTERN (the desired string)
TEXT (the String to be perused for the presence of PATTERN)
Two indices: i (currently processing index in PATTERN), 1 <= i < PATTERN.SIZE
j (arbitrary index in TEXT)
Match state S(x): S(PATTERN(i)) = S(PATTERN(i-1)) && PATTERN[i] == TEXT[j], S(0) = 1
In english terms, PATTERN.substring(0,i) matches a substring of TEXT if the previous substring PATTERN.substring(0, i-1) was successfully matched and the character at PATTERN[i] is the same as the character at TEXT[j].
What I don't understand is the bit-shifting implementation of this. The official paper detailing this algorithm basically lays it out, but I can't seem to visualize what's supposed to go on. The algorithm specification is only the first 2 pages of the paper, but I'll highlight the important parts:
Here is the bit-shifting version of the concept:
Here is T[text] for a sample search string:
And here is a trace of the algorithm.
Specifically, I don't understand what the T table signifies, and the reason behind ORing an entry in it with the current state.
I'd be grateful if anyone can help me understand what exactly is going on
T is slightly confusing because you would normally number positions in the
pattern from left to right:
0 1 2 3 4
a b a b c
...whereas bits are normally numbered from right to left.
But writing the
pattern backwards above the bits makes it clear:
bit: 4 3 2 1 0
c b a b a
T[a] = 1 1 0 1 0
c b a b a
T[b] = 1 0 1 0 1
c b a b a
T[c] = 0 1 1 1 1
c b a b a
T[d] = 1 1 1 1 1
Bit n of T[x] is 0 if x appears in position n, or 1 if it does not.
Equivalently, you can think of this as saying that if the current character
in the input string is x, and you see a 0 in position n of T[x], then you
can only possibly be matching the pattern if the match started n characters
previously.
Now to the matching procedure. A 0 in bit n of the state means that we started matching the pattern n characters ago (where 0 is the current character). Initially, nothing matches.
[start]
1 1 1 1 1
As we consume characters trying to match, the state is shifted left (which shifts a zero in
to the bottom bit, bit 0) and OR-ed with the table entry for the current character. The first character is a; shifting left and OR-ing in T[a] gives:
a
1 1 1 1 0
The 0 bit that was shifted in is preserved, because a current character of a can
begin a match of the pattern. For any other character, the bit would be have been set to
1.
The fact that bit 0 of the state is now 0 means that we started matching the pattern on
the current character; continuing, we get:
a b
1 1 1 0 1
...because the 0 bit has been shifted left - think of it as saying that we started matching the pattern 1 character ago - and T[b] has a 0 in the same position, telling
us that a seeing a b in the current position is good if we started matching 1 character
ago.
a b d
1 1 1 1 1
d can't match anywhere; all the bits get set back to 1.
a b d a
1 1 1 1 0
As before.
a b d a b
1 1 1 0 1
As before.
b d a b a
1 1 0 1 0
a is good if the match started either 2 characters ago or on the current character.
d a b a b
1 0 1 0 1
b is good if the match started either 1 or 3 characters ago. The 0 in bit 3 means
that we've almost matched the whole pattern...
a b a b a
1 1 0 1 0
...but the next character is a, which is no good if the match started 4 characters
ago. However, shorter matches might still be good.
b a b a b
1 0 1 0 1
Still looking good.
a b a b c
0 1 1 1 1
Finally, c is good if the match started 4 characters before. The fact that
a 0 has made it all the way to the top bit means that we have a match.
Sorry for not allowing anyone else to answer, but I'm pretty sure I've figured it out now.
The concept essential for groking the algorithm is the representation of match states (defined in the original post) in binary. The article in the original post explains it formally; I'll try my hand at doing so colloquially:
Let's have STR, which is a String created with characters from a given alphabet.
Let's represent STR with a set of binary digits: STR_BINARY. The algorithm requires for this representation to be backwards (so, the first letter corresponds to the last digit, second letter with the second-to-last digit, etc.).
Let's assume RANDOM refers to a String with random characters from the same alphabet STR is created from.
In STR_BINARY, a 0 at a given index indicates that, RANDOM matches STR from STR[0] to
STR[(index of letter in STR that the 0 in STR_BINARY corresponds to)]. Empty spaces count as matches. A 1 indicates that RANDOM does not match STR in inside those same boundaries.
The algorithm becomes simpler to learn once this is understood.

Resources