How to use KMP failure function to determine minimum length repeated substring? - algorithm

I want to solve UVA 10298 -"Power Strings" problem using KMP algorithm. In this blog a technique is shown how failure function can be used to calculate minimum length repeated substring. The technique is as follows:
Compute prefix-suffix table pi[ ] for the given string.
Let len be the string length, and last_in_pi be the value stored at the last index of pi table.
Check whether len % (len - last_in_pi) == 0 is true or not. If it is true then the length of the minimum length repeated substring is (len - last_in_pi), otherwise it is the length of the given string.
I understand what is failure function and how it is used to find pattern in a text but I am struggling to understand proof of correctness of this technique.

Remember that Pi[i] is defined as the (length of the) longest prefix of your_string that is a proper suffix (so not the whole string) of the substring your_string[0 ... i].
There is an example on the blog post you linked to:
0 1 2 3 4 5
S : a b a b a b
Pi: 0 0 1 2 3 4
Where we have:
a b a
a b a b
Etc. I hope this makes it clear what Pi (the prefix function / table) does.
Now, the blog says:
The last value of prefix table = 4..
Now If it is a repeated string than , It’s minimal length would be 2. (6(string length) – 4) , Now
So you have to check if len % (len - last_in_pi) == 0. If yes, then len - last_in_pi is the length of the shortest repeated string (the period string).
This works because, if you rotate a string with len(period) positions either way, it will match itself. len - last_in_pi tells you how much you'd need to rotate.

Problem
S (of length Ls) is the given string. M (of length Lm) is the largest proper suffix of S, which is also a prefix of S. We have to prove Ls - Lm is the length of the shortest period of S.
Proof by Contradiction
Let's say there were a period Y whose length Ly < Ls - Lm (i.e, it's shorter than the one the above technique gives).
An important property to note is that M is a proper prefix of Y or vice-versa depending on their lengths. We can denote this as M = n*Y + Z, where n >= 0 and Z is the additional part and Lz < Ly. Z forms a prefix to Y, since Y repeats itself. Let Y = Z + W.
Consider M the suffix. Append the previous Ly number of characters from the original string S to it. This won't exceed the string length because (Ly < Ls - Lm). The new suffix is (n + 1)*Y + Z.
Consider M the prefix. Now append the next Ly number of characters from the original string S to it. The new prefix here is
M + (next Ly characters from S)
- > n*Y + Z + (Ly characters)
- > n*Y + Z + (Ly - Lz characters) + (Lz characters)
- > n*Y + (Z + W) + (Z)
{The `Ly - Lz` characters should be `W` because `Z` and these together form `Y`; The last Lz characters are actually the the first Lz characters of Y which is nothing but Z}
- > (n + 1)*Y + Z
Now we have a proper suffix of S which is also a prefix and is greater than M. But we started off saying M is the longest proper suffix which is also a prefix. So it's a contradiction, implying such a Y can not exist.

Assume you have a string s of size n, which looks like s = x1x2x3...x[n-2]x[n-1]x[n]
Assume s has a maximum common prefix/suffix of length len
Then it's period is p = (n - len), iff n % p == 0
Induction:
Denote prefix = s[1...len], postfix = s[p+1...n]
Then we have prefix[1...p] == postfix[1...p] == s[p+1...2p]
Since s[p+1...2p] == prefix[p+1...2p] so postfix[1...p] == postfix[p+1...2p]
Recursively postfix[p+1...2p] == s[2p+1...3p] == prefix[2p+1...3p]
...

Related

Verify that a number can be decomposed into powers of 2

Is it possible to verify that a number can be decomposed into a sum of powers of 2 where the exponents are sequential?
Is there an algorithm to check this?
Example: where and
The binary representation would have a single, consecutive group of 1 bits.
To check this, you could first identify the value of the least significant bit, add that bit to the original value, and then check whether the result is a power of 2.
This leads to the following formula for a given x:
(x & (x + (x & -x))) == 0
This expression is also true when x is zero. If that case needs to be rejected as a solution, you need an extra condition for that.
In Python:
def f(x):
return x > 0 and (x & (x + (x & -x))) == 0
This can be done in an elegant way using bitwise operations to check whether the binary representation of the number is a single block of consecutive 1 bits, followed by perhaps some 0s.
The expression x & (x - 1) replaces the lowest 1 in the binary representation of x with a 0. If we call that number y, then y | (y >> 1) sets each bit to be a 1 if it had a 1 to its immediate left. If the original number x was a single block of consecutive 1 bits, then the result is the same as the number x that we started with, because the 1 which was removed will be replaced by the shift. On the other hand, if x is not a single block of consecutive 1 bits, then the shift will add at least one other 1 bit that wasn't there in the original x, and they won't be equal.
That works if x has more than one 1 bit, so the shift can put back the one that was removed. If x has only a single 1 bit, then removing it will result in y being zero. So we can check for that, too.
In Python:
def is_sum_of_consecutive_powers_of_two(x):
y = x & (x - 1)
z = y | (y >> 1)
return x == z or y == 0
Note that this returns True when x is zero, and that's the correct result if "a sum of consecutive powers of two" is allowed to be the empty sum. Otherwise, you will have to write a special case to reject zero.
A number can be represented as the sum of powers of 2 with sequential exponents iff its binary representation has all 1s adjacent.
E.g. the set of numbers that can be represented as 2^n + 2^n-1, n >= 1, is exactly those with two adjacent ones in the binary representation.
just like this:
bool check(int x) {/*the number you want to check*/
int flag = 0;
while (x >>= 1) {
if (x & 1) {
if (!flag) flag = 1;
if (flag == 2) return false;
}
if (flag == 1) flag = 2;
}
return true;
}
O(log n).

Number of Ways To arrange Sequence

I am having a M character, from these character i need to make a sequence of length N such that no two consecutive character are same and also first and last character of the sequence is fix. So i need to find the total number of ways.
My Approach:
Dynamic programming.
If first and last character are '0' and '1'
dp[1][0]=1 , dp[1][1]=1
for(int i=2;i<N;i++)
for(int j=0;j<M;j++)
for(int k=0;k<M;k++)
if(j!=k) dp[i][j]+=dp[i-1][k]
So final answer would summation dp[n-1][i] , i!=1
Problem:
Here length N is too large around 10^15 and M is around 128, how find the number of permutation without using arrays ?
Assume M is fixed. Let D(n) be the number of sequences of length n with no repeated characters where the first and last character differ (but are fixed). Let S(n) be the number of sequences of length n where the first and last characters are the same (but are fixed).
For example, D(6) is the number of strings of the form a????b (for some a and b -- noting that for counting it doesn't matter which two characters we chose, and where the ? represent other characters). Similarly, S(6) is the number of strings of the form a????a.
Consider a sequence of length n>3 of the form a....?b. The ? can be any of m-1 characters (anything except b). One of these is a. So D(n) = S(n-1) + (m-2)D(n-1). Using a similar argument, one can figure out that S(n) = (M-1)D(n-1).
For example, how many strings are there of the form a??b? Well, the character just before the b could be a or something else. How many strings are there when it's a? Well, it's the same as the number of strings of the form a?a. How many strings are there when it's something else? Well it's the same as the number of strings of the form a?c multiplied by the number of choices we had for c (namely: m-2 -- everything except for a which we've already counted, and b which is excluded by the rules).
If n is odd, we can consider the middle character. Consider a sequence of length n of the form a...?...b. The ? (which is in the center of the string) can be a, b, or one of the other M-2 characters. Thus D(2n+1) = S(n+1)D(n+1) + D(n+1)S(n+1) + (M-2)D(n+1)D(n+1). Similarly, S(2n+1) = S(n+1)S(n+1) + (M-1)D(n+1)D(n+1).
For small n, S(2)=0, S(3)=M-1, D(2)=1, D(3)=M-2.
We can use the above equations (the first set for even n>3, the second set for odd n>3, and the base cases for n=2 or 3 to compute the result you need in O(log N) arithmetic operations. Presumably the question asks you to compute the result modulo something (since the result grows like O(M^(N-2)), but that's easy to incorporate into the results.
Working code that uses this approach:
def C(n, m, p):
if n == 2:
return 0, 1
if n == 3:
return (m-1)%p, (m-2)%p
if n % 2 == 0:
S, D = C(n-1, m, p)
return ((m-1) * D)%p, (S + (m-2) * D)%p
else:
S, D = C((n-1)//2+1, m, p)
return (S*S + (m-1)*D*D)%p, (2*S*D + (m-2)*D*D)%p
Note that in this code, C(n, m, p) returns two numbers -- S(n)%p and D(n)%p.
For example:
>>> p = 2**64 - 59 # Some large prime
>>> print(C(4, 128, p))
>>> print(C(5, 128, p))
>>> print(C(10**15, 128, p))
(16002, 16003)
(2032381, 2032380)
(12557489471374801501, 12557489471374801502)
Looking at these examples, it seems like D(n) = S(n) + (-1)^n. If that's true, the code can be simplified a bit I guess.
Another, perhaps easier, way to do it efficiently is to use a matrix and the first set of equations. (Sorry for the ascii art -- this diagram is a vector = matrix * vector):
(D(n)) = (M-2 1) * (D(n-1))
(S(n)) = (M-1 0) (S(n-1))
Telescoping this, and using that D(2)=1, S(2)=0:
(D(n)) = (M-2 1)^(n-2) (1)
(S(n)) = (M-1 0) (0)
You can perform the matrix power using exponentiation by squaring in O(log n) time.
Here's working code, including the examples (which you can check produce the same values as the code above). Most of the code is actually matrix multiply and matrix power -- you can probably replace a lot of it with numpy code if you use that package.
def mat_mul(M, N, p):
R = [[0, 0], [0, 0]]
for i in range(2):
for j in range(2):
for k in range(2):
R[i][j] += M[i][k] * N[k][j]
R[i][j] %= p
return R
def mat_pow(M, n, p):
if n == 0:
return [[1, 0], [0, 1]]
if n == 1:
return M
if n % 2 == 0:
R = mat_pow(M, n//2, p)
return mat_mul(R, R, p)
return mat_mul(M, mat_pow(M, n-1, p), p)
def Cmat(n, m, p):
M = [((m-2), 1), (m-1, 0)]
M = mat_pow(M, n-2, p)
return M[1][0], M[0][0]
p = 2**64 - 59
print(Cmat(4, 128, p))
print(Cmat(5, 128, p))
print(Cmat(10**15, 128, p))
You only need to count the number of acceptable sequences, not find them explicitly. It turns out that it doesn't matter what the majority of the characters are. There are only 4 kinds of characters that matter:
The first character
The last character
The last-used character, so you don't repeat characters consecutively
All other characters
In other words, you don't need to iterate over all 10^15 characters. You only need to consider the four cases above, since most characters can be lumped together into the last case.

Boyer-Moore Galil Rule

I was implementing the Boyer-Moore Algorithm for substring search in Python when I learned about the Galil Rule. I've looked around online for the Galil Rule but I haven't found anything more than a couple of sentences, and I cannot get access to the original paper. How can I implement this into my current algorithm?
i = 0
while i < (N - M + 1):
skip = 0
for j in reversed(range(0, M)):
if pattern[j] != text[i + j]:
skip = max(1, j - offsets[text[i+j]])
break
if skip == 0:
return i
i += skip
return -1
Notes:
offsets[c] = -1 if c is not in the pattern
offsets[c] = last index of c in the pattern
Example:
aaabcb
offsets[a] = 2
offsets[b] = 5
offsets[c] = 4
offsets[d] = -1
The few sentences I have found have said to keep track of when the first mismatch occurs in my inner loop (j, if the if-statement inside the inner loop is True) and the position in which I started the comparisons (i + j, in my case). I understand the intuition that I've already checked all the indices in between those, so I shouldn't have to do those comparisons again. I just don't understand how to connect the dots and arrive at an implementation.
The Galil rule is about exploiting periodicity in the pattern to reduce comparisons. Say you have a pattern abcabcab. It's periodic with smallest period abc. In general, a pattern P is periodic if there's a string U such that P is a prefix of UUUUU.... (In the above example, abcabcab is clearly a prefix of the repeating string abc = U.) We call the shortest such string the period of P. Let the length of that period be k (in the example above k = 3 since U = abc).
First of all, keep in mind that the Galil rule applies only after you've found an occurrence of P in the text. When you do that, the Galil rule says that you could shift by k (the periodicity of the pattern) and you only have to compare the last k characters of the now shifted pattern to determine if there was a match.
Here's an example:
P = ababa
T = bababababab
U = ab
k = 2
First occurrence: b[ababa]babab. Now you can shift by k = 2 and you only have to check the last two characters of the pattern:
T = bababa[ba]bab
P = aba[ba] // Only need to compare chars inside brackets for next match.
The rest of P must match since P is periodic and you shifted it by its period k from an existing match (this is crucial) so the repeating parts will nicely line up.
If you've found another match, just repeat. If you find a mismatch, however, you revert to the standard Boyer-Moore algorithm until you find another match. Remember, you can only use the Galil rule when you find a match and you shift by k (otherwise the pattern is not guaranteed to line up with the previous occurrence).
Now, you might wonder, how to determine k for a given pattern P. You'll need to calculate the suffixes array N first, where N[i] will be the length of the longest common suffix of the prefix P[0, i] and P. (You can calculate the suffixes array by calculating the prefixes array Z on the reverse of P using the Z algorithm, as described here, for example.) Once you have the suffixes array, you can easily find k since it'll be the smallest k > 0 such that N[m - k - 1] == m - k (where m = |P|).
For example:
P = ababa
m = 5
N = [1, 0, 3, 0, 5]
k = 2 because N[m - k - 1] == N[5 - 2 - 1] == N[2] == 3 == 5 - k
The answer by #Lajos Nagy has explained the idea of Galil rule perfectly, however we have a more straightforward way to calculate k:
Just use the prefix function of KMP algorithm.
The prefix[i] means the longest proper prefix of P[0..i] which is also a suffix.
And, k = m-prefix[m-1] .
This article has explained the details.

Remove the inferior digits of a number

Given a number n of x digits. How to remove y digits in a way the remaining digits results in the greater possible number?
Examples:
1)x=7 y=3
n=7816295
-8-6-95
=8695
2)x=4 y=2
n=4213
4--3
=43
3)x=3 y=1
n=888
=88
Just to state: x > y > 0.
For each digit to remove: iterate through the digits left to right; if you find a digit that's less than the one to its right, remove it and stop, otherwise remove the last digit.
If the number of digits x is greater than the actual length of the number, it means there are leading zeros. Since those will be the first to go, you can simply reduce the count y by a corresponding amount.
Here's a working version in Python:
def remove_digits(n, x, y):
s = str(n)
if len(s) > x:
raise ValueError
elif len(s) < x:
y -= x - len(s)
if y <= 0:
return n
for r in range(y):
for i in range(len(s)):
if s[i] < s[i+1:i+2]:
break
s = s[:i] + s[i+1:]
return int(s)
>>> remove_digits(7816295, 7, 3)
8695
>>> remove_digits(4213, 4, 2)
43
>>> remove_digits(888, 3, 1)
88
I hesitated to submit this, because it seems too simple. But I wasn't able to think of a case where it wouldn't work.
if x = y we have to remove all the digits.
Otherwise, you need to find maximum digit in first y + 1 digits. Then remove all the y0 elements before this maximum digit. Then you need to add that maximum to the answer and then repeat that task again, but you need now to remove y - y0 elements now.
Straight forward implementation will work in O(x^2) time in the worst case.
But finding maximum in the given range can be done effectively using Segment Tree data structure. Time complexity will be O(x * log(x)) in the worst case.
P. S. I just realized, that it possible to solve in O(x) also, using the fact, that exists only 10 digits (but the algorithm maybe a little bit complicated). We need to find the minimum in the given range [L, R], but the ranges in this task will "change" from left to the right (L and R always increase). And we just need to store 10 pointers to the digits (1 per digit) to the first position in the number such that position >= L. Then to find the minimum, we need to check only 10 pointers. To update the pointers, we will try to move them right.
So the time complexity will be O(10 * x) = O(x)
Here's an O(x) solution. It builds an index that maps (i, d) to j, the smallest number > i such that the j'th digit of n is d. With this index, one can easily find the largest possible next digit in the solution in O(1) time.
def index(digits):
next = [len(digits)+1] * 10
for i in xrange(len(digits), 0, -1):
next[ord(digits[i-1])-ord('0')] = i-1
yield next[::-1]
def minseq(n, y):
n = str(n)
idx = list(index(n))[::-1]
i, r = 0, []
for ry in xrange(len(n)-y):
i = next(j for j in idx[i] if j <= y+ry) + 1
r.append(n[i - 1])
return ''.join(r)
print minseq(7816295, 3)
print minseq(4213, 2)
Pseudocode:
Number.toDigits().filter (sortedSet (Number.toDigits()). take (y))
Imho you don't need to know x.
For efficiency, Number.toDigits () could be precalculated
digits = Number.toDigits()
digits.filter (sortedSet (digits).take (y))
Depending on language and context, you either output the digits and are done or have to convert the result into a number again.
Working Scala-Code for example:
def toDigits (l: Long) : List [Long] = if (l < 10) l :: Nil else (toDigits (l /10)) :+ (l % 10)
val num = 734529L
val dig = toDigits (num)
dig.filter (_ > ((dig.sorted).take(2).last))
A sorted set is a set which is sorted, which means, every element is only contained once and then the resulting collection is sorted by some criteria, for example numerical ascending. => 234579.
We take two of them (23) and from that subset the last (3) and filter the number by the criteria, that the digits have to be greater than that value (3).
Your question does not explicitly say, that each digit is only contained once in the original number, but since you didn't give a criterion, which one to remove in doubt, I took it as an implicit assumption.
Other languages may of course have other expressions (x.sorted, x.toSortedSet, new SortedSet (num), ...) or lack certain classes, functions, which you would have to build on your own.
You might need to write your own filter method, which takes a pedicate P, and a collection C, and returns a new collection of all elements which satisfy P, P being a Method which takes one T and returns a Boolean. Very useful stuff.

Google codejam APAC Test practice round: Parentheses Order

I spent one day solving this problem and couldn't find a solution to pass the large dataset.
Problem
An n parentheses sequence consists of n "("s and n ")"s.
Now, we have all valid n parentheses sequences. Find the k-th smallest sequence in lexicographical order.
For example, here are all valid 3 parentheses sequences in lexicographical order:
((()))
(()())
(())()
()(())
()()()
Given n and k, write an algorithm to give the k-th smallest sequence in lexicographical order.
For large data set: 1 ≤ n ≤ 100 and 1 ≤ k ≤ 10^18
This problem can be solved by using dynamic programming
Let dp[n][m] = number of valid parentheses that can be created if we have n open brackets and m close brackets.
Base case:
dp[0][a] = 1 (a >=0)
Fill in the matrix using the base case:
dp[n][m] = dp[n - 1][m] + (n < m ? dp[n][m - 1]:0 );
Then, we can slowly build the kth parentheses.
Start with a = n open brackets and b = n close brackets and the current result is empty
while(k is not 0):
If number dp[a][b] >= k:
If (dp[a - 1][b] >= k) is true:
* Append an open bracket '(' to the current result
* Decrease a
Else:
//k is the number of previous smaller lexicographical parentheses
* Adjust value of k: `k -= dp[a -1][b]`,
* Append a close bracket ')'
* Decrease b
Else k is invalid
Notice that open bracket is less than close bracket in lexicographical order, so we always try to add open bracket first.
Let S= any valid sequence of parentheses from n( and n) .
Now any valid sequence S can be written as S=X+Y where
X=valid prefix i.e. if traversing X from left to right , at any point of time, numberof'(' >= numberof')'
Y=valid suffix i.e. if traversing Y from right to left, at any point of time, numberof'(' <= numberof')'
For any S many pairs of X and Y are possible.
For our example: ()(())
`()(())` =`empty_string + ()(())`
= `( + )(())`
= `() + (())`
= `()( + ())`
= `()(( + ))`
= `()(() + )`
= `()(()) + empty_string`
Note that when X=empty_string, then number of valid S from n( and n)= number of valid suffix Y from n( and n)
Now, Algorithm goes like this:
We will start with X= empty_string and recursively grow X until X=S. At any point of time we have two options to grow X, either append '(' or append ')'
Let dp[a][b]= number of valid suffixes using a '(' and b ')' given X
nop=num_open_parenthesis_left ncp=num_closed_parenthesis_left
`calculate(nop,ncp)
{
if dp[nop][ncp] is not known
{
i1=calculate(nop-1,ncp); // Case 1: X= X + "("
i2=((nop<ncp)?calculate(nop,ncp-1):0);
/*Case 2: X=X+ ")" if nop>=ncp, then after exhausting 1 ')' nop>ncp, therefore there can be no valid suffix*/
dp[nop][ncp]=i1+i2;
}
return dp[nop][ncp];
}`
Lets take example,n=3 i.e. 3 ( and 3 )
Now at the very start X=empty_string, therefore
dp[3][3]= number of valid sequence S using 3( and 3 )
= number of valid suffixes Y from 3 ( and 3 )

Resources