Longest substring where every character appear even number of times (possibly zero) - algorithm

Suppose we have a string s. We want to find the length of the longest substring of s such that every character in the substring appears even number of times (possible zero).
WC Time: O(nlgn). WC space: O(n)
First, it's obvious that the substring must be of an even length. Second, I'm familiar with the sliding window method where we anchor some right index and look for the left-most index to match your criterion. I tried to apply this idea here but couldn't really formulate it.
Also, it seems to me like a priority queue could come in handy (since the O(nlgn) requirement is sort of hinting it)
I'd be glad for help!

Let's define the following bitsets:
B[c,i] = 1 if character c appeared in s[0,...,i] even number of times.
Calculating B[c,i] takes linear time (for all values):
for all c:
B[c,-1] = 0
for i from 0 to len(arr):
B[c, i] = B[s[i], i-1] XOR 1
Since the alphabet is of constant size, so are the bitsets (for each i).
Note that the condition:
every character in the substring appears even number of times
is true for substring s[i,j] if and only if the bitset of index i is identical to the bitset of index j (otherwise, there is a bit that repeated odd number of times in this substring ; other direction: If there is a bit that repeated number of times, then its bit cannot be identical).
So, if we store all bitsets in some set (hash set/tree set), and keep only latest entry, this preprocessing takes O(n) or O(nlogn) time (depending on hash/tree set).
In a second iteration, for each index, find the farther away index with identical bitset (O(1)/O(logn), depending if hash/tree set), find the substring length, and mark it as candidate. At the end, take the longest candidate.
This solution is O(n) space for the bitsets, and O(n)/O(nlogn) time, depending if using hash/tree solution.
Pseudo code:
def NextBitset(B, c): # O(1) time
for each x in alphabet \ {c}:
B[x, i] = B[x, i-1]
B[c, i] = B[c, i-1] XOR 1
for each c in alphabet: # O(1) time
B[c,-1] = 0
map = new hash/tree map (bitset->int)
# first pass: # O(n)/O(nlogn) time
for i from 0 to len(s):
# Note, we override with the latest element.
B = NextBitset(B, s[i])
map[B] = i
for each c in alphabet: # O(1) time
B[c,-1] = 0
max_distance = 0
# second pass: O(n)/ O(nlogn) time.
for i from 0 to len(s):
B = NextBitset(B, s[i])
j = map.find(B) # O(1) / O(logn)
max_distance = max(max_distance, j-i)

I'm not sure exactly what amit proposes so if this is it, please consider it another explanation. This can be accomplished in a single traversal.
Produce a bitset of length equal to the alphabet's for each index of the string. Store the first index for each unique bitset encountered while traversing the string. Update the largest interval between a current and previously seen bitset.
For example, the string, "aabccab":
a a b c c a b
0 1 2 3 4 5 6 (index)
_
0 1 0 0 0 0 1 1 | (vertical)
0 0 0 1 1 1 1 0 | bitset for
0 0 0 0 1 0 0 0 _| each index
^ ^
|___________|
largest interval
between current and
previously seen bitset
The update for each iteration can be accomplished in O(1) by preprocessing a bit mask for each character to XOR with the previous bitset:
bitset mask
0 1 1
1 XOR 0 = 1
0 0 0
means update the character associated with the first bit in the alphabet-bitset.

Related

Palindrome partitioning with interval scheduling

So I was looking at the various algorithms of solving Palindrome partitioning problem.
Like for a string "banana" minimum no of cuts so that each sub-string is a palindrome is 1 i.e. "b|anana"
Now I tried solving this problem using interval scheduling like:
Input: banana
Transformed string: # b # a # n # a # n # a #
P[] = lengths of palindromes considering each character as center of palindrome.
I[] = intervals
String: # b # a # n # a # n # a #
P[i]: 0 1 0 1 0 3 0 5 0 3 0 1 0
I[i]: 0 1 2 3 4 5 6 7 8 9 10 11 12
Example: Palindrome considering 'a' (index 7) as center is 5 "anana"
Now constructing intervals for each character based on P[i]:
b = (0,2)
a = (2,4)
n = (2,8)
a = (2,12)
n = (6,12)
a = (10,12)
So, now if I have to schedule these many intervals on time 0 to 12 such that minimum no of intervals should be scheduled and no time slot remain empty, I would choose (0,2) and (2,12) intervals and hence the answer for the solution would be 1 as I have broken down the given string in two palindromes.
Another test case:
String: # E # A # B # A # E # A # B #
P[i]: 0 1 0 1 0 5 0 1 0 5 0 1 0 1 0
I[i]: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Plotting on graph:
Now, the minimum no of intervals that can be scheduled are either:
1(0,2), 2(2,4), 5(4,14) OR
3(0,10), 6(10,12), 7(12,14)
Hence, we have 3 partitions so the no of cuts would be 2 either
E|A|BAEAB
EABAE|A|B
These are just examples. I would like to know if this algorithm will work for all cases or there are some cases where it would definitely fail.
Please help me achieve a proof that it will work in every scenario.
Note: Please don't discourage me if this post makes no sense as i have put enough time and effort on this problem, just state a reason or provide some link from where I can move forward with this solution. Thank you.
As long as you can get a partition of the string, your algorith will work.
Recall to mind that a partion P of a set S is a set of non empty subset A1, ..., An:
The union of every set A1, ... An gives the set S
The intersection between any Ai, Aj (with i != j) is empty
Even if the palindrome partitioning deals with strings (which are a bit different from sets), the properties of a partition are still true.
Hence, if you have a partition, you consequently have a set of time intervals without "holes" to schedule.
Choosing the partition with the minimum number of subsets, makes you have the minimum number of time intervals and therefore the minimum number of cuts.
Furthermore, you always have at least one palindrome partition of a string: in the worst case, you get a palindrome partition made of single characters.

Enumerating all possible sequences when we are given a set of N numbers and range for each of the N numbers

Problem Statement:
We are given
A set of T numbers S1, S2,....ST
An integer called Range
This means S1 can take on (2*Range+1) values (S1-Range,S1-Range+1,...S1, S1+1,....S1+Range)
Similarly S2, ...ST can take on 2*Range+1 values
Problem 1: How do I enumerate all the possible sequences? I.e all the (2*Range-1)^T sequences (S1-Range,S2,...ST), S1-Range+1,S2,...ST), ....., (S1, S2-Range,S3,....ST) etc
Problem 2: How do I only list the sequences whose sum is S1+S2+...+ST?
For problem 1: The approach I am considering is to do a
for (i=0; i<pow(Range,T);i++)
{
Sequences that can be derived from i are
1. {Si + i mod pow(Range,i)}
2. {Si - i mod pow(Range,i)}
}
Any other more elegant solution?
Also, any ideas for problem 2?
For #1, one way is to think of it like how you increment a number. You increment the last digit, and when it overflows you set it back to initial value (0) and increment the next digit.
So, create an array of size T, then initialize elements to (S1-Range, S2-Range, ..., ST-Range). Print it.
Now increment last value to ST-Range+1. Print it. Keep incrementing and printing until you reach ST+Range. When trying to increment, reset back to ST-Range, then move one position left and increment that. Repeat if that overflows too. If moving all the way left, you're done, otherwise print it.
// Input: T, S[T], Range
create V[T]
for (i in 1..T):
V[i] = S[i] - Range
loop forever {
print V
i = T
V[i]++
while V[i] > S[i] + Range {
V[i] = S[i] - Range
i--
if i < 1:
return // we're done
V[i]++
}
}
For #2, it's a bit different. For the description, I'm going to ignore the values of S, and focus of the delta (-Range, ..., 0, ..., +Range) for each position, calling it D.
Since sum(D)=0, the initial set of values are (-Range, -Range, ..., +Range, +Range). If T is even, first half are -Range, and second half are +Range. If T is odd, then middle value is 0.
Now look at how you want the iterations to go:
-2 -2 0 2 2
-2 -2 1 1 2
-2 -2 1 2 1
-2 -2 2 0 2 (A)
-2 -2 2 1 1
-2 -2 2 2 0
-2 -1 -1 2 2
-2 -1 0 1 2
-2 -1 0 2 1
-2 -1 1 0 2
The logic here is that you skip the last digit, since it's always a function of the other digits. You increment the right-most digit that can be incremented, and reset the digits to the right of it to the lower possible values that will give sum(D)=0.
Logic is a bit more complicated, so I'll let you have some fun writing it. ;-)
A good helper method thought, would be a method to reset digits after a certain position to their lowest possible value, given a start delta. You can then use it to initialize the array to begin with by calling reset(1, 0), i.e. reset positions 1..T using a starting delta of 0.
Then, when you increment D[3] to 2 in the step marked A above, you call reset(4, -2), i.e. reset positions 4..5 using a starting delta of -2. With max value of 2 (Range) for last digit, then means D[4] cannot be lower than 0.

Why does this maximum product subarray algorithm work?

The problem is to find the contiguous subarray within an array (containing at least one number) which has the largest product.
For example, given the array [2,3,-2,4],
the contiguous subarray [2,3] has the largest product 6.
Why does the following work? Can anyone provide any insight on how to prove its correctness?
if(nums == null || nums.Length == 0)
{
throw new ArgumentException("Invalid input");
}
int max = nums[0];
int min = nums[0];
int result = nums[0];
for(int i = 1; i < nums.Length; i++)
{
int prev_max = max;
int prev_min = min;
max = Math.Max(nums[i],Math.Max(prev_max*nums[i], prev_min*nums[i]));
min = Math.Min(nums[i],Math.Min(prev_max*nums[i], prev_min*nums[i]));
result = Math.Max(result, max);
}
return result;
Start from the logic-side to understand how to solve the problem. There are two relevant traits for each subarray to consider:
If it contains a 0, the product of the subarray is aswell 0.
If the subarray contains an odd number of negative values, it's total value is negative aswell, otherwise positive (or 0, considering 0 as a positive value).
Now we can start off with the algorithm itself:
Rule 1: zeros
Since a 0 zeros out the product of the subarray, the subarray of the solution mustn't contain a 0, unless only negative values and 0 are contained in the input. This can be achieved pretty simple, since max and min are both reset to 0, as soon as a 0 is encountered in the array:
max = Math.Max(0 , Math.Max(prev_max * 0 , prev_min * 0));
min = Math.Min(0 , Math.Min(prev_max * 0 , prev_min * 0));
Will logically evaluate to 0, no matter what the so far input is.
arr: 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 0
result: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
min: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
max: 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 0
//non-zero values don't matter for Rule 1, so I just used 1
Rule 2: negative numbers
With Rule 1, we've already implicitly splitted the array into subarrays, such that a subarray consists of either a single 0, or multiple non-zero values. Now the task is to find the largest possible product inside that subarray (I'll refer to that as array from here on).
If the number of negative values in the array is even, the entire problem becomes pretty trivial: just multiply all values in the array and the result is the maximum-product of the array. For an odd number of negative values there are two possible cases:
The array contains only a single negative value: In that case either the subarray with all values with smaller index than the negative value or the subarray with all values with larger index than the negative value becomes the subarray with the maximum-value
The array contains at least 3 negative values: In that case we have to eliminate either the first negative number and all of it's predecessors, or the last negative number and all of it's successors.
Now let's have a look at the code:
max = Math.Max(nums[i] , Math.Max(prev_max * nums[i] , prev_min * nums[i]));
min = Math.Min(nums[i] , Math.Min(prev_max * nums[i] , prev_min * nums[i]));
Case 1: the evaluation of min is actually irrelevant, since the sign of the product of the array will only flip once, for the negative value. As soon as the negative number is encountered (= nums[i]), max will be nums[i], since both max and min are at least 1 and thus multiplication with nums[i] results in a number <= nums[i]. And for the first number after the negative number nums[i + 1], max will be nums[i + 1] again. Since the so far found maximum is made persistent in result (result = Math.Max(result, max);) after each step, this will automatically result in the correct result for that array.
arr: 2 3 2 -4 4 5
result: 2 6 12 12 12 20
max: 2 6 12 -4 4 20
//Omitted min, since it's irrelevant here.
Case 2: Here min becomes relevant too. Before we encounter the first negative value, min is the smallest number encountered so far in the array. After we encounter the first positive element in the array, the value turns negative. We continue to build both products (min and max) and swap them each time a negative value is encountered and keep updating result. When the last negative value of the array is encountered, result will hold the value of the subarray that eliminates the last negative value and it's successor. After the last negative value, max will be the product of the subarray that eliminates the first negative value and it's predecessors and min becomes irrelevant. Now we simply continue to multiply max with the remaining values in the array and update result until the end of the array is reached.
arr: 2 3 -4 3 -2 5 -6 3
result: 2 6 6 6 144 770 770 770
min: 2 6 -24 -72 -6 -30 -4620 ...
max: 2 6 -4 3 144 770 180 540
//min becomes irrelevant after the last negative value
Putting the pieces together
Since min and max are reset every time we encounter a 0, we can easily reuse them for each subarray that doesn't contain a 0. Thus Rule 1 is applied implicitly without interfering with Rule 2. Since result isn't reset each time a new subarray is inspected, the value will be kept persistent over all runs. Thus this algorithm works.
Hope this is understandable (To be honest, I doubt it and will try to improve the answer, if any questions appear). Sry for that monstrous answer.
Lets take assume the contiguous subarray, which produces the maximal product, is a[i], a[i+1], ..., a[j]. Since it is the array with the largest product, it is also the one suffix of a[0], a[1], ..., a[j], that produces the largest product.
The idea of your given algorithm is the following: For every prefix-array a[0], ..., a[j] find the largest suffix array. Out of these suffix arrays, take the maximal.
At the beginning, the smallest and biggest suffix-product are simply nums[0]. Then it iterates over all other numbers in the array. The largest suffix-array is always build in one of three ways. It's just the last numbers nums[i], it's the largest suffix-product of the shortened list multiplied by the last number (if nums[i] > 0), or it's the smallest (< 0) suffix-product multiplied by the last number (if nums[i] < 0). (*)
Using the helper variable result, you store the maximal such suffix-product you found so far.
(*) This fact is quite easy to proof. If you have a different case, for instance there exists a different suffix-product that produces a bigger number, than together with the last number nums[i] you create an even bigger suffix, which would be a contradiction.

Heap sort pseudo code algorithm

In heap sort algorithm
n=m
for k:= m div 2 down to 0
downheap(k);
repeat
t:=a[0]
a[0]:=a[n-1]
a[n-1]:=t
n—
downheap(0);
until n <= 0
Can some one please explain to me what is done in lines
n=m
for k:= m div 2 down to 0
downheap(k);
I think that is the heap building process but what is mean by for k:= m div 2 down to 0
Also is n the number of items.So in an array representation last element is stored at a[n-1]?
But why do it for n> = 0. Can't we finish at n>0.Because the first element gets automatically sorted?
n=m
for k:= m div 2 down to 0
downheap(k);
In a binary heap, half of the nodes have no children. So you can build a heap by starting at the midpoint and sifting items down. What you're doing here is building the heap from the bottom up. Consider this array of five items:
[5, 3, 2, 4, 1]
Or, as a tree:
5
3 2
4 1
The length is 5, so we want to start at index 2 (assume a 1-based heap array). downheap, then, will look at the node labeled 3 and compare it with the smallest child. Since 1 is smaller than 3, we swap the items giving:
5
1 2
4 3
Since we reached a leaf level, we're done with that item. Move on to the first item, 5. It's smaller than 1, so we swap items:
1
5 2
4 3
But the item 5 is still larger than its children, so we do another swap:
1
3 2
4 5
And we're done. You have a valid heap.
It's instructive to do that by hand (with pencil and paper) to build a larger heap--say 10 items. That will give you a very good understanding of how the algorithm works.
For purposes of building the heap in this way, it doesn't matter if the array indexes start at 0 or 1. If the array is 0-based, then you end up making one extra call to downheap, but that doesn't do anything because the node you're trying to move down is already a leaf node. So it's slightly inefficient (one extra call to downheap), but not harmful.
It is important, however, that if your root node is at index 1, that you stop your loop with n > 0 rather than n >= 0. In the latter case, you could very well end up adding a bogus value to your heap and removing an item that's supposed to be there.
for k:= m div 2 down to 0
This appears to be pseudocode for:
for(int k = m/2; k >= 0; k--)
Or possibly
for(int k = m/2; k > 0; k--)
Depending on whether "down to 0" is inclusive or not.
Also is n the number of items?
Initially, yes, but it decrements on the line n-.
Can't we finish at n>0.Because the first element gets automatically sorted?
Yes, this is effectively what happens. Once N becomes zero at n-, it's most of the way through the loop body, so the only thing that gets executed after that and before until n <= 0 terminates is downheap(0);

Dynamic programming: can interval of even 1's and 0's be found in linear time?

Found the following inteview q on the web:
You have an array of
0s and 1s and you want to output all the intervals (i, j) where the
number of 0s and numbers of 1s are equal. Example
pos = 0 1 2 3 4 5 6 7 8
0 1 0 0 1 1 1 1 0
One interval is (0, 1) because there the number
of 0 and 1 are equal. There are many other intervals, find all of them
in linear time.
I think there is no linear time algo, as there may be n^2 such intervals.
Am I right? How can I prove that there are n^2 such ?
This is the fastest way I can think of to do this, and it is linear to the number of intervals there are.
Let L be your original list of numbers and A be a hash of empty arrays where initially A[0] = [0]
sum = 0
for i in 0..n
if L[i] == 0:
sum--
A[sum].push(i)
elif L[i] == 1:
sum++
A[sum].push(i)
Now A is essentially an x y graph of the sum of the sequence (x is the index of the list, y is the sum). Every time there are two x values x1 and x2 to an y value, you have an interval (x1, x2] where the number of 0s and 1s is equal.
There are m(m-1)/2 (arithmetic sum from 1 to m - 1) intervals where the sum is 0 in every array M in A where m = M.length
Using your example to calculate A by hand we use this chart
L # 0 1 0 1 0 0 1 1 1 1 0
A keys 0 -1 0 -1 0 -1 -2 -1 0 1 2 1
L index -1 0 1 2 3 4 5 6 7 8 9 10
(I've added a # to represent the start of the list with an key of -1. Also removed all the numbers that are not 0 or 1 since they're just distractions) A will look like this:
[-2]->[5]
[-1]->[0, 2, 4, 6]
[0]->[-1, 1, 3, 7]
[1]->[8, 10]
[2]->[9]
For any M = [a1, a2, a3, ...], (ai + 1, aj) where j > i will be an interval with the same number of 0s as 1s. For example, in [-1]->[0, 2, 4, 6], the intervals are (1, 2), (1, 4), (1, 6), (3, 4), (3, 6), (5, 6).
Building the array A is O(n), but printing these intervals from A must be done in linear time to the number of intervals. In fact, that could be your proof that it is not quite possible to do this in linear time to n because it's possible to have more intervals than n and you need at least the number of interval iterations to print them all.
Unless of course you consider building A is enough to find all the intervals (since it's obvious from A what the intervals are), then it is linear to n :P
A linear solution is possible (sorry, earlier I argued that this had to be n^2) if you're careful to not actually print the results!
First, let's define a "score" for any set of zeros and ones as the number of ones minus the number of zeroes. So (0,1) has a score of 0, while (0) is -1 and (1,1) is 2.
Now, start from the right. If the right-most digit is a 0 then it can be combined with any group to the left that has a score of 1. So we need to know what groups are available to the left, indexed by score. This suggests a recursive procedure that accumulates groups with scores. The sweep process is O(n) and at each step the process has to check whether it has created a new group and extend the table of known groups. Checking for a new group is constant time (lookup in a hash table). Extending the table of known groups is also constant time (at first I thought it wasn't, but you can maintain a separate offset that avoids updating each entry in the table).
So we have a peculiar situation: each step of the process identifies a set of results of size O(n), but the calculation necessary to do this is constant time (within that step). So the process itself is still O(n) (proportional to the number of steps). Of course, actually printing the results (either during the step, or at the end) makes things O(n^2).
I'll write some Python code to test/demonstrate.
Here we go:
SCORE = [-1,1]
class Accumulator:
def __init__(self):
self.offset = 0
self.groups_to_right = {} # map from score to start index
self.even_groups = []
self.index = 0
def append(self, digit):
score = SCORE[digit]
# want existing groups at -score, to sum to zero
# but there's an offset to correct for, so we really want
# groups at -(score+offset)
corrected = -(score + self.offset)
if corrected in self.groups_to_right:
# if this were a linked list we could save a reference
# to the current value. it's not, so we need to filter
# on printing (see below)
self.even_groups.append(
(self.index, self.groups_to_right[corrected]))
# this updates all the known groups
self.offset += score
# this adds the new one, which should be at the index so that
# index + offset = score (so index = score - offset)
groups = self.groups_to_right.get(score-self.offset, [])
groups.append(self.index)
self.groups_to_right[score-self.offset] = groups
# and move on
self.index += 1
#print self.offset
#print self.groups_to_right
#print self.even_groups
#print self.index
def dump(self):
# printing the results does take longer, of course...
for (end, starts) in self.even_groups:
for start in starts:
# this discards the extra points that were added
# to the data after we added it to the results
# (avoidable with linked lists)
if start < end:
print (start, end)
#staticmethod
def run(input):
accumulator = Accumulator()
print input
for digit in input:
accumulator.append(digit)
accumulator.dump()
print
Accumulator.run([0,1,0,0,1,1,1,1,0])
And the output:
dynamic: python dynamic.py
[0, 1, 0, 0, 1, 1, 1, 1, 0]
(0, 1)
(1, 2)
(1, 4)
(3, 4)
(0, 5)
(2, 5)
(7, 8)
You might be worried that some additional processing (the filtering for start < end) is done in the dump routine that displays the results. But that's because I am working around Python's lack of linked lists (I want to both extend a list and save the previous value in constant time).
It may seem surprising that the result is of size O(n^2) while the process of finding the results is O(n), but it's easy to see how that is possible: at one "step" the process identifies a number of groups (of size O(n)) by associating the current point (self.index in append, or end in dump()) with a list of end points (self.groups_to_right[...] or ends).
Update: One further point. The table of "groups to the right" will have a "typical width" of sqrt(n) entries (this follows from the central limit theorem - it's basically a random walk in 1D). Since an entry is added at each step, the average length is also sqrt(n) (the n values shared out over sqrt(n) bins). That means that the expected time for this algorithm (ie with random inputs), if you include printing the results, is O(n^3/2) even though worst case is O(n^2)
Answering directly the question:
you have to constructing an example where there are more than O(N) matches:
let N be in the form 2^k, with the following input:
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 (here, N=16)
number of matches (where 0 is the starting character):
length #
2 N/2
4 N/2 - 1
6 N/2 - 2
8 N/2 - 3
..
N 1
The total number of matches (starting with 0) is: (1+N/2) * (N/2) / 2 = N^2/8 + N/4
The matches starting with 1 are almost the same, expect that it is one less for each length.
Total: (N^2/8 + N/4) * 2 - N/2 = N^2/4
Every interval will contain at least one sequence of either (0,1) or (1,0). Therefore, it's simply a matter of finding every occurance of (0,1) or (1,0), then for each seeing if it is adjacent to an existing solution or if the two bookend elements form another solution.
With a bit of storage trickery you will be able to find all solutions in linear time. Enumerating them will be O(N^2), but you should be able to encode them in O(N) space.

Resources