Find all substrings that don't contain the entire set of characters - algorithm

This was asked to me in an interview.
I'm given a string whose characters come from the set {a,b,c} only. Find all substrings that dont contain all the characters from the set.For e.g, substrings that contain only a's, only b's, only c's or only a,b's or only b,c's or only c,a's. I gave him the naive O(n^2) solution by generating all substrings and testing them.
The interviewer wanted an O(n) solution.
Edit: My attempt was to have the last indexes of a,b,c and run a pointer from left to right, and anytime all 3 were counted, change the start of the substring to exclude the earliest one and start counting again. It doesn't seem exhaustive
So for e.g, if the string is abbcabccaa,
let i be the pointer that traverses the string. Let start be start of the substring.
1) i = 0, start = 0
2) i = 1, start = 0, last_index(a) = 0 --> 1 substring - a
3) i = 2, start = 0, last_index(a) = 0, last_index(b) = 1 -- > 1 substring ab
4) i = 3, start = 0, last_index(a) = 0, last_index(b) = 2 --> 1 substring abb
5) i = 4, start = 1, last_index(b) = 2, last_index(c) = 3 --> 1 substring bbc(removed a from the substring)
6) i = 5, start = 3, last_index(c) = 3, last_index(a) = 4 --> 1 substring ca(removed b from the substring)
but this isn't exhaustive

Given that the problem in its original definition can't be solved in less than O(N^2) time, as some comments point out, I suggest a linear algorithm for counting the number of substrings (not necessarily unique in their values, but unique in their positions within the original string).
The algorithm
count = 0
For every char C in {'a','b','c'} scan the input S and break it into longest sequences not including C. For each such section A, add |A|*(|A|+1)/2 to count. This addition stands for the number of legal sub-strings inside A.
Now we have the total number of legal strings including only {'a','b'}, only {'a','c'} and only {'b','c'}. The problem is that we counted substrings with a single repeated character twice. To fix this we iterate over S again, this time subtracting |A|*(|A|+1)/2 for every largest sequence A of a single character that we encounter.
Return count
Example
S='aacb'
breaking it using 'a' gives us only 'cb', so count = 3. For C='b' we have 'aac', which makes count = 3 + 6 = 9. With C='c' we get 'aa' and 'b', so count = 9 + 3 + 1 = 13. Now we have to do the subtraction: 'aa': -3, 'c': -1, 'b': -1. So we have count=8.
The 8 substrings are:
'a'
'a' (the second char this time)
'aa'
'ac'
'aac'
'cb'
'c'
'b'

To get something better than O(n) we may need additional assumptions (maybe longest substrings with this property).
Consider a string of the form aaaaaaaaaabbbbbbbbbb of length n. There is at least O(n^2) possible substrings so if we want to list them all we need O(n^2) time.
I came up with a linear solution for the longest substrings.
Take a set S of all substrings separated by a, all substrings separated by b and finally all substrings separated by c. Each of those steps can be done in O(n), so we have O(3n), thus O(n).
Example:
Take aaabcaaccbaa.
In this case set S contains:
substrings separated by a: bc, ccb
substrings separated by b: aaa, caacc
substrings separated by c: aaab, aa, baa.
By the set I mean a data structure with adding and finding element with a given key in O(1).

Related

Interview Question: Remove repeating numbers at the end of an array

I got a surprising interview question today at a big Bay Area tech company that I was absolutely stumped by despite seeming so easy. Was wondering if anyone has seen it or can offer a simpler solution as the interviewer didn't want to show me the answer. The solution can be written in any language or pseudocode.
Question:
Given a list of numbers, remove any extraneous repeating suffix sequences of numbers that appear at the end of the list until it has no repeating suffix sequences. The repeating sequence can be cut-off.
For example:
[1,2,3,4,5,6,7,5,6,7,5,6] -> [1,2,3,4,5,6,7]
explanation: [5, 6, 7] were repeating
Also consider the situation
[1,2,3,4,5,4,5,1,4,5,4,5,1,4,5,4,5,] -> [1,2,3,4,5,4,5,1] # not [1,2,3,4,5,4,5,1,4,5,4,5,1]
explanation: [4,5,4,5,1] is a repeating sequence
There are always two ways to approach this topic. Finding any solution and finding an efficient one. It is usually better to start with any and then think on how to optimize it.
Now as we can see in the second example, the problem is complicated by the fact that the repeating pattern is not known. So we could just do it for all the possible patterns at the end. Then we would need to check two things
is it actually repeating
how long is the result
Then we could just take the shortest result. Here is the Python code:
def remove_repeating_tail(a: list) -> list:
results = []
for i in range(len(a)):
tail = a[i:]
results.append(remove_repeats(a, tail))
if len(results) == 0:
return a
return sorted(results, key=len)[0]
Also we made sure we cover all the cases. Empty list, no repeating pattern. Next we need to write remove_repeats. Also we check the empty repeating pattern, so we need to be aware of that.
def remove_repeats(a: list, tail: list) -> list:
assert len(tail) <= len(a)
if len(tail) == 0:
return a
remainder = a
count = 0
while remainder[-len(tail):] == tail:
remainder = remainder[:-len(tail)]
count += 1
if count <= 1:
return a
return remainder
We remove the repeating pattern and then add it back at the end. Now it's time to test the code if it actually works, if that is possible in the interview.
remove_repeating_tail([1,2,3,4,5,6,7,5,6,7,5,6])
-> [1, 2, 3, 4, 5, 6]
remove_repeating_tail([1,2,3,4,5,4,5,1,4,5,4,5,1,4,5,4,5])
-> [1, 2, 3, 4, 5, 4, 5]
Also good to check some other cases:
remove_repeating_tail([1,2,3,4])
-> [1, 2, 3, 4]
remove_repeating_tail([])
-> []
After quite a bit of fixing we got the above, which I think is correct. In particular I missed:
first I had an infinite loop in remove_repeats for an empty tail
remove_repeats removed always the tail and sometimes everything, as I wasn't checking that there is at least one repeat. I then added the counting.
I made simple mistakes like writing results = res instead of results.append(res) leading to some Exceptions.
Then a lot of simplification. First I used some sentinel None to communicate back that it is not repeating, but we could just return the whole list. Then I checked the repeating with some if before the while loop, but realized its basically doing the same as the first iteration, so I used counting.
Similarly I don't like the if len(results) == 0: check. I would probably add a to the result in the beginning and remove the check, as now there is always a result. Then we could start the counting from 1 instead of 0. Still I kept it in.
If we want something fast, we first need to analyze the complexity.
So remove repeating tails for a list of size n and tail size k is: O(n / k). Then we call this function n times. And then we sort it. Wait why do we sort it, we could just take the minimum return min(results, key=len). That's better.
In each loop we call remove_repeats starting with k = 1 to n. So we have:
sum(k = 1 .. n) O(n / k). This is n / 1 + n / 2 + n / 3 + .. n / n. I had to look this up on Wikipedia, but these are called harmonic numbers. We can also just make our live easy and say its less than O(n^2) for now. Otherwise I found an approximation of H_n = n ln(n) + 0.5 n here. So the complexity overall is O(n log n). Not to bad I would say. Is it the optimal? Maybe. Here I would compare it to some other similar algorithms (like substring search, etc).
Before going there, at this point, I would check with the interviewer, where he would like to go next. As there are many directions.
This seems a tricky question and there may not be a simple solution. Best solution I can think of would be O(n) time and O(n) and that is if I am not missing any edge case.
Let's take as example
[1,2,3,4,5,4,5,1,4,5,4,5,1,4,5,4,5] -> [1,2,3,4,5,4,5,1]
Steps would be as follows:
Iterate over the input array from last index to first and build a dictionary (hashtable) with every number in the array being a key and value: a list of positions where the specific number is found in the array.
Occurrences dictionary will become:
{
5: [14, 11, 9, 6, 4],
4: [13, 10, 8, 5, 3],
1: [12, 7, 0],
3: [2]
2: [1]
}
Find the possible suffix lengths by calculating deltas between every position and first position for every number. This way we take into consideration the case in which a specific number repeats in the suffix or in the prefix.
We then add each distinct possible suffix length to a set.
We sort the possible suffix lengths in descending order.
We get following suffix lengths:
[12, 10, 7, 5, 2]
For every possible length l, we test if arr[n-1] == arr[n-1-l]. If l is our suffix's length, it means that the number at last position is repeated at exactly l positions before. We then check the last l elements to respect the same condition. If they do, we found the maximum suffix length. If not, the max suffix length is even smaller, so we check the next possible length.
After finding the correct suffix length, we delete the remaining numbers that repeat at positions pos-l. We then return the slice of array with suffix removed.
def removeRepeatingSuffixes(arr):
if not arr:
return []
n = len(arr)
occurrences = {}
for i in range(n - 1, -1, -1):
c = arr[i]
if c not in occurrences:
occurrences[c] = []
occurrences[c].append(i)
# treat edge case: no repeating suffix
if len(occurrences[arr[n-1]]) == 1:
return arr
# create a set of possible suffix lengths,
# based on the differences between the positions of each number.
possible_suffixes_lengths_set = set()
for c, olist in occurrences.items():
if len(olist) >= 2:
for i in range(len(olist)-1):
delta = olist[i] - olist[len(olist)-1]
possible_suffixes_lengths_set.add(delta)
suff_lengths = sorted(possible_suffixes_lengths_set, reverse=True)
for l in suff_lengths:
if arr[n - 1] == arr[n - 1 - l]:
# possible suffix length, check if last l characters repeat
ok_length = True
for j in range(n-2, n-1-l, -1):
if arr[j] != arr[j-l]:
ok_length = False
break
if ok_length:
last_i = n-1-l
while last_i > 0 and arr[last_i] == arr[last_i - l]:
last_i -= 1
# return non-repeating slice, from 0 to last_i
return arr[0:last_i + 1]
quick way to remove repeating or dedupe is change to a type set() instead of a list

Substring search with max 1's in a binary sequence

Problem
The task is to find a substring from the given binary string with highest score. The substring should be at least of given min length.
score = number of 1s / substring length where score can range from 0 to 1.
Inputs:
1. min length of substring
2. binary sequence
Outputs:
1. index of first char of substring
2. index of last char of substring
Example 1:
input
-----
5
01010101111100
output
------
7
11
explanation
-----------
1. start with minimum window = 5
2. start_ind = 0, end_index = 4, score = 2/5 (0.4)
3. start_ind = 1, end_index = 5, score = 3/5 (0.6)
4. and so on...
5. start_ind = 7, end_index = 11, score = 5/5 (1) [max possible]
Example 2:
input
-----
5
10110011100
output
------
2
8
explanation
-----------
1. while calculating all scores for windows 5 to len(sequence)
2. max score occurs in the case: start_ind=2, end_ind=8, score=5/7 (0.7143) [max possible]
Example 3:
input
-----
4
00110011100
output
------
5
8
What I attempted
The only technique i could come up with was a brute force technique, with nested for loops
for window_size in (min to max)
for ind 0 to end
calculate score
save max score
Can someone suggest a better algorithm to solve this problem?
There's a few observations to make before we start talking about an algorithm- some of these observations already have been pointed out in the comments.
Maths
Take the minimum length to be M, the length of the entire string to be L, and a substring from the ith char to the jth char (inclusive-exclusive) to be S[i:j].
All optimal substrings will satisfy at least one of two conditions:
It is exactly M characters in length
It starts and ends with a 1 character
The reason for the latter being if it were longer than M characters and started/ended with a 0, we could just drop that 0 resulting in a higher ratio.
In the same spirit (again, for the 2nd case), there exists an optimal substring which is not preceded by a 1. Otherwise, if it were, we could include that 1, resulting in an equal or higher ratio. The same logic applies to the end of S and a following 1.
Building on the above- such a substring being preceded or followed by another 1 will NOT be optimal, unless the substring contains no 0s. In the case where it doesn't contain 0s, there will exist an optimal substring of length M as well anyways.
Again, that all only applies to the length greater than M case substrings.
Finally, there exists an optimal substring that has length at least M (by definition), and at most 2 * M - 1. If an optimal substring had length K, we could split it into two substrings of length floor(K/2) and ceil(K/2) - S[i:i+floor(K/2)] and S[i+floor(K/2):i+K]. If the substring has the score (ratio) R, and its halves R0 and R1, we would have one of two scenarios:
R = R0 = R1, meaning we could pick either half and get the same score as the combined substring, giving us a shorter substring.
If this substring has length less than 2 * M, we are done- we have an optimal substring of length [M, 2*M).
Otherwise, recurse on the new substring.
R0 != R1, so (without loss of generality) R0 < R < R1, meaning the combined substring would not be optimal in the first place.
Note that I say "there exists an optimal" as opposed to "the optimal". This is because there may be multiple optimal solutions, and the observations above may refer to different instances.
Algorithm
You could search every window size [M, 2*M) at every offset, which would already be better than a full search for small M. You can also try a two-phase approach:
search every M sized window, find the max score
search from the beginning of every run of 1s forward through a special list of ends of runs of 1s, implicitly skipping over 0s and irrelevant 1s, breaking when out of the [M, 2 * M) bound.
For random data, I only expect this to save a small factor- skipping 15/16 of the windows (ignoring the added overhead). For less-random data, you could potentially see huge benefits, particularly if there's LOTS of LARGE runs of 1s and 0s.
The biggest speedup you'll be able to do (besides limiting the window max to 2 * M) is computing a cumulative sum of the bit array. This lets you query "how many 1s were seen up to this point". You can then take the difference of two elements in this array to query "how many 1s occurred between these offsets" in constant time. This allows for very quick calculation of the score.
You can use 2 pointer method, starting from both left-most and right-most ends. then adjust them searching for highest score.
We can add some cache to optimize time.
Example: (Python)
binary="01010101111100"
length=5
def get_score(binary,left,right):
ones=0
for i in range(left,right+1):
if binary[i]=="1":
ones+=1
score= ones/(right-left+1)
return score
cache={}
def get_sub(binary,length,left,right):
if (left,right) in cache:
return cache[(left,right)]
table=[0,set()]
if right-left+1<length:
pass
else:
scores=[[get_score(binary,left,right),set([(left,right)])],
get_sub(binary,length,left+1,right),
get_sub(binary,length,left,right-1),
get_sub(binary,length,left+1,right-1)]
for s in scores:
if s[0]>table[0]:
table[0]=s[0]
table[1]=s[1]
elif s[0]==table[0]:
for e in s[1]:
table[1].add(e)
cache[(left,right)]=table
return table
result=get_sub(binary,length,0,len(binary)-1)
print("Score: %f"%result[0])
print("Index: %s"%result[1])
Output
Score: 1
Index: {(7, 11)}

Number of substrings of a given string containing a specific character

What can be the most efficient algorithm to count the number of substrings of a given string that contain a given character.
e.g. for abb b
sub-strings : a, b, b, ab, bb, abb.
Answer : strings containg b atlest once = 5.
PS. i solved this question by generating all the substrings and then checking in O(n ^ 2). Just want to know whether there can be a better solution to this.
Let you need to find substrings with character X.
Scan string left to right, keeping position of the last X: lastX with starting value -1
When you meet X at position i, add i+1 to result and update lastX
(this is number of substrings ending in current position and they all contain X)
When you meet another character, add lastX + 1 to result
(this is again number of substrings ending in current position and containing X),
because the rightmost possible start of substring is position of the last X
Algorithm is linear.
Example:
a X a a X a
good substrings overall
idx char ending at idx lastX count count
0 a - -1 0 0
1 X aX X 1 2 2
2 a aXa Xa 1 2 4
3 a aXaa Xaa 1 2 6
4 X aXaaX XaaX aaX aX X 4 5 11
5 a aXaaXa XaaXa aaXa aXa Xa 4 5 16
Python code:
def subcnt(s, c):
last = -1
cnt = 0
for i in range(len(s)):
if s[i] == c:
last = i
cnt += last + 1
return cnt
print(subcnt('abcdba', 'b'))
You could turn this around and scan your string for occurrences of your letter. Every time you find an occurrence in some position i, you know that it is contained by definition in all the substrings that contain it (i.e. all substrings which start before or at i and end at or after i), so you only need to store pairs of indices to define substrings instead of storing substrings explicitly.
That being said, you'll still need O(n²) with this approach because although you don't mind repeated substrings as your example shows, you don't want to count the same substring twice, so you still have to make sure that you don't select the same pair of indices twice.
Let's consider the string as abcdaefgabb and the given character as a.
Loop over the string char by char.
If a character matches a given character, let's say a at index 4, so number of substrings which will contain a is from abcda to aefgabb. So, we add (4-0 + 1) + (10 - 4) = 11. These represent substrings as abcda,bcda,cda,da,a,ae,aef,aefg,aefga,aefgab and aefgabb.
This applies to wherever you find a, like you find it at index 0 and also at index 8.
Final answer is the sum of above mentioned math operations.
Update: You will have to maintain 2 pointers between last occurred a and the current a to avoid calculating duplicate substrings which start end end with the same index.
Think of a substring as selecting two elements from the gaps between the letters in your string and including everything between them (where there are gaps on the extreme ends of the string).
For a string of length n, there are choose(n+1,2) substrings.
Of those, for each run of k characters that doesn't include the target, there are choose(k+1,2) substrings that only include letters from that substring. All other substrings of the main string must include the target.
Answer: choose(n+1,2) - sum(choose(k_i+1,2)), where the k_i are the lengths of runs of letters that don't include the target.

Pair up strings to form palindromes

Given N strings each of at max 1000 length. We can concatenate pair of strings by ends. Like if one is "abc" and other is "cba" then we can get "abccba" as well as "cbaabc". Some string may be left without concatenation to any other string. Also no string can be concatenated to itself.
We can only concatenate those two strings that form a palindrome. So I need to tell the minimum number of strings left after making such pairs.
Example : Let we have 9 strings :
aabbaabb
bbaabbaa
aa
bb
a
bbaa
bba
bab
ab
Then here answer is 5
Explanation : Here are 5 strings :
"aabbaabb" + "bbaabbaa" = "aabbaabbbbaabbaa"
"aa" + "a = "aaa"
"bba" + "bb" = "bbabb"
"bab" + "ab" = "babab"
"bbaa"
Also there can be 1000 such strings in total.
1) Make a graph where we have one node for each word.
2) Go through all pairs of words and check if they form palindrome if we concatenate them. If they do connect corresponding nodes in graph with edge.
3) Now use matching algorithm to find maximum number of edges you can match: http://en.wikipedia.org/wiki/Blossom_algorithm
Time complexity: O(N) for point 1, O(n*n*1000) for point 2 and O(V^4) for point 3 yielding total complexity of O(n^4).

Disperse Duplicates in an Array

Source : Google Interview Question
Write a routine to ensure that identical elements in the input are maximally spread in the output?
Basically, we need to place the same elements,in such a way , that the TOTAL spreading is as maximal as possible.
Example:
Input: {1,1,2,3,2,3}
Possible Output: {1,2,3,1,2,3}
Total dispersion = Difference between position of 1's + 2's + 3's = 4-1 + 5-2 + 6-3 = 9 .
I am NOT AT ALL sure, if there's an optimal polynomial time algorithm available for this.Also,no other detail is provided for the question other than this .
What i thought is,calculate the frequency of each element in the input,then arrange them in the output,each distinct element at a time,until all the frequencies are exhausted.
I am not sure of my approach .
Any approaches/ideas people .
I believe this simple algorithm would work:
count the number of occurrences of each distinct element.
make a new list
add one instance of all elements that occur more than once to the list (order within each group does not matter)
add one instance of all unique elements to the list
add one instance of all elements that occur more than once to the list
add one instance of all elements that occur more than twice to the list
add one instance of all elements that occur more than trice to the list
...
Now, this will intuitively not give a good spread:
for {1, 1, 1, 1, 2, 3, 4} ==> {1, 2, 3, 4, 1, 1, 1}
for {1, 1, 1, 2, 2, 2, 3, 4} ==> {1, 2, 3, 4, 1, 2, 1, 2}
However, i think this is the best spread you can get given the scoring function provided.
Since the dispersion score counts the sum of the distances instead of the squared sum of the distances, you can have several duplicates close together, as long as you have a large gap somewhere else to compensate.
for a sum-of-squared-distances score, the problem becomes harder.
Perhaps the interview question hinged on the candidate recognizing this weakness in the scoring function?
In perl
#a=(9,9,9,2,2,2,1,1,1);
then make a hash table of the counts of different numbers in the list, like a frequency table
map { $x{$_}++ } #a;
then repeatedly walk through all the keys found, with the keys in a known order and add the appropriate number of individual numbers to an output list until all the keys are exhausted
#r=();
$g=1;
while( $g == 1 ) {
$g=0;
for my $n (sort keys %x)
{
if ($x{$n}>1) {
push #r, $n;
$x{$n}--;
$g=1
}
}
}
I'm sure that this could be adapted to any programming language that supports hash tables
python code for algorithm suggested by Vorsprung and HugoRune:
from collections import Counter, defaultdict
def max_spread(data):
cnt = Counter()
for i in data: cnt[i] += 1
res, num = [], list(cnt)
while len(cnt) > 0:
for i in num:
if num[i] > 0:
res.append(i)
cnt[i] -= 1
if cnt[i] == 0: del cnt[i]
return res
def calc_spread(data):
d = defaultdict()
for i, v in enumerate(data):
d.setdefault(v, []).append(i)
return sum([max(x) - min(x) for _, x in d.items()])
HugoRune's answer takes some advantage of the unusual scoring function but we can actually do even better: suppose there are d distinct non-unique values, then the only thing that is required for a solution to be optimal is that the first d values in the output must consist of these in any order, and likewise the last d values in the output must consist of these values in any (i.e. possibly a different) order. (This implies that all unique numbers appear between the first and last instance of every non-unique number.)
The relative order of the first copies of non-unique numbers doesn't matter, and likewise nor does the relative order of their last copies. Suppose the values 1 and 2 both appear multiple times in the input, and that we have built a candidate solution obeying the condition I gave in the first paragraph that has the first copy of 1 at position i and the first copy of 2 at position j > i. Now suppose we swap these two elements. Element 1 has been pushed j - i positions to the right, so its score contribution will drop by j - i. But element 2 has been pushed j - i positions to the left, so its score contribution will increase by j - i. These cancel out, leaving the total score unchanged.
Now, any permutation of elements can be achieved by swapping elements in the following way: swap the element in position 1 with the element that should be at position 1, then do the same for position 2, and so on. After the ith step, the first i elements of the permutation are correct. We know that every swap leaves the scoring function unchanged, and a permutation is just a sequence of swaps, so every permutation also leaves the scoring function unchanged! This is true at for the d elements at both ends of the output array.
When 3 or more copies of a number exist, only the position of the first and last copy contribute to the distance for that number. It doesn't matter where the middle ones go. I'll call the elements between the 2 blocks of d elements at either end the "central" elements. They consist of the unique elements, as well as some number of copies of all those non-unique elements that appear at least 3 times. As before, it's easy to see that any permutation of these "central" elements corresponds to a sequence of swaps, and that any such swap will leave the overall score unchanged (in fact it's even simpler than before, since swapping two central elements does not even change the score contribution of either of these elements).
This leads to a simple O(nlog n) algorithm (or O(n) if you use bucket sort for the first step) to generate a solution array Y from a length-n input array X:
Sort the input array X.
Use a single pass through X to count the number of distinct non-unique elements. Call this d.
Set i, j and k to 0.
While i < n:
If X[i+1] == X[i], we have a non-unique element:
Set Y[j] = Y[n-j-1] = X[i].
Increment i twice, and increment j once.
While X[i] == X[i-1]:
Set Y[d+k] = X[i].
Increment i and k.
Otherwise we have a unique element:
Set Y[d+k] = X[i].
Increment i and k.

Resources