Number of substrings of a given string containing a specific character - algorithm

What can be the most efficient algorithm to count the number of substrings of a given string that contain a given character.
e.g. for abb b
sub-strings : a, b, b, ab, bb, abb.
Answer : strings containg b atlest once = 5.
PS. i solved this question by generating all the substrings and then checking in O(n ^ 2). Just want to know whether there can be a better solution to this.

Let you need to find substrings with character X.
Scan string left to right, keeping position of the last X: lastX with starting value -1
When you meet X at position i, add i+1 to result and update lastX
(this is number of substrings ending in current position and they all contain X)
When you meet another character, add lastX + 1 to result
(this is again number of substrings ending in current position and containing X),
because the rightmost possible start of substring is position of the last X
Algorithm is linear.
Example:
a X a a X a
good substrings overall
idx char ending at idx lastX count count
0 a - -1 0 0
1 X aX X 1 2 2
2 a aXa Xa 1 2 4
3 a aXaa Xaa 1 2 6
4 X aXaaX XaaX aaX aX X 4 5 11
5 a aXaaXa XaaXa aaXa aXa Xa 4 5 16
Python code:
def subcnt(s, c):
last = -1
cnt = 0
for i in range(len(s)):
if s[i] == c:
last = i
cnt += last + 1
return cnt
print(subcnt('abcdba', 'b'))

You could turn this around and scan your string for occurrences of your letter. Every time you find an occurrence in some position i, you know that it is contained by definition in all the substrings that contain it (i.e. all substrings which start before or at i and end at or after i), so you only need to store pairs of indices to define substrings instead of storing substrings explicitly.
That being said, you'll still need O(n²) with this approach because although you don't mind repeated substrings as your example shows, you don't want to count the same substring twice, so you still have to make sure that you don't select the same pair of indices twice.

Let's consider the string as abcdaefgabb and the given character as a.
Loop over the string char by char.
If a character matches a given character, let's say a at index 4, so number of substrings which will contain a is from abcda to aefgabb. So, we add (4-0 + 1) + (10 - 4) = 11. These represent substrings as abcda,bcda,cda,da,a,ae,aef,aefg,aefga,aefgab and aefgabb.
This applies to wherever you find a, like you find it at index 0 and also at index 8.
Final answer is the sum of above mentioned math operations.
Update: You will have to maintain 2 pointers between last occurred a and the current a to avoid calculating duplicate substrings which start end end with the same index.

Think of a substring as selecting two elements from the gaps between the letters in your string and including everything between them (where there are gaps on the extreme ends of the string).
For a string of length n, there are choose(n+1,2) substrings.
Of those, for each run of k characters that doesn't include the target, there are choose(k+1,2) substrings that only include letters from that substring. All other substrings of the main string must include the target.
Answer: choose(n+1,2) - sum(choose(k_i+1,2)), where the k_i are the lengths of runs of letters that don't include the target.

Related

Longest odd Palindromic Substring with middle index i

Longest Odd Pallindromes
Problem Description
Given a string S(consisting of only lower case characters) and Q queries.
In each query you will given an integer i and your task is to find the length of longest odd palindromic substring whose middle index is i. Note:
1.) Assume 1 based indexing.
2.) Longest odd palindrome: A palindrome substring whose length is odd.
Problem Constraints
1<=|s|,Q<=1e5
1<=i<=|s|
Input Format
First argument A is string S.
Second argument B is an array of integers where B[i] denotes the query index of ith query.
Output Format
Return an array of integers where ith integer denotes the answer of ith query.
Is there any better way to solve this question other than brute force, that is, when we generate all the palindromic substrings and check
There is Manacher's algorithm that calculates number of palindromes centered at i-th index in linear time.
After precalculation stage you can answer query in O(1). I changed result array to contain lengths of the longest palindromes centered at every position.
Python code (link contains C++ one)
def manacher_odd(s):
n = len(s)
odds = []
l, r = 0, -1
for i in range(n):
k = min(odds[l+r-i], r-i+1) if i<=r else 1
while (i+k < n) and (i-k >= 0) and (s[i+k]==s[i-k]):
k += 1
odds.append(k)
if (i+k-1 > r):
l, r = i-k+1, i+k-1
for i in range(n):
odds[i] = 2 * odds[i] - 1
return odds
print(manacher_odd("abaaaba"))
[1, 3, 1, 7, 1, 3, 1]
There are 2 possible optimizations they might be looking for.
First, you can do an initial run over S first, cleverly building a lookup table, and then your query will just use that, which I think would be faster if B is long.
Alternatively, if not doing a look up, then while you're searching at index i, you'll potentially search neighboring indexes at the same time. As you check i, you can also be checking i+1, i-1, i+2, i-2, etc... as you go, and save that answer for later. This seems the less promising route to me, so I want to dive into the first idea more.
Finally, if B is quite short, then the best answer might be brute force, actually. It's good to know when to keep it simple.
Initial run method
One optimization that comes to mind for a pre-process run is as follows:
Search the next unknown index, brute force it by looking forwards and back, while recording the frequency of each letter (including the middle one.) If a palindrome of 1 or 3 was found, move to the next index and repeat.
If a palindrome or 5 or longer was found, calculate mid points of any letters that showed up more than twice which are to the right of the current index.
Any point between the current index and the index of the last palindrome letter that isn't in the mid-points list is a 1 for length.
This means you'll search all the midpoints found in (2). After that, you'll continue searching from the index of the last letter of the palindrome found in (2).
An example
Let's say S starts with: ```a, b, c, d, a, f, g, f, g, f, a, d, c, b, a, ...``` and you have checked from ```i = 2``` up to ```i = 7``` but found nothing except a run of 3 at ```i = 7```. Now, you check index ```i = 8```. You will find a palindrome extending out 7 letters in each direction, for a total of 15, but as you check, note any letters that show up more than twice. In this case, there are 3 ```f```s and 4 ```a```s. Find any mid points these pairs have that are right of the current index (8). In this case, 2 ```f```s have a mid point of i=9, the 2 right-most ```a```s have a midpoint of i=13. Once you're done looking at i=8, then you can skip any index not on your list, all the way up to the last letter you found in i=8. For example, we only have to check i=9 and i=13, and then start from i=15, checking every step. We've been able to skip checking i=10, 11, 12, and 14.

Substring search with max 1's in a binary sequence

Problem
The task is to find a substring from the given binary string with highest score. The substring should be at least of given min length.
score = number of 1s / substring length where score can range from 0 to 1.
Inputs:
1. min length of substring
2. binary sequence
Outputs:
1. index of first char of substring
2. index of last char of substring
Example 1:
input
-----
5
01010101111100
output
------
7
11
explanation
-----------
1. start with minimum window = 5
2. start_ind = 0, end_index = 4, score = 2/5 (0.4)
3. start_ind = 1, end_index = 5, score = 3/5 (0.6)
4. and so on...
5. start_ind = 7, end_index = 11, score = 5/5 (1) [max possible]
Example 2:
input
-----
5
10110011100
output
------
2
8
explanation
-----------
1. while calculating all scores for windows 5 to len(sequence)
2. max score occurs in the case: start_ind=2, end_ind=8, score=5/7 (0.7143) [max possible]
Example 3:
input
-----
4
00110011100
output
------
5
8
What I attempted
The only technique i could come up with was a brute force technique, with nested for loops
for window_size in (min to max)
for ind 0 to end
calculate score
save max score
Can someone suggest a better algorithm to solve this problem?
There's a few observations to make before we start talking about an algorithm- some of these observations already have been pointed out in the comments.
Maths
Take the minimum length to be M, the length of the entire string to be L, and a substring from the ith char to the jth char (inclusive-exclusive) to be S[i:j].
All optimal substrings will satisfy at least one of two conditions:
It is exactly M characters in length
It starts and ends with a 1 character
The reason for the latter being if it were longer than M characters and started/ended with a 0, we could just drop that 0 resulting in a higher ratio.
In the same spirit (again, for the 2nd case), there exists an optimal substring which is not preceded by a 1. Otherwise, if it were, we could include that 1, resulting in an equal or higher ratio. The same logic applies to the end of S and a following 1.
Building on the above- such a substring being preceded or followed by another 1 will NOT be optimal, unless the substring contains no 0s. In the case where it doesn't contain 0s, there will exist an optimal substring of length M as well anyways.
Again, that all only applies to the length greater than M case substrings.
Finally, there exists an optimal substring that has length at least M (by definition), and at most 2 * M - 1. If an optimal substring had length K, we could split it into two substrings of length floor(K/2) and ceil(K/2) - S[i:i+floor(K/2)] and S[i+floor(K/2):i+K]. If the substring has the score (ratio) R, and its halves R0 and R1, we would have one of two scenarios:
R = R0 = R1, meaning we could pick either half and get the same score as the combined substring, giving us a shorter substring.
If this substring has length less than 2 * M, we are done- we have an optimal substring of length [M, 2*M).
Otherwise, recurse on the new substring.
R0 != R1, so (without loss of generality) R0 < R < R1, meaning the combined substring would not be optimal in the first place.
Note that I say "there exists an optimal" as opposed to "the optimal". This is because there may be multiple optimal solutions, and the observations above may refer to different instances.
Algorithm
You could search every window size [M, 2*M) at every offset, which would already be better than a full search for small M. You can also try a two-phase approach:
search every M sized window, find the max score
search from the beginning of every run of 1s forward through a special list of ends of runs of 1s, implicitly skipping over 0s and irrelevant 1s, breaking when out of the [M, 2 * M) bound.
For random data, I only expect this to save a small factor- skipping 15/16 of the windows (ignoring the added overhead). For less-random data, you could potentially see huge benefits, particularly if there's LOTS of LARGE runs of 1s and 0s.
The biggest speedup you'll be able to do (besides limiting the window max to 2 * M) is computing a cumulative sum of the bit array. This lets you query "how many 1s were seen up to this point". You can then take the difference of two elements in this array to query "how many 1s occurred between these offsets" in constant time. This allows for very quick calculation of the score.
You can use 2 pointer method, starting from both left-most and right-most ends. then adjust them searching for highest score.
We can add some cache to optimize time.
Example: (Python)
binary="01010101111100"
length=5
def get_score(binary,left,right):
ones=0
for i in range(left,right+1):
if binary[i]=="1":
ones+=1
score= ones/(right-left+1)
return score
cache={}
def get_sub(binary,length,left,right):
if (left,right) in cache:
return cache[(left,right)]
table=[0,set()]
if right-left+1<length:
pass
else:
scores=[[get_score(binary,left,right),set([(left,right)])],
get_sub(binary,length,left+1,right),
get_sub(binary,length,left,right-1),
get_sub(binary,length,left+1,right-1)]
for s in scores:
if s[0]>table[0]:
table[0]=s[0]
table[1]=s[1]
elif s[0]==table[0]:
for e in s[1]:
table[1].add(e)
cache[(left,right)]=table
return table
result=get_sub(binary,length,0,len(binary)-1)
print("Score: %f"%result[0])
print("Index: %s"%result[1])
Output
Score: 1
Index: {(7, 11)}

Counting in Wonderland

The text of Alice in Wonderland contains the word 'Wonderland' 8 times. (Let's be case-insensitive for this question).
However it contains the word many more times if you count non-contiguous subsequences as well as substrings, eg.
Either the well was very deep, or she fell very slowly, for she had
plenty of time as she went down to look about her and to WONDER what was
going to happen next. First, she tried to Look down AND make out what
she was coming to, but it was too dark to see anything;
(A subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements. —Wikipedia)
How many times does the book contain the word Wonderland as a subsequence? I expect this will be a big number—it's a long book with many w's and o's and n's and d's.
I tried brute force counting (recursion to make a loop 10 deep) but it was too slow, even for that example paragraph.
Let's say you didn't want to search for wonderland, but just for w. Then you'd simply count how many times w occurred in the story.
Now let's say you want wo. For each first character of the current pattern you find, you add to your count:
How many times the current pattern without its first character occurs in the rest of the story, after this character you're at: so you have reduced the problem (story[1..n], pattern[1..n]) to (story[2..n], pattern[2..n])
How many times the entire current pattern occurs in the rest of the story. So you have reduced the problem to (story[2..n], pattern[1..n])
Now you can just add the two. There is no overcounting if we talk in terms of subproblems. Consider the example wawo. Obviously, wo occurs 2 times. You might think the counting will go like:
For the first w, add 1 because o occurs once after it and another 1 because wo occurs once after it.
For the second w, add 1 because o occurs once after it.
Answer is 3, which is wrong.
But this is what actually happens:
(wawo, wo) -> (awo, o) -> (wo, o) -> (o, o) -> (-, -) -> 1
-> (-, o) -> 0
-> (awo, wo) -> (wo, wo) -> (o, wo) -> (-, wo) -> 0
-> (o, o) -> (-, -) -> 1
-> (-, o) -> 0
So you can see that the answer is 2.
If you don't find a w, then the count for this position is just how many times wo occurs after this current character.
This allows for dynamic programming with memoization:
count(story_index, pattern_index, dp):
if dp[story_index, pattern_index] not computed:
if pattern_index == len(pattern):
return 1
if story_index == len(story):
return 0
if story[story_index] == pattern[pattern_index]:
dp[story_index, pattern_index] = count(story_index + 1, pattern_index + 1, dp) +
count(story_index + 1, pattern_index, dp)
else:
dp[story_index, pattern_index] = count(story_index + 1, pattern_index, dp)
return dp[story_index, pattern_index]
Call with count(0, 0, dp). Note that you can make the code cleaner (remove the duplicate function call).
Python code, with no memoization:
def count(story, pattern):
if len(pattern) == 0:
return 1
if len(story) == 0:
return 0
s = count(story[1:], pattern)
if story[0] == pattern[0]:
s += count(story[1:], pattern[1:])
return s
print(count('wonderlandwonderland', 'wonderland'))
Output:
17
This makes sense: for each i first characters in the first wonderland of the story, you can group it with remaining final characters in the second wonderland, giving you 10 solutions. Another 2 are the words themselves. The other five are:
wonderlandwonderland
********* *
******** **
******** * *
** ** ******
*** * ******
You're right that this will be a huge number. I suggest that you either use large integers or take the result modulo something.
The same program returns 9624 for your example paragraph.
The string "wonderland" occurs as a subsequence in Alice in Wonderland1 24100772180603281661684131458232 times.
The main idea is to scan the main text character by character, keeping a running count of how often each prefix of the target string (i.e.: in this case, "w", "wo", "won", ..., "wonderlan", and "wonderland") has occurred up to the current letter. These running counts are easy to compute and update. If the current letter does not occur in "wonderland", then the counts are left untouched. If the current letter is "a" then we increment the count of "wonderla"s seen by the number of "wonderl"s seen up to this point. If the current letter is "n" then we increment the count of "won"s by the count of "wo"s and the count of "wonderlan"s by the count of "wonderla"s. And so forth. When we reach end of the text, we will have the count of all prefixes of "wonderland" including the string "wonderland" itself, as desired.
The advantage of this approach is that it requires a single pass through the text and does not require O(n) recursive calls (which will likely exceed the maximum recursion depth unless you do something clever).
Code
import fileinput
import string
target = 'wonderland'
prefixes = dict()
count = dict()
for i in range(len(target)) :
letter = target[i]
prefix = target[:i+1]
if letter not in prefixes :
prefixes[letter] = [prefix]
else :
prefixes[letter].append(prefix)
count[prefix] = 0L
for line in fileinput.input() :
for letter in line.lower() :
if letter in prefixes :
for prefix in prefixes[letter] :
if len(prefix) > 1 :
count[prefix] = count[prefix] + count[prefix[:len(prefix)-1]]
else:
count[prefix] = count[prefix] + 1
print count[target]
Using this text from Project Gutenberg, starting with "CHAPTER I. Down the Rabbit-Hole" and ending with "THE END"
Following up on previous comments, if you are looking for an algorithm that would return 2 for the input wonderlandwonderland and 1 for wonderwonderland, then I think you could adapt the algorithm from this question:
How to find smallest substring which contains all characters from a given string?
Effectively, the change in your case would be that, once an instance of the word is found, you increment a counter and repeat all the procedure with the remaining part of the text.
Such algorithm would be O(n) in time when n is the lenght of the text and O(m) in space where m is the length of the searched string.

Knuth-Morris-Pratt Fail table

I am studying for an exam I have and I am looking over the Knuth-Morris-Pratt algorithm. What is going to be on the exam is the Fail table and DFA construction. I understand DFA construction, but I don't really understand how to make the fail table.
If I have an example of a pattern "abababc" how do I build a fail table from this? The solution is:
Fail table:
0 1 2 3 4 5 6 7
0 0 0 1 2 3 4 0
but how do I get that? No code just an explanation of how to get that is necessary.
The value of cell i in the fail table for string s is defined as follows: take the substring of s that ends at position i, and the value in the cell is the length of the longest proper(not the whole string) sufix of this substring that is equal to its prefix of the same length.
Let's take your example and consider the value for 6. The substring of s with length 6 is ababab. It has 6 suffixes: babab, abab, bab, ab and b on the other hand its proper prefixes are ababa, abab, aba, ab and a. Now it is easy to see that the sufixes that are equal to prefixes of the same length are abab and ab. Of these the longer is abab and thus the value in cell 6 is the its length - 4.
Pattern P = {abababc}
P[0] = 'a'. P[1] = 'b'. P[2] = 'a'. P[3] = 'b'. P[4] = 'a'. P[5] = 'b'. P[6] = 'c'.
The motive of the Fail Table is to identify the maximum possible shift (such that we would not miss out on any pattern matching, but would also not make unnecessary comparison), if first "i" character of the pattern string are matching and the break is found at the i+1 th character.
The number in the Fail Table indicates how many character still continues to match after the shift if the first i character of the pattern matches to the text.
Let FailTable be FT[].
FT[1] - 'a' matches with text. Break found at 'b'(P[1]). Do we have a proper suffix of 'a' which matches the proper prefix of 'a'? Ans is NO. So length of the String which still continues to match after the shift is 0. Hence FT[1] = 0.
FT[2] - 'ab' matches with text. Break found at 'a' (P[2]). Do we have a proper suffix of 'ab' which matches the proper prefix of 'ab'? Ans is NO. So length of the String which still continues to match after the shift is 0. Hence FT[2] = 0.
FT[3] - 'aba' matches with text. Break found at 'b' (P[3]). Do we have a proper suffix of 'aba' which matches the proper prefix of 'aba'? Ans is YES ('a'). So length of the String which still continues to match after the shift is 1. Hence FT[3] = 1.
FT[4] - 'abab' matches with text. Break found at 'a' (P[4]). Do we have a proper suffix of 'abab' which matches the proper prefix of 'abab'? Ans is YES('ab'). So length of the String which still continues to match after the shift is 2. Hence FT[4] = 2.
FT[5] - 'ababa' matches with text. Break found at 'b' (P[5]). Do we have a proper suffix of 'ababa' which matches the proper prefix of 'ababa'? Ans is YES('aba'). So length of the String which still continues to match after the shift is 3. Hence FT[5] = 3.
FT[6] - 'ababab' matches with text. Break found at 'a' (P[6]). Do we have a proper suffix of 'ababab' which matches the proper prefix of 'ababab'? Ans is YES('abab'). So length of the String which still continues to match after the shift is 4. Hence FT[6] = 4.
FT[7] - 'abababc' matches with text. No break found at all, Pattern matched with the text. Do we have a proper suffix of 'abababc' which matches the proper prefix of 'abababc'? Ans is NO. So length of the String which still continues to match after the shift is 0. Hence FT[7] = 0.
Hence the final array is FT = [0,0,1,2,3,4,0]
Hope it helps!

Find all substrings that don't contain the entire set of characters

This was asked to me in an interview.
I'm given a string whose characters come from the set {a,b,c} only. Find all substrings that dont contain all the characters from the set.For e.g, substrings that contain only a's, only b's, only c's or only a,b's or only b,c's or only c,a's. I gave him the naive O(n^2) solution by generating all substrings and testing them.
The interviewer wanted an O(n) solution.
Edit: My attempt was to have the last indexes of a,b,c and run a pointer from left to right, and anytime all 3 were counted, change the start of the substring to exclude the earliest one and start counting again. It doesn't seem exhaustive
So for e.g, if the string is abbcabccaa,
let i be the pointer that traverses the string. Let start be start of the substring.
1) i = 0, start = 0
2) i = 1, start = 0, last_index(a) = 0 --> 1 substring - a
3) i = 2, start = 0, last_index(a) = 0, last_index(b) = 1 -- > 1 substring ab
4) i = 3, start = 0, last_index(a) = 0, last_index(b) = 2 --> 1 substring abb
5) i = 4, start = 1, last_index(b) = 2, last_index(c) = 3 --> 1 substring bbc(removed a from the substring)
6) i = 5, start = 3, last_index(c) = 3, last_index(a) = 4 --> 1 substring ca(removed b from the substring)
but this isn't exhaustive
Given that the problem in its original definition can't be solved in less than O(N^2) time, as some comments point out, I suggest a linear algorithm for counting the number of substrings (not necessarily unique in their values, but unique in their positions within the original string).
The algorithm
count = 0
For every char C in {'a','b','c'} scan the input S and break it into longest sequences not including C. For each such section A, add |A|*(|A|+1)/2 to count. This addition stands for the number of legal sub-strings inside A.
Now we have the total number of legal strings including only {'a','b'}, only {'a','c'} and only {'b','c'}. The problem is that we counted substrings with a single repeated character twice. To fix this we iterate over S again, this time subtracting |A|*(|A|+1)/2 for every largest sequence A of a single character that we encounter.
Return count
Example
S='aacb'
breaking it using 'a' gives us only 'cb', so count = 3. For C='b' we have 'aac', which makes count = 3 + 6 = 9. With C='c' we get 'aa' and 'b', so count = 9 + 3 + 1 = 13. Now we have to do the subtraction: 'aa': -3, 'c': -1, 'b': -1. So we have count=8.
The 8 substrings are:
'a'
'a' (the second char this time)
'aa'
'ac'
'aac'
'cb'
'c'
'b'
To get something better than O(n) we may need additional assumptions (maybe longest substrings with this property).
Consider a string of the form aaaaaaaaaabbbbbbbbbb of length n. There is at least O(n^2) possible substrings so if we want to list them all we need O(n^2) time.
I came up with a linear solution for the longest substrings.
Take a set S of all substrings separated by a, all substrings separated by b and finally all substrings separated by c. Each of those steps can be done in O(n), so we have O(3n), thus O(n).
Example:
Take aaabcaaccbaa.
In this case set S contains:
substrings separated by a: bc, ccb
substrings separated by b: aaa, caacc
substrings separated by c: aaab, aa, baa.
By the set I mean a data structure with adding and finding element with a given key in O(1).

Resources