building Suffix array in O(n logn) - data-structures

I am reading suffix array construction tutorials from codechef and stackoverflow as well. One point I could understand is that they say..
It works by first sorting the 2-grams(*), then the 4-grams, then the 8-grams, and so forth, of the original string S, so in the i-th iteration, we sort the 2i-grams
And so forth. Each iteration i has two steps:
Sorting by 2i-grams, using the lexicographic names from the previous iteration to enable comparisons in 2 steps (i.e. O(1) time) each
Creating new lexicographic names
MY DOUBT IS:
How can I use the index computed at 2-grams for 4 - grams. ?
Suppose my 2 suffixes are 'ab', 'ac' how can you compare then in O(1) time and give them indexes.
I really tried but stuck there. Please provide some example , that helps . Than
ks in advance

Let's assume that all substrings with length 2^k are sorted and now we want to sort all substrings with length 2^(k + 1). The key observation here is that any substring with length 2^(k + 1) is a concatenation of two substrings with length 2^k.
For example, in a string abacaba a substring caba is a concatenation of ca and ba.
But all substrings with length 2^k are sorted, so we may assume that each of them is assigned an integer from range[0 ... n - 1](I will call it class) based on it's position in the sorted array of all substrings with this length(equal strings should be assigned equal numbers and this array is not maintained explicitly, of course). In this case, each substring with length 2^(k + 1) can be represented as a pair of two numbers (p1, p2) - classes of the first and the second substring, respectively. So all what we need to do is to sort an array of pairs of integers from range [0 ... n - 1]. One can use radix sort to do in linear time. After sorting these pairs, we can find classes for all substrings with length 2^(k + 1) using single pass in the sorted array.

Related

Sorting array of strings containing n characters in O(n) time

Question:
We have an array of m strings composed of only lower case
characters such that the total number of characters in all the strings
combined is n.
Show how to sort the strings (in lexicographic order) in O(n) time using only character comparisons. Justify your answer.
What I have:
This really seems like it should be radix sort. Radix sort has a time complexity of O(k*(m+d)) where k is the maximum number of letters in a string contained in the array, and d is the number of "buckets" (Assuming you are using radix sort with bucket sort) in this case we know we will have 26 "buckets" (for each letter in the alphabet). Therefore we can simplify the time complexity to O(k*m).
Assuming I am correct and the best way of doing this is radix sort, what I am struggling to prove is that O(k*m) = O(n).
Am I right that this is radix sort?
How can I prove that O(k*m) = O(n)?
O(k*(m+d)) ~ O(n+kd) in your case.
For example, let's say you have to sort ["ABCD", "ABDC","AB"]. When you sort the first and second character, you go through all 3 elements. But when you check for the third and fourth character, you don't have to check the string "AB" since it doesn't have a third and fourth letter. So actual times you go through each letter will be 2*3 + 2*2 = 10 which is the sum of length of all strings (plus the kd term for storing and retrieving letters).
You'll just have to tweak the radix sort by adding a few validation checks on terminated strings and it comes to O(n + kd)

Finding the lengths of the longest palindromic substrings for all prefixes of a given string

I've gone through the problem of finding the longest palindromic substring, but this is different.
Given a String like "ababa", the lengths of the longest palindromic substrings for all the prefixes will be as per below -
"a" : "a" (Length 1)
"ab" : "a" or "b" (Length 1)
"aba" : "aba" (Length 3)
"abab" : "aba" or "bab" (Length 3)
"ababa" : "ababa" (Length 5)
Here's the sample input / output ->
Sample Input: "ababa"
Sample Output: 1 1 3 3 5
I thought about a couple of solutions -
Find out all the palindromic substrings of the given string (O(N^2) using expanding around centre approach) and then for each prefix, find out if it contains the substrings (sorted in desc order of lengths). This seems to be worse than O(N^3).
For each prefix, find the longest palindromic substring using Manacher's algorithm (O(N)). There will be N prefixes, so this is O(N^2).
We need only the lengths and not the actual palindromes. Is there any easier / better (in terms of of runtime complexity) way that I'm missing out on?
Update : We need the lengths (of longest palindromic substrings) for all the prefixes of the string (like in the example above).
2nd option. Think about how manacher works. Each time it moves the right pointer it basically considers a new prefix. Keep track of the maximum value in the currently calculated part of the manacher's table values, in each iteration that is the length of the longest palindromic subsequence for the current prefix (ending on the right pointer). That takes O(n) time.
I would suggest this solution:https://www.geeksforgeeks.org/longest-palindromic-substring-set-2/?ref=rp, its time complexity is O(N^2) and space complexity is O(1).
But in your case you would need an array(maxArr) to hold length of maximum length substring for a prefix
Idea remains the same, you choose a center and find maximum length substring with that as center. Where the max substring will end, update the maximum length in array for that position.
At last you might have some positions empty in the array(maxArr), those will hold the same values as the left position of them.

Preprocess-Query to find number of pairs containing a number X

Formally we are given N pairs of rational numbers . We want to somehow preprocess on this data so as to answer queries like "Find number of pairs which contain a given rational number X" .
By ' a pair contains X' i mean [2,5] contains 3 & so on.
At worst , expected time for each query should be O(log N) or O(sqrt(N)) (or anything similair better than O(N)) & preprocessing should be at worst O(N^2) .
My approach:
I tried sorting pairs , first by first number & break ties by second number [First nos in pair < Second nos in pair]. Then applying a lower_bound form of binary search reduces the search space but now i can't apply another Binary search in this search space since pairs are sorted first by first nos. so after reducing search space i have to linearly check . This is again having worst case O(N) per query.
First you should try to make the ranges disjoint. For example ranges [1 5],[2 6],[3 7] will result in disjoint ranges of [1 2],[2 3],[3 5],[5 6],[6 7] and for each range you should save in how many original ranges it was present. Like this
1-------5 // original ranges
2------6
3------7
1-2, 2-3, 3-5, 5-6, 6-7 // disjoint ranges
1 2 3 2 1 // number of presence of each range in original ranges
You can do this by a sweep line algorithm in O(NlogN). After that You can use the method you described by sorting the ranges by its start and then for each query finding the lower_bound of Xi and printing the presence count of that range. For example in this case if the query is 4 you can find the range 3-5 by a binary search and then the result is 3 because the presence of range 3-5 is equal to 3.

Finding the kth smallest element in a sequence where duplicates are compressed?

I've been asked to write a program to find the kth order statistic of a data set consisting of character and their occurrences. For example, I have a data set consisting of
B,A,C,A,B,C,A,D
Here I have A with 3 occurrences, B with 2 occurrences C with 2 occurrences and D with on occurrence. They can be grouped in pairs (characters, number of occurrences), so, for example, we could represent the above sequence as
(A,3), (B,2), (C,2) and (D,1).
Assuming than k is the number of these pairs, I am asked to find the kth of the data set in O(n) where n is the number of pairs.
I thought could sort the element based their number of occurrence and find their kth smallest elements, but that won't work in the time bounds. Can I please have some help on the algorithm for this problem?
Assuming that you have access to a linear-time selection algorithm, here's a simple divide-and-conquer algorithm for solving the problem. I'm going to let k denote the total number of pairs and m be the index you're looking for.
If there's just one pair, return the key in that pair.
Otherwise:
Using a linear-time selection algorithm, find the median element. Let medFreq be its frequency.
Sum up the frequencies of the elements less than the median. Call this less. Note that the number of elements less than or equal to the median is less + medFreq.
If less < m < less + medFreq, return the key in the median element.
Otherwise, if m ≤ less, recursively search for the mth element in the first half of the array.
Otherwise (m > less + medFreq), recursively search for the (m - less - medFreq)th element in the second half of the array.
The key insight here is that each iteration of this algorithm tosses out half of the pairs, so each recursive call is on an array half as large as the original array. This gives us the following recurrence relation:
T(k) = T(k / 2) + O(k)
Using the Master Theorem, this solves to O(k).

Check if array B is a permutation of A

I tried to find a solution to this but couldn't get much out of my head.
We are given two unsorted integer arrays A and B. We have to check whether array B is a permutation of A. How can this be done.? Even XORing the numbers wont work as there can be several counterexamples which have same XOR value bt are not permutation of each other.
A solution needs to be O(n) time and with space O(1)
Any help is welcome!!
Thanks.
The question is theoretical but you can do it in O(n) time and o(1) space. Allocate an array of 232 counters and set them all to zero. This is O(1) step because the array has constant size. Then iterate through the two arrays. For array A, increment the counters corresponding to the integers read. For array B, decrement them. If you run into a negative counter value during iteration of array B, stop --- the arrays are not permutations of each others. Otherwise at the end (assuming A and B have the same size, a prerequisite) the counter array is all zero and the two arrays are permutations of each other.
This is O(1) space and O(n) time solution. However it is not practical, but would easily pass as a solution to the interview question. At least it should.
More obscure solutions
Using a nondeterministic model of computation, checking that the two arrays are not permutations of each others can be done in O(1) space, O(n) time by guessing an element that has differing count on the two arrays, and then counting the instances of that element on both of the arrays.
In randomized model of computation, construct a random commutative hash function and calculate the hash values for the two arrays. If the hash values differ, the arrays are not permutations of each others. Otherwise they might be. Repeat many times to bring the probability of error below desired threshold. Also on O(1) space O(n) time approach, but randomized.
In parallel computation model, let 'n' be the size of the input array. Allocate 'n' threads. Every thread i = 1 .. n reads the ith number from the first array; let that be x. Then the same thread counts the number of occurrences of x in the first array, and then check for the same count on the second array. Every single thread uses O(1) space and O(n) time.
Interpret an integer array [ a1, ..., an ] as polynomial xa1 + xa2 + ... + xan where x is a free variable and the check numerically for the equivalence of the two polynomials obtained. Use floating point arithmetics for O(1) space and O(n) time operation. Not an exact method because of rounding errors and because numerical checking for equivalence is probabilistic. Alternatively, interpret the polynomial over integers modulo a prime number, and perform the same probabilistic check.
If we are allowed to freely access a large list of primes, you can solve this problem by leveraging properties of prime factorization.
For both arrays, calculate the product of Prime[i] for each integer i, where Prime[i] is the ith prime number. The value of the products of the arrays are equal iff they are permutations of one another.
Prime factorization helps here for two reasons.
Multiplication is transitive, and so the ordering of the operands to calculate the product is irrelevant. (Some alluded to the fact that if the arrays were sorted, this problem would be trivial. By multiplying, we are implicitly sorting.)
Prime numbers multiply losslessly. If we are given a number and told it is the product of only prime numbers, we can calculate exactly which prime numbers were fed into it and exactly how many.
Example:
a = 1,1,3,4
b = 4,1,3,1
Product of ith primes in a = 2 * 2 * 5 * 7 = 140
Product of ith primes in b = 7 * 2 * 5 * 2 = 140
That said, we probably aren't allowed access to a list of primes, but this seems a good solution otherwise, so I thought I'd post it.
I apologize for posting this as an answer as it should really be a comment on antti.huima's answer, but I don't have the reputation yet to comment.
The size of the counter array seems to be O(log(n)) as it is dependent on the number of instances of a given value in the input array.
For example, let the input array A be all 1's with a length of (2^32) + 1. This will require a counter of size 33 bits to encode (which, in practice, would double the size of the array, but let's stay with theory). Double the size of A (still all 1 values) and you need 65 bits for each counter, and so on.
This is a very nit-picky argument, but these interview questions tend to be very nit-picky.
If we need not sort this in-place, then the following approach might work:
Create a HashMap, Key as array element, Value as number of occurances. (To handle multiple occurrences of the same number)
Traverse array A.
Insert the array elements in the HashMap.
Next, traverse array B.
Search every element of B in the HashMap. If the corresponding value is 1, delete the entry. Else, decrement the value by 1.
If we are able to process entire array B and the HashMap is empty at that time, Success. else Failure.
HashMap will use constant space and you will traverse each array only once.
Not sure if this is what you are looking for. Let me know if I have missed any constraint about space/time.
You're given two constraints: Computational O(n), where n means the total length of both A and B and memory O(1).
If two series A, B are permutations of each other, then theres also a series C resulting from permutation of either A or B. So the problem is permuting both A and B into series C_A and C_B and compare them.
One such permutation would be sorting. There are several sorting algorithms which work in place, so you can sort A and B in place. Now in a best case scenario Smooth Sort sorts with O(n) computational and O(1) memory complexity, in the worst case with O(n log n) / O(1).
The per element comparision then happens at O(n), but since in O notation O(2*n) = O(n), using a Smooth Sort and comparison will give you a O(n) / O(1) check if two series are permutations of each other. However in the worst case it will be O(n log n)/O(1)
The solution needs to be O(n) time and with space O(1).
This leaves out sorting and the space O(1) requirement is a hint that you probably should make a hash of the strings and compare them.
If you have access to a prime number list do as cheeken's solution.
Note: If the interviewer says you don't have access to a prime number list. Then generate the prime numbers and store them. This is O(1) because the Alphabet length is a constant.
Else here's my alternative idea. I will define the Alphabet as = {a,b,c,d,e} for simplicity.
The values for the letters are defined as:
a, b, c, d, e
1, 2, 4, 8, 16
note: if the interviewer says this is not allowed, then make a lookup table for the Alphabet, this takes O(1) space because the size of the Alphabet is a constant
Define a function which can find the distinct letters in a string.
// set bit value of char c in variable i and return result
distinct(char c, int i) : int
E.g. distinct('a', 0) returns 1
E.g. distinct('a', 1) returns 1
E.g. distinct('b', 1) returns 3
Thus if you iterate the string "aab" the distinct function should give 3 as the result
Define a function which can calculate the sum of the letters in a string.
// return sum of c and i
sum(char c, int i) : int
E.g. sum('a', 0) returns 1
E.g. sum('a', 1) returns 2
E.g. sum('b', 2) returns 4
Thus if you iterate the string "aab" the sum function should give 4 as the result
Define a function which can calculate the length of the letters in a string.
// return length of string s
length(string s) : int
E.g. length("aab") returns 3
Running the methods on two strings and comparing the results takes O(n) running time. Storing the hash values takes O(1) in space.
e.g.
distinct of "aab" => 3
distinct of "aba" => 3
sum of "aab => 4
sum of "aba => 4
length of "aab => 3
length of "aba => 3
Since all the values are equal for both strings, they must be a permutation of each other.
EDIT: The solutions is not correct with the given alphabet values as pointed out in the comments.
You can convert one of the two arrays into an in-place hashtable. This will not be exactly O(N), but it will come close, in non-pathological cases.
Just use [number % N] as it's desired index or in the chain that starts there. If any element has to be replaced, it can be placed at the index where the offending element started. Rinse , wash, repeat.
UPDATE:
This is a similar (N=M) hash table It did use chaining, but it could be downgraded to open addressing.
I'd use a randomized algorithm that has a low chance of error.
The key is to use a universal hash function.
def hash(array, hash_fn):
cur = 0
for item in array:
cur ^= hash_item(item)
return cur
def are_perm(a1, a2):
hash_fn = pick_random_universal_hash_func()
return hash_fn(a1, hash_fn) == hash_fn(a2, hash_fn)
If the arrays are permutations, it will always be right. If they are different, the algorithm might incorrectly say that they are the same, but it will do so with very low probability. Further, you can get an exponential decrease in chance for error with a linear amount of work by asking many are_perm() questions on the same input, if it ever says no, then they are definitely not permutations of each other.
I just find a counterexample. So, the assumption below is incorrect.
I can not prove it, but I think this may be possible true.
Since all elements of the arrays are integers, suppose each array has 2 elements,
and we have
a1 + a2 = s
a1 * a2 = m
b1 + b2 = s
b1 * b2 = m
then {a1, a2} == {b1, b2}
if this is true, it's true for arrays have n-elements.
So we compare the sum and product of each array, if they equal, one is the permutation
of the other.

Resources