Rank the suffixes of a list - algorithm

ranking an element x in an array/list is just to find out how many elements in the array/list that strictly smaller than x.
So ranking a list is just get ranks of all elements in the list.
For example, rank [51, 38, 29, 51, 63, 38] = [3, 1, 0, 3, 5, 1], i.e., there are 3 elements smaller than 51, etc.
Ranking a list can be done in O(NlogN). Basically, we can sort the list while remembering the original index of each element, and then see for each element, how many before it.
The question here is How to rank the suffixes of a list, in O(NlogN)?
Ranking the suffixes of a list means:
for list [3; 1; 2], rank [[3;1;2]; [1;2]; [2]]
note that elements may not be distinct.
edit
We don't need to print out all elements for all suffixes. You can image that we just need to print out a list/array, where each element is a rank of a suffix.
For example, rank suffix_of_[3;1;2] = rank [[3;1;2]; [1;2]; [2]] = [2;0;1] and you just print out [2;0;1].
edit 2
Let me explain what is all suffixes and what means sorting/ranking all suffixes more clearly here.
Suppose we have an array/list [e1;e2;e3;e4;e5].
Then all suffixes of [e1;e2;e3;e4;e5] are:
[e1;e2;e3;e4;e5]
[e2;e3;e4;e5]
[e3;e4;e5]
[e4;e5]
[e5]
for example, all suffixes of [4;2;3;1;0] are
[4;2;3;1;0]
[2;3;1;0]
[3;1;0]
[1;0]
[0]
Sorting above 5 suffixes implies lexicographic sort. sorting above all suffixes, you get
[0]
[1;0]
[2;3;1;0]
[3;1;0]
[4;2;3;1;0]
by the way, if you can't image how 5 lists/arrays can be sorted among them, just think of sorting strings in lexicographic order.
"0" < "10" < "2310" < "310" < "42310"
It seems sorting all suffixes is actually sorting all elements of the original array.
However, please be careful that all elements may not be distinct, for example
for [4;2;2;1;0], all suffixes are:
[4;2;2;1;0]
[2;2;1;0]
[2;1;0]
[1;0]
[0]
then the order is
[0]
[1;0]
[2;1;0]
[2;2;1;0]
[4;2;2;1;0]

As MBo noted correctly, your problem is that of constructing the suffix array of your input list. The fast and complicated algorithms to do this are actually linear time, but since you only aim for O(n log n), I will try to propose a simpler version that is much easier to implement.
Basic idea and an initial O(n log² n) implementation
Let's take the sequence [4, 2, 2, 1] as an example. Its suffixes are
0: 4 2 2 1
1: 2 2 1
2: 2 1
3: 1
I numbered the suffixes with their starting index in the original sequence. Ultimately we want to sort this set of suffixes lexicographically, and fast. We know we can represent each suffix using its starting index in constant space and we can sort in O(n log n) comparisons using merge sort, heap sort or a similar algorithm. So the question remains, how can we compare two suffixes fast?
Let's say we want to compare the suffixes [2, 2, 1] and [2, 1]. We can pad those with negative infinity values changing the result of the comparison: [2, 2, 1, -∞] and [2, 1, -∞, -∞].
Now the key idea here is the following divide-and-conquer observation: Instead of comparing the sequences character by character until we find a position where the two differ, we can instead split both lists in half and compare the halves lexicographically:
[a, b, c, d] < [e, f, g, h]
<=> ([a, b], [c, d]) < ([e, f], [g, h])
<=> [a, b] < [e, f] or ([a, b,] = [e, f] and [c, d] < [g, h])
Essentially we have decomposed the problem of comparing the sequences into two problems of comparing smaller sequences. This leads to the following algorithm:
Step 1: Sort the substrings (contiguous subsequences) of length 1. In our example, the substrings of length 1 are [4], [2], [2], [1]. Every substring can be represented by the starting position in the original list. We sort them by a simple comparison sort and get [1], [2], [2], [4]. We store the result by assigning to every position it's rank in the sorted lists of lists:
position substring rank
0 [4] 2
1 [2] 1
2 [2] 1
3 [1] 0
It is important that we assign the same rank to equal substrings!
Step 2: Now we want to sort the substrings of length 2. The are only really 3 such substrings, but we assign one to every position by padding with negative infinity if necessary. The trick here is that we can use our divide-and-conquer idea from above and the ranks assigned in step 1 to do a fast comparison (this isn't really necessary yet but will become important later).
position substring halves ranks from step 1 final rank
0 [4, 2] ([4], [2]) (2, 1) 3
1 [2, 2] ([2], [2]) (1, 1) 2
2 [2, 1] ([2], [2]) (1, 0) 1
3 [1, -∞] ([1], [-∞]) (0, -∞) 0
Step 3: You guessed it, now we sort substrings of length 4 (!). These are exactly the suffixes of the list! We can use the divide-and-conquer trick and the results from step 2 this time:
position substring halves ranks from step 2 final rank
0 [4, 2, 2, 1] ([4, 2], [2, 1]) (3, 1) 3
1 [2, 2, 1, -∞] ([2, 2], [1, -∞]) (2, 0) 2
2 [2, 1, -∞, -∞] ([2, 1], [-∞,-∞]) (1, -∞) 1
3 [1, -∞, -∞, -∞] ([1,-∞], [-∞,-∞]) (0, -∞) 0
We're done! If our initial sequence would have had size 2^k, we would have needed k steps. Or put the other way round, we need log_2 n steps to process a sequence of size n. If its length is not a power of two, we just pad with negative infinity.
For an actual implementation we just need to remember the sequence "final rank" for every step of the algorithm.
An implementation in C++ could look like this (compile with -std=c++11):
#include <algorithm>
#include <iostream>
using namespace std;
int seq[] = {8, 3, 2, 4, 2, 2, 1};
const int n = 7;
const int log2n = 3; // log2n = ceil(log_2(n))
int Rank[log2n + 1][n]; // Rank[i] will save the final Ranks of step i
tuple<int, int, int> L[n]; // L is a list of tuples. in step i,
// this will hold pairs of Ranks from step i - 1
// along with the substring index
const int neginf = -1; // should be smaller than all the numbers in seq
int main() {
for (int i = 0; i < n; ++i)
Rank[1][i] = seq[i]; // step 1 is actually simple if you think about it
for (int step = 2; step <= log2n; ++step) {
int length = 1 << (step - 1); // length is 2^(step - 1)
for (int i = 0; i < n; ++i)
L[i] = make_tuple(
Rank[step - 1][i],
(i + length / 2 < n) ? Rank[step - 1][i + length / 2] : neginf,
i); // we need to know where the tuple came from later
sort(L, L + n); // lexicographical sort
for (int i = 0; i < n; ++i) {
// we save the rank of the index, but we need to be careful to
// assign equal ranks to equal pairs
Rank[step][get<2>(L[i])] = (i > 0 && get<0>(L[i]) == get<0>(L[i - 1])
&& get<1>(L[i]) == get<1>(L[i - 1]))
? Rank[step][get<2>(L[i - 1])]
: i;
}
}
// the suffix array is in L after the last step
for (int i = 0; i < n; ++i) {
int start = get<2>(L[i]);
cout << start << ":";
for (int j = start; j < n; ++j)
cout << " " << seq[j];
cout << endl;
}
}
Output:
6: 1
5: 2 1
4: 2 2 1
2: 2 4 2 2 1
1: 3 2 4 2 2 1
3: 4 2 2 1
0: 8 3 2 4 2 2 1
The complexity is O(log n * (n + sort)), which is O(n log² n) in this implementation because we use a comparison sort of complexity O(n log n)
A simple O(n log n) algorithm
If we manage to do the sorting parts in O(n) per step, we get a O(n log n) bound. So basically we have to sort a sequence of pairs (x, y), where 0 <= x, y < n. We know that we can sort a sequence of integers in the given range in O(n) time using counting sort. We can intepret our pairs (x, y) as numbers z = n * x + y in base n. We can now see how to use LSD radix sort to sort the pairs.
In practice, this means we sort the pairs by increasing y using counting sort, and then use counting sort again to sort by increasing x. Since counting sort is stable, this gives us the lexicographical order of our pairs in 2 * O(n) = O(n). The final complexity is thus O(n log n).
In case you are interested, you can find an O(n log² n) implementation of the approach at my Github repo. The implementation has 27 lines of code. Neat, ain't it?

This is exactly suffix array construction problem, and wiki page contains links to the linear-complexity algorithms (probably, depending on alphabet)

Related

Arrange n items in k nonempty groups such that the difference between the minimum element and the maximum element of each group is minimized

Given N items with values x[1], ..., x[n] and an integer K find a linear time algorithm to arrange these N items in K non empty groups such that in each group the range (difference between minimum and maximum element values/keys in each group) is minimized and therefore the sum of the ranges is minimum.
For example given N=4, K=2 and the elements 1 1 4 3 the minimum range is 1 for groups (1,1) and (4,3).
You can binary search the answer.
Assume the optimal answer is x. Now you should verify whether we can group the items into k groups where the maximum difference between the group items is at most x. This can be done in O(n) [after sorting the array]. Traverse the sorted array and pick consecutive items until the difference between minimum number you have picked for this group and the maximum number you have picked hasn't exceeded x. After that you should initialize a new group and repeat this process. At the end count how many groups you have made. If the number of groups is more than k we can conclude that we can not group the items in k groups with x being the answer. So we should increase x. By binary searching on x we can find the minimum x.
The overall complexity is O(NlogN).
Here is a sample implementation in C++
#include <algorithm>
#include <iostream>
using namespace std;
int main()
{
int n = 4, k = 2;
std::vector<int> v = {1, 1, 4, 3};
sort(v.begin(), v.end());
int low = 0, high = *max_element(v.begin(), v.end());
while ( low < high ){
int x = (low+high)/2;
int groups = 0;
int left = 0;
while (left < v.size()){
int right = left;
while( right < v.size() && v[right] - v[left] <= x ){
++right;
}
++groups;
left = right;
}
// printf("x:%d groups:%d\n", x, groups );
if (groups > k)
{
low = x + 1;
} else {
high = x;
}
}
cout << "result is " << low << endl;
}
Alright, I'll assume that we want to minimize the sum of differences over all groups.
Let's sort the numbers. There's an optimal answer where each group is a consecutive segment in the sorted array (proof: let A1 < B1 < A2 < B2. We can exchange A2 and B1. The answer will not increase).
Let a[l], a[l + 1], ..., a[r] is a group. It's cost is a[r] - a[l] = (a[r] - a[r - 1]) + (a[r - 1] - a[r - 2]) + ... + (a[l + 1] - a[l]). It leads us to a key insight: k groups is k - 1 gaps and the answer is a[n - 1] - a[0] - sum of gaps. Thus, we just need to maximize the gaps.
Here is a final solution:
sort the array
compute differences between adjacent numbers
take k - 1 largest differences. That's exactly where the groups split.
We can find the k-1th largest element in linear time (or if we are fine with O(N log N) time, we can just sort them). That's it.
Here is an example:
x = [1, 1, 4, 3], k = 2
sorted: [1, 1, 3, 4]
differences: [0, 2, 1]
taking k - 1 = 1 largest gaps: it's 2. Thus the groups are [1, 1] and [3, 4].
A slightly more contrived one:
x = [8, 2, 0, 3], k = 3
sorted: [0, 2, 3, 8]
differences: [2, 1, 5]
taking k - 1 = 2 largest gaps: they're 2 and 5. Thus, the groups are [0], [2, 3], [8] with the total cost of 1.

Algorithm to find combination of n numbers with largest sum

Problem is simple -
Suppose I have an array of following numbers -
4,1,4,5,7,4,3,1,5
I have to find number of sets of k elements each that can be created from above numbers having largest sum. Two sets are considered to be different if they have at least one different element.
e.g.
if k = 2, then there can be two sets - {7,5} and {7,5}. Note: 5 appears twice in above array.
I think I can start with something like-
1. Sort array
2. Create two arrays. One for different number and an other in parallel for number's occurence.
But I am stuck now. Any suggestions?
The algorithm is as follows:
1) Sort elements in descending order.
2) Look at this array. It may look something like this:
a ... a b ... b c ... c d ...
| <- k -> |
Now obviously all elements a and b will be in the sets with the largest sum. You can't replace any of them with a smaller element, because then the sum wouldn't be the largest possible. So you have no choice here, you have to choose all a and b for any of the sets.
On the other hand only some of the elements c will be in those sets. So the answer is just the number of possibilities, to choose c's to fill the positions left in the sets, after you have taken all larger elements. That is the binomial coefficient:
count of c's choose (k - (count of elements larger than c))
For example for an array (already sorted here)
[9, 8, 7, 7, 5, 5, 5, 5, 4, 4, 2, 2, 1, 1, 1]
and k = 6, you must choose 9, 8 and both 7's for every set with the largest sum (which is 41). And then you can choose any two out of the four 5's. So the result will be 4 choose 2 = 6.
With the same array and k = 4, the result would be x choose 0 = 1 (that unique set is {9, 8, 7, 7}), with k = 7 the result would be 4 choose 3 = 4, and with k = 9: 2 choose 1 = 2 (choosing any 4 for the set with the largest sum).
EDIT: I edited the answer, because we figured it out that OP needs to count multisets.
First, find the largest k numbers in the array. This is of course easy, and if k is very small, you can do it in O(k) by performing k linear scans. If k is not so small, you can use a binary heap, or a priority queue or just sort the array to do that which is respectively O(n * log(k)) or O(n * log(n)) when using sorting.
Let assume that you have computed k largest numbers. Of course all sets of size k with the largest sum have to contain exactly these k largest numbers and no more other numbers. On the other hand, any different set doesn't have the largest sum.
Let count[i] be the number of occurrences of number i in the input sequence.
Let occ[i] be the number of occurrences of number i in the largest k numbers.
We can compute these both tables in very different ways, for example using a hash table or if input numbers are small, you can use an array indexed by these numbers.
Let B be the array of distinct numbers from the largest k numbers.
Let m be the size of B.
Now let's compute the answer. We will do it in m steps. After i-th step we will have computed the number of different multisets consisting of the first i numbers from B. At the beginning the result is 1 since there is only one empty multiset. In the i-th step, we will multiply the current result by the number of possible chooses of occ[B[i]] elements from count[B[i]] elements, which is equal to binomial(occ[i], count[i])
For example, let's consider your instance with added one more 7 at the end and k set to 3:
k = 3
A = [4, 1, 4, 5, 7, 4, 3, 1, 5, 7]
The largest three numbers in A are 7, 7, 5
At the beginning we have:
count[7] = 2
count[5] = 2
occ[7] = 2
occ[5] = 1
result = 1
B = [7, 5]
We start with the first element in B which is 7. Its count is 2 and its occ is also 2, so we do:
// binomial(2, 2) is 1
result = result * binomial(2, 2)
Next element in B is 5, its count is 2 and its occ is 1, so we do:
// binomial(2, 1) is 2
result = result * binomial(2, 1)
And the final result is 2, since there are two different multisets [7, 7, 5]
I'd create a sorted dictionary of the frequencies of occurrence of the numbers in the input. Then take the two largest numbers and multiply the number of times they occur.
In C++, it could look something like this:
std::vector<int> inputs { 4, 1, 4, 5, 7, 3, 1, 5};
std::map<int, int> counts;
for (auto i : inputs)
++counts[i];
auto last = counts.rbegin();
int largest_count = *last;
int second_count = *++last;
int set_count = largeest_count * second_count;
You can do the following:
1) Sort the elements in descending order;
2) define variable answer=1;
3) Start from the beginning of the array and for each new value you see, count the number of its occurrence (lets call this variable count). every time do: answer = answer * count. The pseudo-code should look like this.
find_count(Array A, K)
{
sort(A,'descending);
int answer=1;
int count=1;
for (int i=1,j=1; i<K && j<A.length;j++)
{
if(A[i] != A[i-1])
{
answer = answer *count;
i++;
count=1;
}
else
count++;
}
return answer;
}

Sum of continuous sequences

Given an array A with N elements, I want to find the sum of minimum elements in all the possible contiguous sub-sequences of A. I know if N is small we can look for all possible sub sequences but as N is upto 10^5 what can be best way to find this sum?
Example: Let N=3 and A[1,2,3] then ans is 10 as Possible contiguous sub sequences {(1),(2),(3),(1,2),(1,2,3),(2,3)} so Sum of minimum elements = 1 + 2 + 3 + 1 + 1 + 2 = 10
Let's fix one element(a[i]). We want to know the position of the rightmost element smaller than this one located to the left from i(L). We also need to know the position of the leftmost element smaller than this one located to the right from i(R).
If we know L and R, we should add (i - L) * (R - i) * a[i] to the answer.
It is possible to precompute L and R for all i in linear time using a stack. Pseudo code:
s = new Stack
L = new int[n]
fill(L, -1)
for i <- 0 ... n - 1:
while !s.isEmpty() && s.top().first > a[i]:
s.pop()
if !s.isEmpty():
L[i] = s.top().second
s.push(pair(a[i], i))
We can reverse the array and run the same algorithm to find R.
How to deal with equal elements? Let's assume that a[i] is a pair <a[i], i>. All elements are distinct now.
The time complexity is O(n).
Here is a full pseudo code(I assume that int can hold any integer value here, you should
choose a feasible type to avoid an overflow in a real code. I also assume that all elements are distinct):
int[] getLeftSmallerElementPositions(int[] a):
s = new Stack
L = new int[n]
fill(L, -1)
for i <- 0 ... n - 1:
while !s.isEmpty() && s.top().first > a[i]:
s.pop()
if !s.isEmpty():
L[i] = s.top().second
s.push(pair(a[i], i))
return L
int[] getRightSmallerElementPositions(int[] a):
R = getLeftSmallerElementPositions(reversed(a))
for i <- 0 ... n - 1:
R[i] = n - 1 - R[i]
return reversed(R)
int findSum(int[] a):
L = getLeftSmallerElementPositions(a)
R = getRightSmallerElementPositions(a)
int res = 0
for i <- 0 ... n - 1:
res += (i - L[i]) * (R[i] - i) * a[i]
return res
If the list is sorted, you can consider all subsets for size 1, then 2, then 3, to N. The algorithm is initially somewhat inefficient, but an optimized version is below. Here's some pseudocode.
let A = {1, 2, 3}
let total_sum = 0
for set_size <- 1 to N
total_sum += sum(A[1:N-(set_size-1)])
First, sets with one element:{{1}, {2}, {3}}: sum each of the elements.
Then, sets of two element {{1, 2}, {2, 3}}: sum each element but the last.
Then, sets of three elements {{1, 2, 3}}: sum each element but the last two.
But this algorithm is inefficient. To optimize to O(n), multiply each ith element by N-i and sum (indexing from zero here). The intuition is that the first element is the minimum of N sets, the second element is the minimum of N-1 sets, etc.
I know it's not a python question, but sometimes code helps:
A = [1, 2, 3]
# This is [3, 2, 1]
scale = range(len(A), 0, -1)
# Take the element-wise product of the vectors, and sum
sum(a*b for (a,b) in zip(A, scale))
# Or just use the dot product
np.dot(A, scale)

Find a single number in a list when other numbers occur more than twice

The problem is extended from Finding a single number in a list
If I extend the problem to this:
What would be the best algorithm for finding a number that occurs only once in a list which has all other numbers occurring exactly k times?
Does anyone have good answer?
for example, A = { 1, 2, 3, 4, 2, 3, 1, 2, 1, 3 }, in this case, k = 3. How can I get the single number "4" in O(n) time and the space complexity is O(1)?
If every element in the array is less n and greater than 0.
Let the array be a, traverse the array for each a[i] add n to a[(a[i])%(n)].
Now traverse the array again, the position at which a[i] is less than 2*n and greater than n (assuming 1 based index) is the answer.
This method won't work if at least on element is greater than n. In that case you have to use method suggested by Jayram
EDIT:
To retrieve the array just apply mod n to every element in the array
This can be solved in given with your constraints if the numbers other than lonely number are occurring exactly in even count (i.e. 2, 4, 6, 8...) by doing the XOR operation on all the numbers.
But other than this in space complexity O(1) its just teasing me.
If other than your given constraints you could use these approaches to solve this.
Sort the numbers and have a current variable to get the count of current number. If it is greater than 1 then go to next number and so on. Space O(1)...Time O(nlogn)
Use O(n) extra memory to count the occurrences of each number. Time O(n)...Space O(n)
I Just want to extend #banarun answer .
Take the input as map . Like a[0]=1; Then take it as myMap with 0 as index and 1 as value .
And while reading the input find the maximum number M . Then find A prime greater than M as P.
No iterate through the map and for every key i of myMap add P to myMap(myMap(i)%P) if myMap(myMap(i)%P) is not initiated set it to P. Now iterate through the myMap again, the position at which myMap[i] is >=P And < 2*P is your answer. Basically the the Idea is to remove overflow and overwrite problem from the banarun suggested Algo .
Here is an mechanism which may not be as good as the others but which is instructive and gets to the core of why the XOR answer is as good as it is when k = 2.
1. Represent each number in base k. Support there are at most r digits in the representation
2. Add each of the numbers in the right-most ('r'th) digit mod k, then 'r - 1'st digit (mod k) and so on
3. The final representation of r digits that you have is the answer.
For example, if the array is
A = {1, 2, 3, 4, 2, 3, 1, 2, 1, 3, 5, 4, 4}
Representation in mod 3 is
A = {01, 02, 10, 11, 02, 10, 01, 02, 01, 10, 12, 11, 11}
r = 2
Sum of 'r'th place = 2
Sum of the 'r-1'th place = 1
Hence answer = {12} in base 3 which is 5.
This is an answer which will be O(n * r). Note that r is proportional to log n.
Why is the XOR answer in O(n) ? Because the processor provides an XOR operation which is performed in O(1) time rather than the O(r) factor that we have above.
According to banarun solution(with small fix's):
Algorithm conditions:
for each i arr[i]<N (size of array)
for each i arr[i]>=0 (positive)
The Algorithm:
int[] arr = { 1, 2, 3, 4, 2, 3, 1, 2, 1, 3 };
for (int i = 0; i < arr.Length; i++)
{
arr[(arr[i])%(arr.Length)] += arr.Length;
if(arr[i] < arr.Length)
arr[i] = -1;
}
for (int i = 0; i < arr.Length; i++)
{
if (arr[i] - 3 * arr.Length <0 && arr[i]!=-1)
Console.WriteLine("single number = "+i);
}
This solution is with Time complexity of O(N) And Space complexity of O(1)
Note:
Again this algorithm can work only if all number are positives and all numbers are less then N.

Finding largest from each subarray of length k

Interview Question :- Given an array and an integer k , find the maximum for each and every contiguous sub array of size k.
Sample Input :
1 2 3 1 4 5 2 3 6
3 [ value of k ]
Sample Output :
3
3
4
5
5
5
6
I cant think of anything better than brute force. Worst case is O(nk) when array is sorted in decreasing order.
Just iterate over the array and keep k last elements in a self-balancing binary tree.
Adding element to such tree, removing element and finding current maximum costs O(logk).
Most languages provide standard implementations for such trees. In STL, IIRC, it's MultiSet. In Java you'd use TreeMap (map, because you need to keep count, how many times each element occurs, and Java doesn't provide Multi- collections).
Pseudocode
for (int i = 0; i < n; ++i) {
tree.add(a[i]);
if (tree.size() > k) {
tree.remove(a[i - k]);
}
if (tree.size() == k) {
print(tree.max());
}
}
You can actually do this in O(n) time with O(n) space.
Split the array into blocks of each.
[a1 a2 ... ak] [a(k+1) ... a2k] ...
For each block, maintain two more blocks, the left block and the right block.
The ith element of the left block will be the max of the i elements from the left.
The ith element of the right block will be the max of the i elements from the right.
You will have two such blocks for each block of k.
Now if you want to find the max in range a[i... i+k], say the elements span two of the above blocks of k.
[j-k+1 ... i i+1 ... j] [j+1 ... i+k ... j+k]
All you need to do is find the max of RightMax of i to j of the first block and the left max of j+1 to i+k of the second block.
Hope this is the solution which you are looking for:
def MaxContigousSum(lst, n):
m = [0]
if lst[0] > 0:
m[0] = lst[0]
maxsum = m[0]
for i in range(1, n):
if m[i - 1] + lst[i] > 0:
m.append(m[i - 1] + lst[i])
else:
m.append(0)
if m[i] > maxsum:
maxsum = m[i]
return maxsum
lst = [-2, 11, -4, 13, -5, 2, 1, -3, 4, -2, -1, -6, -9]
print MaxContigousSum(lst, len(lst))
**Output**
20 for [11, -4, 13]

Resources