Sum of product of distinct numbers and their count - algorithm

Given an array of N elements where N is up to 200000. Array elements are at max 100000. Now we are providing Q queries of form [a b]. For each query we need to tell the sum of:
((Count of each distinct number in range a to b)^2)*(Value of that distinct number)
Example Let N=8 and array be [1 1 2 2 1 3 1 1], and let Q=1. That means just one query. Let a=2 and b=7, then the answer is 20
Explanation :
occurrence of 1-> 3
occurrence of 2-> 2
occurrence of 3-> 1
cost=3*3*1 + 2*2*2 + 1*1*3= 20
Now if there were less queries than it would not have been so difficult question But Q can be up to 200000. So what must be best suited data structure for this problem ?

Here is offline an O((n + q) * sqrt(n)) solution:
Let's divide the given array into sqrt(n) consecutive blocks with sqrt(n) elements each.
Let's divide all queries based on the number a block which contains their left border.
Now we will answer queries from each group individually:
Inside one group, we should sort the queries by their right border(in increasing order).
Let's iterate over all queries from this group in the sorted order and maintain the following invariant: all numbers that lie inside a block covered by this query except, maybe, the first and the last blocks, are already processed. We can maintain it by processing the next block when we need it.
Given this invariant, we can get the answer to this query by looking only at numbers in the first and the last block(that contain borders of this query). There are at most O(sqrt(n)) such numbers, so we can simply iterate over them.
Clarification: we maintain an array count of size MAX_VALUE, where count[i] is the number of occurrences of i among the processed numbers and curSum - the sum of the target function for the processed numbers. We can add or remove one number in O(1): increment or decrement count[i] and adjust curSum. The number was processed means that it has been taken into account in the count array and the curSum variable.
Time complexity: for each group, we traverse the array from left to right at once to process the number in inner blocks. It takes an O(n * sqrt(n)) times. Each query gives additional O(sqrt(n)) time for processing numbers in the first and the last block for this query. Thus, the total time complexity is O((n + q) * sqrt(n)).

Related

How to find 2 special elements in the array in O(n)

Let a1,...,an be a sequence of real numbers. Let m be the minimum of the sequence, and let M be the maximum of the sequence.
I proved that there exists 2 elements in the sequence, x,y, such that |x-y|<=(M-m)/n.
Now, is there a way to find an algorithm that finds such 2 elements in time complexity of O(n)?
I thought about sorting the sequence, but since I dont know anything about M I cannot use radix/bucket or any other linear time algorithm that I'm familier with.
I'd appreciate any idea.
Thanks in advance.
First find out n, M, m. If not already given they can be determined in O(n).
Then create a memory storage of n+1 elements; we will use the storage for n+1 buckets with width w=(M-m)/n.
The buckets cover the range of values equally: Bucket 1 goes from [m; m+w[, Bucket 2 from [m+w; m+2*w[, Bucket n from [m+(n-1)*w; m+n*w[ = [M-w; M[, and the (n+1)th bucket from [M; M+w[.
Now we go once through all the values and sort them into the buckets according to the assigned intervals. There should be at a maximum 1 element per bucket. If the bucket is already filled, it means that the elements are closer together than the boundaries of the half-open interval, e.g. we found elements x, y with |x-y| < w = (M-m)/n.
If no such two elements are found, afterwards n buckets of n+1 total buckets are filled with one element. And all those elements are sorted.
We once more go through all the buckets and compare the distance of the content of neighbouring buckets only, whether there are two elements, which fulfil the condition.
Due to the width of the buckets, the condition cannot be true for buckets, which are not adjoining: For those the distance is always |x-y| > w.
(The fulfilment of the last inequality in 4. is also the reason, why the interval is half-open and cannot be closed, and why we need n+1 buckets instead of n. An alternative would be, to use n buckets and make the now last bucket a special case with [M; M+w]. But O(n+1)=O(n) and using n+1 steps is preferable to special casing the last bucket.)
The running time is O(n) for step 1, 0 for step 2 - we actually do not do anything there, O(n) for step 3 and O(n) for step 4, as there is only 1 element per bucket. Altogether O(n).
This task shows, that either sorting of elements, which are not close together or coarse sorting without considering fine distances can be done in O(n) instead of O(n*log(n)). It has useful applications. Numbers on computers are discrete, they have a finite precision. I have sucessfuly used this sorting method for signal-processing / fast sorting in real-time production code.
About #Damien 's remark: The real threshold of (M-m)/(n-1) is provably true for every such sequence. I assumed in the answer so far the sequence we are looking at is a special kind, where the stronger condition is true, or at least, for all sequences, if the stronger condition was true, we would find such elements in O(n).
If this was a small mistake of the OP instead (who said to have proven the stronger condition) and we should find two elements x, y with |x-y| <= (M-m)/(n-1) instead, we can simplify:
-- 3. We would do steps 1 to 3 like above, but with n buckets and the bucket width set to w = (M-m)/(n-1). The bucket n now goes from [M; M+w[.
For step 4 we would do the following alternative:
4./alternative: n buckets are filled with one element each. The element at bucket n has to be M and is at the left boundary of the bucket interval. The distance of this element y = M to the element x in the n-1th bucket for every such possible element x in the n-1thbucket is: |M-x| <= w = (M-m)/(n-1), so we found x and y, which fulfil the condition, q.e.d.
First note that the real threshold should be (M-m)/(n-1).
The first step is to calculate the min m and max M elements, in O(N).
You calculate the mid = (m + M)/2value.
You concentrate the value less than mid at the beginning, and more than mid at the end of he array.
You select the part with the largest number of elements and you iterate until very few numbers are kept.
If both parts have the same number of elements, you can select any of them. If the remaining part has much more elements than n/2, then in order to maintain a O(n) complexity, you can keep onlyn/2 + 1 of them, as the goal is not to find the smallest difference, but one difference small enough only.
As indicated in a comment by #btilly, this solution could fail in some cases, for example with an input [0, 2.1, 2.9, 5]. For that, it is needed to calculate the max value of the left hand, and the min value of the right hand, and to test if the answer is not right_min - left_max. This doesn't change the O(n) complexity, even if the solution becomes less elegant.
Complexity of the search procedure: O(n) + O(n/2) + O(n/4) + ... + O(2) = O(2n) = O(n).
Damien is correct in his comment that the correct results is that there must be x, y such that |x-y| <= (M-m)/(n-1). If you have the sequence [0, 1, 2, 3, 4] you have 5 elements, but no two elements are closer than (M-m)/n = (4-0)/5 = 4/5.
With the right threshold, the solution is easy - find M and m by scanning through the input once, and then bucket the input into (n-1) buckets of size (M-m)/(n-1), putting values that are on the boundaries of a pair of buckets into both buckets. At least one bucket must have two values in it by the pigeon-hole principle.

Can this be properly modeled with segment trees?

The problem I'm working on requires processing several queries on an array (the size of the array is less than 10k, the largest element is certainly less than 10^9).
A query consists of two integers, and one must find the total count of subarrays that have an equal count of these integers. There may be up to 5 * 10^5 queries.
For instance, given the array [1, 2, 1], and the query 1 2 we find that there are two subarrays with equal counts of 1 and 2, namely [1, 2] and [2, 1].
My initial approach was using dynamic programming in order to construct a map, such that memo[i][j] = the number of times the number i appears in the array, until index j. I would use this in a similar way one would use prefix sums, but instead frequencies would accumulate.
Constructing this map took me O(n^2). For each query, I'd do an O(1) processing for each interval and increment the answer. This leads to a complexity of O((q + 1)n * (n - 1) / 2)) [q is the number of queries], which is to say O(n^2), but I also wanted to emphasize that daunting constant factor.
After some rearrangement, I'm trying to find out if there's a way to determine for every subarray the frequency count of each element. I strongly feel this problem is about segment trees and I've struggled with coming up with a proper model and this was the only thing I could think of.
However my approach doesn't seem to be too useful in this case, considering the complexity of combining nodes holding such a great amount of information, not to mention the memory overhead.
How can this be solved efficiently?
Idea 1
You can reduce the time for each query from O(n^2) to O(n) by computing the frequency count of the cumulative count difference:
from collections import defaultdict
def query(A,a,b):
t = 0
freq = defaultdict(int)
freq[0] = 1
for x in A:
if x==a:
t+=1
elif x==b:
t-=1
freq[t] += 1
return sum(count*(count-1)/2 for count in freq.values())
print query([1,2,1],1,2)
The idea is that t represents the total discrepancy between the count of the two elements.
If we find two positions in the array with the same total discrepancy we can conclude that the subarray between these positions must have an equal number.
The expression count*(count-1)/2 simply counts the number of ways of choosing two positions from the count which have the same discrepancy.
Example
For example, suppose we have the array [1,1,1,2,2,2]. The values for the cumulative discrepancy (number of 1's take away number of 2's) will be:
0,1,2,3,2,1,0
Each pair with the same number, corresponds to a subarray with equal count. e.g. looking at the pair of 2s we find that the range from position 2 to position 4 has equal count.
Idea 2
If this is still not fast enough, you could optimize the query function to quickly skip over all elements that are not equal to a or b. For example, you could prepare a list for each element value that contains all the locations of that element.
Once you have this list, you can then instantly jump to the next location of either a or b. For all intermediate values we know the discrepancy will not change, so you can update the frequency by the number of skipped elements (instead of always adding just 1 to the count).

Given an array, and a series of queries for indices `i j` how do I find out which index has been queried the most number of times efficiently?

Given an array A, indexed from 0 to n-1 where n is the size of the array, and a series of queries of the form i j where i and j indicate indices (i and j inclusive), how do I find out which index has been queries the most number of times efficiently?
For example, consider an array [3,4,5,6,7,9]
And queries
0 3
3 5
1 2
2 4
Output
Index 0 has been queried 1 time.
Index 1 has been queried 2 times.
Index 2 has been queried 3 times.
Index 3 has been queried 3 times.
Index 4 has been queried 2 times.
Index 5 has been queried 1 time.
How do I make this as fast as possible?
You can do this in O(n+q) where n is size of array and q is the number of queries by:
Make empty array A with n entries
For each query i,j increase A[i] by 1 and decrease A[j+1] by 1
Loop over the array computing the cumulative total and keep track of the index with the highest cumulative total
The cumulative total will contain +1 for each interval where we have seen the start, and -1 for each interval where we have seen the end. This means that the total will give the count of current open intervals, or in other words the number of times that entry has been queried.
You can use an interval tree to store all your queries (construction of the tree takes O(nlogn) time), then check for each array entry how many intervals contain it (O(log n)) time.
A more naive, but still effective approach would be to use an auxiliary array A of size n (all entries initialized to 0), for each query do:
for (int k = i; k <= j; k++)
A[k]++;
And then just print the array.
There are several measures possible: Running the accesses and any processing done while at it are "free," you want to get a the count after all is done efficiently; or you need to consider the processing during access and the final search.
In the first case, set up a priority queue and bump each one up as it is accessed. The final step of getting the most accessed is constant.
In the second case, you can't do better than to count each access for each index (linear in the number of accesses) and at the end go through the counts to pick the largest (linear in the number of indices, presumably less than the number of accesses).
1) Sort all the times in the given queries. O(q log q)
2) num_times_queried[times[i]]=num_queries_started_till[times[i]]-num_queries_ended_till[times[i]-1]
num_queries_started/ended_till(x) can be found in log q using binary search.
O(q log q + n log q)

How to find number of expected swaps in bubble sort in better than O(n^2) time

I am stuck on problem http://www.codechef.com/JULY12/problems/LEBOBBLE
Here it is required to find number of expected swaps.
I tried an O(n^2) solution but it is timing out.
The code is like:
swaps = 0
for(i = 0;i < n-1;i++)
for(j = i+1;j<n;j++)
{
swaps += expected swap of A[i] and A[j]
}
Since probabilities of elements are varying, so every pair is needed to be compared. So according to me the above code snippet must be most efficient but it is timing out.
Can it be done in O(nlogn) or it any complexity better than O(n^2).
Give me any hint if possible.
Alright, let's think about this.
We realize that every number needs to be eventually swapped with every number after it that's less than it, sooner or later. Thus, the total number of swaps for a given number is the total number of numbers after it which are less than it. However, this is still O(n^2) time.
For every pass of the outer loop of bubble sort, one element gets put in the correct position. Without loss of generality, we'll say that for every pass, the largest element remaining gets sorted to the end of the list.
So, in the first pass of the outer loop, the largest number is put at the end. This takes q swaps, where q is the number of positions the number started away from the final position.
Thus, we can say that it will take q1+q2+ ... +qn swaps to complete this bubble sort. However, keep in mind that with every swap, one number will be taken either one position closer or one position farther away from their final positions. In our specific case, if a number is in front of a larger number, and at or in front of its correct position, one more swap will be required. However, if a number is behind a larger number and behind it's correct position, one less swap will be required.
We can see that this is true with the following example:
5 3 1 2 4
=> 3 5 1 2 4
=> 3 1 5 2 4
=> 3 1 2 5 4
=> 3 1 2 4 5
=> 1 3 2 4 5
=> 1 2 3 4 5 (6 swaps total)
"5" moves 4 spaces. "3" moves 1 space. "1" moves 2 spaces. "2" moves 2 spaces. "4" moves 1 space. Total: 10 spaces.
Note that 3 is behind 5 and in front of its correct position. Thus one more swap will be needed. 1 and 2 are behind 3 and 5 -- four less swaps will be needed. 4 is behind 5 and behind its correct position, thus one less swap will be needed. We can see now that the expected value of 6 matches the actual value.
We can compute Σq by sorting the list first, keeping the original positions of each of the elements in memory while doing the sort. This is possible in O(nlogn + n) time.
We can also see what numbers are behind what other numbers, but this is impossible to do in faster than O(n^2) time. However, we can get a faster solution.
Every swap effectively moves two numbers number needs to their correct positions, but some swaps actually do nothing, because one be eventually swapped with every number gets closerafter it that's less than it, and another gets farthersooner or later. The first swap in our previous exampleThus, between "3" and "5" is the only example of this in our example.
We have to calculate how many total number of said swaps that there are. This is left as an exercise to the reader, but here's one last hint: you only have to loop through the first half of the list. Though this for a given number is still, in the end O(n^2), we only have to do O(n^2) operations on the first half total number of the list, making numbers after it much faster overall.
Use divide and conquer
divide: size of sequence n to two lists of size n/2
conquer: count recursively two lists
combine: this is a trick part (to do it in linear time)
combine use merge-and-count. Suppose the two lists are A, B. They are already sorted. Produce an output list L from A, B while also counting the number of inversions, (a,b) where a is-in A, b is-in B and a > b.
The idea is similar to "merge" in merge-sort. Merge two sorted lists into one output list, but we also count the inversion.
Everytime a_i is appended to the output, no new inversions are encountered, since a_i is smaller than everything left in list B. If b_j is appended to the output, then it is smaller than all the remaining items in A, we increase the number of count of inversions by the number of elements remaining in A.
merge-and-count(A,B)
; A,B two input lists (sorted)
; C output list
; i,j current pointers to each list, start at beginning
; a_i, b_j elements pointed by i, j
; count number of inversion, initially 0
while A,B != empty
append min(a_i,b_j) to C
if b_j < a_i
count += number of element remaining in A
j++
else
i++
; now one list is empty
append the remainder of the list to C
return count, C
With merge-and-count, we can design the count inversion algorithm as follows:
sort-and-count(L)
if L has one element return 0
else
divide L into A, B
(rA, A) = sort-and-count(A)
(rB, B) = sort-and-count(B)
(r, L) = merge-and-count(A,B)
return r = rA+rB+r, L
T(n) = O(n lg n)

random merge sort

I was given the following question in an algorithms book:
Suppose a merge sort is implemented to split a file at a random position, rather then exactly in the middle. How many comparisons would be used by such method to sort n elements on average?
Thanks.
To guide you to the answer, consider these more specific questions:
Assume the split is always at 10%, or 25%, or 75%, or 90%. In each case: what's the impact on recursion depths? How many comparisons need to be per recursion level?
I'm partially agree with #Armen, they should be comparable.
But: consider the case when they are split in the middle. To merge two lists of lengths n we would need 2*n - 1 comparations (sometimes less, but we'll consider it fixed for simplicity), each of them producing the next value. There would be log2(n) levels of merges, that gives us approximately n*log2(n) comparations.
Now considering the random-split case: The maximum number of comparations needed to merge a list of length n1 with one of length n2 will be n1 + n2 - 1. Howerer, the average number will be close to it, because even for the most unhappy split 1 and n-1 we'll need an average of n/2 comparations. So we can consider that the cost of merging per level will be the same as in even case.
The difference is that in random case the number of levels will be larger, and we can consider that n for next level would be max(n1, n2) instead of n/2. This max(n1, n2) will tend to be 3*n/4, that gives us the approximate formula
n*log43(n) // where log43 is log in base 4/3
that gives us
n * log2(n) / log2(4/3) ~= 2.4 * n * log2(n)
This result is still larger than the correct one because we ignored that the small list will have fewer levels, but it should be close enough. I suppose that the correct answer will be the number of comparations on average will double
You can get an upper bound of 2n * H_{n - 1} <= 2n ln n using the fact that merging two lists of total length n costs at most n comparisons. The analysis is similar to that of randomized quicksort (see http://www.cs.cmu.edu/afs/cs/academic/class/15451-s07/www/lecture_notes/lect0123.pdf).
First, suppose we split a list of length n into 2 lists L and R. We will charge the first element of R for a comparison against all of the elements of L, and the last element of L for a comparison against all elements of R. Even though these may not be the exact comparisons that are executed, the total number of comparisons we are charging for is n as required.
This handles one level of recursion, but what about the rest? We proceed by concentrating only on the "right-to-left" comparisons that occur between the first element of R and every element of L at all levels of recursion (by symmetry, this will be half the actual expected total). The probability that the jth element is compared to the ith element is 1/(j - i) where j > i. To see this, note that element j is compared with element i exactly when it is the first element chosen as a "splitting element" from among the set {i + 1,..., j}. That is, elements i and j are split into two lists as soon as the list they are in is split at some element from {i + 1,..., j}, and element j is charged for a comparison with i exactly when element j is the element that is chosen from this set.
Thus, the expected total number of comparisons involving j is at most H_n (i.e., 1 + 1/2 + 1/3..., where the number of terms is at most n - 1). Summing across all possible j gives n * H_{n - 1}. This only counted "right-to-left" comparisons, so the final upper bound is 2n * H_{n - 1}.

Resources