Data structure that supports range based most frequently occuring element query - algorithm

I'm looking for a data structure with which I can find the most frequently occuring number (among an array of numbers) in a given, variable range.
Let's consider the following 1 based array:
1 2 3 1 1 3 3 3 3 1 1 1 1
If I query the range (1,4), the data structure must retun 1, which occurs twice.
Several other examples:
(1,13) = 1
(4,9) = 3
(2,2) = 2
(1,3) = 1 (all of 1,2,3 occur once, so return the first/smallest one. not so important at the moment)
I have searched, but could not find anything similar. I'm looking (ideally) a data structure with minimal space requirement, fast preprocessing, and/or query complexities.
Thanks in advance!

Let N be the size of the array and M the number of different values in that array.
I'm considering two complexities : pre-processing and querying an interval of size n, each must be spacial and temporal.
Solution 1 :
Spacial : O(1) and O(M)
Temporal : O(1) and O(n + M)
No pre-processing, we look at all values of the interval and find the most frequent one.
Solution 2 :
Spacial : O(M*N) and O(1)
Temporal : O(M*N) and O(min(n,M))
For each position of the array, we have an accumulative array that gives us for each value x, how many times x is in the array before that position.
Given an interval we just need for each x to subtract 2 values to find the number of x in that interval. We iterate over each x and find the maximum value. If n < M we iterate over each value of the interval, otherwise we iterate over all possible values for x.
Solution 3 :
Spacial : O(N) and O(1)
Temporal : O(N) and O(min(n,M)*log(n))
For each value x build a binary heap of all the position in the array where x is present. The key in your heap is the position but you also store the total number of x between this position and the begin of the array.
Given an interval we just need for each x to subtract 2 values to find the number of x in that interval : in O(log(N)) we can ask the x's heap to find the two positions just before the start/end of the interval and substract the numbers. Basically it needs less space than a histogram but the query in now in O(log(N)).

You could create a binary partition tree where each node represents a histogram map of {value -> frequency} for a given range, and has two child nodes which represent the upper half and lower half of the range.
Querying is then just a case of recursively adding together a small number of these histograms to cover the range required, and scanning the resulting histogram once to find the highest occurrence count.
Useful optimizations include:
Using a histogram with mutable frequency counts as an "accumulator" while you add histograms together
Stop using precomputed histograms once you get down to a certain size (maybe a range less than the total number of possible values M) and just counting the numbers directly. It's a time/space trade-off that I think will pay off a lot of the time.
If you have a fixed small number of possible values, use an array rather than a map to store the frequency counts at each node
UPDATE: my thinking on algorithmic complexity assuming a bounded small number of possible values M and a total of N values in the complete range:
Preprocessing is O(N log N) - basically you need to traverse the complete list and build a binary tree, building one node for every M elements in order to amortise the overhead of each node
Querying is O(M log N) - basically adding up O(log N) histograms each of size M, plus counting O(M) values on either side of the range
Space requirement is O(N) - approx. 2N/M histograms each of size M. The 2 factor is the sum from having N/M histograms at the bottom level, 0.5N/M histograms at the next level, 0.25N/M at the third level etc...

Related

How to find 2 special elements in the array in O(n)

Let a1,...,an be a sequence of real numbers. Let m be the minimum of the sequence, and let M be the maximum of the sequence.
I proved that there exists 2 elements in the sequence, x,y, such that |x-y|<=(M-m)/n.
Now, is there a way to find an algorithm that finds such 2 elements in time complexity of O(n)?
I thought about sorting the sequence, but since I dont know anything about M I cannot use radix/bucket or any other linear time algorithm that I'm familier with.
I'd appreciate any idea.
Thanks in advance.
First find out n, M, m. If not already given they can be determined in O(n).
Then create a memory storage of n+1 elements; we will use the storage for n+1 buckets with width w=(M-m)/n.
The buckets cover the range of values equally: Bucket 1 goes from [m; m+w[, Bucket 2 from [m+w; m+2*w[, Bucket n from [m+(n-1)*w; m+n*w[ = [M-w; M[, and the (n+1)th bucket from [M; M+w[.
Now we go once through all the values and sort them into the buckets according to the assigned intervals. There should be at a maximum 1 element per bucket. If the bucket is already filled, it means that the elements are closer together than the boundaries of the half-open interval, e.g. we found elements x, y with |x-y| < w = (M-m)/n.
If no such two elements are found, afterwards n buckets of n+1 total buckets are filled with one element. And all those elements are sorted.
We once more go through all the buckets and compare the distance of the content of neighbouring buckets only, whether there are two elements, which fulfil the condition.
Due to the width of the buckets, the condition cannot be true for buckets, which are not adjoining: For those the distance is always |x-y| > w.
(The fulfilment of the last inequality in 4. is also the reason, why the interval is half-open and cannot be closed, and why we need n+1 buckets instead of n. An alternative would be, to use n buckets and make the now last bucket a special case with [M; M+w]. But O(n+1)=O(n) and using n+1 steps is preferable to special casing the last bucket.)
The running time is O(n) for step 1, 0 for step 2 - we actually do not do anything there, O(n) for step 3 and O(n) for step 4, as there is only 1 element per bucket. Altogether O(n).
This task shows, that either sorting of elements, which are not close together or coarse sorting without considering fine distances can be done in O(n) instead of O(n*log(n)). It has useful applications. Numbers on computers are discrete, they have a finite precision. I have sucessfuly used this sorting method for signal-processing / fast sorting in real-time production code.
About #Damien 's remark: The real threshold of (M-m)/(n-1) is provably true for every such sequence. I assumed in the answer so far the sequence we are looking at is a special kind, where the stronger condition is true, or at least, for all sequences, if the stronger condition was true, we would find such elements in O(n).
If this was a small mistake of the OP instead (who said to have proven the stronger condition) and we should find two elements x, y with |x-y| <= (M-m)/(n-1) instead, we can simplify:
-- 3. We would do steps 1 to 3 like above, but with n buckets and the bucket width set to w = (M-m)/(n-1). The bucket n now goes from [M; M+w[.
For step 4 we would do the following alternative:
4./alternative: n buckets are filled with one element each. The element at bucket n has to be M and is at the left boundary of the bucket interval. The distance of this element y = M to the element x in the n-1th bucket for every such possible element x in the n-1thbucket is: |M-x| <= w = (M-m)/(n-1), so we found x and y, which fulfil the condition, q.e.d.
First note that the real threshold should be (M-m)/(n-1).
The first step is to calculate the min m and max M elements, in O(N).
You calculate the mid = (m + M)/2value.
You concentrate the value less than mid at the beginning, and more than mid at the end of he array.
You select the part with the largest number of elements and you iterate until very few numbers are kept.
If both parts have the same number of elements, you can select any of them. If the remaining part has much more elements than n/2, then in order to maintain a O(n) complexity, you can keep onlyn/2 + 1 of them, as the goal is not to find the smallest difference, but one difference small enough only.
As indicated in a comment by #btilly, this solution could fail in some cases, for example with an input [0, 2.1, 2.9, 5]. For that, it is needed to calculate the max value of the left hand, and the min value of the right hand, and to test if the answer is not right_min - left_max. This doesn't change the O(n) complexity, even if the solution becomes less elegant.
Complexity of the search procedure: O(n) + O(n/2) + O(n/4) + ... + O(2) = O(2n) = O(n).
Damien is correct in his comment that the correct results is that there must be x, y such that |x-y| <= (M-m)/(n-1). If you have the sequence [0, 1, 2, 3, 4] you have 5 elements, but no two elements are closer than (M-m)/n = (4-0)/5 = 4/5.
With the right threshold, the solution is easy - find M and m by scanning through the input once, and then bucket the input into (n-1) buckets of size (M-m)/(n-1), putting values that are on the boundaries of a pair of buckets into both buckets. At least one bucket must have two values in it by the pigeon-hole principle.

Can this be properly modeled with segment trees?

The problem I'm working on requires processing several queries on an array (the size of the array is less than 10k, the largest element is certainly less than 10^9).
A query consists of two integers, and one must find the total count of subarrays that have an equal count of these integers. There may be up to 5 * 10^5 queries.
For instance, given the array [1, 2, 1], and the query 1 2 we find that there are two subarrays with equal counts of 1 and 2, namely [1, 2] and [2, 1].
My initial approach was using dynamic programming in order to construct a map, such that memo[i][j] = the number of times the number i appears in the array, until index j. I would use this in a similar way one would use prefix sums, but instead frequencies would accumulate.
Constructing this map took me O(n^2). For each query, I'd do an O(1) processing for each interval and increment the answer. This leads to a complexity of O((q + 1)n * (n - 1) / 2)) [q is the number of queries], which is to say O(n^2), but I also wanted to emphasize that daunting constant factor.
After some rearrangement, I'm trying to find out if there's a way to determine for every subarray the frequency count of each element. I strongly feel this problem is about segment trees and I've struggled with coming up with a proper model and this was the only thing I could think of.
However my approach doesn't seem to be too useful in this case, considering the complexity of combining nodes holding such a great amount of information, not to mention the memory overhead.
How can this be solved efficiently?
Idea 1
You can reduce the time for each query from O(n^2) to O(n) by computing the frequency count of the cumulative count difference:
from collections import defaultdict
def query(A,a,b):
t = 0
freq = defaultdict(int)
freq[0] = 1
for x in A:
if x==a:
t+=1
elif x==b:
t-=1
freq[t] += 1
return sum(count*(count-1)/2 for count in freq.values())
print query([1,2,1],1,2)
The idea is that t represents the total discrepancy between the count of the two elements.
If we find two positions in the array with the same total discrepancy we can conclude that the subarray between these positions must have an equal number.
The expression count*(count-1)/2 simply counts the number of ways of choosing two positions from the count which have the same discrepancy.
Example
For example, suppose we have the array [1,1,1,2,2,2]. The values for the cumulative discrepancy (number of 1's take away number of 2's) will be:
0,1,2,3,2,1,0
Each pair with the same number, corresponds to a subarray with equal count. e.g. looking at the pair of 2s we find that the range from position 2 to position 4 has equal count.
Idea 2
If this is still not fast enough, you could optimize the query function to quickly skip over all elements that are not equal to a or b. For example, you could prepare a list for each element value that contains all the locations of that element.
Once you have this list, you can then instantly jump to the next location of either a or b. For all intermediate values we know the discrepancy will not change, so you can update the frequency by the number of skipped elements (instead of always adding just 1 to the count).

Is it possible to query number of distinct integers in a range in O(lg N)?

I have read through some tutorials about two common data structure which can achieve range update and query in O(lg N): Segment tree and Binary Indexed Tree (BIT / Fenwick Tree).
Most of the examples I have found is about some associative and commutative operation like "Sum of integers in a range", "XOR integers in a range", etc.
I wonder if these two data structures (or any other data structures / algorithm, please propose) can achieve the below query in O(lg N)? (If no, how about O(sqrt N))
Given an array of integer A, query the number of distinct integer in a range [l,r]
PS: Assuming the number of available integer is ~ 10^5, so used[color] = true or bitmask is not possible
For example: A = [1,2,3,2,4,3,1], query([2,5]) = 3, where the range index is 0-based.
Yes, this is possible to do in O(log n), even if you should answer queries online. However, this requires some rather complex techniques.
First, let's solve the following problem: given an array, answer the queries of form "how many numbers <= x are there within indices [l, r]". This is done with a segment-tree-like structure which is sometimes called Merge Sort Tree. It is basically a segment tree where each node stores a sorted subarray. This structure requires O(n log n) memory (because there are log n layers and each of them requires storing n numbers). It is built in O(n log n) as well: you just go bottom-up and for each inner vertex merge sorted lists of its children.
Here is an example. Say 1 5 2 6 8 4 7 1 be an original array.
|1 1 2 4 5 6 7 8|
|1 2 5 6|1 4 7 8|
|1 5|2 6|4 8|1 7|
|1|5|2|6|8|4|7|1|
Now you can answer for those queries in O(log^2 n time): just make a reqular query to a segment tree (traversing O(log n) nodes) and make a binary search to know how many numbers <= x are there in that node (additional O(log n) from here).
This can be speed up to O(log n) using Fractional Cascading technique, which basically allows you to do the binary search not in each node but only in the root. However it is complex enough to be described in the post.
Now we return to the original problem. Assume you have an array a_1, ..., a_n. Build another array b_1, ..., b_n, where b_i = index of the next occurrence of a_i in the array, or ∞ if it is the last occurrence.
Example (1-indexed):
a = 1 3 1 2 2 1 4 1
b = 3 ∞ 6 5 ∞ 8 ∞ ∞
Now let's count numbers in [l, r]. For each unique number we'll count its last occurrence in the segment. With b_i notion you can see that the occurrence of the number is last if and only if b_i > r. So the problem boils down to "how many numbers > r are there in the segment [l, r]" which is trivially reduced to what I described above.
Hope it helps.
If you're willing to answer queries offline, then plain old Segment Trees/ BIT can still help.
Sort queries based on r values.
Make a Segment Tree for range sum queries [0, n]
For each value in input array from left to right:
Increment by 1 at current index i in the segment tree.
For current element, if it's been seen before, decrement by 1 in
segment tree at it's previous position.
Answer queries ending at current index i, by querying for sum in range [l, r == i].
The idea in short is to keep marking rightward indexes, the latest occurrence of each individual element, and setting previous occurrences back to 0. The sum of range would give the count of unique elements.
Overall time complexity again would be nLogn.
There is a well-known offline method to solve this problem. If you have n size array and q queries on it and in each query, you need to know the count of distinct number in that range then you can solve this whole thing in O(n log n + q log n) time complexity. Which is similar to solve every query in O(log n) time.
Let's solve the problem using the RSQ( Range sum query) technique. For the RSQ technique, you can use a segment tree or BIT. Let's discuss the segment tree technique.
For solving this problem you need an offline technique and a segment tree. Now, what is an offline technique?? The offline technique is doing something offline. In problem-solving an example of the offline technique is, You take input all queries first and then reorder them is a way so that you can answer them correctly and easily and finally output the answers in the given input order.
Solution Idea:
First, take input for a test case and store the given n numbers in an array. Let the array name is array[] and take input q queries and store them in a vector v. where every element of v hold three field- l, r, idx. where l is the start point of a query and r is the endpoint of a query and idx is the number of queries. like this one is n^th query.
Now sort the vector v on the basis of the endpoint of a query.
Let we have a segment tree which can store the information of at least 10^5 element. and we also have an areay called last[100005]. which stores the last position of a number in the array[].
Initially, all elements of the tree are zero and all elements of the last are -1.
now run a loop on the array[]. now inside the loop, you have to check this thing for every index of array[].
last[array[i]] is -1 or not? if it is -1 then write last[array[i]]=i and call update() function of which will add +1 in the last[array[i]] th position of segment tree.
if last[array[i]] is not -1 then call update() function of segment tree which will subtract 1 or add -1 in the last[array[i]] th position of segment tree. Now you need to store current position as last position for future. so that you need to write last[array[i]]=i and call update() function which will add +1 in the last[array[i]] th position of segment tree.
Now you have to check whether a query is finished in the current index. that is if(v[current].r==i). if this is true then call query() function of segment tree which will return and sum of the range v[current].l to v[current].r and store the result in the v[current].idx^th index of the answer[] array. you also need to increment the value of current by 1.
6. Now print the answer[] array which contains your final answer in the given input order.
the complexity of the algorithm is O(n log n).
The given problem can also be solved using Mo's (offline) algorithm also called Square Root decomposition algorithm.
Overall time complexity is O(N*SQRT(N)).
Refer mos-algorithm for detailed explanation, it even has complexity analysis and a SPOJ problem that can be solved with this approach.
kd-trees provide range queries in O(logn), where n is the number of points.
If you want faster query than a kd-tree, and you are willing to pay the memory cost, then Range trees are your friends, offering a query of:
O(logdn + k)
where n is the number of points stored in the tree, d is the dimension of each point and k is the number of points reported by a given query.
Bentley is an important name when it comes to this field. :)

Preprocess-Query to find number of pairs containing a number X

Formally we are given N pairs of rational numbers . We want to somehow preprocess on this data so as to answer queries like "Find number of pairs which contain a given rational number X" .
By ' a pair contains X' i mean [2,5] contains 3 & so on.
At worst , expected time for each query should be O(log N) or O(sqrt(N)) (or anything similair better than O(N)) & preprocessing should be at worst O(N^2) .
My approach:
I tried sorting pairs , first by first number & break ties by second number [First nos in pair < Second nos in pair]. Then applying a lower_bound form of binary search reduces the search space but now i can't apply another Binary search in this search space since pairs are sorted first by first nos. so after reducing search space i have to linearly check . This is again having worst case O(N) per query.
First you should try to make the ranges disjoint. For example ranges [1 5],[2 6],[3 7] will result in disjoint ranges of [1 2],[2 3],[3 5],[5 6],[6 7] and for each range you should save in how many original ranges it was present. Like this
1-------5 // original ranges
2------6
3------7
1-2, 2-3, 3-5, 5-6, 6-7 // disjoint ranges
1 2 3 2 1 // number of presence of each range in original ranges
You can do this by a sweep line algorithm in O(NlogN). After that You can use the method you described by sorting the ranges by its start and then for each query finding the lower_bound of Xi and printing the presence count of that range. For example in this case if the query is 4 you can find the range 3-5 by a binary search and then the result is 3 because the presence of range 3-5 is equal to 3.

Length of union of ranges

I need to find length of union of ranges in one dimension coordinate system. I have many ranges of form [a_i,b_i], and I need to find the length of union of these ranges. The ranges can be dynamically added or removed and can be queried at any state for length of union of ranges.
for example: is ranges are:
[0-4]
[3-6]
[8-10]
The output should be 8.
Is there any suitable data structure for the purpose with following upper bounds on complexity:
Insertion - O(log N)
Deletion - O(log N)
Query - O(log N)
For a moment, assume you have a sorted array, containing both start and end points, with the convention that a start point precedes an end point with the same coordinate. With your example, the array will contain
0:start, 3:start, 4:end, 6:end, 8:start, 10:end
(if there was an interval ending at 3, then 3:start will precede 3:end)
To do a query, perform a sweep from left to right, incrementing a counter on "start" and decrementing a counter on "end". You record as S the place where the counter increments from 0 and record as
E the place where the counter becomes zero. At this point you add to the total count the number of elements between S and E. This is also a point, where you can just replace the preceding intervals with the interval [S, E].
Now, if you need O(log n) complexity for insertion/deletion, instead of in an array, you store the same elements (pairs of coordinate and start or end flag) in a balanced binary tree.
The sweep is then performed according to the inorder traversal.
The query itself stays O(n) complexity.
It's not quite O(lg n), but would an interval tree or segment tree suit your needs? You can keep the length of union in a variable, and when inserting or removing an interval, you can find in O(lg n + m) time what other m intervals intersect it, and then use that information to update the length variable in O(m) time.
Maintain a frequency array. Ex: If your range is (0,2) and (1,3), your frequency array should be [1, 2, 2, 1]. Also maintain a count of non-zero elements in the frequency array.
For insertion, increment the frequencies corresponding to that range. Update count when you increase from 0 to 1 (but not from 1 to 2 etc).
For deletion, decrement the frequencies. Similarly update count.
For query, output count.
Complexity is length of the range.

Resources