How to effectively answer range queries in an array of integers? - algorithm

How to effectively and range queries in an array of integers?
Queries are of one type only, which is, given a range [a,b], find the sum of elements that are less than x (here x is a part of each query, say of the form a b x).
Initially, I tried to literally go from a to b and check if current element is less than x and adding up. But, this way is very inefficient as complexity is O(n).
Now I am trying with segment trees and sort the numbers while merging. But now my challenge is if I sort, then I am losing integers relative order. So when a query comes, I cannot use the sorted array to get values from a to b.

Here are two approaches to solving this problem with segment trees:
Approach 1
You can use a segment tree of sorted arrays.
As usual, the segment tree divides your array into a series of subranges of different sizes. For each subrange you store a sorted list of the entries plus a cumulative sum of the sorted list. You can then use binary search to find the sum of entries below your threshold value in any subrange.
When given a query, you first work out the O(log(n)) subrange that cover your [a,b] range. For each of these you use a O(log(n)) binary search. Overall this is O(qlog^2n) complexity to answer q queries (plus the preprocessing time).
Approach 2
You can use a dynamic segment tree.
A segment tree allows you to answer queries of the form "Compute sum of elements from a to b" in O(logn) time, and also to modify a single entry in O(logn).
Therefore if you start with an empty segment tree, you can reinsert the entries in increasing order. Suppose we have added all entries from 1 to 5, so our array may look like:
[0,0,0,3,0,0,0,2,0,0,0,0,0,0,1,0,0,0,4,4,0,0,5,1]
(The 0s represent entries that are bigger than 5 so haven't been added yet.)
At this point you can answer any queries that have a threshold of 5.
Overall this will cost O(nlog(n)) to add all the entries into the segment tree, O(qlog(q)) to sort the queries, and O(qlog(n)) to use the segment tree to answer the queries.

Related

Finding the medians of multiple subarrays in an unsorted array

Suppose you are given an unsorted array of integers S and a list of ranges in T, return a list of medians from each of the ranges.
For example, S = [3,6,1,5,0,0,1,-2], T = [[1,3],[0,5],[4,4]]. Return [5, 2, 0].
Is there a better approach than running Median of Medians on each range? Can we somehow precompute/cache the results?
Let me introduce you to an interesting data structure called Wavelet Tree:
You build it by looking at the bit-string representation of your integers and recursively bisecting them:
You first separate your integers into those with most significant bit (MSB) 0 and those with MSB 1. However you store the MSBs in their original order in a bitvector. Then for each of these subsets of integers, you ignore the MSB and recursively repeat this construction for the next-most significant bit.
If you repeat this down to the least significant bit, you get a tree structure like this (note that the indices are just there for illustration, you should store only the bitvectors):
You can easily see that the construction of this data structure takes O(n log N) time where n is the number of integers and N is their maximum value.
Wavelet trees have the nice property that they represent the original sequence as well as their sorted counterpart at the same time:
If you read the topmost bitvector, you get the MSBs of the input sequence. To reconstruct the next bit of the entries, you can alternate between looking in the bitvector in the root's left child (if the MSB is 0) or in the right child (if the MSB is 1). For the following bits, you can continue recursively.
If you read the leaf nodes from left to right, you get the sorted sequence.
To use a Wavelet tree efficiently, you need two fundamental operations on the bitvectors:
rank1(k) tells you how many 1s come before the kth position in the bitvector, rank0 does the same for 0s
select1(k) tells you the index of the kth 1 in the bitvector, select0 does the same for 0s
Note that there are bitvector representations that require only o(n) (small o) bits of additional storage to implement these operations in O(1)
You can utilize them as follows:
If you are looking at the first 7 in the sequence above, it has index 3. If you now want to know which index it has in the right child node, you simply call rank1(3) on the root bitvector and get 2, which is exactly the index of the first 7 in the right child
If you are at the child containing 4544 and want to know the position of the second 4 (with index 2) in the parent node containing 46754476, you call select0(2) on the parent's bitvector and get the index 5.
Now how can you implement a range median query with this? The most important realization you need to make is that finding the median of a range of size k is equivalent to selecting the k/2 th element.
The basic idea of the algorithm is similar to Quickselect: Bisect the element range and recurse only into the range containing the element you are looking for.
Let's say we want to find the median of the range starting at the second 2 (inclusive) and ending at the 1 (exclusive).
These are 7 elements, thus the median has rank 4 (fourth-smallest element) in that range.
Now using a rank0/1 call in the root bitvector at the beginning and end of this range, we find the corresponding ranges in the children of the root:
As you can see, the left range (which contains only smaller elements) has only 3 elements, thus the element with rank 4 must be contained in the right child of the root. We can now recursively search for the element with rank 4 - 3 = 1 in that right child. By recursively descending the wavelet tree until you reach a leaf, you can thus identify the median with only two rank operations (à O(1) time) per level of the Wavelet tree, thus the whole range median query takes O(log N) time where N is the maximum number in your input sequence.
If you want to see a practical implementation of these Wavelet trees, have a look at the Succinct Data Structures Library (SDSL) which implements the aforementioned bitvectors and different WT variants.

Processing "update elements" & "get min value among all element" queries efficiently

Question
You are given an array a = [a0, a1, ..., an-1], process these Q queries. The queries has following two types:
Given two integers i and x, update ai to x
Find the minimum value among all elements in array
I already know the algorithm with segment tree (range minimum query), and the time complexity is O(n log n). But this way also can calculate the minimum value among any section, so I think there is more simple and good performance way that can process these two types of queries.
Is there any other way to solve?
Use an array and a minimum heap with references to the heap in the array.
The array has the elements by index (it's basically the actual array you have) and the heap is ordered by value so that the minimum is always on top. You add a reference (a pointer) from each array element to its corresponding node in the heap so you can find it easily there.
To perform the first query you access the array at index i and set the element value to x (after index validation and all that). Then you update the node in the heap that ai points to and heapify. This costs O(log n).
To perform the second query just get the minimum from the heap. O(1).

IOI Qualifier INOI task 2

I can't figure out how to solve question 2 in the following link in an efficient manner:
http://www.iarcs.org.in/inoi/2012/inoi2012/inoi2012-qpaper.pdf
You can do this in On log n) time. (Or linear if you really care to.) First, pad the input array out to the next power of two using some really big negative number. Now, build an interval tree-like data structure; recursively partition your array by dividing it in half. Each node in the tree represents a subarray whose length is a power of two and which begins at a position that is a multiple of its length, and each nonleaf node has a "left half" child and a "right half" child.
Compute, for each node in your tree, what happens when you add 0,1,2,3,... to that subarray and take the maximum element. Notice that this is trivial for the leaves, which represent subarrays of length 1. For internal nodes, this is simply the maximum of the left child with length/2 + right child. So you can build this tree in linear time.
Now we want to run a sequence of n queries on this tree and print out the answers. The queries are of the form "what happens if I add k,k+1,k+2,...n,1,...,k-1 to the array and report the maximum?"
Notice that, when we add that sequence to the whole array, the break between n and 1 either occurs at the beginning/end, or smack in the middle, or somewhere in the left half, or somewhere in the right half. So, partition the array into the k,k+1,k+2,...,n part and the 1,2,...,k-1 part. If you identify all of the nodes in the tree that represent subarrays lying completely inside one of the two sequences but whose parents either don't exist or straddle the break-point, you will have O(log n) nodes. You need to look at their values, add various constants, and take the maximum. So each query takes O(log n) time.

Finding closest number in a range

I thought a problem which is as follows:
We have an array A of integers of size n, and we have test cases t and in every test cases we are given a number m and a range [s,e] i.e. we are given s and e and we have to find the closest number of m in the range of that array(A[s]-A[e]).
You may assume array indexed are from 1 to n.
For example:
A = {5, 12, 9, 18, 19}
m = 13
s = 4 and e = 5
So the answer should be 18.
Constraints:
n<=10^5
t<=n
All I can thought is an O(n) solution for every test case, and I think a better solution exists.
This is a rough sketch:
Create a segment tree from the data. At each node, besides the usual data like left and right indices, you also store the numbers found in the sub-tree rooted at that node, stored in sorted order. You can achieve this when you construct the segment tree in bottom-up order. In the node just above the leaf, you store the two leaf values in sorted order. In an intermediate node, you keep the numbers in the left child, and right child, which you can merge together using standard merging. There are O(n) nodes in the tree, and keeping this data should take overall O(nlog(n)).
Once you have this tree, for every query, walk down the path till you reach the appropriate node(s) in the given range ([s, e]). As the tutorial shows, one or more different nodes would combine to form the given range. As the tree depth is O(log(n)), that is the time per query to reach these nodes. Each query should be O(log(n)). For all the nodes which lie completely inside the range, find the closest number using binary search in the sorted array stored in those nodes. Again, O(log(n)). Find the closest among all these, and that is the answer. Thus, you can answer each query in O(log(n)) time.
The tutorial I link to contains other data structures, such as sparse table, which are easier to implement, and should give O(sqrt(n)) per query. But I haven't thought much about this.
sort the array and do binary search . complexity : o(nlogn + logn *t )
I'm fairly sure no faster solution exists. A slight variation of your problem is:
There is no array A, but each test case contains an unsorted array of numbers to search. (The array slice of A from s to e).
In that case, there is clearly no better way than a linear search for each test case.
Now, in what way is your original problem more specific than the variation above? The only added information is that all the slices come from the same array. I don't think that this additional constraint can be used for an algorithmic speedup.
EDIT: I stand corrected. The segment tree data structure should work.

search for interval overlap in list of intervals?

Say [a,b] represents the interval on the real line from a to b, a < b, inclusive (ie, [a,b] = set of all x such that a<=x<=b). Also, say [a,b] and [c,d] are 'overlapping' if they share any x such that x is in both [a,b] and [c,d].
Given a list of intervals, ([x1,y1],[x2,y2],...), what is the most efficient way to find all such intervals that overlap with [x,y]?
Obviously, I can try each and get it in O(n). But I was wondering if I could sort the list of intervals in some clever way, I could find /one/ overlapping item in O(log N) via a binary search, and then 'look around' from that position in the list to find all overlapping intervals. However, how do I sort intervals such that such a strategy would work?
Note that there may be overlaps between elements in the list items itself, which is what makes this hard.
I've tried it by sorting intervals by their left end, right end, middle, but none seem to lead to an exhaustive search.
Help?
For completeness' sake, I'd like to add that there is a well-known data structure for just this sort of problem, known (surprise, surprise) as an interval tree. It's basically an augmented balanced tree (red-black, AVL, your pick) that stores intervals sorted by their left (low) endpoint. The augmentation is that each node stores the largest right (high) endpoint in its subtree. This tree allows you to find all overlapping intervals in O(log n) time.
It's described in CLRS 14.3.
[a, b] overlaps with [x, y] iff b > x and a < y. Sorting intervals by their first elements gives you intervals matching the first condition in log time. Sorting intervals by their last elements gives you intervals matching the second condition in log time. Take the intersections of the resulting sets.
A 'quadtree' is a data structure often used to improve the efficiency of collision detection in 2 dimensions.
I think you could come up with a similar 1-d structure. This would require some pre-computation but should result in O(log N) performance.
Basically you start with a root 'node' that covers all possible intervals, and when adding a node to the tree, you decide if it falls on the left or the right of the midpoint. If it crosses the mid point, you break it into two intervals (but record the original parent) and recursively proceed from there. You can set a limit on the depth of the tree, which can save memory and improve performance, but comes at the expense of complicating things a little (you need to store a list of intervals in your nodes).
Then when checking an interval, you basically find all leaf nodes that it would be inserted into were it inserted, check the partial intervals within those nodes for intersection, and then report the interval that is recorded against them as the 'original' parent.
Just a quick thought 'off the cuff' so to speak.
Could you organize them into 2 lists, one for start of intervals and the other for end of intervals.
This way, you can compare y to the items in the start of interval list (say by binary search) to cut down the candidates based on that.
You can then compare x to the items in the end of interval list.
EDIT
Case: Once Off
If you are comparing only single interval to the list of intervals in a once-off situation, I don't believe sorting will help you out since ideal sorting is O(n).
By doing a linear search through all x's to trim out any impossible intervals then doing another linear search through the remaining y's you can reduce your total work. While this is still O(n), without this you would be doing 2n comparisons, whereas on average, you would only do (3n-1)/2 comparisons this way.
I believe this is the best you can do for an unsorted list.
Case: Pre-sorting doesn't count
In the case where you will be repeatedly comparing single intervals to this list of intervals and your pre-sort your list, you can achieve better results. The process above still applies, but by doing a binary search on the first list then the second you can get O(m log n) as opposed to O(mn), where m is the number of single intervals being compared. Note, still still gives you the advantage of reducing total comparisons. [2m log n compared to m(3(log n) - 1)/2]
You could sort by both left end and right end at the same time and use both lists to eliminate none overlapping values. If the list is sorted by the left end then none of the intervals to the right of the right end of the test range can overlap. If the list is sorted by the right end then none of the intervals to the left of the left end of the test range can overlap.
For example if the intervals are
[1,4], [3,6], [4,5], [2,8], [5,7], [1,2], [2,2.5]
and you're finding overlap with [3,4] then sorting by left end and marking position of the right end of the test (with the right end as just greater than its value so that 4 is included in the range)
[1,4], [1,2], [2,2.5], [2,8], [3,6], [4,5], *, [5,7]
you know [5,7] can't overlap, then sorting by right end and marking position of the left end of the test
[1,2], [2,2.5], *, [1,4], [4,5], [3,6], [5,7], [2,8]
you know [1,2] and [2,2.5] can't overlap
Not sure how efficient this would be since you're having to do two sorts and searches.
As you can see in other answers, most algorithms come together with a special data structure. For example, for unsorted list of intervals as input O(n) is best that you'll get. (And usually it's easier to think in terms of data structure that dictates the algorithm).
In this case, your question is not complete:
Are you given the whole list or it is you who actually creates it?
Do you have to perform just one such lookup or many of them?
Do you have any estimations for operations it should support and their frequencies?
For example, if you have to perform just one such lookup, then it's not worthy to sort the list before. If many, then the more expensive sorting or generation of an "1D quadtree" would be amortized.
However, it would be difficult to solve it, because a simple quadtree (as I understand it) is able just to detect the collistion, but it's not able to create the list of all the segments that are overlapping with your input.
One simple implementation would be an ordered (by coordonate) list where you insert all the segment ends with flag start/end and with segment number. In this way, by parsing it (still O(n), but I doubt you can make it faster if you also need the list of all the segments that overlaps), and keeping the track of all opened segments that were not closed at "check points".

Resources