search for interval overlap in list of intervals? - algorithm

Say [a,b] represents the interval on the real line from a to b, a < b, inclusive (ie, [a,b] = set of all x such that a<=x<=b). Also, say [a,b] and [c,d] are 'overlapping' if they share any x such that x is in both [a,b] and [c,d].
Given a list of intervals, ([x1,y1],[x2,y2],...), what is the most efficient way to find all such intervals that overlap with [x,y]?
Obviously, I can try each and get it in O(n). But I was wondering if I could sort the list of intervals in some clever way, I could find /one/ overlapping item in O(log N) via a binary search, and then 'look around' from that position in the list to find all overlapping intervals. However, how do I sort intervals such that such a strategy would work?
Note that there may be overlaps between elements in the list items itself, which is what makes this hard.
I've tried it by sorting intervals by their left end, right end, middle, but none seem to lead to an exhaustive search.
Help?

For completeness' sake, I'd like to add that there is a well-known data structure for just this sort of problem, known (surprise, surprise) as an interval tree. It's basically an augmented balanced tree (red-black, AVL, your pick) that stores intervals sorted by their left (low) endpoint. The augmentation is that each node stores the largest right (high) endpoint in its subtree. This tree allows you to find all overlapping intervals in O(log n) time.
It's described in CLRS 14.3.

[a, b] overlaps with [x, y] iff b > x and a < y. Sorting intervals by their first elements gives you intervals matching the first condition in log time. Sorting intervals by their last elements gives you intervals matching the second condition in log time. Take the intersections of the resulting sets.

A 'quadtree' is a data structure often used to improve the efficiency of collision detection in 2 dimensions.
I think you could come up with a similar 1-d structure. This would require some pre-computation but should result in O(log N) performance.
Basically you start with a root 'node' that covers all possible intervals, and when adding a node to the tree, you decide if it falls on the left or the right of the midpoint. If it crosses the mid point, you break it into two intervals (but record the original parent) and recursively proceed from there. You can set a limit on the depth of the tree, which can save memory and improve performance, but comes at the expense of complicating things a little (you need to store a list of intervals in your nodes).
Then when checking an interval, you basically find all leaf nodes that it would be inserted into were it inserted, check the partial intervals within those nodes for intersection, and then report the interval that is recorded against them as the 'original' parent.

Just a quick thought 'off the cuff' so to speak.
Could you organize them into 2 lists, one for start of intervals and the other for end of intervals.
This way, you can compare y to the items in the start of interval list (say by binary search) to cut down the candidates based on that.
You can then compare x to the items in the end of interval list.
EDIT
Case: Once Off
If you are comparing only single interval to the list of intervals in a once-off situation, I don't believe sorting will help you out since ideal sorting is O(n).
By doing a linear search through all x's to trim out any impossible intervals then doing another linear search through the remaining y's you can reduce your total work. While this is still O(n), without this you would be doing 2n comparisons, whereas on average, you would only do (3n-1)/2 comparisons this way.
I believe this is the best you can do for an unsorted list.
Case: Pre-sorting doesn't count
In the case where you will be repeatedly comparing single intervals to this list of intervals and your pre-sort your list, you can achieve better results. The process above still applies, but by doing a binary search on the first list then the second you can get O(m log n) as opposed to O(mn), where m is the number of single intervals being compared. Note, still still gives you the advantage of reducing total comparisons. [2m log n compared to m(3(log n) - 1)/2]

You could sort by both left end and right end at the same time and use both lists to eliminate none overlapping values. If the list is sorted by the left end then none of the intervals to the right of the right end of the test range can overlap. If the list is sorted by the right end then none of the intervals to the left of the left end of the test range can overlap.
For example if the intervals are
[1,4], [3,6], [4,5], [2,8], [5,7], [1,2], [2,2.5]
and you're finding overlap with [3,4] then sorting by left end and marking position of the right end of the test (with the right end as just greater than its value so that 4 is included in the range)
[1,4], [1,2], [2,2.5], [2,8], [3,6], [4,5], *, [5,7]
you know [5,7] can't overlap, then sorting by right end and marking position of the left end of the test
[1,2], [2,2.5], *, [1,4], [4,5], [3,6], [5,7], [2,8]
you know [1,2] and [2,2.5] can't overlap
Not sure how efficient this would be since you're having to do two sorts and searches.

As you can see in other answers, most algorithms come together with a special data structure. For example, for unsorted list of intervals as input O(n) is best that you'll get. (And usually it's easier to think in terms of data structure that dictates the algorithm).
In this case, your question is not complete:
Are you given the whole list or it is you who actually creates it?
Do you have to perform just one such lookup or many of them?
Do you have any estimations for operations it should support and their frequencies?
For example, if you have to perform just one such lookup, then it's not worthy to sort the list before. If many, then the more expensive sorting or generation of an "1D quadtree" would be amortized.
However, it would be difficult to solve it, because a simple quadtree (as I understand it) is able just to detect the collistion, but it's not able to create the list of all the segments that are overlapping with your input.
One simple implementation would be an ordered (by coordonate) list where you insert all the segment ends with flag start/end and with segment number. In this way, by parsing it (still O(n), but I doubt you can make it faster if you also need the list of all the segments that overlaps), and keeping the track of all opened segments that were not closed at "check points".

Related

How to sort intervals into minimum number of bins without overlapping intervals

I'm looking for an algorithm to achieve the following:
Given an arbitrary set of "intervals", where an interval is defined simply by a start and an end (2 floating point numbers with end >= start). I would like to organise these intervals into 1 or more "bins"/"buckets"/groups such that:
No two intervals within a single bin overlap each other
The minimum number of bins is used
My solution has been to iterate the intervals and for each one, essentially perform a binary search on each bin until one is found which can accommodate the interval (new bin if necessary). This works but I'm left wondering if it can be optimised because depending on the order in which the intervals are processed, the result is different. I have a feeling that sorting the intervals from largest to smallest (end - start) gives better results but I'm not certain.
Sort the intervals by their end point, and then take each interval in end-point order, and put it into the bucket with the largest right-most end point, such that bucket.max_endpoint < interval.startpoint. If there is no such bucket, then you have to start a new one.
If you keep the buckets in a binary search tree sorted by max_endpoint, then you can find the best one in log(|buckets|) time, for O(N log N) all together.
The proof that choosing the tightest fit interval for each bucket is optimal is simple:
Imagine that you already know an optimal assignment of intervals to buckets, and you put them into the buckets in endpoint order, and at some point you don't choose the tightest-fit bucket...
If you change your mind and choose the tightest-fit bucket instead, you'll be in the same situation, except that one of the buckets will have more room on the right. It never hurts to have more room, so the remaining intervals will still fit.

How to effectively answer range queries in an array of integers?

How to effectively and range queries in an array of integers?
Queries are of one type only, which is, given a range [a,b], find the sum of elements that are less than x (here x is a part of each query, say of the form a b x).
Initially, I tried to literally go from a to b and check if current element is less than x and adding up. But, this way is very inefficient as complexity is O(n).
Now I am trying with segment trees and sort the numbers while merging. But now my challenge is if I sort, then I am losing integers relative order. So when a query comes, I cannot use the sorted array to get values from a to b.
Here are two approaches to solving this problem with segment trees:
Approach 1
You can use a segment tree of sorted arrays.
As usual, the segment tree divides your array into a series of subranges of different sizes. For each subrange you store a sorted list of the entries plus a cumulative sum of the sorted list. You can then use binary search to find the sum of entries below your threshold value in any subrange.
When given a query, you first work out the O(log(n)) subrange that cover your [a,b] range. For each of these you use a O(log(n)) binary search. Overall this is O(qlog^2n) complexity to answer q queries (plus the preprocessing time).
Approach 2
You can use a dynamic segment tree.
A segment tree allows you to answer queries of the form "Compute sum of elements from a to b" in O(logn) time, and also to modify a single entry in O(logn).
Therefore if you start with an empty segment tree, you can reinsert the entries in increasing order. Suppose we have added all entries from 1 to 5, so our array may look like:
[0,0,0,3,0,0,0,2,0,0,0,0,0,0,1,0,0,0,4,4,0,0,5,1]
(The 0s represent entries that are bigger than 5 so haven't been added yet.)
At this point you can answer any queries that have a threshold of 5.
Overall this will cost O(nlog(n)) to add all the entries into the segment tree, O(qlog(q)) to sort the queries, and O(qlog(n)) to use the segment tree to answer the queries.

Best sorting algorithm - Partially sorted linked list

Problem- Given a sorted doubly link list and two numbers C and K. You need to decrease the info of node with data K by C and insert the new node formed at its correct position such that the list remains sorted.
I would think of insertion sort for such problem, because, insertion sort at any instance looks like, shown bunch of cards,
that are partially sorted. For insertion sort, number of swaps is equivalent to number of inversions. Number of compares is equivalent to number of exchanges + (N-1).
So, in the given problem(above), if node with data K is decreased by C, then the sorted linked list became partially sorted. Insertion sort is the best fit.
Another point is, amidst selection of sorting algorithm, if sorting logic applied for array representation of data holds best fit, then same sorting logic should holds best fit for linked list representation of same data.
For this problem, Is my thought process correct in choosing insertion sort?
Maybe you mean something else, but insertion sort is not the best algorithm, because you actually don't need to sort anything. If there is only one element with value K then it doesn't make a big difference, but otherwise it does.
So I would suggest the following algorithm O(n), ignoring edge cases for simplicity:
Go forward in the list until the value of the current node is > K - C.
Save this node, all the reduced nodes will be inserted before this one.
Continue to go forward while the value of the current node is < K
While the value of the current node is K, remove node, set value to K - C and insert it before the saved node. This could be optimized further, so that you only do one remove and insert operation of the whole sublist of nodes which had value K.
If these decrease operations can be batched up before the sorted list must be available, then you can simply remove all the decremented nodes from the list. Then, sort them, and perform a two-way merge into the list.
If the list must be maintained in order after each node decrement, then there is little choice but to remove the decremented node and re-insert in order.
Doing this with a linear search for a deck of cards is probably acceptable, unless you're running some monstrous Monte Carlo simulation involving cards, that runs for hours or day, so that optimization counts.
Otherwise the way we would deal with the need to maintain order would be to use an ordered sequence data structure: balanced binary tree (red-black, splay) or a skip list. Take the node out of the structure, adjust value, re-insert: O(log N).

IOI Qualifier INOI task 2

I can't figure out how to solve question 2 in the following link in an efficient manner:
http://www.iarcs.org.in/inoi/2012/inoi2012/inoi2012-qpaper.pdf
You can do this in On log n) time. (Or linear if you really care to.) First, pad the input array out to the next power of two using some really big negative number. Now, build an interval tree-like data structure; recursively partition your array by dividing it in half. Each node in the tree represents a subarray whose length is a power of two and which begins at a position that is a multiple of its length, and each nonleaf node has a "left half" child and a "right half" child.
Compute, for each node in your tree, what happens when you add 0,1,2,3,... to that subarray and take the maximum element. Notice that this is trivial for the leaves, which represent subarrays of length 1. For internal nodes, this is simply the maximum of the left child with length/2 + right child. So you can build this tree in linear time.
Now we want to run a sequence of n queries on this tree and print out the answers. The queries are of the form "what happens if I add k,k+1,k+2,...n,1,...,k-1 to the array and report the maximum?"
Notice that, when we add that sequence to the whole array, the break between n and 1 either occurs at the beginning/end, or smack in the middle, or somewhere in the left half, or somewhere in the right half. So, partition the array into the k,k+1,k+2,...,n part and the 1,2,...,k-1 part. If you identify all of the nodes in the tree that represent subarrays lying completely inside one of the two sequences but whose parents either don't exist or straddle the break-point, you will have O(log n) nodes. You need to look at their values, add various constants, and take the maximum. So each query takes O(log n) time.

Sub O(n^2) algorithm for counting nested intervals?

We have a list of intervals of the form [ai, bi]. For each interval, we want to count the number of other intervals that are nested within it.
For example, if we had two intervals, A = [1,4] and B = [2,3]. Then the count for B would be 0 as there are no nested intervals for B; and the count for A would be 1 as B fits within A.
My question is, does there exist a sub- O(n2) algorithm for this problem where n is the number of intervals?
EDIT: Here are the conditions the intervals meet. The end points of the intervals are floating point numbers. The lower limit for the ai's/bi's is 0 and the upper limit is whatever max float is. Also, there is the condition that ai < bi, so no intervals of length 0.
Yes, it is possible.
We will borrow the typical computational geometry "scan line" trick.
First, let's answer an easier (but closely related) question. Instead of reporting how many other intervals each interval contains, let's report how many intervals each is contained in. So for your example with only two intervals, interval I0 = [1,4] has value zero because it is contained in zero intervals, while I1 = [2,3] has value one because it is contained in one interval.
You will see in a minute (a) why this question is easier and (b) how it leads to the answer for the original question.
To solve this easier question: Take all starting and ending points -- all of the ai and bi -- and put them into a master list. Call each element of this list an "event". So an event would be something like "interval I37 started" or "interval I23 ended".
Sort this list of events and process it in order.
As you process the list of events, maintain a set S of "active intervals". An interval is "active" if we have encountered its start event but not its ending event; that is, if we are within that interval.
Now, whenever we see an ending event bj, we are ready to compute how many intervals contain Ij (= [aj, bj]). All we need to do is examine the set S of active intervals and determine how many of them started before aj. That is our answer for how many intervals contain interval Ij.
To do this efficiently, keep S itself sorted by starting point; e.g., by using a self-balancing binary tree.
Sorting the list of events is O(2n log 2n) = O(n log n). Adding or removing an element from a self-balancing binary tree is O(log n). Asking "how many elements of the self-balancing binary tree are less than x?" is also O(log n). Therefore this entire algorithm is O(n log n).
So, that solves the easy question. Call that the "easy algorithm". Now for what you actually asked.
Think of the number line as extending to infinity and wrapping around to -infinity, and define an interval with bi < ai to start at ai, stretch to infinity, wrap to minus infinity, and end at bi.
For any interval Ij = [aj, bj], define Complement(Ij) as the interval [bj, aj]. (For example, the interval [2, 3] starts at 2 and ends at 3; so Complement([2,3]) = [3,2] starts at 3, stretches to infinity, wraps to -infinity, and ends at 2.)
Observe that interval I contains interval J if and only if Complement(J) contains Complement(I). (Prove this.)
So, we can answer your original question simply by running the "easy algorithm" on the set of complements of all of the intervals. That is, start your scan at -infinity with the set S of "active intervals" containing all intervals (because all complements contain infinity/-infinity). Keep S sorted by end point (i.e. start point of complement).
Sort all start points and end points and process them in order. When you encounter a starting point for interval Ij (= [aj, bj]), you are actually hitting the end point of its complement... So remove Ij from S, query S to see how many of its endpoints (i.e. complement start points) come before bj, and report that as the answer for Ij. If you later encounter the end point of Ij, you are encountering the start point of its complement, so you need to add it back into the set S of active intervals.
This final algorithm is O(n log n) for the same reasons the "easy algorithm" was.
[Update]
One clarification, one correction, one comment...
Clarification: Of course, the "self-balancing binary tree" has to be augmented such that each sub-tree knows how many elements it contains. Otherwise, you cannot answer "how many elements are less than x?" This augmentation is straightforward to maintain, but it is not something that every implementation provides; e.g. the C++ std::set does not, to my knowledge.
Correction: You do not want to add any elements back in to the set S of active intervals; in fact, doing so can result in the wrong answer. For example, if the intervals are just [1,2] and [3,4], you would hit 1 (and remove [1,2] from the set), then 2 (and add it back in again), then 3... And since 2<4, you would conclude that [3,4] contains [1,2]. Which is wrong.
Conceptually, you already processed all of the "start events" for the complement intervals; that is why S begins will all intervals inside of it. So all you need to worry about are the ending points; you do not want to add any elements to S, ever.
Put another way, instead of having the intervals wrap around, you can think of [bi,ai] (where bi > ai) as meaning [bi - infinity, ai] with no wrap-around. The logic still works, but the processing is more clear: First you process all of the "whatever - infinity" terms (i.e. the end points), then you process the others (i.e. the start points).
With this correction, I am pretty sure my solution actually works. This formulation also extends -- I think -- to the case where you have both normal and "backward" intervals together in one input.
Comment: This problem is tricky because if you have to enumerate the set of all intervals contained within every interval, the output itself can be O(n^2). So any working approach has to somehow count the intervals without even being able to identify them :-).
Here is a O(N*LOG(N)):
let Ii = Interval i = (ai, bi)
let L = list of intervals I
sort L by ai
divide L in half into L1a and L2a.
sort L1a and L2a by bi to get L1b and L2b
merge sort L1b and L2b keeping track of the count of nestings (e.g. because all intervals in L1b start before intervals in L2b, when we find an endpoint in L1b that is higher than an endpoint in l2b, we know everything between them is nested inside - think about it)..
Now you have updated the counts on how often an interval in L2 is nested inside an interval in L1.
after merging L1 and L2, we repeat the process (recursion) by dividing L1 into L11a and l12a, also dividing L2 into L21a and L21a..

Resources