Finding "maximum" overlapping interval pair in O(nlog(n)) - algorithm

Problem Statement
Input
set of n intervals; {[s_1,t_1], [s_2,t_2], ... ,[s_n,t_n]}.
Output
pair of intervals; {[s_i,t_i],[s_j,t_j]}, with the maximum overlap among all the interval pairs.
Example
input intervals : {[1, 10], [2, 6], [3,15], [5, 9]}
-> There are possible 6 interval pairs. Among those pairs, [1,10] & [3,15] has the largest possible overlap of 7.
output : {[1,10],[3,15]}
A naive algorithm will be a brute force method where all n intervals get compared to each other, while the current maximum overlap value being tracked. The time complexity would be O(n^2) for this case.
I was able to find many procedures regarding interval trees, maximum number of overlapping intervals and maximum set of non-overlapping intervals, but nothing on this problem. Maybe I would be able to use the ideas given in the above algorithms, but I wasn't able to come up with one.
I spent many hours trying to figure out a nice solution, but I think I need some help at this point.
Any suggestions will help!

First, sort the intervals: first by left endpoint in increasing order, then — as a secondary criterion — by right endpoint in decreasing order. For the rest of this answer, I'll assume that the intervals are already in sorted order.
Now, there are two possibilities for what the maximum possible overlap might be:
it may be between an interval and a later interval that it completely covers.
it may be between an interval and the very next interval that it doesn't completely cover.
We can cover both cases in O(n) time by iterating over the intervals, keeping track of the following:
the greatest overlap we've seen so far, and the relevant pair of intervals.
the latest interval we've seen, call it L, that wasn't completely covered by any of its predecessors. (For this, the key insight is that thanks to the ordering of the intervals, we can easily tell if an interval is completely covered by any of its predecessors — and therefore if we need to update L — by simply checking if it's completely covered by the current L. So we can keep L up-to-date in O(1) time.)
and computing each interval's overlap with L.
So:
result := []
max_overlap := 0
L := sorted_intervals[1]
for interval I in sorted_intervals[2..n]:
overlap := MIN(L.right, I.right) - I.left
if overlap >= max_overlap:
result := [L, I]
max_overlap := overlap
if I.right > L.right:
L := I
So the total cost is the cost of sorting the intervals, which is likely to be O(n log n) time but may be O(n) if you can use bucket-sort or radix-sort or similar.

Related

Positioning an ordered sequence of intervals for maximum alignment with another sequence of fixed intervals

I have two sequences of intervals.
The first is fixed and non-overlapping, so something like:
[1..10], [12..15], [23..56], [72..89], ...
The second is not fixed, so it's just an ordered list of interval lengths:
[7, 2, 5, 26, ...]
The task at hand is to:
Place every interval from the second list at a given starting point, so that the second list becomes a list of fixed, non-overlapping intervals much like the first, while preserving its order
Find the alignment that minimizes the amount of integers that are in some interval from one of the lists but not in any interval from the other list
Very simple example:
[25..26], [58..68], [74..76], [78..86]
[10, 12]
The optimal solution is to place the interval of length 10 at [58..68] and the interval of length 12 at [74..86] which results in only the numbers 25, 26, and 77 being in one list but not the other.
The only thing I've come up with that seems mildly helpful is that if I lay down the intervals in order, I know how many 'penalties' the interval I've already created, so I have an upper bound for the score, which means I have an admissible heuristic and I can do A* search instead of looking at the entire tree. However, the total range of numbers spans from 0 to about 34M, so I'd like something better.
Any help would be hot!
OK, here's a half-thought-out answer. It should work in polynomial time, but I haven't bothered checking what the index is. It may well be possible to get a better index than the answer as outlined here. The details are left as an exercise to the reader :-) I hope it's not too unclear.
I'll define the score of a solution as the number of integers which appear in both lists of intervals. Let f(i,m) be the highest score it's possible to get using just the first i interval lengths, subject to the condition that none of your intervals goes above m. The function f, for fixed i, is essentially a (non-strictly) increasing function from the integers to a bounded subset of the integers. Therefore:
all values of f(i,m), for m > 0, are equal, with finitely many exceptions;
all values of f(i,m), for m < 0, are equal, with finitely many exceptions.
This means it's possible to represent all values of f(i,m) using a finite data structure (still considering a fixed value of i).
Now let F(i) be the value of this data structure representing all values of f(i,m). I claim that, given F(i), it is possible to calculate F(i+1). To do this, we only need to answer the following question for all x: If I place the new interval at x, how good is the best solution I can get? But we know what this is - it's just f(i,x) + the score we've got from this interval.
So if n is the number of intervals in the second list, the score of the best solution will be F(n).
To actually find the solution, you could work backwards from this.
You know what's the best score you can get. Say it's s_0. Then put the last interval as far left as possible, subject to the condition that it allows you to score s_0. That is, find the smallest m such that f(n,m) = s_0; and place the interval such that it only just stays inside the bound at m.
Then, let s_1 be the score you need to get from all the other intervals in order to get a total of s_0. Place the next-last interval as far left as possible, subject to the condition that you can still score s_1. That is, find the smallest m such that f(n,m) = s_1; and place the interval such that it only just stays inside the bound at m.
And so on...

Algorithm to generate k element subsets in order of their sum

If I have an unsorted large set of n integers (say 2^20 of them) and would like to generate subsets with k elements each (where k is small, say 5) in increasing order of their sums, what is the most efficient way to do so?
Why I need to generate these subsets in this fashion is that I would like to find the k-element subset with the smallest sum satisfying a certain condition, and I thus would apply the condition on each of the k-element subsets generated.
Also, what would be the complexity of the algorithm?
There is a similar question here: Algorithm to get every possible subset of a list, in order of their product, without building and sorting the entire list (i.e Generators) about generating subsets in order of their product, but it wouldn't fit my needs due to the extremely large size of the set n
I intend to implement the algorithm in Mathematica, but could do it in C++ or Python too.
If your desired property of the small subsets (call it P) is fairly common, a probabilistic approach may work well:
Sort the n integers (for millions of integers i.e. 10s to 100s of MB of ram, this should not be a problem), and sum the k-1 smallest. Call this total offset.
Generate a random k-subset (say, by sampling k random numbers, mod n) and check it for P-ness.
On a match, note the sum-total of the subset. Subtract offset from this to find an upper bound on the largest element of any k-subset of equivalent sum-total.
Restrict your set of n integers to those less than or equal to this bound.
Repeat (goto 2) until no matches are found within some fixed number of iterations.
Note the initial sort is O(n log n). The binary search implicit in step 4 is O(log n).
Obviously, if P is so rare that random pot-shots are unlikely to get a match, this does you no good.
Even if only 1 in 1000 of the k-sized sets meets your condition, That's still far too many combinations to test. I believe runtime scales with nCk (n choose k), where n is the size of your unsorted list. The answer by Andrew Mao has a link to this value. 10^28/1000 is still 10^25. Even at 1000 tests per second, that's still 10^22 seconds. =10^14 years.
If you are allowed to, I think you need to eliminate duplicate numbers from your large set. Each duplicate you remove will drastically reduce the number of evaluations you need to perform. Sort the list, then kill the dupes.
Also, are you looking for the single best answer here? Who will verify the answer, and how long would that take? I suggest implementing a Genetic Algorithm and running a bunch of instances overnight (for as long as you have the time). This will yield a very good answer, in much less time than the duration of the universe.
Do you mean 20 integers, or 2^20? If it's really 2^20, then you may need to go through a significant amount of (2^20 choose 5) subsets before you find one that satisfies your condition. On a modern 100k MIPS CPU, assuming just 1 instruction can compute a set and evaluate that condition, going through that entire set would still take 3 quadrillion years. So if you even need to go through a fraction of that, it's not going to finish in your lifetime.
Even if the number of integers is smaller, this seems to be a rather brute force way to solve this problem. I conjecture that you may be able to express your condition as a constraint in a mixed integer program, in which case solving the following could be a much faster way to obtain the solution than brute force enumeration. Assuming your integers are w_i, i from 1 to N:
min sum(i) w_i*x_i
x_i binary
sum over x_i = k
subject to (some constraints on w_i*x_i)
If it turns out that the linear programming relaxation of your MIP is tight, then you would be in luck and have a very efficient way to solve the problem, even for 2^20 integers (Example: max-flow/min-cut problem.) Also, you can use the approach of column generation to find a solution since you may have a very large number of values that cannot be solved for at the same time.
If you post a bit more about the constraint you are interested in, I or someone else may be able to propose a more concrete solution for you that doesn't involve brute force enumeration.
Here's an approximate way to do what you're saying.
First, sort the list. Then, consider some length-5 index vector v, corresponding to the positions in the sorted list, where the maximum index is some number m, and some other index vector v', with some max index m' > m. The smallest sum for all such vectors v' is always greater than the smallest sum for all vectors v.
So, here's how you can loop through the elements with approximately increasing sum:
sort arr
for i = 1 to N
for v = 5-element subsets of (1, ..., i)
set = arr{v}
if condition(set) is satisfied
break_loop = true
compute sum(set), keep set if it is the best so far
break if break_loop
Basically, this means that you no longer need to check for 5-element combinations of (1, ..., n+1) if you find a satisfying assignment in (1, ..., n), since any satisfying assignment with max index n+1 will have a greater sum, and you can stop after that set. However, there is no easy way to loop through the 5-combinations of (1, ..., n) while guaranteeing that the sum is always increasing, but at least you can stop checking after you find a satisfying set at some n.
This looks to be a perfect candidate for map-reduce (http://en.wikipedia.org/wiki/MapReduce). If you know of any way of partitioning them smartly so that passing candidates are equally present in each node then you can probably get a great throughput.
Complete sort may not really be needed as the map stage can take care of it. Each node can then verify the condition against the k-tuples and output results into a file that can be aggregated / reduced later.
If you know of the probability of occurrence and don't need all of the results try looking at probabilistic algorithms to converge to an answer.

Sub O(n^2) algorithm for counting nested intervals?

We have a list of intervals of the form [ai, bi]. For each interval, we want to count the number of other intervals that are nested within it.
For example, if we had two intervals, A = [1,4] and B = [2,3]. Then the count for B would be 0 as there are no nested intervals for B; and the count for A would be 1 as B fits within A.
My question is, does there exist a sub- O(n2) algorithm for this problem where n is the number of intervals?
EDIT: Here are the conditions the intervals meet. The end points of the intervals are floating point numbers. The lower limit for the ai's/bi's is 0 and the upper limit is whatever max float is. Also, there is the condition that ai < bi, so no intervals of length 0.
Yes, it is possible.
We will borrow the typical computational geometry "scan line" trick.
First, let's answer an easier (but closely related) question. Instead of reporting how many other intervals each interval contains, let's report how many intervals each is contained in. So for your example with only two intervals, interval I0 = [1,4] has value zero because it is contained in zero intervals, while I1 = [2,3] has value one because it is contained in one interval.
You will see in a minute (a) why this question is easier and (b) how it leads to the answer for the original question.
To solve this easier question: Take all starting and ending points -- all of the ai and bi -- and put them into a master list. Call each element of this list an "event". So an event would be something like "interval I37 started" or "interval I23 ended".
Sort this list of events and process it in order.
As you process the list of events, maintain a set S of "active intervals". An interval is "active" if we have encountered its start event but not its ending event; that is, if we are within that interval.
Now, whenever we see an ending event bj, we are ready to compute how many intervals contain Ij (= [aj, bj]). All we need to do is examine the set S of active intervals and determine how many of them started before aj. That is our answer for how many intervals contain interval Ij.
To do this efficiently, keep S itself sorted by starting point; e.g., by using a self-balancing binary tree.
Sorting the list of events is O(2n log 2n) = O(n log n). Adding or removing an element from a self-balancing binary tree is O(log n). Asking "how many elements of the self-balancing binary tree are less than x?" is also O(log n). Therefore this entire algorithm is O(n log n).
So, that solves the easy question. Call that the "easy algorithm". Now for what you actually asked.
Think of the number line as extending to infinity and wrapping around to -infinity, and define an interval with bi < ai to start at ai, stretch to infinity, wrap to minus infinity, and end at bi.
For any interval Ij = [aj, bj], define Complement(Ij) as the interval [bj, aj]. (For example, the interval [2, 3] starts at 2 and ends at 3; so Complement([2,3]) = [3,2] starts at 3, stretches to infinity, wraps to -infinity, and ends at 2.)
Observe that interval I contains interval J if and only if Complement(J) contains Complement(I). (Prove this.)
So, we can answer your original question simply by running the "easy algorithm" on the set of complements of all of the intervals. That is, start your scan at -infinity with the set S of "active intervals" containing all intervals (because all complements contain infinity/-infinity). Keep S sorted by end point (i.e. start point of complement).
Sort all start points and end points and process them in order. When you encounter a starting point for interval Ij (= [aj, bj]), you are actually hitting the end point of its complement... So remove Ij from S, query S to see how many of its endpoints (i.e. complement start points) come before bj, and report that as the answer for Ij. If you later encounter the end point of Ij, you are encountering the start point of its complement, so you need to add it back into the set S of active intervals.
This final algorithm is O(n log n) for the same reasons the "easy algorithm" was.
[Update]
One clarification, one correction, one comment...
Clarification: Of course, the "self-balancing binary tree" has to be augmented such that each sub-tree knows how many elements it contains. Otherwise, you cannot answer "how many elements are less than x?" This augmentation is straightforward to maintain, but it is not something that every implementation provides; e.g. the C++ std::set does not, to my knowledge.
Correction: You do not want to add any elements back in to the set S of active intervals; in fact, doing so can result in the wrong answer. For example, if the intervals are just [1,2] and [3,4], you would hit 1 (and remove [1,2] from the set), then 2 (and add it back in again), then 3... And since 2<4, you would conclude that [3,4] contains [1,2]. Which is wrong.
Conceptually, you already processed all of the "start events" for the complement intervals; that is why S begins will all intervals inside of it. So all you need to worry about are the ending points; you do not want to add any elements to S, ever.
Put another way, instead of having the intervals wrap around, you can think of [bi,ai] (where bi > ai) as meaning [bi - infinity, ai] with no wrap-around. The logic still works, but the processing is more clear: First you process all of the "whatever - infinity" terms (i.e. the end points), then you process the others (i.e. the start points).
With this correction, I am pretty sure my solution actually works. This formulation also extends -- I think -- to the case where you have both normal and "backward" intervals together in one input.
Comment: This problem is tricky because if you have to enumerate the set of all intervals contained within every interval, the output itself can be O(n^2). So any working approach has to somehow count the intervals without even being able to identify them :-).
Here is a O(N*LOG(N)):
let Ii = Interval i = (ai, bi)
let L = list of intervals I
sort L by ai
divide L in half into L1a and L2a.
sort L1a and L2a by bi to get L1b and L2b
merge sort L1b and L2b keeping track of the count of nestings (e.g. because all intervals in L1b start before intervals in L2b, when we find an endpoint in L1b that is higher than an endpoint in l2b, we know everything between them is nested inside - think about it)..
Now you have updated the counts on how often an interval in L2 is nested inside an interval in L1.
after merging L1 and L2, we repeat the process (recursion) by dividing L1 into L11a and l12a, also dividing L2 into L21a and L21a..

search for interval overlap in list of intervals?

Say [a,b] represents the interval on the real line from a to b, a < b, inclusive (ie, [a,b] = set of all x such that a<=x<=b). Also, say [a,b] and [c,d] are 'overlapping' if they share any x such that x is in both [a,b] and [c,d].
Given a list of intervals, ([x1,y1],[x2,y2],...), what is the most efficient way to find all such intervals that overlap with [x,y]?
Obviously, I can try each and get it in O(n). But I was wondering if I could sort the list of intervals in some clever way, I could find /one/ overlapping item in O(log N) via a binary search, and then 'look around' from that position in the list to find all overlapping intervals. However, how do I sort intervals such that such a strategy would work?
Note that there may be overlaps between elements in the list items itself, which is what makes this hard.
I've tried it by sorting intervals by their left end, right end, middle, but none seem to lead to an exhaustive search.
Help?
For completeness' sake, I'd like to add that there is a well-known data structure for just this sort of problem, known (surprise, surprise) as an interval tree. It's basically an augmented balanced tree (red-black, AVL, your pick) that stores intervals sorted by their left (low) endpoint. The augmentation is that each node stores the largest right (high) endpoint in its subtree. This tree allows you to find all overlapping intervals in O(log n) time.
It's described in CLRS 14.3.
[a, b] overlaps with [x, y] iff b > x and a < y. Sorting intervals by their first elements gives you intervals matching the first condition in log time. Sorting intervals by their last elements gives you intervals matching the second condition in log time. Take the intersections of the resulting sets.
A 'quadtree' is a data structure often used to improve the efficiency of collision detection in 2 dimensions.
I think you could come up with a similar 1-d structure. This would require some pre-computation but should result in O(log N) performance.
Basically you start with a root 'node' that covers all possible intervals, and when adding a node to the tree, you decide if it falls on the left or the right of the midpoint. If it crosses the mid point, you break it into two intervals (but record the original parent) and recursively proceed from there. You can set a limit on the depth of the tree, which can save memory and improve performance, but comes at the expense of complicating things a little (you need to store a list of intervals in your nodes).
Then when checking an interval, you basically find all leaf nodes that it would be inserted into were it inserted, check the partial intervals within those nodes for intersection, and then report the interval that is recorded against them as the 'original' parent.
Just a quick thought 'off the cuff' so to speak.
Could you organize them into 2 lists, one for start of intervals and the other for end of intervals.
This way, you can compare y to the items in the start of interval list (say by binary search) to cut down the candidates based on that.
You can then compare x to the items in the end of interval list.
EDIT
Case: Once Off
If you are comparing only single interval to the list of intervals in a once-off situation, I don't believe sorting will help you out since ideal sorting is O(n).
By doing a linear search through all x's to trim out any impossible intervals then doing another linear search through the remaining y's you can reduce your total work. While this is still O(n), without this you would be doing 2n comparisons, whereas on average, you would only do (3n-1)/2 comparisons this way.
I believe this is the best you can do for an unsorted list.
Case: Pre-sorting doesn't count
In the case where you will be repeatedly comparing single intervals to this list of intervals and your pre-sort your list, you can achieve better results. The process above still applies, but by doing a binary search on the first list then the second you can get O(m log n) as opposed to O(mn), where m is the number of single intervals being compared. Note, still still gives you the advantage of reducing total comparisons. [2m log n compared to m(3(log n) - 1)/2]
You could sort by both left end and right end at the same time and use both lists to eliminate none overlapping values. If the list is sorted by the left end then none of the intervals to the right of the right end of the test range can overlap. If the list is sorted by the right end then none of the intervals to the left of the left end of the test range can overlap.
For example if the intervals are
[1,4], [3,6], [4,5], [2,8], [5,7], [1,2], [2,2.5]
and you're finding overlap with [3,4] then sorting by left end and marking position of the right end of the test (with the right end as just greater than its value so that 4 is included in the range)
[1,4], [1,2], [2,2.5], [2,8], [3,6], [4,5], *, [5,7]
you know [5,7] can't overlap, then sorting by right end and marking position of the left end of the test
[1,2], [2,2.5], *, [1,4], [4,5], [3,6], [5,7], [2,8]
you know [1,2] and [2,2.5] can't overlap
Not sure how efficient this would be since you're having to do two sorts and searches.
As you can see in other answers, most algorithms come together with a special data structure. For example, for unsorted list of intervals as input O(n) is best that you'll get. (And usually it's easier to think in terms of data structure that dictates the algorithm).
In this case, your question is not complete:
Are you given the whole list or it is you who actually creates it?
Do you have to perform just one such lookup or many of them?
Do you have any estimations for operations it should support and their frequencies?
For example, if you have to perform just one such lookup, then it's not worthy to sort the list before. If many, then the more expensive sorting or generation of an "1D quadtree" would be amortized.
However, it would be difficult to solve it, because a simple quadtree (as I understand it) is able just to detect the collistion, but it's not able to create the list of all the segments that are overlapping with your input.
One simple implementation would be an ordered (by coordonate) list where you insert all the segment ends with flag start/end and with segment number. In this way, by parsing it (still O(n), but I doubt you can make it faster if you also need the list of all the segments that overlaps), and keeping the track of all opened segments that were not closed at "check points".

Finding the minimal coverage of an interval with subintervals

Suppose I have an interval (a,b), and a number of subintervals {(ai,bi)}i whose union is all of (a,b). Is there an efficient way to choose a minimal-cardinality subset of these subintervals which still covers (a,b)?
A greedy algorithm starting at a or b always gives the optimal solution.
Proof: consider the set Sa of all the subintervals covering a. Clearly, one of them has to belong to the optimal solution. If we replace it with a subinterval (amax,bmax) from Sa whose right endpoint bmax is maximal in Sa (reaches furthest to the right), the remaining uncovered interval (bmax,b) will be a subset of the remaining interval from the optimal solution, so it can be covered with no more subintervals than the analogous uncovered interval from the optimal solution. Therefore, a solution constructed from (amax,bmax) and the optimal solution for the remaining interval (bmax,b) will also be optimal.
So, just start at a and iteratively pick the interval reaching furthest right (and covering the end of previous interval), repeat until you hit b. I believe that picking the next interval can be done in log(n) if you store the intervals in an augmented interval tree.
Sounds like dynamic programming.
Here's an illustration of the algorithm (assume intervals are in a list sorted by ending time):
//works backwards from the end
int minCard(int current, int must_end_after)
{
if (current < 0)
if (must_end_after == 0)
return 0; //no more intervals needed
else
return infinity; //doesn't cover (a,b)
if (intervals[current].end < must_end_after)
return infinity; //doesn't cover (a,b)
return min( 1 + minCard(current - 1, intervals[current].start),
minCard(current - 1, must_end_after) );
//include current interval or not?
}
But it should also involve caching (memoisation).
There are two cases to consider:
Case 1: There are no over-lapping intervals after the finish time of an interval. In this case, pick the next interval with the smallest starting time and the longest finishing time. (amin, bmax).
Case 2: There are 1 or more intervals overlapping with the last interval you're looking at. In this case, the start time doesn't matter because you've already covered that. So optimize for the finishing time. (a, bmax).
Case 1 always picks the first interval as the first interval in the optimal set as well (the proof is the same as what #RafalDowgrid provided).
You mean so that the subintervals still overlap in such a way that (a,b) remains completely covered at all points?
Maybe splitting up the subintervals themselves into basic blocks associated with where they came from, so you can list options for each basic block interval accounting for other regions covered by the subinterval also. Then you can use a search based on each sub-subinterval and at least be sure no gaps are left.
Then would need to search.. efficiently.. that would be harder.
Could eliminate any collection of intervals that are entirely covered by another set of smaller number and work the problem after the preprocessing.
Wouldn't the minimal for the whole be minimal for at least one half? I'm not sure.
Found a link to a journal but couldn't read it. :(
This would be a hitting set problem and be NP_hard in general.
Couldn't read this either but looks like opposite kind of problem.
Couldn't read it but another link that mentions splitting intervals up.
Here is an available reference on Randomized Algorithms for GeometricOptimization Problems.
Page 35 of this pdf has a greedy algorithm.
Page 11 of Karp (1972) mentions hitting-set and is cited alot.
Google result. Researching was fun but I have to go now.

Resources