Maximum non-overlapping intervals in a interval tree - algorithm

Given a list of intervals of time, I need to find the set of maximum non-overlapping intervals.
For example,
if we have the following intervals:
[0600, 0830], [0800, 0900], [0900, 1100], [0900, 1130],
[1030, 1400], [1230, 1400]
Also it is given that time have to be in the range [0000, 2400].
The maximum non-overlapping set of intervals is [0600, 0830], [0900, 1130], [1230, 1400].
I understand that maximum set packing is NP-Complete. I want to confirm if my problem (with intervals containing only start and end time) is also NP-Complete.
And if so, is there a way to find an optimal solution in exponential time, but with smarter preprocessing and pruning data. Or if there is a relatively easy to implement fixed parameter tractable algorithm. I don't want to go for an approximation algorithm.

This is not a NP-Complete problem. I can think of an O(n * log(n)) algorithm using dynamic programming to solve this problem.
Suppose we have n intervals. Suppose the given range is S (in your case, S = [0000, 2400]). Either suppose all intervals are within S, or eliminate all intervals not within S in linear time.
Pre-process:
Sort all intervals by their begin points. Suppose we get an array A[n] of n intervals.
This step takes O(n * log(n)) time
For all end points of intervals, find the index of the smallest begin point that follows after it. Suppose we get an array Next[n] of n integers.
If such begin point does not exist for the end point of interval i, we may assign n to Next[i].
We can do this in O(n * log(n)) time by enumerating n end points of all intervals, and use a binary search to find the answer. Maybe there exists linear approach to solve this, but it doesn't matter, because the previous step already take O(n * log(n)) time.
DP:
Suppose the maximum non-overlapping intervals in range [A[i].begin, S.end] is f[i]. Then f[0] is the answer we want.
Also suppose f[n] = 0;
State transition equation:
f[i] = max{f[i+1], 1 + f[Next[i]]}
It is quite obvious that the DP step take linear time.
The above solution is the one I come up with at the first glance of the problem. After that, I also think out a greedy approach which is simpler (but not faster in the sense of big O notation):
(With the same notation and assumptions as the DP approach above)
Pre-process: Sort all intervals by their end points. Suppose we get an array B[n] of n intervals.
Greedy:
int ans = 0, cursor = S.begin;
for(int i = 0; i < n; i++){
if(B[i].begin >= cursor){
ans++;
cursor = B[i].end;
}
}
The above two solutions come out from my mind, but your problem is also referred as the activity selection problem, which can be found on Wikipedia http://en.wikipedia.org/wiki/Activity_selection_problem.
Also, Introduction to Algorithms discusses this problem in depth in 16.1.

Related

Optimal algorithm for the Greedy Algorithm: Interval Scheduling - Scheduling All Intervals

I recently read about the Interval Scheduling algorithm in chapter 4 of Algorithm Design by Tardos and Kleinberg. The solution provided to the Interval Scheduling Problem was this:
Sort the n intervals based on their finishing time and store in array F
A = [] #list of intervals
S = array with property that S[i] contains the finishing time of the ith interval
i = 0
For j = 0 to n-1
if S[j] >= F[i] #this interval does not overlap with the ith interval
i = j
A.add(j)
This algorithm in C++/Java/Python can be found here: http://www.geeksforgeeks.org/greedy-algorithms-set-1-activity-selection-problem/
An extension of the problem, also known as the Interval Coloring Problem, can be explained as follows.
We are given a set of intervals, and we want to colour all intervals
so that intervals given the same colour do not intersect and the goal
is to try to minimize the number of colours used.
The algorithm proposed in the book was this:
Sort the intervals by their start times in a list I
n = len(I)
For j = 1 to n:
For each interval I[i] that precedes I[j] and overlaps it:
Exclude the label of I[i] from consideration for I[j]
The jth interval is assigned a nonexcluded label
However isn't this algorithm's running time O(n^2)? I thought of a better one, but I'm not sure if it is correct. Basically, we modify the algorithm for Interval Scheduling.
Sort the n intervals based on their finishing time and store in array F
A = []
x = 1
S = array with property that S[i] contains the finishing time of the ith interval
i = 0
while F is not empty:
new = []
For j = 0 to n-1
if S[j] >= F[i] #this interval does not overlap with the ith interval
i = j
A.add(F[j])
else:
new.add(F[j])
F = new
Recompute S
print x, A
A = []
x++
I might have a silly bug in the pseudocode but I hope you understand what my algorithm is trying to do. Basically I'm peeling off all possible colourings by repeatedly running the standard interval scheduling algorithm.
So, shouldn't my algorithm print all the various colourings of the intervals? Is this more efficient than the previous algorithm, if it is a correct algorithm? If it is not, could you please provide a counter-example?
Thanks!
(More about interval scheduling here: https://en.wikipedia.org/wiki/Interval_scheduling)
Your algorithm can produce a suboptimal solution. Here's an example of what your algorithm will produce:
--- -----
----
------------
Clearly the following 2-colour solution (which will be produced by the proven-optimal algorithm) is possible for the same input:
--- ------------
---- -----
Finally, although the pseudocode description of the book's proposed algorithm does indeed look like it needs O(n^2) time, it can be sped up using other data structures. Here's an O(n log n) algorithm: Instead of looping through all n intervals, loop through all 2n interval endpoints in increasing order. Maintain a heap (priority queue) of available colours ordered by colour, which initially contains n colours; every time we see an interval start point, extract the smallest colour from the heap and assign it to this interval; every time we see an interval end point, reinsert its assigned colour into the heap. Each endpoint (start or end) processed is O(log n) time, and there are 2n of them. This matches the time complexity for sorting the intervals, so the overall time complexity is also O(n log n).

Big O for exponential complexity specific case

Let's an algorithm to find all paths between two nodes in a directed, acyclic, non-weighted graph, that may contain more than one edge between the same two vertices. (this DAG is just an example, please I'm not discussing this case specifically, so disregard it's correctness though it's correct, I think).
We have two effecting factors which are:
mc: max number of outgoing edges from a vertex.
ml: length of the max length path measured by number of edges.
Using an iterative fashion to solve the problem, where complexity in the following stands for count of processing operations done.
for the first iteration the complexity = mc
for the second iteration the complexity = mc*mc
for the third iteration the complexity = mc*mc*mc
for the (max length path)th iteration the complexity = mc^ml
Total worst complexity is (mc + mc*mc + ... + mc^ml).
1- can we say it's O(mc^ml)?
2- Is this exponential complexity?, as I know, in exponential complexity, the variable only appear at the exponent, not at the base.
3- Are mc and ml both are variables in my algorithm comlexity?
There's a faster way to achieve the answer in O(V + E), but seems like your question is about calculating complexity, not about optimizing algorithm.
Yes, seems like it's O(mc^ml)
Yes, they bot can be variables in your algorithm complexity
As about the complexity of your algorithm: let's do some transformation, using the fact that a^b = e^(b*ln(a)):
mc^ml = (e^ln(mc))^ml = e^(ml*ln(mc)) < e^(ml*mc) if ml,mc -> infinity
So, basically, your algorithm complexity upperbound is O(e^(ml*mc)), but we can still shorten it to see, if it's really an exponential complexity. Assume that ml, mc <= N, where N is, let's say, max(ml, mc). So:
e^(ml*mc) <= e^N^2 = e^(e^2*ln(N)) = (e^e)^(2*ln(N)) < (e^e)^(2*N) = [C = e^e] = O(C^N).
So, your algorithm complexity will be O(C^N), where C is a constant, and N is something that growth not faster than linear. So, basically - yes, it is exponetinal complexity.

Greedy Attempt for covering all the numbers with the given intervals

Let S be a set of intervals (containing n number of intervals) of the natural numbers that might overlap and N be a list of numbers (containing n number of numbers).
I want to find the smallest subset (let's call P) of S such that for each number
in our list N, there exists at least one interval in P that contains it. The intervals in P are allowed to overlap.
Trivial example:
S = {[1..4], [2..7], [3..5], [8..15], [9..13]}
N = [1, 4, 5]
// so P = {[1..4], [2..7]}
I think a dynamic algorithm might not work always, so if anybody knows of a solution to this problem (or a similar one that can be converted into), that would be great. I am trying to make a O(n^2 solution)
Here is one greedy approach
P = {}
for each q in N: // O(n)
if q in P // O(n)
continue
for each i in S // O(n)
if q in I: // O(n)
P.add(i)
break
But that is O(n^4).. Any help with creating a greedy approach that is O(n^2) would be great!
Thanks!
* Update: * I've been slamming at this problem and I think I have an O(n^2) solution!!
Let me know if you think I'm right!!!
N = MergeSort (N)
upper, lower = infinity, -1
P = empty set
for each q in N do
if (q>=lower and q<=upper)=False
max_interval = [-infinity, infinity]
for each r in S do
if q in r then
if r.rightEndPoint > max_interval.rightEndPoint
max_interval = r
P.append(max_interval)
lower = max_interval.leftEndPoint
upper = max_interval.rightEndPoint
S.remove(max_interval)
I think this should work!! I'm trying to find a counter solution; but yeah!!
This problem is similar to set cover problem, which is NP-complete (i.e., arguably has no solution faster than exponential). What makes it different is that intervals always cover adjacent elements (not arbitrary subset of N), which opens ways for faster solutions.
http://en.wikipedia.org/wiki/Set_cover_problem
I think that the solution proposed by Mike is good enough. But I think I have quite straightforward O(N^2) greedy algo. It starts like the Mike's one (moreover, I believe Mike's solution can also be improved in similar way):
You sort your N numbers and place them sorted into array ELEM; COMPLEXITY O(N*lg N);
Using binary search, for each interval S[i] you identify starting and ending index of elements in ELEM that are covered by S[i]. Say, you place this pair of numbers into array COVER, the difference between the two indices tells you how many elements you cover, for simplicity, let us place it array COVER_COUNT; COMPLEXITY O(N*lg N);
You introduce index pointer p, that shows till which element in ELEM, your N is already covered. you set p = 0, meaning that all elements up to 0-th (excluded) are initially covered (i.e., no elements); Complexity O(1). Moreover you introduce boolean array IS_INCLUDED, that reflects if interval S[i] is already included in your coverage set. Complexity O(N)
Then you start from the 0-th element in ELEM and see what is the interval that contains ELEM[0] and has greater coverage COVER_COUNT[i]. Imagine that it is i-th interval. We then mark it as included by setting IS_INCLUDED[i] to true. Then you set p to end[i] + 1 where end[i] is the ending index in COVER[i] pair (indeed now all elements til end[i] are covered). Then, knowing p you update all elements in COVER_COUNT so that they reflect how many elements of not yet covered elements each interval covers (this can be easily done in O(N) time). Then you perform the same step for ELEM[p] and continues till p >= ELEM.length. It can be observed that the overall complexity is O(N^2).
You finish in O(n^2) and in IS_INCLUDED has true for intervals of S included in optimal cover set
Let me know if this solution seems reasonable to you and if I calculated everything well.
P.S. Just wanted to add that the optimality of ythe solution found by algo can be proved by induction and contradiction. By contradiction, it is easy to show that at least one optimal solution includes the longest interval of those covering element ELEM[0]. If so, by induction we can show that for each next element in algo, we can keep on following the strategy of selelcting the interval that is the longest with respect to the number of remaining elements covered and that covers the leftmost yet uncovered element.
I am not sure, but mb some think like this.
1) For each interval create a list with elements from N witch contain in interval, it will take O(n^2) lets call it Q[i] for S[i]
2) Then sort our S by length of Q[i], O(n*lg(n))
3) Go throw this array excluding Q[i] from N O(n) and from Q[i+1]...Q[n] = O(n^2)
4) Repeat 2 while N is not empty.
It's not O(n^2), it's O(n^3) but if you can use hashmap, i think you can improve this.

differential equation VS Algorithms complexity

I don't know if it's the right place to ask because my question is about how to calculate a computer science algorithm complexity using differential equation growth and decay method.
The algorithm that I would like to prove is Binary search for a sorted array, which has a complexity of log2(n)
The algorithm says: if the target value are searching for is equal to the mid element, then return its index. If if it's less, then search on the left sub-array, if greater search on the right sub-array.
As you can see each time N(t): [number of nodes at time t] is being divided by half. Therefore, we can say that it takes O(log2(n)) to find an element.
Now using differential equation growth and decay method.
dN(t)/dt = N(t)/2
dN(t): How fast the number of elements is increasing or decreasing
dt: With respect to time
N(t): Number of elements at time t
The above equation says that the number of cells is being divided by 2 with time.
Solving the above equations gives us:
dN(t)/N(t) = dt/2
ln(N(t)) = t/2 + c
t = ln(N(t))*2 + d
Even though we got t = ln(N(t)) and not log2(N(t)), we can still say that it's logarithmic.
Unfortunately, the above method, even if it makes sense while approaching it to finding binary search complexity, turns out it does not work for all algorithms. Here's a counter example:
Searching an array linearly: O(n)
dN(t)/dt = N(t)
dN(t)/N(t) = dt
t = ln(N(t)) + d
So according to this method, the complexity of searching linearly takes O(ln(n)) which is NOT true of course.
This differential equation method is called growth and decay and it's very popluar. So I would like to know if this method could be applied in computer science algorithm like the one I picked, and if yes, what did I do wrong to get incorrect result for the linear search ? Thank you
The time an algorithm takes to execute is proportional to the number
of steps covered(reduced here).
In your linear searching of the array, you have assumed that dN(t)/dt = N(t).
Incorrect Assumption :-
dN(t)/dt = N(t)
dN(t)/N(t) = dt
t = ln(N(t)) + d
Going as per your previous assumption, the binary-search is decreasing the factor by 1/2 terms(half-terms are directly reduced for traversal in each of the pass of array-traversal,thereby reducing the number of search terms by half). So, your point of dN(t)/dt=N(t)/2 was fine. But, when you are talking of searching an array linearly, obviously, you are accessing the element in one single pass and hence, your searching terms are decreasing in the order of one item in each of the passes. So, how come your assumption be true???
Correct Assumption :-
dN(t)/dt = 1
dN(t)/1 = dt
t = N(t) + d
I hope you got my point. The array elements are being accessed sequentially one pass(iteration) each. So, the array accessing is not changing in order of N(t), but in order of a constant 1. So, this N(T) order result!

Subset-Sum in Linear Time

This was a question on our Algorithms final exam. It's verbatim because the prof let us take a copy of the exam home.
(20 points) Let I = {r1,r2,...,rn} be a set of n arbitrary positive integers and the values in I are distinct. I is not given in any sorted order. Suppose we want to find a subset I' of I such that the total sum of all elements in I' is exactly 100*ceil(n^.5) (each element of I can appear at most once in I'). Present an O(n) time algorithm for solving this problem.
As far as I can tell, it's basically a special case of the knapsack problem, otherwise known as the subset-sum problem ... both of which are in NP and in theory impossible to solve in linear time?
So ... was this a trick question?
This SO post basically explains that a pseudo-polynomial (linear) time approximation can be done if the weights are bounded, but in the exam problem the weights aren't bounded and either way given the overall difficulty of the exam I'd be shocked if the prof expected us to know/come up with an obscure dynamic optimization algorithm.
There are two things that make this problem possible:
The input can be truncated to size O(sqrt(n)). There are no negative inputs, so you can discard any numbers greater than 100*sqrt(n), and all inputs are distinct so we know there are at most 100*sqrt(n) inputs that matter.
The playing field has size O(sqrt(n)). Although there are O(2^sqrt(n)) ways to combine the O(sqrt(n)) inputs that matter, you don't have to care about combinations that either leave the 100*sqrt(n) range or redundantly hit a target you can already reach.
Basically, this problem screams dynamic programming with each input being checked against each part of the 'reached number' space somehow.
The solution ends up being a matter of ensuring numbers don't reach off of themselves (by scanning in the right direction), of only looking at each number once, and of giving ourselves enough information to reconstruct the solution afterwards.
Here's some C# code that should solve the problem in the given time:
int[] FindSubsetToImpliedTarget(int[] inputs) {
var target = 100*(int)Math.Ceiling(Math.Sqrt(inputs.Count));
// build up how-X-was-reached table
var reached = new int?[target+1];
reached[0] = 0; // the empty set reaches 0
foreach (var e in inputs) {
// we go backwards to avoid reaching off of ourselves
for (var i = target; i >= e; i--) {
if (reached[i-e].HasValue) {
reached[i] = e;
}
}
}
// was target even reached?
if (!reached[target].HasValue) return null;
// build result by back-tracking via the logged reached values
var result = new List<int>();
for (var i = target; reached[i] != 0; i -= reached[i].Value) {
result.Add(reached[i].Value);
}
return result.ToArray();
}
I haven't actually tested the above code, so beware typos and off-by-ones.
With the typical DP algorithm for subset-sum problem will obtain O(N) time consuming algorithm. We use dp[i][k] (boolean) to indicate whether the first i items have a subset with sum k,the transition equation is:
dp[i][k] = (dp[i-1][k-v[i] || dp[i-1][k]),
it is O(NM) where N is the size of the set and M is the targeted sum. Since the elements are distinct and the sum must equal to 100*ceil(n^.5), we just need consider at most the first 100*ceil(n^.5) items, then we get N<=100*ceil(n^.5) and M = 100*ceil(n^.5).
The DP algorithm is O(N*M) = O(100*ceil(n^.5) * 100*ceil(n^.5)) = O(n).
Ok following is a simple solution in O(n) time.
Since the required sum S is of the order of O(n^0.5), if we formulate an algorithm of complexity S^2, then we are good since our algorithm shall be of effective complexity O(n).
Iterate once over all the elements and check if the value is less than S or not. If it is then push it in a new array. This array shall contain a maximum of S elements (O(n^.5))
Sort this array in descending order in O(sqrt(n)*logn) time ( < O(n)). This is so because logn <= sqrt(n) for all natural numbers. https://math.stackexchange.com/questions/65793/how-to-prove-log-n-leq-sqrt-n-over-natural-numbers
Now this problem is a 1D knapsack problem with W = S and number of elements = S (upper bound).
Maximize the total weight of items and see if it equals S.
It can be solved using dynamic programming in linear time (linear wrt W ~ S).

Resources