dynamic programming to find the minimum weight cover of the points - algorithm

You are given n points p1, p2, . . . , pn on the real line. The location of pi
is given by its coordinate xi
. You
are also given m intervals I1, I2, . . . , Im where Ij = [aj , bj ] (aj is the left end point and bj is the right end
point). Each interval j has a non-negative weight wj . An interval Ij is said to cover pi
if xi ∈ [aj , bj ]. A
subset S ⊆ {I1, I2, . . . , Im} of intervals is a cover for the given points if for each pi
, 1 ≤ i ≤ n, there is some
interval in S that covers pi
. In the figure below, the intervals shown in bold form a cover of the points.
The goal is to find a minimum weight cover of the points. Note that a minimum weight cover may differ
from a cover with minimum number of intervals.
ps: I first tried to find the smallest interval, each time we find a valid smallest interval we remove its covered points from P[ ] until P[ ] is empty, but this algorithms can be easily prove to be wrong. Then I tried to pick the minimum ratio of weight to number of its covering points, but I do not know how to apply it into dp with memoization.

So what we can do here is first sort all the p[] array values and also sort the intervals according to a[] i.e the left end point.
Now at each state we can have two index, the first is the index of the interval and the second is the index of the point
So at each state what we have two cases:
The first case is simple, i.e we do not select the current interval.
In the second case what we can do is select the current interval if the point lies inside, also note that if we take the current interval than we can just move the index of the point in the new state forward by the no of points we find in the current interval since they are sorted.
Pseudo code for Recursive Dynamic Programming solution using memoisation:
#define INF 1e9
int n = 100;
int m = 100;
int p[100], a[100], b[100], w[100], memo[100][100];//initialize memo to -1
int solve(int intervalIndex, int pointIndex){
if(intervalIndex == m){
if(pointIndex == n) return 0;
else return INF;
}
if(pointIndex == n) return 0;
if(memo[intervalIndex][pointIndex] != -1) return memo[intervalIndex][pointIndex];
//we can make 2 choices either to select this interval ao not to select this interval
//if we select this interval, then we also take the points that it covers
//case #1 : do not take current interval
int ans = solve(intervalIndex + 1, pointIndex);
if(a[intervalIndex] <= p[pointIndex] && b[intervalIndex] >= p[pointIndex]){ //i.e this point is inside curr interval
//case # 2 : take current interval, and all the points it contains
int index = pointIndex;
for(int i = pointIndex + 1; i < n; i++){
if(a[intervalIndex] <= p[i] && b[i] >= p[i]){
index = i;
}else{
break;
}
}
ans = min(ans, solve(intervalIndex + 1, index + 1) + w[intervalIndex]);
}
return memo[intervalIndex][pointIndex] = ans;
}
The above can be modified for real numbers easily.
The complexity of above code is O(m*n*n), but it can be reduced to O(m * n), if we just precompute the value that what is the farthest point index corresponding to an interval, so that we do not have to use the for loop inside to search for it, instead just use the precomputed value.

Related

Algorithm Problem: Finding all cells that has distance of K from some specific cells in a 2D grid [duplicate]

I am attempting to solve a coding challenge however my solution is not very performant, I'm looking for advice or suggestions on how I can improve my algorithm.
The puzzle is as follows:
You are given a grid of cells that represents an orchard, each cell can be either an empty spot (0) or a fruit tree (1). A farmer wishes to know how many empty spots there are within the orchard that are within k distance from all fruit trees.
Distance is counted using taxicab geometry, for example:
k = 1
[1, 0]
[0, 0]
the answer is 2 as only the bottom right spot is >k distance from all trees.
My solution goes something like this:
loop over grid and store all tree positions
BFS from the first tree position and store all empty spots until we reach a neighbour that is beyond k distance
BFS from the next tree position and store the intersection of empty spots
Repeat step 3 until we have iterated over all tree positions
Return the number of empty spots remaining after all intersections
I have found that for large grids with large values of k, my algorithm becomes very slow as I end up checking every spot in the grid multiple times. After doing some research, I found some solutions for similar problems that suggest taking the two most extreme target nodes and then only comparing distance to them:
https://www.codingninjas.com/codestudio/problem-details/count-nodes-within-k-distance_992849
https://www.geeksforgeeks.org/count-nodes-within-k-distance-from-all-nodes-in-a-set/
However this does not work for my challenge given certain inputs like below:
k = 4
[0, 0, 0, 1]
[0, 1, 0, 0]
[0, 0, 0, 0]
[1, 0, 0, 0]
[0, 0, 0, 0]
Using the extreme nodes approach, the bottom right empty spot is counted even though it is 5 distance away from the middle tree.
Could anyone point me towards a more efficient approach? I am still very new to these types of problems so I am finding it hard to see the next step I should take.
There is a simple, linear time solution to this problem because of the grid and distance structure. Given a fruit tree with coordinates (a, b), consider the 4 diagonal lines bounding the box of distance k around it. The diagonals going down and to the right have a constant value of x + y, while the diagonals going down and to the left have a constant value of x - y.
A point (x, y) is inside the box (and therefore, within distance k of (a, b)) if and only if:
a + b - k <= x + y <= a + b + k, and
a - b - k <= x - y <= a - b + k
So we can iterate over our fruit trees (a, b) to find four numbers:
first_max = max(a + b - k); first_min = min(a + b + k);
second_max = max(a - b - k); second_min = min(a - b + k);
where min and max are taken over all fruit trees. Then, iterate over empty cells (or do some math and subtract fruit tree counts, if your grid is enormous), counting how many empty spots (x,y) satisfy
first_max <= x + y <= first_min, and
second_max <= x - y <= second_min.
This Python code (written in a procedural style) illustrates this idea. Each diagonal of each bounding box cuts off exactly half of the plane, so this is equivalent to intersection of parallel half planes:
fruit_trees = [(a, b) for a in range(len(grid))
for b in range(len(grid[0]))
if grid[a][b] == 1]
northwest_half_plane = -infinity
southeast_half_plane = infinity
southwest_half_plane = -infinity
northeast_half_plane = infinity
for a, b in fruit_trees:
northwest_half_plane = max(northwest_half_plane, a - b - k)
southeast_half_plane = min(southeast_half_plane, a - b + k)
southwest_half_plane = max(southwest_half_plane, a + b - k)
northeast_half_plane = min(northeast_half_plane, a + b + k)
count = 0
for x in range(len(grid)):
for y in range(len(grid[0])):
if grid[x][y] == 0:
if (northwest_half_plane <= x - y <= southeast_half_plane
and southwest_half_plane <= x + y <= northeast_half_plane):
count += 1
print(count)
Some notes on the code: Technically the array coordinates are a quarter-turn rotated from the Cartesian coordinates of the picture, but that is immaterial here. The code is left deliberately bereft of certain 'optimizations' which may seem obvious, for two reasons: 1. The best optimization depends on the input format of fruit trees and the grid, and 2. The solution, while being simple in concept and simple to read, is not simple to get right while writing, and it's important that the code be 'obviously correct'. Things like 'exit early and return 0 if a lower bound exceeds an upper bound' can be added later if the performance is necessary.
As Answered by #kcsquared ,Providing an implementation in JAVA
public int solutionGrid(int K, int [][]A){
int m=A.length;
int n=A[0].length;
int k=K;
//to store the house coordinates
Set<String> houses=new HashSet<>();
//Find the house and store the coordinates
for(int i=0;i<m;i++) {
for (int j = 0; j < n; j++) {
if (A[i][j] == 1) {
houses.add(i + "&" + j);
}
}
}
int northwest_half_plane = Integer.MIN_VALUE;
int southeast_half_plane = Integer.MAX_VALUE;
int southwest_half_plane = Integer.MIN_VALUE;
int northeast_half_plane = Integer.MAX_VALUE;
for(String ele:houses){
String arr[]=ele.split("&");
int a=Integer.valueOf(arr[0]);
int b=Integer.valueOf(arr[1]);
northwest_half_plane = Math.max(northwest_half_plane, a - b - k);
southeast_half_plane = Math.min(southeast_half_plane, a - b + k);
southwest_half_plane = Math.max(southwest_half_plane, a + b - k);
northeast_half_plane = Math.min(northeast_half_plane, a + b + k);
}
int count = 0;
for(int x=0;x<m;x++) {
for (int y = 0; y < n; y++) {
if (A[x][y] == 0){
if ((northwest_half_plane <= x - y && x - y <= southeast_half_plane)
&& southwest_half_plane <= x + y && x + y <= northeast_half_plane){
count += 1;
}
}
}
}
return count;
}
This wouldn't be easy to implement but could be sublinear for many cases, and at most linear. Consider representing the perimeter of each tree as four corners (they mark a square rotated 45 degrees). For each tree compute it's perimeter intersection with the current intersection. The difficulty comes with managing the corners of the intersection, which could include more than one point because of the diagonal alignments. Run inside the final intersection to count how many empty spots are within it.
Since you are using taxicab distance, BFS is unneccesary. You can compute the distance between an empty spot and a tree directly.
This algorithm is based on a suggestion by https://stackoverflow.com/users/3080723/stef
// select tree near top left corner
SET flag false
LOOP r over rows
LOOP c over columns
IF tree at c, r
SET t to tree at c,r
SET flag true
BREAK
IF flag
BREAK
LOOP s over empty spots
Calculate distance between s and t
IF distance <= k
ADD s to spotlist
LOOP s over spotlist
LOOP t over trees, starting at bottom right corner
Calculate distance between s and t
IF distance > k
REMOVE s from spotlist
BREAK
RETURN spotlist

Count nodes within k distance of marked nodes in grid

I am attempting to solve a coding challenge however my solution is not very performant, I'm looking for advice or suggestions on how I can improve my algorithm.
The puzzle is as follows:
You are given a grid of cells that represents an orchard, each cell can be either an empty spot (0) or a fruit tree (1). A farmer wishes to know how many empty spots there are within the orchard that are within k distance from all fruit trees.
Distance is counted using taxicab geometry, for example:
k = 1
[1, 0]
[0, 0]
the answer is 2 as only the bottom right spot is >k distance from all trees.
My solution goes something like this:
loop over grid and store all tree positions
BFS from the first tree position and store all empty spots until we reach a neighbour that is beyond k distance
BFS from the next tree position and store the intersection of empty spots
Repeat step 3 until we have iterated over all tree positions
Return the number of empty spots remaining after all intersections
I have found that for large grids with large values of k, my algorithm becomes very slow as I end up checking every spot in the grid multiple times. After doing some research, I found some solutions for similar problems that suggest taking the two most extreme target nodes and then only comparing distance to them:
https://www.codingninjas.com/codestudio/problem-details/count-nodes-within-k-distance_992849
https://www.geeksforgeeks.org/count-nodes-within-k-distance-from-all-nodes-in-a-set/
However this does not work for my challenge given certain inputs like below:
k = 4
[0, 0, 0, 1]
[0, 1, 0, 0]
[0, 0, 0, 0]
[1, 0, 0, 0]
[0, 0, 0, 0]
Using the extreme nodes approach, the bottom right empty spot is counted even though it is 5 distance away from the middle tree.
Could anyone point me towards a more efficient approach? I am still very new to these types of problems so I am finding it hard to see the next step I should take.
There is a simple, linear time solution to this problem because of the grid and distance structure. Given a fruit tree with coordinates (a, b), consider the 4 diagonal lines bounding the box of distance k around it. The diagonals going down and to the right have a constant value of x + y, while the diagonals going down and to the left have a constant value of x - y.
A point (x, y) is inside the box (and therefore, within distance k of (a, b)) if and only if:
a + b - k <= x + y <= a + b + k, and
a - b - k <= x - y <= a - b + k
So we can iterate over our fruit trees (a, b) to find four numbers:
first_max = max(a + b - k); first_min = min(a + b + k);
second_max = max(a - b - k); second_min = min(a - b + k);
where min and max are taken over all fruit trees. Then, iterate over empty cells (or do some math and subtract fruit tree counts, if your grid is enormous), counting how many empty spots (x,y) satisfy
first_max <= x + y <= first_min, and
second_max <= x - y <= second_min.
This Python code (written in a procedural style) illustrates this idea. Each diagonal of each bounding box cuts off exactly half of the plane, so this is equivalent to intersection of parallel half planes:
fruit_trees = [(a, b) for a in range(len(grid))
for b in range(len(grid[0]))
if grid[a][b] == 1]
northwest_half_plane = -infinity
southeast_half_plane = infinity
southwest_half_plane = -infinity
northeast_half_plane = infinity
for a, b in fruit_trees:
northwest_half_plane = max(northwest_half_plane, a - b - k)
southeast_half_plane = min(southeast_half_plane, a - b + k)
southwest_half_plane = max(southwest_half_plane, a + b - k)
northeast_half_plane = min(northeast_half_plane, a + b + k)
count = 0
for x in range(len(grid)):
for y in range(len(grid[0])):
if grid[x][y] == 0:
if (northwest_half_plane <= x - y <= southeast_half_plane
and southwest_half_plane <= x + y <= northeast_half_plane):
count += 1
print(count)
Some notes on the code: Technically the array coordinates are a quarter-turn rotated from the Cartesian coordinates of the picture, but that is immaterial here. The code is left deliberately bereft of certain 'optimizations' which may seem obvious, for two reasons: 1. The best optimization depends on the input format of fruit trees and the grid, and 2. The solution, while being simple in concept and simple to read, is not simple to get right while writing, and it's important that the code be 'obviously correct'. Things like 'exit early and return 0 if a lower bound exceeds an upper bound' can be added later if the performance is necessary.
As Answered by #kcsquared ,Providing an implementation in JAVA
public int solutionGrid(int K, int [][]A){
int m=A.length;
int n=A[0].length;
int k=K;
//to store the house coordinates
Set<String> houses=new HashSet<>();
//Find the house and store the coordinates
for(int i=0;i<m;i++) {
for (int j = 0; j < n; j++) {
if (A[i][j] == 1) {
houses.add(i + "&" + j);
}
}
}
int northwest_half_plane = Integer.MIN_VALUE;
int southeast_half_plane = Integer.MAX_VALUE;
int southwest_half_plane = Integer.MIN_VALUE;
int northeast_half_plane = Integer.MAX_VALUE;
for(String ele:houses){
String arr[]=ele.split("&");
int a=Integer.valueOf(arr[0]);
int b=Integer.valueOf(arr[1]);
northwest_half_plane = Math.max(northwest_half_plane, a - b - k);
southeast_half_plane = Math.min(southeast_half_plane, a - b + k);
southwest_half_plane = Math.max(southwest_half_plane, a + b - k);
northeast_half_plane = Math.min(northeast_half_plane, a + b + k);
}
int count = 0;
for(int x=0;x<m;x++) {
for (int y = 0; y < n; y++) {
if (A[x][y] == 0){
if ((northwest_half_plane <= x - y && x - y <= southeast_half_plane)
&& southwest_half_plane <= x + y && x + y <= northeast_half_plane){
count += 1;
}
}
}
}
return count;
}
This wouldn't be easy to implement but could be sublinear for many cases, and at most linear. Consider representing the perimeter of each tree as four corners (they mark a square rotated 45 degrees). For each tree compute it's perimeter intersection with the current intersection. The difficulty comes with managing the corners of the intersection, which could include more than one point because of the diagonal alignments. Run inside the final intersection to count how many empty spots are within it.
Since you are using taxicab distance, BFS is unneccesary. You can compute the distance between an empty spot and a tree directly.
This algorithm is based on a suggestion by https://stackoverflow.com/users/3080723/stef
// select tree near top left corner
SET flag false
LOOP r over rows
LOOP c over columns
IF tree at c, r
SET t to tree at c,r
SET flag true
BREAK
IF flag
BREAK
LOOP s over empty spots
Calculate distance between s and t
IF distance <= k
ADD s to spotlist
LOOP s over spotlist
LOOP t over trees, starting at bottom right corner
Calculate distance between s and t
IF distance > k
REMOVE s from spotlist
BREAK
RETURN spotlist

Path of Length N in graph with constraints

I want to find number of path of length N in a graph where the vertex can be any natural number. However two vertex are connected only if the product of the two vertices is less than some natural number P. If the product of two vertexes are greater than P than those are not connected and can't be reached from one other.
I can obviously run two nested loops (<= P) and create an adjacency matrix, but P can be extremely large and this approach would be extremely slow. Can anyone think of some optimal approach to solve the problem? Can we solve it using Dynamic Programming?
I agree with Ante's recurrence, although I used a slightly simplified version. Note that I'm using the letter P to name the maximum product, as it is used in the original problem statement:
f(1,x) = 1
f(i,x) = sum(f(i-1, y) for y in {1, ..., floor(P/x)})
f(i,x) is the number of sequences of length i that end with x. The answer to the question is then f(n+1, 1).
Of course since P can be up to 10^9 in this task, a straightforward implementation with a DP table is out of the question. However, there are only up to m < 70000 possible different values of floor(P/i). So let's find the maximal segments aj ... bj, where floor(P/aj) = floor(P/bj). We can find those segments in O(number of segments * log P) using binary search.
Imagine the full DP table for f. Since there are only m different values for floor(P/x), every row of f consists of m contiguous ranges that have the same value.
So let's compute the compressed DP table, where we represent the rows as list of (length, value) pairs. We start with f(1) = [(P, 1)] and we can compute f(i+1) from f(i) by processing the segments in increasing order and computing prefix sums of the lengths stored in f(i).
The total runtime of my implementation of this approach is O(m (log P + n)). This is the code I used:
using ll=long long;
const int mod = 1000000007;
void add(int& x, ll y) { x = (x+y)%mod; }
int main() {
int n, P;
cin >> n >> P;
int x = 1;
vector<pair<int,int>> segments;
while(x <= P) {
int y = x+1, hi = P+1;
while(y<hi) {
int mid = (y+hi)/2;
if (P/mid < P/x) hi=mid;
else y=mid+1;
}
segments.push_back(make_pair(P/x, y-x));
x = y;
}
reverse(begin(segments), end(segments));
vector<pair<int,int>> dp;
dp.push_back(make_pair(P,1));
for (int i = 1; i <= n; ++i) {
int j = 0;
int sum_smaller = 0, cnt_smaller = 0;
vector<pair<int,int>> dp2;
for (auto it : segments) {
int value = it.first, cnt = it.second;
while (cnt_smaller + dp[j].first <= value) {
cnt_smaller += dp[j].first;
add(sum_smaller,(ll)dp[j].first*dp[j].second);
j++;
}
int pref_sum = sum_smaller;
if (value > cnt_smaller)
add(pref_sum, (ll)(value - cnt_smaller)*dp[j].second);
dp2.push_back(make_pair(cnt, pref_sum));
}
dp = dp2;
reverse(begin(dp),end(dp));
}
cout << dp[0].second << endl;
}
I needed to do some micro-optimizations with the handling of the arrays to get AC, but those aren't really relevant, so I left them away.
If number of vertices is small than adjacency matrix (A) can help. Since sum of elements in A^N is number of distinct paths, if paths are oriented. If not than number of paths i sum of elements / 2. That is due an element (i,j) represents number of paths from vertex i to vertex j.
In this case, same approach can be done by DP, using reasoning that number of paths of length n from vertex v is sum of numbers of paths of length n-1 of all it's neighbours. Neigbours of vertex i are vertices from 1 to floor(Q/i). With that we can construct function N(vertex, length) which represent number of paths from given vertex with given length:
N(i, 1) = floor(Q/i),
N(i, n) = sum( N(j, n-1) for j in {1, ..., floor(Q/i)}.
Number of all oriented paths of length is sum( N(i,N) ).

Specific Max Sum of the elements of an Int array - C/C++

Let's say we have an array: 7 3 1 1 6 13 8 3 3
I have to find the maximum sum of this array such that:
if i add 13 to the sum: i cannot add the neighboring elements from each side: 6 1 and 8 3 cannot be added to the sum
i can skip as many elements as necessary to make the sum max
My algorithm was this:
I take the max element of the array and add that to the sum
I make that element and the neighbor elements -1
I keep doing this until it's not possible to find anymore max
The problem is that for some specific test cases this algorithm is wrong.
Lets see this one: 15 40 45 35
according to my algorithm:
I take 45 and make neighbors -1
The program ends
The correct way to do it is 15 + 35 = 50
This problem can be solved with dynamic programming.
Let A be the array, let DP[m] be the max sum in {A[1]~A[m]}
Every element in A only have two status, been added into the sum or not. First we suppose we have determine DP[1]~DP[m-1], now look at {A[1]~A[m]}, A[m] only have two status that we have said, if A[m] have been added into, A[m-1] and A[m-2] can't be added into the sum, so in add status, the max sum is A[m]+DP[m-3] (intention: DP[m-3] has been the max sum in {A[1]~A[m-3]}), if A[m] have not been added into the sum, the max sum is DP[m-1], so we just need to compare A[m]+DP[m-3] and DP[m-1], the bigger is DP[m]. The thought is the same as mathematical induction.
So the DP equation is DP[m] = max{ DP[m-3]+A[m], DP[m-1] },DP[size(A)] is the result
The complexity is O(n), pseudocode is follow:
DP[1] = A[1];
DP[2] = max(DP[1], DP[2]);
DP[3] = max(DP[1], DP[2], DP[3]);
for(i = 4; i <= size(A); i++) {
DP[i] = DP[i-3] + A[i];
if(DP[i] < DP[i-1])
DP[i] = DP[i-1];
}
It's solvable with a dynamic programming approach, taking O(N) time and O(N) space. Implementation following:
int max_sum(int pos){
if( pos >= N){ // N = array_size
return 0;
}
if( visited(pos) == true ){ // if this state is already checked
return ret[pos]; // ret[i] contains the result for i'th cell
}
ret[pos] = max_sum(pos+3) + A[pos] + ret[pos-2]; // taking this item
ret[pos] = max(ret[pos], ret[pos-1]+ max_sum(pos+1) ); // if skipping this item is better
visited[pos] = true;
return ret[pos];
}
int main(){
// clear the visited array
// and other initializations
cout << max_sum(2) << endl; //for i < 2, ret[i] = A[i]
}
The above problem is max independent set problem (with twist) in a path graph which has dynamic programming solution in O(N).
Recurrence relation for solving it : -
Max(N) = maximum(Max(N-3) + A[N] , Max(N-1))
Explanation:- IF we have to select maximum set from N elements than we can either select Nth element and the maximum set from first N-3 element or we can select maximum from first N-1 elements excluding Nth element.
Pseudo Code : -
Max(1) = A[1];
Max(2) = maximum(A[1],A[2]);
Max(3) = maximum(A[3],Max(2));
for(i=4;i<=N;i++) {
Max(N) = maximum(Max(N-3)+A[N],Max(N-1));
}
As suggested, this is a dynamic programming problem.
First, some notation, Let:
A be the array, of integers, of length N
A[a..b) be the subset of A containing the elements at index a up to
but not including b (the half open interval).
M be an array such that M[k] is the specific max sum of A[0..k)
such that M[N] is the answer to our original problem.
We can describe an element of M (M[n]) by its relation to one or more elements of M (M[k]) where k < n. And this lends itself to a nice linear time algorithm. So what is this relationship?
The base cases are as follows:
M[0] is the max specific sum of the empty list, which must be 0.
M[1] is the max specific sum for a single element, so must be
that element: A[0].
M[2] is the max specific sum of the first two elements. With only
two elements, we can either pick the first or the second, so we better
pick the larger of the two: max(A[0], A[1]).
Now, how do we calculate M[n] if we know M[0..n)? Well, we have a choice to make:
Either we add A[n-1] (the last element in A[0..n)) or we don't. We don't know for
certain whether adding A[n-1] in will make for a larger sum, so we try both and take
the max:
If we don't add A[n-1] what would the sum be? It would be the same as the
max specific sum immediately before it: M[n-1].
If we do add A[n-1] then we can't have the previous two elements in our
solution, but we can have any elements before those. We know that M[n-1] and
M[n-2] might have used those previous two elements, but M[n-3] definitely
didn't, because it is the max in the range A[0..n-3). So we get
M[n-3] + A[n-1].
We don't know which one is bigger though, (M[n-1] or M[n-3] + A[n-1]), so to find
the max specific sum at M[n] we must take the max of those two.
So the relation becomes:
M[0] = 0
M[1] = A[0]
M[2] = max {A[0], A[1]}
M[n] = max {M[n-1], M[n-3] + A[n-1]} where n > 2
Note a lot of answers seem to ignore the case for the empty list, but it is
definitely a valid input, so should be accounted for.
The simple translation of the solution in C++ is as follows:
(Take special note of the fact that the size of m is one bigger than the size of a)
int max_specific_sum(std::vector<int> a)
{
std::vector<int> m( a.size() + 1 );
m[0] = 0; m[1] = a[0]; m[2] = std::max(a[0], a[1]);
for(unsigned int i = 3; i <= a.size(); ++i)
m[i] = std::max(m[i-1], m[i-3] + a[i-1]);
return m.back();
}
BUT This implementation has a linear space requirement in the size of A. If you look at the definition of M[n], you will see that it only relies on M[n-1] and M[n-3] (and not the whole preceding list of elements), and this means you need only store the previous 3 elements in M, resulting in a constant space requirement. (The details of this implementation are left to the OP).

nth smallest element in a union of an array of intervals with repetition

I want to know if there is a more efficient solution than what I came up with(not coded it yet but described the gist of it at the bottom).
Write a function calcNthSmallest(n, intervals) which takes as input a non-negative int n, and a list of intervals [[a_1; b_1]; : : : ; [a_m; b_m]] and calculates the nth smallest number (0-indexed) when taking the union of all the intervals with repetition. For example, if the intervals were [1; 5]; [2; 4]; [7; 9], their union with repetition would be [1; 2; 2; 3; 3; 4; 4; 5; 7; 8; 9] (note 2; 3; 4 each appear twice since they're in both the intervals [1; 5] and [2; 4]). For this list of intervals, the 0th smallest number would be 1, and the 3rd and 4th smallest would both be 3. Your implementation should run quickly even when the a_i; b_i can be very large (like, one trillion), and there are several intervals
The way I thought to go about it is the straightforward solution which is to make the union array and traverse it.
This problem can be solved in O(N log N) where N is the number of intervals in the list, regardless of the actual values of the interval endpoints.
The key to solving this problem efficiently is to transform the list of possibly-overlapping intervals into a list of intervals which are either disjoint or identical. In the given example, only the first interval needs to be split:
{ [1,5], [2,4], [7,9]} =>
+-----------------+ +---+ +---+
{[1,1], [2,4], [5,5], [2,4], [7,9]}
(This doesn't have to be done explicitly, though: see below.) Now, we can sort the new intervals, replacing duplicates with a count. From that, we can compute the number of values each (possibly-duplicated) interval represents. Now, we simply need to accumulate the values to figure out which interval the solution lies in:
interval count size values cumulative
in interval values
[1,1] 1 1 1 [0, 1)
[2,4] 2 3 6 [1, 7) (eg. from n=1 to n=6 will be here)
[5,5] 1 1 1 [7, 8)
[7,9] 1 3 3 [8, 11)
I wrote the cumulative values as a list of half-open intervals, but obviously we only need the end-points. We can then find which interval holds value n by, for example, binary-searching the cumulative values list, and we can figure out which value in the interval we want by subtracting the start of the interval from n and then integer-dividing by the count.
It should be clear that the maximum size of the above table is twice the number of original intervals, because every row must start and end at either the start or end of some interval in the original list. If we'd written the intervals as half-open instead of closed, this would be even clearer; in that case, we can assert that the precise size of the table will be the number of unique values in the collection of end-points. And from that insight, we can see that we don't really need the table at all; we just need the sorted list of end-points (although we need to know which endpoint each value represents). We can simply iterate through that list, maintaining the count of the number of active intervals, until we reach the value we're looking for.
Here's a quick python implementation. It could be improved.
def combineIntervals(intervals):
# endpoints will map each endpoint to a count
endpoints = {}
# These two lists represent the start and (1+end) of each interval
# Each start adds 1 to the count, and each limit subtracts 1
for start in (i[0] for i in intervals):
endpoints[start] = endpoints.setdefault(start, 0) + 1
for limit in (i[1]+1 for i in intervals):
endpoints[limit] = endpoints.setdefault(limit, 0) - 1
# Filtering is a possibly premature optimization but it was easy
return sorted(filter(lambda kv: kv[1] != 0,
endpoints.iteritems()))
def nthSmallestInIntervalList(n, intervals):
limits = combineIntervals(intervals)
cumulative = 0
count = 0
index = 0
here = limits[0][0]
while index < len(limits):
size = limits[index][0] - here
if n < cumulative + count * size:
# [here, next) contains the value we're searching for
return here + (n - cumulative) / count
# advance
cumulative += count * size
count += limits[index][1]
here += size
index += 1
# We didn't find it. We could throw an error
So, as I said, the running time of this algorithm is independent of the actual values of the intervals; it only depends in the length of the interval list. This particular solution is O(N log N) because of the cost of the sort (in combineIntervals); if we used a priority queue instead of a full sort, we could construct the heap in O(N) but making the scan O(log N) for each scanned endpoint. Unless N is really big and the expected value of the argument n is relatively small, this would be counter-productive. There might be other ways to reduce complexity, though.
Edit2:
Here's yet another take on your question.
Let's consider the intervals graphically:
1 1 1 2 2 2 3
0-2-4--7--0--3---7-0--4--7--0
[-------]
[-----------------]
[---------]
[--------------]
[-----]
When sorted in increasing order on the lower bound, we could get something that looks like the above for the interval list ([2;10];[4;24];[7;17];[13;30];[20;27]). Each lower bound indicates the start of a new interval, and would also marks the beginning of one more "level" of duplication of the numbers. Conversely, upper bounds mark the end of that level, and decrease the duplication level of one.
We could therefore convert the above into the following list:
[2;+];[4;+];[7;+][10;-];[13;+];[17;-][20;+];[24;-];[27;-];[30;-]
Where the first value indicates the rank of the bound, and the second value whether the bound is lower (+) or upper (-). The computation of the nth element is done by following the list, raising or lowering the duplication level when encountering an lower or upper bound, and using the duplication level as a counting factor.
Let's consider again the list graphically, but as an histogram:
3333 44444 5555
2222222333333344444555
111111111222222222222444444
1 1 1 2 2 2 3
0-2-4--7--0--3---7-0--4--7--0
The view above is the same as the first one, with all the intervals packed vertically.
1 being the elements of the 1st one, 2 the second one, etc. In fact, what matters here
is the height at each index, corresponding of the number of time each index is duplicated in the union of all intervals.
3333 55555 7777
2223333445555567777888
112223333445555567777888999
1 1 1 2 2 2 3
0-2-4--7--0--3---7-0--4--7--0
| | | | | | || | |
We can see that histogram blocks start at lower bounds of intervals, and end either on upper bounds, or one unit before lower bounds, so the new notation must be modified accordingly.
With a list containing n intervals, as a first step, we convert the list into the notation above (O(n)), and sort it in increasing bound order (O(nlog(n))). The second step of computing the number is then in O(n), for a total average time in O(nlog(n)).
Here's a simple implementation in OCaml, using 1 and -1 instead of '+' and '-'.
(* transform the list in the correct notation *)
let rec convert = function
[] -> []
| (l,u)::xs -> (l,1)::(u+1,-1)::convert xs;;
(* the counting function *)
let rec count r f = function
[] -> raise Not_found
| [a,x] -> (match f + x with
0 -> if r = 0 then a else raise Not_found
| _ -> a + (r / f))
| (a,x)::(b,y)::l ->
if a = b
then count r f ((b,x+y)::l)
else
let f = f + x in
if f > 0 then
let range = (b - a) * f in
if range > r
then a + (r / f)
else count (r - range) f ((b,y)::l)
else count r f ((b,y)::l);;
(* the compute function *)
let compute l =
let compare (x,_) (y,_) = compare x y in
let l = List.sort compare (convert l) in
fun m -> count m 0 l;;
Notes:
- the function above will raise an exception if the sought number is above the intervals. This corner case isn't taken in account by the other methods below.
- the list sorting function used in OCaml is merge sort, which effectively performs in O(nlog(n)).
Edit:
Seeing that you might have very large intervals, the solution I gave initially (see down below) is far from optimal.
Instead, we could make things much faster by transforming the list:
we try to compress the interval list by searching for overlapping ones and replace them by prefixing intervals, several times the overlapping one, and suffixing intervals. We can then directly compute the number of entries covered by each element of the list.
Looking at the splitting above (prefix, infix, suffix), we see that the optimal structure to do the processing is a binary tree. A node of that tree may optionally have a prefix and a suffix. So the node must contain :
an interval i in the node
an integer giving the number of repetition of i in the list,
a left subtree of all the intervals below i
a right subtree of all the intervals above i
with this structure in place, the tree is automatically sorted.
Here's an example of an ocaml type embodying that tree.
type tree = Empty | Node of int * interval * tree * tree
Now the transformation algorithm boils down to building the tree.
This function create a tree out of its component:
let cons k r lt rt =
the tree made of count k, interval r, left tree lt and right tree rt
This function recursively insert an interval in a tree.
let rec insert i it =
let r = root of it
let lt = the left subtree of it
let rt = the right subtree of it
let k = the count of r
let prf, inf, suf = the prefix, infix and suffix of i according to r
return cons (k+1) inf (insert prf lt) (insert suf rt)
Once the tree is built, we do a pre-order traversal of the tree, using the count of the node to accelerate the computation of the nth element.
Below is my previous answer.
Here are the steps of my solution:
you need to sort the interval list in increasing order on the lower bound of each interval
you need a deque dq (or a list which will be reversed at some point) to store the intervals
here's the code:
let lower i = lower bound of interval i
let upper i = upper bound of i
let il = sort of interval list
i <- 0
j <- lower (head of il)
loop on il:
i <- i + 1
let h = the head of il
let il = the tail of il
if upper h > j then push h to dq
if lower h > j then
il <- concat dq and il
j <- j + 1
dq <- empty
loop
if i = k then return j
loop
This algorithm works by simply iterating through the intervals, only taking in account the relevant intervals, and counting both the rank i of the element in the union, and the value j of that element. When the targeted rank k has been reached, the value is returned.
The complexity is roughly in O(k) + O(sort(l)).
if i have understood your question correctly, you want to find the kth largest element in union of list of intervals.
If we assume that no of list = 2 the question is :
Find the kth smallest element in union of two sorted arrays (where an interval [2,5] is nothing but elements from 2 to 5 {2,3,4,5}) this sollution can be solved in (n+m)log(n+m) time where (n and m are sizes of list) . where i and j are list iterators .
Maintaining the invariant
i + j = k – 1,
If Bj-1 < Ai < Bj, then Ai must be the k-th smallest,
or else if Ai-1 < Bj < Ai, then Bj must be the k-th smallest.
For details click here
Now the problem is if you have no of lists=3 lists then
Maintaining the invariant
i + j+ x = k – 1,
i + j=k-x-1
The value k-x-1 can take y (size of third list, because x iterates from start point of list to end point) .
problem of 3 lists size can be reduced to y*(problem of size 2 list). So complexity is `y*((n+m)log(n+m))`
If Bj-1 < Ai < Bj, then Ai must be the k-th smallest,
or else if Ai-1 < Bj < Ai, then Bj must be the k-th smallest.
So for problem of size n list the complexity is NP .
But yes we can do minor improvement if we know that k< sizeof(some lists) we can chop the elements starting from k+1th element to end(from our search space ) in those list whose size is bigger than k (i think it doesnt help for large k).If there is any mistake please let me know.
Let me explain with an example:
Assume we are given these intervals [5,12],[3,9],[8,13].
The union of these intervals is:
number : 3 4 5 5 6 6 7 7 8 8 8 9 9 9 10 10 11 11 12 12 13.
indices: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
The lowest will return 11 when 9 is passed an input.
The highest will return 14 when 9 is passed an input.
Lowest and highest function just check whether the x is present in that interval, if it is present then adds x-a(lower index of interval) to return value for that one particular interval. If an interval is completely smaller than x, then adds total number of elements in that interval to the return value.
The find function will return 9 when 13 is passed.
The find function will use the concept of binary search to find the kth smallest element. In the given range [0,N] (if range is not given we can find high range in O(n)) find the mid and calculate the lowest and highest for mid. If given k falls in between lowest and highest return mid else if k is less than or equal to lowest search in the lower half(0,mid-1) else search in the upper half(mid+1,high).
If the number of intervals are n and the range is N, then the running time of this algorithm is n*log(N). we will find lowest and highest (which runs in O(n)) log(N) times.
//Function call will be `find(0,N,k,in)`
//Retrieves the no.of smaller elements than first x(excluding) in union
public static int lowest(List<List<Integer>> in, int x){
int sum = 0;
for(List<Integer> lst: in){
if(x > lst.get(1))
sum += lst.get(1) - lst.get(0)+1;
else if((x >= lst.get(0) && x<lst.get(1)) || (x > lst.get(0) && x<=lst.get(1))){
sum += x - lst.get(0);
}
}
return sum;
}
//Retrieve the no.of smaller elements than last x(including) in union.
public static int highest(List<List<Integer>> in, int x){
int sum = 0;
for(List<Integer> lst: in){
if(x > lst.get(1))
sum += lst.get(1) - lst.get(0)+1;
else if((x >= lst.get(0) && x<lst.get(1)) || (x > lst.get(0) && x<=lst.get(1))){
sum += x - lst.get(0)+1;
}
}
return sum;
}
//Do binary search on the range.
public static int find(int low, int high, int k,List<List<Integer>> in){
if(low > high)
return -1;
int mid = low + (high-low)/2;
int lowIdx = lowest(in,mid);
int highIdx = highest(in,mid);
//k lies between the current numbers high and low indices
if(k > lowIdx && k <= highIdx) return mid;
//k less than lower index. go on to left side
if(k <= lowIdx) return find(low,mid-1,k,in);
// k greater than higher index go to right
if(k > highIdx) return find(mid+1,high,k,in);
else
return -1; // catch statement
}
It's possible to count how many numbers in the list are less than some chosen number X (by iterating through all of the intervals). Now, if this number is greater than n, the solution is certainly smaller than X. Similarly, if this number is less than or equal to n, the solution is greater than or equal to X. Based on these observation we can use binary search.
Below is a Java implementation :
public int nthElement( int[] lowerBound, int[] upperBound, int n )
{
int lo = Integer.MIN_VALUE, hi = Integer.MAX_VALUE;
while ( lo < hi ) {
int X = (int)( ((long)lo+hi+1)/2 );
long count = 0;
for ( int i=0; i<lowerBound.length; ++i ) {
if ( X >= lowerBound[i] && X <= upperBound[i] ) {
// part of interval i is less than X
count += (long)X - lowerBound[i];
}
if ( X >= lowerBound[i] && X > upperBound[i] ) {
// all numbers in interval i are less than X
count += (long)upperBound[i] - lowerBound[i] + 1;
}
}
if ( count <= n ) lo = X;
else hi = X-1;
}
return lo;
}

Resources