top down ranges merge?

top down ranges merge? - algorithm

I want to merge some intervals like this:
>>> ranges = [(30, 45), (40, 50), (10, 50), (60, 90), (90, 100)]
>>> merge(ranges)
[(10, 50), (60, 100)]
I'm not in cs field. I know how to do it by iteration, but wonder if there's a more efficient "top-down" approach to merge them more efficiently, maybe using some special data structure?
Thanks.

Interval tree definitely works, but it is more complex than what you need. Interval tree is an "online" solution, and so it allows you to add some intervals, look at the union, add more intervals, look again, etc.
If you have all the intervals upfront, you can do something simpler:
Start with the input
ranges = [(30, 45), (40, 50), (10, 50)]
Convert the range list into a list of endpoints. If you have range (A, B), you'll convert it to two endpoints: (A, 0) will be the left endpoint and (B, 1) wil be the right endpoint.
endpoints = [(30, 0), (45, 1), (40, 0), (50, 1), (10, 0), (50, 1)]
Sort the endpoints
endpoints = [(10, 0), (30, 0), (40, 0), (45, 1), (50, 1), (50, 1)]
Scan forward through the endpoints list. Increment a counter when you see a left endpoint and decrement the counter when you see a right endpoint. Whenever the counter hits 0, you close the current merged interval.
This solution can be implemented in a few lines.

Yeah, the efficient way to do it is to use an interval tree.

The following algorithm in C# does what you want. It uses DateTime interval ranges, but you can adapt it however you like. Once the collection is sorted in ascending start order, if the start of the next interval is at or before the end of the previous one, they overlap, and you extend the end time outward if needed. Otherwise they don't overlap, and you save the prior one off to the results.
public static List<DateTimeRange> MergeTimeRanges(List<DateTimeRange> inputRanges)
{
List<DateTimeRange> mergedRanges = new List<DateTimeRange>();
// Sort in ascending start order.
inputRanges.Sort();
DateTime currentStart = inputRanges[0].Start;
DateTime currentEnd = inputRanges[0].End;
for (int i = 1; i < inputRanges.Count; i++)
{
if (inputRanges[i].Start <= currentEnd)
{
if (inputRanges[i].End > currentEnd)
{
currentEnd = inputRanges[i].End; // Extend range.
}
}
else
{
// Save current range to output.
mergedRanges.Add(new DateTimeRange(currentStart, currentEnd));
currentStart = inputRanges[i].Start;
currentEnd = inputRanges[i].End;
}
}
mergedRanges.Add(new DateTimeRange(currentStart, currentEnd));
return mergedRanges;
}

Related

Filtering Spatial Data in Apache Spark

I am currently solving a problem involving GPS data from buses. The issue I am facing is to reduce computation in my process.
There are about 2 billion GPS-coordinate points (Lat-Long degrees) in one table and about 12,000 bus-stops with their Lat-Long in another table. It is expected that only 5-10% of the 2-billion points are at bus-stops.
Problem: I need to tag and extract only those points (out of the 2-billion) that are at bus-stops (the 12,000 points). Since this is GPS data, I cannot do exact matching of the coordinates, but rather do a tolerance based geofencing.
Issue: The process of tagging bus-stops is taking extremely long time with the current naive approach. Currently, we are picking each of the 12,000 bus-stop points, and querying the 2-billion points with a tolerance of 100m (by converting degree-differences into distance).
Question: Is there an algorithmically efficient process to achieve this tagging of points?

Yes you can use something like SpatialSpark. It only works with Spark 1.6.1 but you can use BroadcastSpatialJoin to create an RTree which is extremely efficient.
Here's an example of me using SpatialSpark with PySpark to check if different polygons are within each other or are intersecting:
from ast import literal_eval as make_tuple
print "Java Spark context version:", sc._jsc.version()
spatialspark = sc._jvm.spatialspark
rectangleA = Polygon([(0, 0), (0, 10), (10, 10), (10, 0)])
rectangleB = Polygon([(-4, -4), (-4, 4), (4, 4), (4, -4)])
rectangleC = Polygon([(7, 7), (7, 8), (8, 8), (8, 7)])
pointD = Point((-1, -1))
def geomABWithId():
return sc.parallelize([
(0L, rectangleA.wkt),
(1L, rectangleB.wkt)
])
def geomCWithId():
return sc.parallelize([
(0L, rectangleC.wkt)
])
def geomABCWithId():
return sc.parallelize([
(0L, rectangleA.wkt),
(1L, rectangleB.wkt),
(2L, rectangleC.wkt)])
def geomDWithId():
return sc.parallelize([
(0L, pointD.wkt)
])
dfAB = sqlContext.createDataFrame(geomABWithId(), ['id', 'wkt'])
dfABC = sqlContext.createDataFrame(geomABCWithId(), ['id', 'wkt'])
dfC = sqlContext.createDataFrame(geomCWithId(), ['id', 'wkt'])
dfD = sqlContext.createDataFrame(geomDWithId(), ['id', 'wkt'])
# Supported Operators: Within, WithinD, Contains, Intersects, Overlaps, NearestD
SpatialOperator = spatialspark.operator.SpatialOperator
BroadcastSpatialJoin = spatialspark.join.BroadcastSpatialJoin
joinRDD = BroadcastSpatialJoin.apply(sc._jsc, dfABC._jdf, dfAB._jdf, SpatialOperator.Within(), 0.0)
joinRDD.count()
results = joinRDD.collect()
map(lambda result: make_tuple(result.toString()), results)
# [(0, 0), (1, 1), (2, 0)] read as:
# ID 0 is within 0
# ID 1 is within 1
# ID 2 is within 0
Note the line
joinRDD = BroadcastSpatialJoin.apply(sc._jsc, dfABC._jdf, dfAB._jdf, SpatialOperator.Within(), 0.0)
the last argument is a buffer value, in your case it would be the tolerance you want to use. It will probably be a very small number if you are using lat/lon since it's a radial system and depending on the meters you want for your tolerance you will need to calculate based on lat/lon for your area of interest.

Sorting vector of x/y coordinates

I have a vector of (u32, u32) tuples which represent coordinates on a 10 x 10 grid. The coordinates are unsorted. Because the standard sort function also didn't yield the result I wanted, I wrote a sort function like this for them:
vec.sort_by(|a, b| {
if a.0 > b.0 { return Ordering::Greater; }
if a.0 < b.0 { return Ordering::Less; }
if a.1 > b.1 { return Ordering::Greater; }
if a.1 < b.1 { return Ordering::Less; }
return Ordering::Equal;
});
The resulting grid for my custom function looks like this:
(0/0) (0/1) (0/2) (0/3) (0/4) (0/5) (0/6) (0/7) (0/8) (0/9)
(1/0) (1/1) (1/2) (1/3) (1/4) (1/5) (1/6) (1/7) (1/8) (1/9)
(2/0) (2/1) (2/2) (2/3) (2/4) (2/5) (2/6) (2/7) (2/8) (2/9)
...
(9/0) (9/1) (9/2) (9/3) (9/4) (9/5) (9/6) (9/7) (9/8) (9/9)
This is not what I want, because the lower left should start with (0/0) as I would expect on a mathematical coordinates grid.
I probably can manage to add more cases to the sort algorithm, but is there an easier way to do what I want besides writing a big if .. return Ordering ...; block?

You didn't show how you are populating or printing your tuples, so this is a guess. Flip around and/or negate parts of your coordinates. I'd also recommend using sort_by_key as it's easier, as well as just reusing the existing comparison of tuples:
fn main() {
let mut points = [(0, 0), (1, 1), (1, 0), (0, 1)];
points.sort_by_key(|&(x, y)| (!y, x));
println!("{:?}", points);
}
Adding an extra newline in the output:
[(0, 1), (1, 1),
(0, 0), (1, 0)]
Originally, this answer suggested negating the value ((-y, x)). However, as pointed out by Francis Gagné, this fails for unsigned integers or signed integers when the value is the minimum value. Negating the bits happens to work fine, but is a bit too "clever".
Nowadays, I would use Ordering::reverse and Ordering::then for the clarity:
fn main() {
let mut points = [(0u8, 0u8), (1, 1), (1, 0), (0, 1)];
points.sort_by(|&(x0, y0), &(x1, y1)| y0.cmp(&y1).reverse().then(x0.cmp(&x1)));
println!("{:?}", points);
}
[(0, 1), (1, 1),
(0, 0), (1, 0)]

Given a set of ranges S, and an overlapping range R, find the smallest subset in S that encompases R

The following is a practice interview question that was given to me by someone, and I'm not sure what the best solution to this is:
Given a set of ranges:
(e.g. S = {(1, 4), (30, 40), (20, 91) ,(8, 10), (6, 7), (3, 9), (9, 12), (11, 14)}. And given a target range R (e.g. R = (3, 13) - meaning the range going from 3 to 13). Write an algorithm to find the smallest set of ranges that covers your target range. All of the ranges in the set must overlap in order to be considered as spanning the entire target range. (In this example, the answer would be {(3, 9), (9, 12), (11, 14)}.
What is the best way to solve this? I was thinking this would be done using a greedy algorithm. In our example above, we would look for all of the numbers that intersect with 3, and pick from those the one with the highest max. Then we would do the same thing with the one we just picked. So, since we picked (3, 9) we now want to find all of the ranges that intersect 9, and among those, we pick the one with the highest max. In that iteration, we picked (9, 12). We do the same thing to that one, and we find that the next range that intersects 12, with the highest max is (11, 14).
After that iteration, we see that 14 is greater than 13 (the max of our range), so we can stop.
The problem I'm having with this algorithm is, how do efficiently query the intersecting ranges? If we try a linear search, we end up with an algorithm that is O(n^2). My next thought was to cross off any of our intersecting ranges from our list each time we run through the loop. So in the first iteration, we cross of (1, 4) and (3, 9). In our next iteration we cross of (9, 12), (3, 9), and (8, 10). So by the last iteration, all we have to look through is {(30, 40), (20, 91), (6, 7)}. We could make this even more efficient by also crossing out everything that has a min > 13, and a max < 3. The problem is this still might not be enough. There is still the potential problem of having lots of duplicate sequences within the bounds of our range. If our list of ranges contained something like {(6, 7), (6, 7), (6, 7), (6, 7), (6, 7)} we would have to look through those each time, even though they aren't useful to us. Even if we were only to store unique values (by putting them all in a set), we might have a really big range, with a bunch of ranges that are inside of our target range, but we also have one range inside that spans almost the entire target range.
What would be an efficient way to query our ranges? Or possibly, what would be a more efficient algorithm to solving this problem?

How about using an interval tree for queries? (https://en.m.wikipedia.org/wiki/Interval_tree) I'm not sure if greedy could work here or not. If we look at the last set of choices, overlapping with the high point in R, there's a possibility of overlap between the earlier choices for each one of those, for example:
R = (2,10) and we have (8,10) and (7,10) both overlapping with (6,8)
In that case, we only need to store one value for (6,8) as a second leg of the path; and visiting (6,8) again as we make longer paths towards the low point in R would be superfluous since we already know (6,8) was visited with a lower leg count. So your idea of eliminating intervals as we go makes sense. Could something like this work?
leg = 1
start with the possible end (or beginning) intervals
label these intervals with leg
until end of path is reached:
remove the intervals labeled leg from the tree
for each of those intervals labeled leg:
list overlapping intervals in the chosen direction
leg = leg + 1
label the listed overlapping intervals with leg

I can suggest following algorithm with complexity O(n log n) without using Intervals trees.
Let introduce some notation. We should cover a range (X,Y) by intervals (x_i,y_i).
First sort given intervals (x_i,y_i) by start point. It will take O(n log n)
Let select from intervals (x_i,y_i) with x_i <= X interval (x_k,y_k) with maximum of y_i. Because interval already sorted by start point, we can just increment index, while interval satisfies condition. If y_k less than X, there are no solution for given set and range. In other case interval (x_k,y_k) contains 'X' and has maximal end point among intervals containing X.
Now we need to cover an interval (y_k, Y), to satisfy overlapping condition. Because for all intervals containing X has end point less than y_k+1, we can start from last interval from the previous step.
Each interval was used only once in this stage, so the time complexity of this part is O(n) and in total O(n log n).
Following code snippet for solution:
intervals // given intervals from set S
(X, Y) // range to cover
sort intervals
i = 0 // start index
start = X // start point
result_set // set to store result
while start <= Y && i < len(intervals):
next_start = intervals[i].y
to_add = intervals[i]
while intervals[i].x <= start && i < len(intervals):
if next_start > intervals[i].y:
next_start = intervals[i].y
to_add = intervals[i]
i++
if(next_start < start):
print 'No solution'
exit
start = next_start
result_set add to_add

Ok, after trying a bunch of different things, here is my solution. It runs in O(nlogn) time, and doesn't require the use of an Interval Tree (although I would probably use it if I could memorize how to implement one for an interview, but I think that would take too long without providing any real benefit).
The bottleneck of this algorithm is in the sorting. Every item is only touched once, but it only works with a sorted array, so that is the first thing we do. Thus the O(nlogn) time complexity. Because it modifies the original array , it has an O(1) space complexity, but if we were not allowed to modify the original array, we can just make a copy of it, and keep the rest of the algorithm the same, making the space complexity O(n).
import java.util.*;
class SmallestRangingSet {
static class Interval implements Comparable<Interval>{
Integer min;
Integer max;
public Interval(int min, int max) {
this.min = min;
this.max = max;
}
boolean intersects(int num) {
return (min <= num && max >= num);
}
//Overrides the compareTo method so it will be sorted
//in order relative to the min value
#Override
public int compareTo(Interval obj) {
if (min > obj.min) return 1;
else if (min < obj.min) return -1;
else return 0;
}
}
public static Set<Interval> smallestIntervalSet(Interval[] set, Interval target) {
//Bottleneck is here. The array is sorted, giving this algorithm O(nlogn) time
Arrays.sort(set);
//Create a set to store our ranges in
Set<Interval> smallSet = new HashSet<Interval>();
//Create a variable to keep track of the most optimal range, relative
//to the range before it, at all times.
Interval bestOfCurr = null;
//Keep track of the specific number that any given range will need to
//intersect with. Initialize it to the target-min-value.
int currBestNum = target.min;
//Go through each element in our sorted array.
for (int i = 0; i < set.length; i++) {
Interval currInterval = set[i];
//If we have already passed our target max, break.
if (currBestNum >= target.max)
break;
//Otherwise, if the current interval intersects with
//our currBestNum
if (currInterval.intersects(currBestNum)) {
//If the current interval, which intersects currBestNum
//has a greater max, then our current bestOfCurr
//Update bestOfCurr to be equal to currInterval.
if (bestOfCurr == null || currInterval.max >= bestOfCurr.max) {
bestOfCurr = currInterval;
}
}
//If our range does not intersect, we can assume that the most recently
//updated bestOfCurr is probably the most optimal new range to add to
//our set. However, if bestOfCurr is null, it means it was never updated,
//because there is a gap somewhere when trying to fill our target range.
//So we must check for null first.
else if (bestOfCurr != null) {
//If it's not null, add bestOfCurr to our set
smallSet.add(bestOfCurr);
//Update currBestNum to look for intervals that
//intersect with bestOfCurr.max
currBestNum = bestOfCurr.max;
//This line is here because without it, it actually skips over
//the next Interval, which is problematic if your sorted array
//has two optimal Intervals next to eachother.
i--;
//set bestOfCurr to null, so that it won't run
//this section of code twice on the same Interval.
bestOfCurr = null;
}
}
//Now we should just make sure that we have in fact covered the entire
//target range. If we haven't, then we are going to return an empty list.
if (currBestNum < target.max)
smallSet.clear();
return smallSet;
}
public static void main(String[] args) {
//{(1, 4), (30, 40), (20, 91) ,(8, 10), (6, 7), (3, 9), (9, 12), (11, 14)}
Interval[] interv = {
new Interval(1, 4),
new Interval(30, 40),
new Interval(20, 91),
new Interval(8, 10),
new Interval(6, 7),
new Interval(3, 9),
new Interval(9, 12),
new Interval(11, 14)
};
Set<Interval> newSet = smallestIntervalSet(interv, new Interval(3,14));
for (Interval intrv : newSet) {
System.out.print("(" + intrv.min + ", " + intrv.max + ") ");
}
}
}
Output
(3, 9) (9, 12) (11, 14)

Your assignment intrigued me, so I wrote a C++ program that solves the problem by iterating through the ranges that overlap the left side of the target range, and recursively searches for the smallest number of ranges that covers the remaining (right side) of the target range.
A significant optimization to this algorithm (not shown in this program) would be to, for each recursive level, use the range that overlaps the left side of the target range by the largest amount, and discarding from further consideration all ranges that overlap the left side by smaller amounts. By employing this rule, I believe there would be at most a single descent into the recursive call tree. Such an optimization would produce an algorithm having complexity O(n log(n)). (n to account for the depth of recursion, and log(n) to account for the binary search to find the range having the most overlap.)
This program produces the following as output:
{ (3, 9) (9, 12) (11, 14) }
Here is the program:
#include <utility> // for std::pair
#include <vector> // for std::vector
#include <iostream> // for std::cout & std::endl
typedef std::pair<int, int> range;
typedef std::vector<range> rangelist;
// function declarations
rangelist findRanges (range targetRange, rangelist candidateRanges);
void print (rangelist list);
int main()
{
range target_range = { 3, 13 };
rangelist candidate_ranges =
{ { 1, 4 }, { 30, 40 }, { 20, 91 }, { 8, 10 }, { 6, 7 }, { 3, 9 }, { 9, 12 }, { 11, 14 } };
rangelist result = findRanges (target_range, candidate_ranges);
print (result);
return 0;
}
// Recursive function that returns the smallest subset of candidateRanges that
// covers the given targetRange.
// If there is no subset that covers the targetRange, then this function
// returns an empty rangelist.
//
rangelist findRanges (range targetRange, rangelist candidateRanges)
{
rangelist::iterator it;
rangelist smallest_list_so_far;
for (it = candidateRanges.begin (); it != candidateRanges.end (); ++it) {
// if this candidate range overlaps the beginning of the target range
if (it->first <= targetRange.first && it->second >= targetRange.first) {
// if this candidate range also overlaps the end of the target range
if (it->second >= targetRange.second) {
// done with this level - return a list of ranges consisting only of
// this single candidate range
return { *it };
}
else {
// prepare new version of targetRange that excludes the subrange
// overlapped by the present range
range newTargetRange = { it->second + 1, targetRange.second };
// prepare new version of candidateRanges that excludes the present range
// from the list of ranges
rangelist newCandidateRanges;
rangelist::iterator it2;
// copy all ranges up to but not including the present range
for (it2 = candidateRanges.begin (); it2 != it; ++it2) {
newCandidateRanges.push_back (*it2);
}
// skip the present range
it2++;
// copy the remainder of ranges in the list
for (; it2 != candidateRanges.end(); ++it2) {
newCandidateRanges.push_back (*it2);
}
// recursive call to find the smallest list of ranges that cover the remainder
// of the target range not covered by the present range
rangelist subList = findRanges (newTargetRange, newCandidateRanges);
if (subList.size () == 0) {
// no solution includes the present range
continue;
}
else if (smallest_list_so_far.size () == 0 || // - first subList that covers the remainder of the target range
subList.size () < smallest_list_so_far.size ()) // - this subList is smaller than all previous ones checked
{
// add the present range to the subList, which represents a solution
// (though possibly not optimal yet) at the present level of recursion
subList.push_back (*it);
smallest_list_so_far = subList;
}
}
}
}
return smallest_list_so_far;
}
// print list of ranges
void print (rangelist list)
{
rangelist::reverse_iterator rit;
std::cout << "{ ";
for (rit = list.rbegin (); rit != list.rend (); ++rit) {
std::cout << "(" << rit->first << ", " << rit->second << ") ";
}
std::cout << "}" << std::endl;
}

Skyline of Buildings

I'm trying to understand the skyline problem. Given n rectangular building and we need to compute the skyline. I have trouble in understanding the output for this problem.
Input: (1,11,5), (2,6,7), (3,13,9), (12,7,16), (14,3,25), (19,18,22), (23,13,29), (24,4,28) }
Output Skylines: (1, 11), (3, 13), (9, 0), (12, 7), (16, 3), (19, 18), (22, 3), (25, 0)
The output is pair (xaxis, height). Why is the third pair (9,0)? If we see the skyline graph, the x-axis value 9 has height of 13, not 0. Why is it showing 0? In other words, if we take the first building (input (1,11,5)), the output is (1, 11), (5, 0). Can you guys explain why it is (5,0) instead of (5,11)?

Think of the rooftop intervals as closed on the left and open on the right.

Your output does not signify "at x the height is y", but rather "at x the height changes to y".

using the sweep line algorithm; here is my python version solution:
class Solution:
# #param {integer[][]} buildings
# #return {integer[][]}
def getSkyline(self, buildings):
if len(buildings)==0: return []
if len(buildings)==1: return [[buildings[0][0], buildings[0][2]], [buildings[0][1], 0]]
points=[]
for building in buildings:
points+=[[building[0],building[2]]]
points+=[[building[1],-building[2]]]
points=sorted(points, key=lambda x: x[0])
moving, active, res, current=0, [0], [],-1
while moving<len(points):
i=moving
while i<=len(points):
if i<len(points) and points[i][0]==points[moving][0]:
if points[i][1]>0:
active+=[points[i][1]]
if points[i][1]>current:
current=points[i][1]
if len(res)>0 and res[-1][0]==points[i][0]:
res[-1][1]=current
else:
res+=[[points[moving][0], current]]
else:
active.remove(-points[i][1])
i+=1
else:
break
if max(active)<current:
current=max(active)
res+=[[points[moving][0], current]]
moving=i
return res

static long largestRectangle(int[] h) {
int k=1;
int n=h.length;
long max=0;
while(k<=n){
long area=0;
for(int i=0;i<n-k+1;i++){
long min=Long.MAX_VALUE;
for(int j=i;j<i+k;j++){
//System.out.print(h[j]+" ");
min=Math.min(h[j],min);
}
// System.out.println();
area=k*min;
//System.out.println(area);
max=Math.max(area,max);
}
//System.out.println(k);
k++;
}
return max;
}

How to make this sparse matrix and trie work in tandem

I have a sparse matrix that has been exported to this format:
(1, 3) = 4
(0, 5) = 88
(6, 0) = 100
...
Strings are stored into a Trie data structure. The numbers in the previous exported sparse matrix correspond to the result of the lookup on the Trie.
Lets say the word "stackoverflow" is mapped to number '0'. I need to iterate the exported sparse matrix where the first element is equals to '0' and find the highest value.
For example:
(0, 1) = 4
(0, 3) = 8
(0, 9) = 100 <-- highest value
(0, 9) is going to win.
What would be the best implementation to store the exported sparse matrix?
In general, what would be the best approach (data structure, algorithm) to handle this functionality?

Absent memory or dynamism constraints, probably the best approach is to slurp the sparse matrix into a map from first number to the pairs ordered by value, e.g.,
matrix_map = {} # empty map
for (first_number, second_number, value) in matrix_triples:
if first_number not in matrix_map:
matrix_map[first_number] = [] # empty list
matrix_map[first_number].append((second_number, value))
for lst in matrix_map.values():
lst.sort(key=itemgetter(1), reverse=True) # sort by value descending
Given a matrix like
(0, 1) = 4
(0, 3) = 8
(0, 5) = 88
(0, 9) = 100
(1, 3) = 4
(6, 0) = 100,
the finished product looks like this:
{0: [(9, 100), (5, 88), (3, 8), (1, 4)],
1: [(3, 4)],
6: [(0, 100)]}.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

top down ranges merge? - algorithm

Yeah, the efficient way to do it is to use an interval tree.

Related

Filtering Spatial Data in Apache Spark

Sorting vector of x/y coordinates

Given a set of ranges S, and an overlapping range R, find the smallest subset in S that encompases R

Skyline of Buildings

How to make this sparse matrix and trie work in tandem

Categories

Resources