Find a region with maximum sum of top-K points - algorithm

My problem is: we have N points in a 2D space, each point has a positive weight. Given a query consisting of two real numbers a,b and one integer k, find the position of a rectangle of size a x b, with edges are parallel to axes, so that the sum of weights of top-k points, i.e. k points with highest weights, covered by the rectangle is maximized?
Any suggestion is appreciated.
P.S.:
There are two related problems, which are already well-studied:
Maximum region sum: find the rectangle with the highest total weight sum. Complexity: NlogN.
top-K query for orthogonal ranges: find top-k points in a given rectangle. Complexity: O(log(N)^2+k).

You can reduce this problem into finding two points in the rectangle: rightmost and topmost. So effectively you can select every pair of points and calculate the top-k weight (which according to you is O(log(N)^2+k)). Complexity: O(N^2*(log(N)^2+k)).
Now, given two points, they might not form a valid pair: they might be too far or one point may be right and top of the other point. So, in reality, this will be much faster.
My guess is the optimal solution will be a variation of maximum region sum problem. Could you point to a link describing that algorithm?

An non-optimal answer is the following:
Generate all the possible k-plets of points (they are N × N-1 × … × N-k+1, so this is O(Nk) and can be done via recursion).
Filter this list down by eliminating all k-plets which are not enclosed in a a×b rectangle: this is a O(k Nk) at worst.
Find the k-plet which has the maximum weight: this is a O(k Nk-1) at worst.
Thus, this algorithm is O(k Nk).
Improving the algorithm
Step 2 can be integrated in step 1 by stopping the branch recursion when a set of points is already too large. This does not change the need to scan the element at least once, but it can reduce the number significantly: think of cases where there are no solutions because all points are separated more than the size of the rectangle, that can be found in O(N2).
Also, the permutation generator in step 1 can be made to return the points in order by x or y coordinate, by pre-sorting the point array correspondingly. This is useful because it lets us discard a bunch of more possibilities up front. Suppose the array is sorted by y coordinate, so the k-plets returned will be ordered by y coordinate. Now, supposing we are discarding a branch because it contains a point whose y coordinate is outside the max rectangle, we can also discard all the next sibling branches because their y coordinate will be more than of equal to the current one which is already out of bounds.
This adds O(n log n) for the sort, but the improvement can be quite significant in many cases -- again, when there are many outliers. The coordinate should be chosen corresponding to the minimum rectangle side, divided by the corresponding side of the 2D field -- by which I mean the maximum coordinate minus the minimum coordinate of all points.
Finally, if all the points lie within an a×b rectangle, then the algorithm performs as O(k Nk) anyways. If this is a concrete possibility, it should be checked, an easy O(N) loop, and if so then it's enough to return the points with the top N weights, which is also O(N).

Related

find best k-means from a list of candidates

I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized. One way to do it would be to check the cost of any possible k points in S and take the minimum, but that would take O(k^k*n) time, is there any more efficient way to do it?
I need either an optimal solution or a constant approximation.
The reason I need this is that I'm trying to find a constant approximation for the k-means as fast as possible and later use this for a coreset construction (coreset=data minimization while still keeping the cost of any query approximately the same). I was able to show that if we assume that in the optimal clustering each cluster has omega(n/k) points we can create pretty fast a list of size O(k) canidates that contains inside of them a 3-approximation for the k-means, so I was wondering if we can find those k points or a constant approximation for their costs in time which is faster than exhaustive search.
Example for k=2
In this example S is the green dots and A is the red dots. The algorithm should return the 2 circled points from S since they minimize the sum of squared distances from the points of A to their closest point of the 2.
I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized.
It sounds like this could be solved simply by checking the N points against the K points to find the k points in N with the smallest squared distance.
Therefore, I'm now fairly sure this is actually finding the k-nearest neighbors (K-NN as a computational geometry problem, not the pattern recognition definition) in the N points for each point in the K points and not actually k-means.
For higher dimensionality, it is often useful to also consider the dimensionality, D in the algorithm.
The algorithm mentioned is indeed O(NDk^2) then when considering K-NN instead. That can be improved to O(NDk) by using Quickselect algorithm on the distances. This allows for checking the list of N points against each of the K points in O(N) to find the nearest k points.
https://en.wikipedia.org/wiki/Quickselect
Edit:
Seems there is some confusion on quickselect and if it can be used. Here is a O(DkNlogN) solution that uses a standard sort O(NlogN) instead of quickselect O(N). Though this might be faster in practice and as you can see in most languages it's pretty easy to implement.
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# First k sorted by distanceSquared
results[y] = S.sort(key=distanceSquared)[:k]
return results
Update for new visual
# Build up distance sums O(A*N*D)
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# Sum of distance squared from y for all points in S
results[y] = sum(map(distanceSquared, S))
def results_key_value(key):
results[key]
# First k results sorted by key O(D*AlogA)
results.keys().sort(key=results_key_value)[:k]
You could approximate by only considering Z random points chosen from the S points. Alternatively, you could merge points in S if they are close enough together. This could reduce S to a much smaller size as long S remains about F^2 or larger in size, it shouldn't affect which points in F are chosen too much. Though you would also need to adjust the weight of the points to handle that better. IE: the square distance of a point that represents 10 points is multiplied by 10 to account for it acting as 10 points instead of just 1.

Trouble understanding Closest-Pair divide and conquer algorithm

I'm new into coding and today I completed the trivial solution for the Closest-Pair problem in a 2-D space. (2 for loops)
However I gave up finding any solution which could do it in O(n log n). Even after researching it, I still don't understand how this can be faster than the trivial method.
What I understand:
-> At first we split the array in 2 halfs and sort everything only considering the X coordinates. This can be done in n log n.
Next there are recursive calls which "find the two points with the lowest distance" in each half. But how is this done exactly below O(n^2)?
In my understanding it is impossible to find the lowest distance between N/2 points without checking every single one of them.
There is a solution in 1-D which absolutely makes sense to me. After sorting we know, that the distance between two non-adjacent points can't be lower than the distance of at least 2 adjacent ones. However this is not true for 2-D space, since we have an additional Y coordinate which could lead to the lowest distance between two points which are not adjacent on the X axis.
First of all, heed the advice of user #Evg - this answer cannot substitute the comprehensive description and mathematically rigorous analysis of the algorithm.
However, here are some ideas to get the intuition started:
(Recursion structure)
The question states:
Next there are recursive calls which "find the two points with the lowest distance" in each half. But how is this done exactly below O(n^2)? In my understanding it is impossible to find the lowest distance between N/2 points without checking every single one of them.
The recursion, however, does not stop at level 1 - assume for the sake of the argument that some O(n log n) algorithm works. Finding closest pairs among N/2 points applying that very algorithm takes O(N/2 log N/2) - not O((N/2)^2).
(Consequences of finding a closest pair in one half)
If you have found a closest pair (p, q) in the 'left' half of the point set, this pair's distance sets an upper bound to the width of a corridor around the halving line from which a closer pair (r, s) with r from the left, s from the right half can be drawn. If the closest distance found so far is 'small', it significantly reduces the size of the candidate set. As the points have been ordered by their x coordinate, the algorithm can exploit the information efficiently.
Said corridor may still cover up to the whole set of N points, but if it does, it provides information of the geometry of the point set: the points of each half will basically be aligned along a vertical line. This information can be exploited algorithmically - the most naive way would be to execute the algorithm once again but sorting along y coordinates and halving the point set by a horizontal line. Note that executing any algorithm a constant number of times does not change asymptotic run time expressed by the O(.) notation.
(Finding a close pair with one point from each half)
Consider checking a pair of points (r, s), one point from each half. It is known that the difference in their x and y coordinates, resp., mustn't exceed the minimal distance d found so far. It is known from the recursion that there can be no points r', s' (r' from the left, s' from the right half) closer to r, s, resp., than d. So given some r there cannot be 'many' candidates from the other half.
Imagine a circle of radius d drawn around r. Any point s from the other half being closer than d must be located within that circle. Let there be a few of them - however, the minimum distance among each pair still be at least d. The maximum number of points that can be distributed within a circle of radius d such that the distance between each pair of them is at least d is 7 - think of a regular hexagon with side length d and its center coinciding with the circle's center.
So after the recursion, at most every r from the left half needs to be checked against at max a constant number of points from the other half which makes the part of the algorithm after the recursion run in O(N).
Note that finding the pairing candidates for a given r is an efficient operation - the points from both halves have been sorted by the same criterion.

All points with minimum Manhattan distance from all other given points [Optimized]

The problem here is to find set of all integer points which gives minimum sum over all Manhattan distances from given set of points!
For example:
lets have a given set of points { P1, P2, P3...Pn }
Basic problem is to find a point say X which would have minimum sum over all distances from points { P1, P2, P3... Pn }.
i.e. |P1-X| + |P2-X| + .... + |Pn-X| = D, where D will be minimum over all X.
Moving a step further, there can be more than one value of X satisfying above condition. i.e. more than one X can be possible which would give the same value D. So, we need to find all such X.
One basic approach that anyone can think of will be to find the median of inputs and then brute force the co-ordinates which is mentioned in this post
But the problem with such approach is: if the median gives two values which are very apart, then we end up brute forcing all points which will never run in given time.
So, is there any other approach which would give the result even when the points are very far apart (where median gives a range which is of the order of 10^9).
You can consider X and Y separately, since they add to the distance independently of each other. This reduces the question to finding, given n points on a line, a point with the minimum sum-of-distances to the other points. This is simple: any point between the two medians (inclusive) will satisfy this.
Proof: If we have an even number of points, there will be two medians. A point between the two medians will have n/2 points to the left and n/2 points to the right, and a total sum-of-distances to those points of S.
If we move it one point to the left, S will go up by n/2 (since we're moving away from the right-most points) and down by n/2 (since we're moving towards the left-most points), so overall S remains the same. This holds true until we hit the left-most median point. When we move one left of the left-most median point, we now have (n/2 + 1) points to the right, and (n/2 - 1) points to the left, so S goes up by two. Continuing to the left will only increase S further.
By the same logic, all points to the right of the right-most median also have a higher S.
If we have an odd number of points, there is only one median. Using the same logic as above, we can show that it has the lowest value of S.
If the median gives you an interval of the order of 10^9 then each point in that interval is as good as any other.
So depending on what you want to do with those points later on you can either return the range or enumerate points in that range. No way around it..
Obviously in two dimensions you'll get a bouding rectangle, in 3 dimensions a bounding cuboid etc..
The result will always be a cartesian product of ranges obtained for each dimension, so you can return a list of those ranges as a result.
Since in manhattan distance each component contributes separately, you can consider them separately too. The optimal answer is ( median(x),median(y) ). You need to look around this point for integer solutions.
NOTE: I did not read your question properly while answering. My answer still holds, but probably you knew about this solution already.
Yes i also think that for odd number of N points on a grid , there will be only a Single point(i.e the MEDIAN) which will be at minimum sum of Manhattan distance from all other points.
For Even value of N, the scenario will be a little different.
According to me if two Sets X = {1,2} and Y= {3,4} their Cartesian product will be always 4.
i.e X × Y = {1,2} × {3,4} = {(1,3), (1,4), (2,3), (2,4)}. This is what i have understood so far.
As for EVEN number of values we always take "MIDDLE TWO" values as MEDIAN. Taking 2 from X and 2 from Y will always return a Cartesian product of 4 points.
Correct me if i am wrong.

Intersection of N rectangles

I'm looking for an algorithm to solve this problem:
Given N rectangles on the Cartesian coordinate, find out if the intersection of those rectangles is empty or not. Each rectangle can lie in any direction (not necessary to have its edges parallel to Ox and Oy)
Do you have any suggestion to solve this problem? :) I can think of testing the intersection of each rectangle pair. However, it's O(N*N) and quite slow :(
Abstract
Either use a sorting algorithm according to smallest X value of the rectangle, or store your rectangles in an R-tree and search it.
Straight-forward approach (with sorting)
Let us denote low_x() - the smallest (leftmost) X value of a rectangle, and high_x() - the highest (rightmost) X value of a rectangle.
Algorithm:
Sort the rectangles according to low_x(). # O(n log n)
For each rectangle in sorted array: # O(n)
Finds its highest X point. # O(1)
Compare it with all rectangles whose low_x() is smaller # O(log n)
than this.high(x)
Complexity analysis
This should work on O(n log n) on uniformly distributed rectangles.
The worst case would be O(n^2), for example when the rectangles don't overlap but are one above another. In this case, generalize the algorithm to have low_y() and high_y() too.
Data-structure approach: R-Trees
R-trees (a spatial generalization of B-trees) are one of the best ways to store geospatial data, and can be useful in this problem. Simply store your rectangles in an R-tree, and you can spot intersections with a straightforward O(n log n) complexity. (n searches, log n time for each).
Observation 1: given a polygon A and a rectangle B, the intersection A ∩ B can be computed by 4 intersection with half-planes corresponding to each edge of B.
Observation 2: cutting a half plane from a convex polygon gives you a convex polygon. The first rectangle is a convex polygon. This operation increases the number of vertices at most per 1.
Observation 3: the signed distance of the vertices of a convex polygon to a straight line is a unimodal function.
Here is a sketch of the algorithm:
Maintain the current partial intersection D in a balanced binary tree in a CCW order.
When cutting a half-plane defined by a line L, find the two edges in D that intersect L. This can be done in logarithmic time through some clever binary or ternary search exploiting the unimodality of the signed distance to L. (This is the part I don't exactly remember.) Remove all the vertices on the one side of L from D, and insert the intersection points to D.
Repeat for all edges L of all rectangles.
This seems like a good application of Klee's measure. Basically, if you read http://en.wikipedia.org/wiki/Klee%27s_measure_problem there are lower bounds on the runtime of the best algorithms that can be found for rectilinear intersections at O(n log n).
I think you should use something like the sweep line algorithm: finding intersections is one of its applications. Also, have a look at answers to this questions
Since the rectangles must not be parallel to the axis, it is easier to transform the problem to an already solved one: compute the intersections of the borders of the rectangles.
build a set S which contains all borders, together with the rectangle they're belonging to; you get a set of tuples of the form ((x_start,y_start), (x_end,y_end), r_n), where r_n is of course the ID of the corresponding rectangle
now use a sweep line algorithm to find the intersections of those lines
The sweep line stops at every x-coordinate in S, i.e. all start values and all end values. For every new start coordinate, put the corresponding line in a temporary set I. For each new end-coordinate, remove the corresponding line from I.
Additionally to adding new lines to I, you can check for each new line whether it intersects with one of the lines currently in I. If they do, the corresponding rectangles do, too.
You can find a detailed explanation of this algorithm here.
The runtime is O(n*log(n) + c*log(n)), where c is the number of intersection points of the lines in I.
Pick the smallest rectangle from the set (or any rectangle), and go over each point within it. If one of it's point also exists in all other rectangles, the intersection is not empty. If all points are free from ALL other rectangles, the intersection is empty.

Finding the farthest point in one set from another set

My goal is a more efficient implementation of the algorithm posed in this question.
Consider two sets of points (in N-space. 3-space for the example case of RGB colorspace, while a solution for 1-space 2-space differs only in the distance calculation). How do you find the point in the first set that is the farthest from its nearest neighbor in the second set?
In a 1-space example, given the sets A:{2,4,6,8} and B:{1,3,5}, the answer would be
8, as 8 is 3 units away from 5 (its nearest neighbor in B) while all other members of A are just 1 unit away from their nearest neighbor in B. edit: 1-space is overly simplified, as sorting is related to distance in a way that it is not in higher dimensions.
The solution in the source question involves a brute force comparison of every point in one set (all R,G,B where 512>=R+G+B>=256 and R%4=0 and G%4=0 and B%4=0) to every point in the other set (colorTable). Ignore, for the sake of this question, that the first set is elaborated programmatically instead of iterated over as a stored list like the second set.
First you need to find every element's nearest neighbor in the other set.
To do this efficiently you need a nearest neighbor algorithm. Personally I would implement a kd-tree just because I've done it in the past in my algorithm class and it was fairly straightforward. Another viable alternative is an R-tree.
Do this once for each element in the smallest set. (Add one element from the smallest to larger one and run the algorithm to find its nearest neighbor.)
From this you should be able to get a list of nearest neighbors for each element.
While finding the pairs of nearest neighbors, keep them in a sorted data structure which has a fast addition method and a fast getMax method, such as a heap, sorted by Euclidean distance.
Then, once you're done simply ask the heap for the max.
The run time for this breaks down as follows:
N = size of smaller set
M = size of the larger set
N * O(log M + 1) for all the kd-tree nearest neighbor checks.
N * O(1) for calculating the Euclidean distance before adding it to the heap.
N * O(log N) for adding the pairs into the heap.
O(1) to get the final answer :D
So in the end the whole algorithm is O(N*log M).
If you don't care about the order of each pair you can save a bit of time and space by only keeping the max found so far.
*Disclaimer: This all assumes you won't be using an enormously high number of dimensions and that your elements follow a mostly random distribution.
The most obvious approach seems to me to be to build a tree structure on one set to allow you to search it relatively quickly. A kd-tree or similar would probably be appropriate for that.
Having done that, you walk over all the points in the other set and use the tree to find their nearest neighbour in the first set, keeping track of the maximum as you go.
It's nlog(n) to build the tree, and log(n) for one search so the whole thing should run in nlog(n).
To make things more efficient, consider using a Pigeonhole algorithm - group the points in your reference set (your colorTable) by their location in n-space. This allows you to efficiently find the nearest neighbour without having to iterate all the points.
For example, if you were working in 2-space, divide your plane into a 5 x 5 grid, giving 25 squares, with 25 groups of points.
In 3 space, divide your cube into a 5 x 5 x 5 grid, giving 125 cubes, each with a set of points.
Then, to test point n, find the square/cube/group that contains n and test distance to those points. You only need to test points from neighbouring groups if point n is closer to the edge than to the nearest neighbour in the group.
For each point in set B, find the distance to its nearest neighbor in set A.
To find the distance to each nearest neighbor, you can use a kd-tree as long as the number of dimensions is reasonable, there aren't too many points, and you will be doing many queries - otherwise it will be too expensive to build the tree to be worthwhile.
Maybe I'm misunderstanding the question, but wouldn't it be easiest to just reverse the sign on all the coordinates in one data set (i.e. multiply one set of coordinates by -1), then find the first nearest neighbour (which would be the farthest neighbour)? You can use your favourite knn algorithm with k=1.
EDIT: I meant nlog(n) where n is the sum of the sizes of both sets.
In the 1-Space set I you could do something like this (pseudocode)
Use a structure like this
Struct Item {
int value
int setid
}
(1) Max Distance = 0
(2) Read all the sets into Item structures
(3) Create an Array of pointers to all the Items
(4) Sort the array of pointers by Item->value field of the structure
(5) Walk the array from beginning to end, checking if the Item->setid is different from the previous Item->setid
if (SetIDs are different)
check if this distance is greater than Max Distance if so set MaxDistance to this distance
Return the max distance.

Resources