Recursively compute closest pairs - algorithm

I am trying to perform the closest pairs algorithm - there is one area however where I am totally stuck.
The problem can be solved in O(n log n) time using the recursive divide and conquer approach, e.g., as follows:
1) Sort points according to their x-coordinates.
2) Split the set of points into two equal-sized subsets by a vertical line x=xmid.
3) Solve the problem recursively in the left and right subsets. This yields the left-side and right-side minimum distances dLmin and dRmin, respectively.
4) Find the minimal distance dLRmin among the set of pairs of points in which one point lies on the left of the dividing vertical and the second point lies to the right.
5) The final answer is the minimum among dLmin, dRmin, and dLRmin.
My problem is with Step 3: Let's say that you've split your 8 element array into two halves, and on the left half you have 4 points - 1,2,3 and 4. Let's also say that points 2 and 3 are the closest pair among those 4 points. Well, if you keep recursively dividing this subset into halves, you will eventually end up calculating the min between 1-2 (on the left), you will calculate the min between 3-4 (on the right), and you will return the minimum-distance pair from those two pairs..
HOWEVER - you've totally missed the distance between 2-3, as you've never calculated it since they were on opposite sides... so how is this issue solved? Notice how Step 4 comes AFTER this recursive process, it seems to be an independent step and only applies to the end result after Step 3, and not to every subsequent division of the subarrays that occurs within Step 3.. is it? Or is there another way of doing this that I'm missing?

The steps are a bit misleading, in that steps 2-5 are all part of the recursion. At every level of recursion, you need to calculate dLmin, dRmin, and dLRmin. The minimum of these is returned as the answer for that level of recursion.
To use your example, you would calculated dLmin as the distance between points 1 and 2, dRmin as the distance between points 3 and 4, and then dLRmin as the distance between points 2 and 3. Since dLRmin is the smallest in your hypothetical case, it would be returned.

Related

Find a region with maximum sum of top-K points

My problem is: we have N points in a 2D space, each point has a positive weight. Given a query consisting of two real numbers a,b and one integer k, find the position of a rectangle of size a x b, with edges are parallel to axes, so that the sum of weights of top-k points, i.e. k points with highest weights, covered by the rectangle is maximized?
Any suggestion is appreciated.
P.S.:
There are two related problems, which are already well-studied:
Maximum region sum: find the rectangle with the highest total weight sum. Complexity: NlogN.
top-K query for orthogonal ranges: find top-k points in a given rectangle. Complexity: O(log(N)^2+k).
You can reduce this problem into finding two points in the rectangle: rightmost and topmost. So effectively you can select every pair of points and calculate the top-k weight (which according to you is O(log(N)^2+k)). Complexity: O(N^2*(log(N)^2+k)).
Now, given two points, they might not form a valid pair: they might be too far or one point may be right and top of the other point. So, in reality, this will be much faster.
My guess is the optimal solution will be a variation of maximum region sum problem. Could you point to a link describing that algorithm?
An non-optimal answer is the following:
Generate all the possible k-plets of points (they are N × N-1 × … × N-k+1, so this is O(Nk) and can be done via recursion).
Filter this list down by eliminating all k-plets which are not enclosed in a a×b rectangle: this is a O(k Nk) at worst.
Find the k-plet which has the maximum weight: this is a O(k Nk-1) at worst.
Thus, this algorithm is O(k Nk).
Improving the algorithm
Step 2 can be integrated in step 1 by stopping the branch recursion when a set of points is already too large. This does not change the need to scan the element at least once, but it can reduce the number significantly: think of cases where there are no solutions because all points are separated more than the size of the rectangle, that can be found in O(N2).
Also, the permutation generator in step 1 can be made to return the points in order by x or y coordinate, by pre-sorting the point array correspondingly. This is useful because it lets us discard a bunch of more possibilities up front. Suppose the array is sorted by y coordinate, so the k-plets returned will be ordered by y coordinate. Now, supposing we are discarding a branch because it contains a point whose y coordinate is outside the max rectangle, we can also discard all the next sibling branches because their y coordinate will be more than of equal to the current one which is already out of bounds.
This adds O(n log n) for the sort, but the improvement can be quite significant in many cases -- again, when there are many outliers. The coordinate should be chosen corresponding to the minimum rectangle side, divided by the corresponding side of the 2D field -- by which I mean the maximum coordinate minus the minimum coordinate of all points.
Finally, if all the points lie within an a×b rectangle, then the algorithm performs as O(k Nk) anyways. If this is a concrete possibility, it should be checked, an easy O(N) loop, and if so then it's enough to return the points with the top N weights, which is also O(N).

graph, set of points with minimum distance from a set of points [duplicate]

I wish to find the point with the minimum sum of manhattan distance/rectilinear distance from a set of points (i.e the sum of rectilinear distance between this point and each point in the set should be minimum ). The resulting point can be one of the points from the given set (not necessarily). In case more than one points exist with the same minimum distance, i wish to retrieve all of them.
IN OTHER WORDS:
I have a grid with certain intersections marked. I'd like to find the intersection that is closest to all the marked intersections. That is, I need to find a point such that the sum of distances from all points is minimum.
The cool thing about the Manhatan distance is that the distance itself comprises of two independent components: the distance on the x and y coordinate. Thus you can solve two simpler tasks and merge the results from them to obtain the desired results.
The task I speak of is: given are points on a line. Find the point on the line that minimizes the sum of the absolute distances to all the points. If there are many find all of them (btw they always turn to be a single segment which is easy to prove). The segment is determined by the (potentially two) points medians of the set. By median I mean the point that has equal number of points to the left and to the right of it. In case the number of points is even there is no such point and you choose the points with difference 1 in both directions to form the segment.
Here I add examples of solutions of this simpler task:
In case the points on the line are like that:
-4 | | | 0 | 2 3 4
^
The solution is just a point and it is 2.
In case the points on the line are like that:
-4 | | | 0 | 2 3
^---^
The whole segment [0, 2] is the solution of the problem.
You solve this task separately for the x and y coordinate and then merge the results to obtain the rectangle of minimum distanced points.
EXAMPLE
And now comes an example of the solution for the initial task.
Imagine you want to find the points that are with minimum Manhatan distance to the set (0, 6), (1, 3), (3, 5), (3, 3), (4, 7), (2, 4)
You form the two simpler tasks:
For x:
0 1 2 3 3 4
^-^
And here the solution is the segment [2, 3] (note that here we have duplicated point 3, which I represented in probably not the most intuitive way).
For y:
3 3 4 5 6 7
^-^
Here the solution is the segment [4, 5].
Finally we get that the solution of the initial task is the rectangle with formula:
2 <= x <= 3; 4 <= y <= 5
COMPLEXITY
As many people show interest in this post I decided to improve it a bit more.
Let's speak about complexity.
The complexity of the task is actually the same as the complexity of solving the simpler task (because as already discussed the solution actually consists of solving two simpler tasks). Many people will go and solve it via sorting and then choosing out the medians. However, this will cause O(nlog n) complexity, where n is the number of points in the input set.
This can be improved if a better algorithm for finding the kth element is used (Example of implementation in the C++ STL). This algorithm basically follows the same approach as qsort. The running time is O(n). Even in the case of two median points this will still remain linear (needing two runs of the same algorithm), and thus the total complexity of the algorithm becomes O(n). It is obvious that the task can not be solved any faster, as long as the input itself is of the mentioned complexity.

All points with minimum Manhattan distance from all other given points [Optimized]

The problem here is to find set of all integer points which gives minimum sum over all Manhattan distances from given set of points!
For example:
lets have a given set of points { P1, P2, P3...Pn }
Basic problem is to find a point say X which would have minimum sum over all distances from points { P1, P2, P3... Pn }.
i.e. |P1-X| + |P2-X| + .... + |Pn-X| = D, where D will be minimum over all X.
Moving a step further, there can be more than one value of X satisfying above condition. i.e. more than one X can be possible which would give the same value D. So, we need to find all such X.
One basic approach that anyone can think of will be to find the median of inputs and then brute force the co-ordinates which is mentioned in this post
But the problem with such approach is: if the median gives two values which are very apart, then we end up brute forcing all points which will never run in given time.
So, is there any other approach which would give the result even when the points are very far apart (where median gives a range which is of the order of 10^9).
You can consider X and Y separately, since they add to the distance independently of each other. This reduces the question to finding, given n points on a line, a point with the minimum sum-of-distances to the other points. This is simple: any point between the two medians (inclusive) will satisfy this.
Proof: If we have an even number of points, there will be two medians. A point between the two medians will have n/2 points to the left and n/2 points to the right, and a total sum-of-distances to those points of S.
If we move it one point to the left, S will go up by n/2 (since we're moving away from the right-most points) and down by n/2 (since we're moving towards the left-most points), so overall S remains the same. This holds true until we hit the left-most median point. When we move one left of the left-most median point, we now have (n/2 + 1) points to the right, and (n/2 - 1) points to the left, so S goes up by two. Continuing to the left will only increase S further.
By the same logic, all points to the right of the right-most median also have a higher S.
If we have an odd number of points, there is only one median. Using the same logic as above, we can show that it has the lowest value of S.
If the median gives you an interval of the order of 10^9 then each point in that interval is as good as any other.
So depending on what you want to do with those points later on you can either return the range or enumerate points in that range. No way around it..
Obviously in two dimensions you'll get a bouding rectangle, in 3 dimensions a bounding cuboid etc..
The result will always be a cartesian product of ranges obtained for each dimension, so you can return a list of those ranges as a result.
Since in manhattan distance each component contributes separately, you can consider them separately too. The optimal answer is ( median(x),median(y) ). You need to look around this point for integer solutions.
NOTE: I did not read your question properly while answering. My answer still holds, but probably you knew about this solution already.
Yes i also think that for odd number of N points on a grid , there will be only a Single point(i.e the MEDIAN) which will be at minimum sum of Manhattan distance from all other points.
For Even value of N, the scenario will be a little different.
According to me if two Sets X = {1,2} and Y= {3,4} their Cartesian product will be always 4.
i.e X × Y = {1,2} × {3,4} = {(1,3), (1,4), (2,3), (2,4)}. This is what i have understood so far.
As for EVEN number of values we always take "MIDDLE TWO" values as MEDIAN. Taking 2 from X and 2 from Y will always return a Cartesian product of 4 points.
Correct me if i am wrong.

Best matching in a bipartite graph (e.g.associating labels with points on a plot)

I am trying to extract semantics from graphical xy plots where the points are plotted and some or all have a label. The label is plotted "near the point" so that a human can normally understand which label goes with which point. For example in this plot it is clear which label(number) belongs to which point(*) and an algorithm based on Euclidian distance would work. (The labels and points have no semantic ordering - e.g. a scatterplot)
*1
*2
*3
*4
In congested plots the authoring software/human may place the label in different directions to avoid overlap. For example in
1**2
**4
3
A human reader can normally work out which label is associated with which label.
One solution I'd accept would be to create a Euclidean distance matrix and shuffle the rows to get the minimum of a function (e.g. the summed squares of the distances on the diagonal or other heuristic). In the second example (with the points labelled a,b,c,d clockwise from the NW corner) we have a distance matrix (to 1 d.p.)
a b c d
1ab2 1 1.0 2.0 2.2 1.4
dc4 2 2.0 1.0 1.4 2.2
3 3 2.0 2.2 1.4 1.0
4 2.2 1.4 1.0 2.0
and we need to label a1 b2 c4 d3. Swapping rows 3 and 4 gives the minimum sum of the diagonal. Here's a more complex example where simply picking the nearest may fail
*1*2*5
**4
3 *6
If this is solved then I shall need to go to cases where the number of labels may be smaller or larger than the number of points.
If the algorithm is standard than I would appreciate a pointer to Open Source Java (e.g. JAMA or Apache maths)
NOTE: This SO answer Associating nearby points with a path doesn't quite work as an answer because the path through the points is given.
You have a complete bipartite graph that one part is numbers and other one is points. Weight's of edge in this graph is euclidean distance between numbers and points. And you're task is finding matching with minimal weight.
This is known problem and has a well known algorithm named as Hungarian Algorithm:
From Wiki:
We are given a nonnegative n×n matrix, where the element in the i-th
row and j-th column represents the cost of assigning the j-th point to
the i-th number. We have to find an assignment of the point to the
numbers that has minimum cost. If the goal is to find the assignment
that yields the maximum cost, the problem can be altered to fit the
setting by replacing each cost with the maximum cost subtracted by the
cost.
The algorithm is easier to describe if we formulate the problem using
a bipartite graph. We have a complete bipartite graph G=(S, T; E) with
n number vertices (S) and n point vertices (T), and each edge has a
nonnegative cost c(i,j). We want to find a perfect matching with
minimum cost. The Hungarian method is a combinatorial optimization
algorithm which solves the assignment problem in polynomial time and
which anticipated later primal-dual methods. f
For detailed algorithm and code you can take a look at topcoder article
and this pdf maybe to use
there is a media file to describe it.
(This video explains why the Hungarian algorithm works)
algorithm :
step 1:- prepare a cost matrix.if the cost matrix is not a square
matrix then add a dummy row(column) with zero cost element.
step 2:- subtract the minimum element in each row from all the
elements of the respective rows.
step 3:- further modify the resulting matrix by subtracting the
minimum elememnt of each column from all the elements of the
respective columns.thus obtain the modified matrix.
step 4:- then,draw minimum no of horizontal and vertical lines to
cover all zeros in the resulting matrix.let the minimum no of lines be
N.now there are 2 possible cases.
case 1 - if N=n,where n is the order of matrix,then an optimal
assignment can be made.so make the assignment to get the required
solution.
case 2 - if N less than n then proceed to step 5
step 5: determine the smallest uncovered element in the
matrix(element not covered by N lines).subtract this minimum element
from all uncovered elements and add the same elements at the
intersection of horizontal and vertical lines.thus the second modified
matrix is obtained.
step 6:- repeat step(3) and (4) untill we get the case (1) of step 4.
step 7:- (to make zero assignments) examine the rows successively
untill a row-wise exactly single zero is found.circle(o) this zero to
make the assignment.then mark a cross(x) over all zeros if lying in
the column of the circled zero,showing that they can't be considered
for future assignment.continue in this manner untill all the zeros
have been examined. repeat the same procedure for column also.
step 8:- repeat the step 6 succeccively until one of the following
situation arises- (i)if no unmarked zeros is left,then the process
ends or (ii) if there lies more than one of the unmarked zero in any
column or row then,circle one of the unmarked zeros arbitrarily and
mark a cross in the cell of remaining zeros in its row or
column.repeat the process untill no unmarked zero is left in the
matrix.
step 9:- thus exactly one marked circled zero in each row and each
column of the matrix is obtained. the assignment corresponding to
these marked circle zeros will give the optimal assignment.
For details see wiki and http://www.ams.jhu.edu/~castello/362/Handouts/hungarian.pdf
If I have understood your question, each of the examples you show has a unique best solution that minimizes the sum of the squares of the distances between points and labels. There is an exponential number of mappings between points and labels, but perhaps you can try the following:
In polynomial time, compute the distance from each label to each point. In a general graph you would have to solve the all-pairs shortest-path problem. Here, as Mikola points out, you can just do it with a doubly nested and use the coordinate geometry: pick either the Euclidean distance or the Manhattan distance.
In polynomial time, find the minimum-cost bipartite matching between points and labels. The solution to this problem will give you a matching between points and labels that minimizes the total distance.
All algorithms (shortest paths, Euclidean distance, min-cost bipartite matching) are standard and can be found on Wikipedia.
What is slightly nonstandard is if you find more than one bipartite matching with minimum cost. If that happens, you can try them all and see if one matching minimizes the sum of distances squared. If there are still ties, I recommend you treat horizontal distance as slightly shorter than vertical distance, and run the algorithm again. If you still have ties, there may not be a unique solution, or you may want to treat "label to the right" as slightly "closer" than label to the left.
But when there is a unique solution, all-pairs shortest paths followed by miniumum-cost bipartite matching should find it.
I haven't seen a common algorithm for this case. Therefore I would solve this problem pragmantically:
Assuming that a label belonging to a point is always the nearest (maybe with others), you could orient on a region growing algorithm (see animated gif). Iterate through every point (red) for each growing step (circle around number label). The growing step is determined by the minimal distance between a point and a label.
Use temporary lists for points and labels. Each time you find a definite pair, remove the corresponding point and label. Just skip the label if there is more than one nearest point (here: label 2). By solving other point-label combinations (here: label 3), they should be mapped in another iteration.
After iterations without any progress due to overall ambiguous situations, you can define a selection method to solve them (e.g. prefer top over bottom, left over right).

Closest pair of points Planar case

I am looking at the wikipedia entry for how to solve this. It lists five steps
1.Sort points along the x-coordinate
2.Split the set of points into two equal-sized subsets by a vertical line x = xmid
3.Solve the problem recursively in the left and right subsets. This will give the left-side and right-side minimal distances dLmin and dRmin respectively.
4.Find the minimal distance dLRmin among the pair of points in which one point lies on the left of the dividing vertical and the second point lies to the right.
5.The final answer is the minimum among dLmin, dRmin, and dLRmin.
The fourth step I am having trouble understanding. How do I choose what point to the left of the line to compare to a point right of the line. I know I am not supposed to compare all points, but I am unclear about how to choose points to compare. Please do not send me a link, I have searched, gone to numerous links, and have not found an explanation that helps me understand step 4.
Thanks
Aaron
The answer to your question was in the next paragraph of the wikipedia article:
It turns out that step 4 may be
accomplished in linear time. Again, a
naive approach would require the
calculation of distances for all
left-right pairs, i.e., in quadratic
time. The key observation is based on
the following sparsity property of the
point set. We already know that the
closest pair of points is no further
apart than dist = min(dLmin,dRmin).
Therefore for each point p of the left
of the dividing line we have to
compare the distances to the points
that lie in the rectangle of
dimensions (dist, 2 * dist) to the
right of the dividing line, as shown
in the figure. And what is more, this
rectangle can contain at most 6 points
with pairwise distances at least
dRmin. Therefore it is sufficient to
compute at most 6n left-right
distances in step 4. The recurrence
relation for the number of steps can
be written as T(n) = 2T(n / 2) + O(n),
which we can solve using the master
theorem to get O(n log n).
I don't think I can put it much clearer than they already have, but do you have any specific questions about this step of the algorithm?

Resources