I am looking at the wikipedia entry for how to solve this. It lists five steps
1.Sort points along the x-coordinate
2.Split the set of points into two equal-sized subsets by a vertical line x = xmid
3.Solve the problem recursively in the left and right subsets. This will give the left-side and right-side minimal distances dLmin and dRmin respectively.
4.Find the minimal distance dLRmin among the pair of points in which one point lies on the left of the dividing vertical and the second point lies to the right.
5.The final answer is the minimum among dLmin, dRmin, and dLRmin.
The fourth step I am having trouble understanding. How do I choose what point to the left of the line to compare to a point right of the line. I know I am not supposed to compare all points, but I am unclear about how to choose points to compare. Please do not send me a link, I have searched, gone to numerous links, and have not found an explanation that helps me understand step 4.
Thanks
Aaron
The answer to your question was in the next paragraph of the wikipedia article:
It turns out that step 4 may be
accomplished in linear time. Again, a
naive approach would require the
calculation of distances for all
left-right pairs, i.e., in quadratic
time. The key observation is based on
the following sparsity property of the
point set. We already know that the
closest pair of points is no further
apart than dist = min(dLmin,dRmin).
Therefore for each point p of the left
of the dividing line we have to
compare the distances to the points
that lie in the rectangle of
dimensions (dist, 2 * dist) to the
right of the dividing line, as shown
in the figure. And what is more, this
rectangle can contain at most 6 points
with pairwise distances at least
dRmin. Therefore it is sufficient to
compute at most 6n left-right
distances in step 4. The recurrence
relation for the number of steps can
be written as T(n) = 2T(n / 2) + O(n),
which we can solve using the master
theorem to get O(n log n).
I don't think I can put it much clearer than they already have, but do you have any specific questions about this step of the algorithm?
Related
I'm new into coding and today I completed the trivial solution for the Closest-Pair problem in a 2-D space. (2 for loops)
However I gave up finding any solution which could do it in O(n log n). Even after researching it, I still don't understand how this can be faster than the trivial method.
What I understand:
-> At first we split the array in 2 halfs and sort everything only considering the X coordinates. This can be done in n log n.
Next there are recursive calls which "find the two points with the lowest distance" in each half. But how is this done exactly below O(n^2)?
In my understanding it is impossible to find the lowest distance between N/2 points without checking every single one of them.
There is a solution in 1-D which absolutely makes sense to me. After sorting we know, that the distance between two non-adjacent points can't be lower than the distance of at least 2 adjacent ones. However this is not true for 2-D space, since we have an additional Y coordinate which could lead to the lowest distance between two points which are not adjacent on the X axis.
First of all, heed the advice of user #Evg - this answer cannot substitute the comprehensive description and mathematically rigorous analysis of the algorithm.
However, here are some ideas to get the intuition started:
(Recursion structure)
The question states:
Next there are recursive calls which "find the two points with the lowest distance" in each half. But how is this done exactly below O(n^2)? In my understanding it is impossible to find the lowest distance between N/2 points without checking every single one of them.
The recursion, however, does not stop at level 1 - assume for the sake of the argument that some O(n log n) algorithm works. Finding closest pairs among N/2 points applying that very algorithm takes O(N/2 log N/2) - not O((N/2)^2).
(Consequences of finding a closest pair in one half)
If you have found a closest pair (p, q) in the 'left' half of the point set, this pair's distance sets an upper bound to the width of a corridor around the halving line from which a closer pair (r, s) with r from the left, s from the right half can be drawn. If the closest distance found so far is 'small', it significantly reduces the size of the candidate set. As the points have been ordered by their x coordinate, the algorithm can exploit the information efficiently.
Said corridor may still cover up to the whole set of N points, but if it does, it provides information of the geometry of the point set: the points of each half will basically be aligned along a vertical line. This information can be exploited algorithmically - the most naive way would be to execute the algorithm once again but sorting along y coordinates and halving the point set by a horizontal line. Note that executing any algorithm a constant number of times does not change asymptotic run time expressed by the O(.) notation.
(Finding a close pair with one point from each half)
Consider checking a pair of points (r, s), one point from each half. It is known that the difference in their x and y coordinates, resp., mustn't exceed the minimal distance d found so far. It is known from the recursion that there can be no points r', s' (r' from the left, s' from the right half) closer to r, s, resp., than d. So given some r there cannot be 'many' candidates from the other half.
Imagine a circle of radius d drawn around r. Any point s from the other half being closer than d must be located within that circle. Let there be a few of them - however, the minimum distance among each pair still be at least d. The maximum number of points that can be distributed within a circle of radius d such that the distance between each pair of them is at least d is 7 - think of a regular hexagon with side length d and its center coinciding with the circle's center.
So after the recursion, at most every r from the left half needs to be checked against at max a constant number of points from the other half which makes the part of the algorithm after the recursion run in O(N).
Note that finding the pairing candidates for a given r is an efficient operation - the points from both halves have been sorted by the same criterion.
Suppose we have n points in a bounded region of the plane. The problem is to divide it in 4 regions (with a horizontal and a vertical line) such that the sum of a metric in each region is minimized.
The metric can be for example, the sum of the distances between the points in each region ; or any other measure about the spreadness of the points. See the figure below.
I don't know if any clustering algorithm might help me tackle this problem, or if for instance it can be formulated as a simple optimization problem. Where the decision variables are the "axes".
I believe this can be formulated as a MIP (Mixed Integer Programming) problem.
Lets introduce 4 quadrants A,B,C,D. A is right,upper, B is right,lower, etc. Then define a binary variable
delta(i,k) = 1 if point i is in quadrant k
0 otherwise
and continuous variables
Lx, Ly : coordinates of the lines
Obviously we have:
sum(k, delta(i,k)) = 1
xlo <= Lx <= xup
ylo <= Ly <= yup
where xlo,xup are the minimum and maximum x-coordinate. Next we need to implement implications like:
delta(i,'A') = 1 ==> x(i)>=Lx and y(i)>=Ly
delta(i,'B') = 1 ==> x(i)>=Lx and y(i)<=Ly
delta(i,'C') = 1 ==> x(i)<=Lx and y(i)<=Ly
delta(i,'D') = 1 ==> x(i)<=Lx and y(i)>=Ly
These can be handled by so-called indicator constraints or written as linear inequalities, e.g.
x(i) <= Lx + (delta(i,'A')+delta(i,'B'))*(xup-xlo)
Similar for the others. Finally the objective is
min sum((i,j,k), delta(i,k)*delta(j,k)*d(i,j))
where d(i,j) is the distance between points i and j. This objective can be linearized as well.
After applying a few tricks, I could prove global optimality for 100 random points in about 40 seconds using Cplex. This approach is not really suited for large datasets (the computation time quickly increases when the number of points becomes large).
I suspect this cannot be shoe-horned into a convex problem. Also I am not sure this objective is really what you want. It will try to make all clusters about the same size (adding a point to a large cluster introduces lots of distances to be added to the objective; adding a point to a small cluster is cheap). May be an average distance for each cluster is a better measure (but that makes the linearization more difficult).
Note - probably incorrect. I will try and add another answer
The one dimensional version of minimising sums of squares of differences is convex. If you start with the line at the far left and move it to the right, each point crossed by the line stops accumulating differences with the points to its right and starts accumulating differences to the points to its left. As you follow this the differences to the left increase and the differences to the right decrease, so you get a monotonic decrease, possibly a single point that can be on either side of the line, and then a monotonic increase.
I believe that the one dimensional problem of clustering points on a line is convex, but I no longer believe that the problem of drawing a single vertical line in the best position is convex. I worry about sets of points that vary in y co-ordinate so that the left hand points are mostly high up, the right hand points are mostly low down, and the intermediate points alternate between high up and low down. If this is not convex, the part of the answer that tries to extend to two dimensions fails.
So for the one dimensional version of the problem you can pick any point and work out in time O(n) whether that point should be to the left or right of the best dividing line. So by binary chop you can find the best line in time O(n log n).
I don't know whether the two dimensional version is convex or not but you can try all possible positions for the horizontal line and, for each position, solve for the position of the vertical line using a similar approach as for the one dimensional problem (now you have the sum of two convex functions to worry about, but this is still convex, so that's OK). Therefore you solve at most O(n) one-dimensional problems, giving cost O(n^2 log n).
If the points aren't very strangely distributed, I would expect that you could save a lot of time by using the solution of the one dimensional problem at the previous iteration as a first estimate of the position of solution for the next iteration. Given a starting point x, you find out if this is to the left or right of the solution. If it is to the left of the solution, go 1, 2, 4, 8... steps away to find a point to the right of the solution and then run binary chop. Hopefully this two-stage chop is faster than starting a binary chop of the whole array from scratch.
Here's another attempt. Lay out a grid so that, except in the case of ties, each point is the only point in its column and the only point in its row. Assuming no ties in any direction, this grid has N rows, N columns, and N^2 cells. If there are ties the grid is smaller, which makes life easier.
Separating the cells with a horizontal and vertical line is pretty much picking out a cell of the grid and saying that cell is the cell just above and just to the right of where the lines cross, so there are roughly O(N^2) possible such divisions, and we can calculate the metric for each such division. I claim that when the metric is the sum of the squares of distances between points in a cluster the cost of this is pretty much a constant factor in an O(N^2) problem, so the whole cost of checking every possibility is O(N^2).
The metric within a rectangle formed by the dividing lines is SUM_i,j[ (X_i - X_j)^2 + (Y_i-Y_j)^2]. We can calculate the X contributions and the Y contributions separately. If you do some algebra (which is easier if you first subtract a constant so that everything sums to zero) you will find that the metric contribution from a co-ordinate is linear in the variance of that co-ordinate. So we want to calculate the variances of the X and Y co-ordinates within the rectangles formed by each division. https://en.wikipedia.org/wiki/Algebraic_formula_for_the_variance gives us an identity which tells us that we can work out the variance given SUM_i Xi and SUM_i Xi^2 for each rectangle (and the corresponding information for the y co-ordinate). This calculation can be inaccurate due to floating point rounding error, but I am going to ignore that here.
Given a value associated with each cell of a grid, we want to make it easy to work out the sum of those values within rectangles. We can create partial sums along each row, transforming 0 1 2 3 4 5 into 0 1 3 6 10 15, so that each cell in a row contains the sum of all the cells to its left and itself. If we take these values and do partial sums up each column, we have just worked out, for each cell, the sum of the rectangle whose top right corner lies in that cell and which extends to the bottom and left sides of the grid. These calculated values at the far right column give us the sum for all the cells on the same level as that cell and below it. If we subtract off the rectangles we know how to calculate we can find the value of a rectangle which lies at the right hand side of the grid and the bottom of the grid. Similar subtractions allow us to work out first the value of the rectangles to the left and right of any vertical line we choose, and then to complete our set of four rectangles formed by two lines crossing by any cell in the grid. The expensive part of this is working out the partial sums, but we only have to do that once, and it costs only O(N^2). The subtractions and lookups used to work out any particular metric have only a constant cost. We have to do one for each of O(N^2) cells, but that is still only O(N^2).
(So we can find the best clustering in O(N^2) time by working out the metrics associated with all possible clusterings in O(N^2) time and choosing the best).
I was asked this question in Yahoo for machine learning profile. Given a set of points (x,y) coordinates I was asked to find points with lowest distance in O(n) or O(log n )time.
Obviously I was able to come up with O(n^2) time but was no way near getting the better algorithm. Even though the problem statement was screaming for Divide and Conquer I just could not come up with the reasoning for the merge step. I also googled for this question on the internet and found that It is actually very popular but I still could not get hold of the reasoning of the merge step.
Can anyone help me out with this?
Input: (x1,y1),(x2,y2),(x3,y3),(x4,y4),(x5,y5)
The problem can be solved in O(n log n) time using the recursive divide and conquer approach, e.g., as follows:
1.Sort points according to their x-coordinates.
2.Split the set of points into two equal-sized subsets by a vertical line x=xmid.
3.Solve the problem recursively in the left and right subsets. This yields the left-side and right-side minimum distances dLmin and dRmin, respectively.
4.Find the minimal distance dLRmin among the pair of points in which one point lies on the left of the dividing vertical and the second point lies to the right.
5.The final answer is the minimum among dLmin, dRmin, and dLRmin.
http://en.wikipedia.org/wiki/Closest_pair_of_points
I'm creating a simple game and come up with this problem while designing AI for my game:
Given a set of N points inside a rectangle in the Cartesian coordinate, i need to find the widest straight path through this rectangle. The path must be empty (i.e not containing any point).
I wonder if are there any efficient algorithm to solve this problem? Can you suggest any keyword/ paper/ anything related to this problem?
EDIT: The rectangle is always defined by 4 points in its corner. I added an image for illustration. the path in the above pictures are the determined by two red lines
This is the widest empty corridor problem. Houle and Maciel gave an O(n2)-time, O(n)-space algorithm in a 1988 tech report entitled "Finding the widest empty corridor through a set of points", which seems not to be available online. Fortunately, Janardan and Preparata describe this algorithm in Section 4 of their paper Widest-corridor problems, which is available.
Loop through all pairs of points. Construct a line l through the pair. (^1) On each side of l, either there are other points, or not. If not, then there is not a path on that side of l. If there are other points, loop through points calculating the perpendicular distance d from l to each such point. Record the minimum d. That is the widest path on that side of l. Continue looping through all pairs, comparing widest path for that pair with the previous widest path.
This algorithm can be considered naive and runs in O(n^3) time.
Edit: The above algorithm misses a case. At ^1 above, insert: "Construct two lines perpendicular to l through each point of the pair. If there is no third point between the lines, then record distance d between the points. This constitutes a path." Continue the algorithm at ^1. With additional case, algorithm is still O(n^3)
Myself, I would start by looking at the Delaunay triangulation of the point set:
http://en.wikipedia.org/wiki/Delaunay_triangulation
There appear to be plenty of resources there on efficient algorithms to build this - Fortune's algorithm, at O(n log n), for starters.
My intuition tells me that your widest path will be defined by one of the edges in this graph (Namely, it would run perpendicular to the edge, and its width would be equal to the length of the edge). How to sort the edges, check the candidates and identify the widest path remains. I like this question, and I'm going to keep thinking about it. :)
EDIT 1: My intuition fails me! A simple equilateral triangle is a counter-example: the widest path is shorter than any of the edges in the triangulation. Still thinking...
EDIT 2: So, we need a black-box algorithm which, given two points in the set, finds the widest path through the point set which is bounded by those two points. (Visualize two parallel lines running through the two points; rotate them in harmony with each other until there are no points between them). Let's call the runtime of this algorithm 'R'.
Given such an algorithm, we can do the following:
Build the Delaunay triangulation of the point set : O(n log n)
Sort the edges by width : O(n log n)
Beginning with the largest edge and moving down, use the black box algorithm to determine the widest path involving those two points; storing it as X : O(nR))
Stop when the edge being examined is shorter than the width of X.
Steps 1 and 2 are nice, but the O(nR) is kind of scary. If R turns out to be O(n), that's already O(n^2) for the whole algorithm. The nice thing is that, for a general set of random points, we would expect that we wouldn't have to go through all the edges.
Given a set of n points, can we find three points that describe a triangle with minimum area in O(n^2)? If yes, how, and if not, can we do better than O(n^3)?
I have found some papers that state that this problem is at least as hard as the problem that asks to find three collinear points (a triangle with area 0). These papers describe an O(n^2) solution to this problem by reducing it to an instance of the 3-sum problem. I couldn't find any solution for what I'm interested in however. See this (look for General Position) for such a paper and more information on 3-sum.
There are O(n2) algorithms for finding the minimum area triangle.
For instance you can find one here: http://www.cs.tufts.edu/comp/163/fall09/CG-lecture9-LA.pdf
If I understood that pdf correctly, the basic idea is as follows:
For each pair of points AB you find the point that is closest to it.
You construct a dual of the points so that lines <-> points.
Line y = mx + c is mapped to point (m,c)
In the dual, for a given point (which corresponds to a segment in original set of points) the nearest line vertically gives us the required point for 1.
Apparently 2 & 3 can be done in O(n2) time.
Also I doubt the papers showed 3SUM-hardness by reducing to 3SUM. It should be the other way round.
There's an algorithm that finds the required area with complexity O(n^2*log(n)).
For each point Pi in set do the following(without loss of generality we can assume that Pi is in the origin or translate the points to make it so).
Then for each points (x1,y1), (x2,y2) the triangle area will be 0.5*|x1*y2-x2*y1| so we need to minimize that value. Instead of iterating through all pairs of remaining points (which gives us O(N^3) complexity) we sort those points using predicate X1 * Y2 < X2 * Y1. It is claimed that to find triangle with minimal area we need to check only the pairs of adjacent points in the sorted array.
So the complexity of this procedure for each point is n*log(n) and the whole algorithm works in O(n^2*log(n))
P.S. Can't quickly find the proof that this algorithm is correct :(, hope will find it it later and post it then.
The problem
Given a set of n points, can we find three points that describe a triangle with minimum area in O(n^2)? If yes, how, and if not, can we do better than O(n^3)
is better resolved in this paper: James King, A Survey of 3sum-Hard Problems, 2004