algorithm to find a point among n points in plane to minimize the sum of distances - algorithm

I have an algorithm problem here. It is different from the normal Fermat Point problem.
Given a set of n points in the plane, I need to find which one can minimize the sum of distances to the rest of n-1 points.
Is there any algorithm you know of run less than O(n^2)?
Thank you.

One solution is to assume median is close to the mean and for a subset of points close to the mean exhaustively calculate sum of distances. You can choose klog(n) points closest to the mean, where k is an arbitrarily chosen constant (complexity nlog(n)).
Another possible solution is Delaunay Triangulation. This triangulation is possible in O(nlogn) time. The triangulation results in a graph with one vertex for each point and edges to satisfy delauney triangulation.
Once you have the triangulation, you can start at any point and compare sum-of-distances of that point to its neighbors and keep moving iteratively. You can stop when the current point has the minimum sum-of-distance compared to its neighbors. Intuitively, this will halt at the global optimal point.

I think the underlying assumption here is that you have a dataset of points which you can easily bound, as many algorithms which would be "good enough" in practice may not be rigorous enough for theory and/or may not scale well for arbitrarily large solutions.
A very simple solution which is probably "good enough" is to sort the coordinates on the Y ordinate, then do a stable sort on the X ordinate.
Take the rectangle defined by the min(X,Y) and max(X,Y) values, complexity O(1) as the values will be at known locations in the sorted dataset.
Now, working from the center of your sorted dataset, find coordinate values as close as possible to {Xctr = Xmin + (Xmax - Xmin) / 2, Yctr = Ymin + (Ymax - Ymin) / 2} -- complexity O(N) bounded by your minimization criteria, distance being the familiar radius from {Xctr,Yctr}.
The worst case complexity would be comparing your centroid to every other point, but once you get away from the middle points you will not be improving the global optimal and should terminate the search.

Related

find best k-means from a list of candidates

I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized. One way to do it would be to check the cost of any possible k points in S and take the minimum, but that would take O(k^k*n) time, is there any more efficient way to do it?
I need either an optimal solution or a constant approximation.
The reason I need this is that I'm trying to find a constant approximation for the k-means as fast as possible and later use this for a coreset construction (coreset=data minimization while still keeping the cost of any query approximately the same). I was able to show that if we assume that in the optimal clustering each cluster has omega(n/k) points we can create pretty fast a list of size O(k) canidates that contains inside of them a 3-approximation for the k-means, so I was wondering if we can find those k points or a constant approximation for their costs in time which is faster than exhaustive search.
Example for k=2
In this example S is the green dots and A is the red dots. The algorithm should return the 2 circled points from S since they minimize the sum of squared distances from the points of A to their closest point of the 2.
I have an array of size n of points called A, and a candidate array of size O(k)>k called S. I want to find k points in S such that the sum of squared distances from the points of A to their closest point from the k points would be minimized.
It sounds like this could be solved simply by checking the N points against the K points to find the k points in N with the smallest squared distance.
Therefore, I'm now fairly sure this is actually finding the k-nearest neighbors (K-NN as a computational geometry problem, not the pattern recognition definition) in the N points for each point in the K points and not actually k-means.
For higher dimensionality, it is often useful to also consider the dimensionality, D in the algorithm.
The algorithm mentioned is indeed O(NDk^2) then when considering K-NN instead. That can be improved to O(NDk) by using Quickselect algorithm on the distances. This allows for checking the list of N points against each of the K points in O(N) to find the nearest k points.
https://en.wikipedia.org/wiki/Quickselect
Edit:
Seems there is some confusion on quickselect and if it can be used. Here is a O(DkNlogN) solution that uses a standard sort O(NlogN) instead of quickselect O(N). Though this might be faster in practice and as you can see in most languages it's pretty easy to implement.
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# First k sorted by distanceSquared
results[y] = S.sort(key=distanceSquared)[:k]
return results
Update for new visual
# Build up distance sums O(A*N*D)
results = {}
for y in F:
def distanceSquared(x):
distance(x,y) # Custom distance for each y
# Sum of distance squared from y for all points in S
results[y] = sum(map(distanceSquared, S))
def results_key_value(key):
results[key]
# First k results sorted by key O(D*AlogA)
results.keys().sort(key=results_key_value)[:k]
You could approximate by only considering Z random points chosen from the S points. Alternatively, you could merge points in S if they are close enough together. This could reduce S to a much smaller size as long S remains about F^2 or larger in size, it shouldn't affect which points in F are chosen too much. Though you would also need to adjust the weight of the points to handle that better. IE: the square distance of a point that represents 10 points is multiplied by 10 to account for it acting as 10 points instead of just 1.

find a point closest to other points

Given N points(in 2D) with x and y coordinates. You have to find a point P (in N given points) such that the sum of distances from other(N-1) points to P is minimum.
for ex. N points given p1(x1,y1),p2(x2,y2) ...... pN(xN,yN).
we have find a point P among p1 , p2 .... PN whose sum of distances from all other points is minimum.
I used brute force approach , but I need a better approach. I also tried by finding median, mean etc. but it is not working for all cases.
then I came up with an idea that I would treat X as a vertices of a polygon and find centroid of this polygon, and then I will choose a point from Y nearest to the centroid. But I'm not sure whether centroid minimizes sum of its distances to the vertices of polygon, so I'm not sure whether this is a good way? Is there any algorithm for solving this problem?
If your points are nicely distributed and if there are so many of them that brute force (calculating the total distance from each point to every other point) is unappealing the following might give you a good enough answer. By 'nicely distributed' I mean (approximately) uniformly or (approximately) randomly and without marked clustering in multiple locations.
Create a uniform k*k grid, where k is an odd integer, across your space. If your points are nicely distributed the one which you are looking for is (probably) in the central cell of this grid. For all the other cells in the grid count the number of points in each cell and approximate the average position of the points in each cell (either use the cell centre or calculate the average (x,y) for points in the cell).
For each point in the central cell, compute the distance to every other point in the central cell, and the weighted average distance to the points in the other cells. This will, of course, be the distance from the point to the 'average' position of points in the other cells, weighted by the number of points in the other cells.
You'll have to juggle the increased accuracy of higher values for k against the increased computational load and figure out what works best for your points. If the distribution of points across cells is far from uniform then this approach may not be suitable.
This sort of approach is quite widely used in large-scale simulations where points have properties, such as gravity and charge, which operate over distances. Whether it suits your needs, I don't know.
The point in consideration is known as the Geometric Median
The centroid or center of mass, defined similarly to the geometric median as minimizing the sum of the squares of the distances to each sample, can be found by a simple formula — its coordinates are the averages of the coordinates of the samples but no such formula is known for the geometric median, and it has been shown that no explicit formula, nor an exact algorithm involving only arithmetic operations and kth roots can exist in general.
I'm not sure if I understand your question but when you calculate the minimum spanning tree the sum from any point to any other point from the tree is minimum.

Choose rectangles with maximal intersection area

In this problem r is a fixed positive integer. You are given N rectangles, all the same size, in the plane. The sides are either vertical or horizontal. We assume the area of the intersection of all N rectangles has non-zero area. The problem is how to find N-r of these rectangles, so as to maximize the area of the intersection. This problem arises in practical microscopy when one repeatedly images a given biological specimen, and alignment changes slightly during this process, due to physical reasons (e.g. differential expansion of parts of the microscope and camera). I have expressed the problem for dimension d=2. There is a similar problem for each d>0. For d=1, an O(N log(N)) solution is obtained by sorting the lefthand endpoints of the intervals. But let's stick with d=2. If r=1, one can again solve the problem in time O(N log(N)) by sorting coordinates of the corners.
So, is the original problem solved by solving first the case (N,1) obtaining N-1 rectangles, then solving the case (N-1,1), getting N-2 rectangles, and so on, until we reduce to N-r rectangles? I would be interested to see an explicit counter-example to this optimistic attempted procedure. It would be even more interesting if the procedure works (proof please!), but that seems over-optimistic.
If r is fixed at some value r>1, and N is large, is this problem in one of the NP classes?
Thanks for any thoughts about this.
David
Since the intersection of axis-aligned rectangles is an axis-aligned rectangle, there are O(N4) possible intersections (O(N) lefts, O(N) rights, O(N) tops, O(N) bottoms). The obvious O(N5) algorithm is to try all of these, checking for each whether it's contained in at least N - r rectangles.
An improvement to O(N3) is to try all O(N2) intervals in the X dimension and run the 1D algorithm in the Y dimension on those rectangles that contain the given X-interval. (The rectangles need to be sorted only once.)
How large is N? I expect that fancy data structures might lead to an O(N2 log N) algorithm, but it wouldn't be worth your time if a cubic algorithm suffices.
I think I have a counter-example. Let's say you have r := N-2. I.e. you want to find two rectangles with maximum overlapping. Let's say you have to rectangles covering the same area (=maximum overlapping). Those two will be the optimal result in the end.
Now we need to construct some more rectangles, such that at least one of those two get removed in a reduction step.
Let's say we have three rectangles which overlap a lot..but they are not optimal. They have a very small overlapping area with the other two rectangles.
Now if you want to optimize the area for four rectangles, you will remove one of the two optimal rectangles, right? Or maybe you don't HAVE to, but you're not sure which decision is optimal.
So, I think your reduction algorithm is not quite correct. Atm I'm not sure if there is a good algorithm for this or in which complexity class this belongs to, though. If I have time I think about it :)
Postscript. This is pretty defective, but may spark some ideas. It's especially defective where there are outliers in a quadrant that are near the X and Y axes - they will tend to reinforce each other, as if they were both at 45 degrees, pushing the solution away from that quadrant in a way that may not make sense.
-
If r is a lot smaller than N, and N is fairly large, consider this:
Find the average center.
Sort the rectangles into 2 sequences by (X - center.x) + (Y - center.y) and (X - center.x) - (Y - center.y), where X and Y are the center of each rectangle.
For any solution, all of the reject rectangles will be members of up to 4 subsequences, each of which is a head or tail of each of the 2 sequences. Assuming N is a lot bigger than r, most the time will be in sorting the sequences - O(n log n).
To find the solution, first find the intersection given by removing the r rectangles at the head and tail of each sequence. Use this base intersection to eliminate consideration of the "core" set of rectangles that you know will be in the solution. This will reduce the intersection computations to just working with up to 4*r + 1 rectangles.
Each of the 4 sequence heads and tails should be associated with an array of r rectangles, each entry representing the intersection given by intersecting the "core" with the i innermost rectangles from the head or tail. This precomputation reduces the complexity of finding the solution from O(r^4) to O(r^3).
This is not perfect, but it should be close.
Defects with a small r will come from should-be-rejects that are at off angles, with alternatives that are slightly better but on one of the 2 axes. The maximum error is probably computable. If this is a concern, use a real area-of-non-intersection computation instead of the simple "X+Y" difference formula I used.
Here is an explicit counter-example (with N=4 and r=2) to the greedy algorithm proposed by the asker.
The maximum intersection between three of these rectangles is between the black, blue, and green rectangles. But, it's clear that the maximum intersection between any two of these three is smaller than intersection between the black and the red rectangles.
I now have an algorithm, pretty similar to Ed Staub's above, with the same time estimates. It's a bit different from Ed's, since it is valid for all r
The counter-example by mhum to the greedy algorithm is neat. Take a look.
I'm still trying to get used to this site. Somehow an earlier answer by me was truncated to two sentences. Thanks to everyone for their contributions, particularly to mhum whose counter-example to the greedy algorithm is satisfying. I now have an answer to my own question. I believe it is as good as possible, but lower bounds on complexity are too difficult for me. My solution is similar to Ed Staub's above and gives the same complexity estimates, but works for any value of r>0.
One of my rectangles is determined by its lower left corner. Let S be the set of lower left corners. In time O(N log(N)) we sort S into Sx according to the sizes of the x-coordinates. We don't care about the order within Sx between two lower left corners with the same x-coord. Similarly the sorted sequence Sy is defined by using the sizes of the y-coords. Now let u1, u2, u3 and u4 be non-negative integers with u1+u2+u3+u4=r. We compute what happens to the area when we remove various rectangles that we now name explicitly. We first remove the u1-sized head of Sx and the u2-sized tail of Sx. Let Syx be the result of removing these u1+u2 entries from Sy. We remove the u3-sized head of Syx and the u4-sized tail of Syx. One can now prove that one of these possible choices of (u1,u2,u3,u4) gives the desired maximal area of intersection. (Email me if you want a pdf of the proof details.) The number of such choices is equal to the number of integer points in the regular tetrahedron in 4-d euclidean space with vertices at the 4 points whose coordinate sum is r and for which 3 of the 4 coordinates are equal to 0. This is bounded by the volume of the tetrahedron, giving a complexity estimate of O(r^3).
So my algorithm has time complexity O(N log(N)) + O(r^3).
I believe this produces a perfect solution.
David's solution is easier to implement, and should be faster in most cases.
This relies on the assumption that for any solution, at least one of the rejects must be a member of the complex hull. Applying this recursively leads to:
Compute a convex hull.
Gather the set of all candidate solutions produced by:
{Remove a hull member, repair the hull} r times
(The hull doesn't really need to be repaired the last time.)
If h is the number of initial hull members, then the complexity is less than
h^r, plus the cost of computing the initial hull. I am assuming that a hull algorithm is chosen such that the sorted data can be kept and reused in the hull repairs.
This is just a thought, but if N is very large, I would probably try a Monte-Carlo algorithm.
The idea would be to generate random points (say, uniformly in the convex hull of all rectangles), and score how each random point performs. If the random point is in N-r or more rectangles, then update the number of hits of each subset of N-r rectangles.
In the end, the N-r rectangle subset with the most random points in it is your answer.
This algorithm has many downsides, the most obvious one being that the result is random and thus not guaranteed to be correct. But as most Monte-Carlo algorithms it scales well, and you should be able to use it with higher dimensions as well.

Finding a cover of a set of points with circles

I have N points in a set V given by their coordinates and a number K (0 < K < N). I need to determine K circles (disks) with the same radius R, with their centers in points in the V set. These circles have to 'cover' all the N points and R is the smallest possible.
Can anyone help me with this?
This problem is known as the (discrete) $k$-center problem, and is a well known problem in clustering. While the problem is in general NP-complete, there is a very easy algorithm that generates a solution within factor 2 of the optimal solution in any metric (including the implied 2-D Euclidean distance of the question). It is due to Gonzalez, and is as follows
Pick any point
Find its farthest neighbor
Find the point furthest from these two
and so on, till you have k points.
The radius R you end up with is the distance from this last point to the next farthest point. By construction, you are guaranteed to cover all points with disks of radius centered at each of the k points, and by triangle inequality this R is within a factor of 2 of the optimal radius.
If you know that you're in the plane, you can do somewhat better in theory (including getting an exact algorithm in time polynomial in n and exponential in k), but in practice the above algorithm is likely to be the easiest
The problem you described is an instance of a more general optimization problem known as the covering problem, which can be solved with a linear programming relaxation. You might be able to define a cost function that is linear in the radius R of your circles (e.g. the sum of the radii for all the circles), and in indicator variables that select what points are chosen to draw the circles. This cost function would be defined subject to constraints that force the circles to cover all the points in your set (check the Wikipedia article on LP for examples)
Once you have defined the cost function and the constraints, there are several solvers (many of them free) that you can use to solve the optimization problem.

How to find two most distant points?

This is a question that I was asked on a job interview some time ago. And I still can't figure out sensible answer.
Question is:
you are given set of points (x,y). Find 2 most distant points. Distant from each other.
For example, for points: (0,0), (1,1), (-8, 5) - the most distant are: (1,1) and (-8,5) because the distance between them is larger from both (0,0)-(1,1) and (0,0)-(-8,5).
The obvious approach is to calculate all distances between all points, and find maximum. The problem is that it is O(n^2), which makes it prohibitively expensive for large datasets.
There is approach with first tracking points that are on the boundary, and then calculating distances for them, on the premise that there will be less points on boundary than "inside", but it's still expensive, and will fail in worst case scenario.
Tried to search the web, but didn't find any sensible answer - although this might be simply my lack of search skills.
For this specific problem, with just a list of Euclidean points, one way is to find the convex hull of the set of points. The two distant points can then be found by traversing the hull once with the rotating calipers method.
Here is an O(N log N) implementation:
http://mukeshiiitm.wordpress.com/2008/05/27/find-the-farthest-pair-of-points/
If the list of points is already sorted, you can remove the sort to get the optimal O(N) complexity.
For a more general problem of finding most distant points in a graph:
Algorithm to find two points furthest away from each other
The accepted answer works in O(N^2).
Boundary point algorithms abound (look for convex hull algorithms). From there, it should take O(N) time to find the most-distant opposite points.
From the author's comment: first find any pair of opposite points on the hull, and then walk around it in semi-lock-step fashion. Depending on the angles between edges, you will have to advance either one walker or the other, but it will always take O(N) to circumnavigate the hull.
You are looking for an algorithm to compute the diameter of a set of points, Diam(S). It can be shown that this is the same as the diameter of the convex hull of S, Diam(S) = Diam(CH(S)). So first compute the convex hull of the set.
Now you have to find all the antipodal points on the convex hull and pick the pair with maximum distance. There are O(n) antipodal points on a convex polygon. So this gives a O(n lg n) algorithm for finding the farthest points.
This technique is known as Rotating Calipers. This is what Marcelo Cantos describes in his answer.
If you write the algorithm carefully, you can do without computing angles. For details, check this URL.
A stochastic algorithm to find the most distant pair would be
Choose a random point
Get the point most distant to it
Repeat a few times
Remove all visited points
Choose another random point and repeat a few times.
You are in O(n) as long as you predetermine "a few times", but are not guaranteed to actually find the most distant pair. But depending on your set of points the result should be pretty good. =)
This question is introduced at Introduction to Algorithm. It mentioned 1) Calculate Convex Hull O(NlgN). 2) If there is M vectex on Convex Hull. Then we need O(M) to find the farthest pair.
I find this helpful links. It includes analysis of algorithm details and program.
http://www.seas.gwu.edu/~simhaweb/alg/lectures/module1/module1.html
Wish this will be helpful.
Find the mean of all the points, measure the difference between all points and the mean, take the point the largest distance from the mean and find the point farthest from it. Those points will be the absolute corners of the convex hull and the two most distant points.
I recently did this for a project that needed convex hulls confined to randomly directed infinite planes. It worked great.
See the comments: this solution isn't guaranteed to produce the correct answer.
Just a few thoughts:
You might look at only the points that define the convex hull of your set of points to reduce the number,... but it still looks a bit "not optimal".
Otherwise there might be a recursive quad/oct-tree approach to rapidly bound some distances between sets of points and eliminate large parts of your data.
This seems easy if the points are given in Cartesian coordinates. So easy that I'm pretty sure that I'm overlooking something. Feel free to point out what I'm missing!
Find the points with the max and min values of their x, y, and z coordinates (6 points total). These should be the most "remote" of all the boundary points.
Compute all the distances (30 unique distances)
Find the max distance
The two points that correspond to this max distance are the ones you're looking for.
Here's a good solution, which works in O(n log n). It's called Rotating Caliper’s Method.
https://www.geeksforgeeks.org/maximum-distance-between-two-points-in-coordinate-plane-using-rotating-calipers-method/
Firstly you find the convex hull, which you can make in O(n log n) with the Graham's scan. Only the point from the convex hull can provide you the maximal distance. This algorithm arranges points of the convex hull in the clockwise traversal. This property will be used later.
Secondly, for all the points on the convex hull, you'll need to find the most distant point on this hull (it's called the antipodal point here). You don't have to find all the antipodal points separately (which would give quadratic time). Let's say the points of the convex hall are called p_1, ..., p_n, and their order corresponds to the clockwise traversal. There is a property of convex polygons that when you iterate through points p_j on the hull in the clockwise order and calculate the distances d(p_i, p_j), these distances firstly don't decrease (and maybe increase) and then don't increase (and maybe decrease). So you can find the maximum distance easily in this case. But when you've found the correct antipodal point p_j* for the p_i, you can start this search for p_{i+1} with the candidates points starting from that p_j*. You don't need to check all previously seen points. in total p_i iterates through points p_1, ..., p_n once, and p_j iterates through these points at most twice, because p_j can never catch up p_i as it would give zero distance, and we stop when the distance starts decreasing.
A solution that has runtime complexity O(N) is a combination of the above
answers. In detail:
(1) One can compute the convex hull with runtime complexity O(N) if you
use counting sort as an internal polar angle sort and are willing to
use angles rounded to the nearest integer [0, 359], inclusive.
(2) Note that the number of points on the convex hull is then N_H which is usually less than N.
We can speculate about the size of the hull from information in Cormen et al. Introduction to Algorithms, Exercise 33-5.
For sparse-hulled distributions of a unit-radius disk, a convex polygon with k sides, and a 2-D normal distribution respectively as n^(1/3), log_2(n), sqrt(log_2(n)).
The furthest pair problem is then between comparison of points on the hull.
This is N_H^2, but each leading point's search for distance point can be
truncated when the distances start to decrease if the points are traversed
in the order of the convex hull (those points are ordered CCW from first point).
The runtime complexity for this part is then O(N_H^2).
Because N_H^2 is usually less than N, the total runtime complexity
for furthest pair is O(N) with a caveat of using integer degree angles to reduce the sort in the convex hull to linear.
Given a set of points {(x1,y1), (x2,y2) ... (xn,yn)} find 2 most distant points.
My approach:
1). You need a reference point (xa,ya), and it will be:
xa = ( x1 + x2 +...+ xn )/n
ya = ( y1 + y2 +...+ yn )/n
2). Calculate all distance from point (xa,ya) to (x1,y1), (x2,y2),...(xn,yn)
The first "most distant point" (xb,yb) is the one with the maximum distance.
3). Calculate all distance from point (xb,yb) to (x1,y1), (x2,y2),...(xn,yn)
The other "most distant point" (xc,yc) is the one with the maximum distance.
So you got your most distant points (xb,yb) (xc,yc) in O(n)
For example, for points: (0,0), (1,1), (-8, 5)
1). Reference point (xa,ya) = (-2.333, 2)
2). Calculate distances:
from (-2.333, 2) to (0,0) : 3.073
from (-2.333, 2) to (1,1) : 3.480
from (-2.333, 2) to (-8, 5) : 6.411
So the first most distant point is (-8, 5)
3). Calculate distances:
from (-8, 5) to (0,0) : 9.434
from (-8, 5) to (1,1) : 9.849
from (-8, 5) to (-8, 5) : 0
So the other most distant point is (1, 1)

Resources