I was browsing Stack Overflow for a question on an algorithm in which to find the closest point to point A from a list of 2D points. I know the list must be sorted to get an optimal time, but I want it faster than O(N**2) which would be using brute force.
I found an answer that seems appealing here: Given list of 2d points, find the point closest to all other points, but I don't fully understand the answer. I am wondering about the part where he begins explaining about the horizontal/vertical cost and from that point onward. Could someone provide me an example of what to do in the case of these (random) points?
point A: (20, 40.6)
List of Points[(-20,200),(12,47), (4,0), (-82,92), (40,15), (112, 97), (-203, 84)]
If you can provide an alternative method to the linked post that would also be fine. I know it has something to do with sorting the list, and probably taking out the extremes, but again I'm not sure what method(s) to use.
EDIT: I understand now that this is not the euclidian distance that I am most interested in. Would a divide and conquer algorithm/method be the best bet here? I don't fully understand it yet, but it sounds like it solves what I want it to solve in O(N*log(N)). Would this approach be optimal, and if so would someone mind breaking it down to the basics as I am unable to understand it as most other sites have described it.
What you try to do is not possible if there is no structure in the list of points and they can really be random. Assume you have an algorithm that runs faster than in linear time, then in in your list of points there is one point B that is not read at all by the algorithm. Necessarily if I change B to another value the algorithm will run in the same way and return the same result. Now if the algorithm does not return a point of the list that is the same than A then I can change B to B=A and the correct solution to the problem would now be B (you can't get any closer than actually being the same point) and the algorithm would necessarily return a wrong result.
What the question you are referring to is trying to do is find a point A out of a list L such that sum of all distances between A and every point in L is minimal. The algorithm described in the answer runs in time O(n*log(n)) (where n). Note that n*log(n) actually grows faster than n so it is actually slower than looking at every element.
Also "distance" in the question does not refer to the euclidean distance. Where normally you would define the distance between a point (x_1,y_1) and a second point (x_2,y_2) to be sqrt((x_2-x_1)^2+(y_2-x_2)^2) the question refers to the "Taxicab distance" |x_2-x_1|+|y_2-x_2| where | | refers to the absolute value.
Re: edit
If you just want to find one point of the list that is closest to a fixed point A than you can linearly search for it in the list. See the following python code:
def distance(x,y):
#Manhatten distance
x1,x2=x
y1,y2=y
return abs(x1-y1)+abs(x2-y2)
def findClosest(a,listOfPoints):
minDist=float("inf")
minIndex=None
for index,b in enumerate(listOfPoints):
if distance(a,b) < minDist:
minDist=distance(a,b)
minIndex=index
return minDist,minIndex
a=(20, 40.6)
listOfPoints=[(-20,200),(12,47), (4,0), (-82,92), (40,15), (112, 97), (-203, 84)]
minDist,minIndex=findClosest(a,listOfPoints)
print("minDist:",minDist)
print("minIndex:",minIndex)
print("closest point:",listOfPoints[minIndex])
The challenge of the referenced question is that you don't want to minimize the distance to a fixed point, but that you want to find a point A out of the list L whose average distance to all other points in L is minimal.
Related
What is the best algorithm to find if any three points are collinear in a set of points say n. Please also explain the complexity if it is not trivial.
Thanks
Bala
If you can come up with a better than O(N^2) algorithm, you can publish it!
This problem is 3-SUM Hard, and whether there is a sub-quadratic algorithm (i.e. better than O(N^2)) for it is an open problem. Many common computational geometry problems (including yours) have been shown to be 3SUM hard and this class of problems is growing. Like NP-Hardness, the concept of 3SUM-Hardness has proven useful in proving 'toughness' of some problems.
For a proof that your problem is 3SUM hard, refer to the excellent surver paper here: http://www.cs.mcgill.ca/~jking/papers/3sumhard.pdf
Your problem appears on page 3 (conveniently called 3-POINTS-ON-LINE) in the above mentioned paper.
So, the currently best known algorithm is O(N^2) and you already have it :-)
A simple O(d*N^2) time and space algorithm, where d is the dimensionality and N is the number of points (probably not optimal):
Create a bounding box around the set of points (make it big enough so there are no points on the boundary)
For each pair of points, compute the line passing through them.
For each line, compute its two collision points with the bounding box.
The two collision points define the original line, so if there any matching lines they will also produce the same two collision points.
Use a hash set to determine if there are any duplicate collision point pairs.
There are 3 collinear points if and only if there were duplicates.
Another simple (maybe even trivial) solution which doesn't use a hash table, runs in O(n2log n) time, and uses O(n) space:
Let S be a set of points, we will describe an algorithm which finds out whether or not S contains some three collinear points.
For each point o in S do:
Pass a line L parallel to the x-axis through o.
Replace every point in S below L, with its reflection. (For example if L is the x axis, (a,-x) for x>0 will become (a,x) after the reflection). Let the new set of points be S'
The angle of each point p in S', is the right angle of the segment po with the line L. Let us sort the points S' by their angles.
Walk through the sorted points in S'. If there are two consecutive points which are collinear with o - return true.
If no collinear points were found in the loop - return false.
The loop runs n times, and each iteration performs nlog n steps. It is not hard to prove that if there're three points on a line they'll be found, and we'll find nothing otherwise.
I am currently assigned to create a c++ program to find the closest pair of points in a (x,y) coordinate system. However, I am having a lot of trouble trying to understand one thing.
Every tutorial/guide that I have read about the closest pair problem it tells me to sort the set of points by there Y coordinates, but I don't see what the point of this is? Can someone explain to me why we sort it by Y coordinates and what is the use of it? I understand that we sort the points by X in order to get L and X*, but I just don't understand why we have to sort the points by Y coordinates as well.
You don't, but then your running time is not improved over O(n2). The whole point is to compute as little as possible -- by examining as few points as possible, ignoring those you know will not be part of the answer. Do that by sorting Y.
Here's a pretty good explanation I just googled: http://www.cs.mcgill.ca/~cs251/ClosestPair/ClosestPairDQ.html
I have been thinking about a variation of the closest pair problem in which the only available information is the set of distances already calculated (we are not allowed to sort points according to their x-coordinates).
Consider 4 points (A, B, C, D), and the following distances:
dist(A,B) = 0.5
dist(A,C) = 5
dist(C,D) = 2
In this example, I don't need to evaluate dist(B,C) or dist(A,D), because it is guaranteed that these distances are greater than the current known minimum distance.
Is it possible to use this kind of information to reduce the O(n²) to something like O(nlogn)?
Is it possible to reduce the cost to something close to O(nlogn) if I accept a kind of approximated solution? In this case, I am thinking about some technique based on reinforcement learning that only converges to the real solution when the number of reinforcements go to infinite, but provides a great approximation for small n.
Processing time (measured by the big O notation) is not the only issue. To keep a very large amount of previous calculated distances can also be an issue.
Imagine this problem for a set with 10⁸ points.
What kind of solution should I look for? Was this kind of problem solved before?
This is not a classroom problem or something related. I have been just thinking about this problem.
I suggest using ideas that are derived from quickly solving k-nearest-neighbor searches.
The M-Tree data structure: (see http://en.wikipedia.org/wiki/M-tree and http://www.vldb.org/conf/1997/P426.PDF ) is designed to reduce the number distance comparisons that need to be performed to find "nearest neighbors".
Personally, I could not find an implementation of an M-Tree online that I was satisfied with (see my closed thread Looking for a mature M-Tree implementation) so I rolled my own.
My implementation is here: https://github.com/jon1van/MTreeMapRepo
Basically, this is binary tree in which each leaf node contains a HashMap of Keys that are "close" in some metric space you define.
I suggest using my code (or the idea behind it) to implement a solution in which you:
Search each leaf node's HashMap and find the closest pair of Keys within that small subset.
Return the closest pair of Keys when considering only the "winner" of each HashMap.
This style of solution would be a "divide and conquer" approach the returns an approximate solution.
You should know this code has an adjustable parameter the governs the maximum number of Keys that can be placed in an individual HashMap. Reducing this parameter will increase the speed of your search, but it will increase the probability that the correct solution won't be found because one Key is in HashMap A while the second Key is in HashMap B.
Also, each HashMap is associated a "radius". Depending on how accurate you want your result you maybe able to just search the HashMap with the largest hashMap.size()/radius (because this HashMap contains the highest density of points, thus it is a good search candidate)
Good Luck
If you only have sample distances, not original point locations in a plane you can operate on, then I suspect you are bounded at O(E).
Specifically, it would seem from your description that any valid solution would need to inspect every edge in order to rule out it having something interesting to say, meanwhile, inspecting every edge and taking the smallest solves the problem.
Planar versions bypass O(V^2), by using planar distances to deduce limitations on sets of edges, allowing us to avoid needing to look at most of the edge weights.
Use same idea as in space partitioning. Recursively split given set of points by choosing two points and dividing set in two parts, points that are closer to first point and points that are closer to second point. That is same as splitting points by a line passing between two chosen points.
That produces (binary) space partitioning, on which standard nearest neighbour search algorithms can be used.
I need to generate n random points in general position in the plane, i.e. no three points can lie on a same line. Points should have coordinates that are integers and lie inside a fixed square m x m. What would be the best algorithm to solve such a problem?
Update: square is aligned with the axes.
Since they're integers within a square, treat them as points in a bitmap. When you add a point after the first, use Bresenham's algorithm to paint all pixels on each of the lines going through the new point and one of the old ones. When you need to add a new point, get a random location and check if it's clear; otherwise, try again. Since each pair of pixels gives a new line, and thus excludes up to m-2 other pixels, as the number of points grows you will have several random choices rejected before you find a good one. The advantage of the approach I'm suggesting is that you only pay the cost of going through all lines when you have a good choice, while rejecting a bad one is a very quick test.
(if you want to use a different definition of line, just replace Bresenham's with the appropriate algorithm)
Can't see any way around checking each point as you add it, either by (a) running through all of the possible lines it could be on, or (b) eliminating conflicting points as you go along to reduce the possible locations for the next point. Of the two, (b) seems like it could give you better performance.
Similar to #LaC's answer. If memory is not a problem, you could do it like this:
Add all points on the plane to a list (L).
Shuffle the list.
For each point (P) in the list,
For each point (Q) previously picked,
Remove every point from L which are linear to P-Q.
Add P to the picked list.
You could continue the outer loop until you have enough points, or run out of them.
This might just work (though might be a little constrained on being random). Find the largest circle you can draw within the square (this seems very doable). Pick any n points on the circle, no three will ever be collinear :-).
This should be an easy enough task in code. Say the circle is centered at origin (so something of the form x^2 + y^2 = r^2). Assuming r is fixed and x randomly generated, you can solve to find y coordinates. This gives you two points on the circle for every x which are diametrically opposite. Hope this helps.
Edit: Oh, integer points, just noticed that. Thats a pity. I'm going to keep this solution up though - since I like the idea
Both #LaC's and #MizardX's solution are very interesting, but you can combine them to get even better solution.
The problem with #LaC's solution is that you get random choices rejected. The more points you have already generated the harder it gets to generate new ones. If there is only one available position left you have slight chance of randomly choosing it (1/(n*m)).
In the #MizardX's solution you never get rejected choices, however if you directly implement the "Remove every point from L which are linear to P-Q." step you'll get worse complexity (O(n^5)).
Instead it would be better to use a bitmap to find which points from L are to be removed. The bitmap would contain a value indicating whether a point is free to use and what is its location on the L list or a value indicating that this point is already crossed out. This way you get worst-case complexity of O(n^4) which is probably optimal.
EDIT:
I've just found that question: Generate Non-Degenerate Point Set in 2D - C++
It's very similar to this one. It would be good to use solution from this answer Generate Non-Degenerate Point Set in 2D - C++. Modifying it a bit to use radix or bucket sort and adding all the n^2 possible points to the P set initially and shufflying it, one can also get worst-case complexity of O(n^4) with a much simpler code. Moreover, if space is a problem and #LaC's solution is not feasible due to space requirements, then this algorithm will just fit in without modifications and offer a decent complexity.
Here is a paper that can maybe solve your problem:
"POINT-SETS IN GENERAL POSITION WITH MANY
SIMILAR COPIES OF A PATTERN"
by BERNARDO M. ABREGO AND SILVIA FERNANDEZ-MERCHANT
um, you don't specify which plane.. but just generate 3 random numbers and assign to x,y, and z
if 'the plane' is arbitrary, then set z=o every time or something...
do a check on x and y to see if they are in your m boundary,
compare the third x,y pair to see if it is on the same line as the first two... if it is, then regenerate the random values.
I've got a list of ~5000 points (specified as longitude/latitude pairs), and I want to find the nearest 5 of these to another point, specified by the user.
Can anyone suggest an efficient algorithm for working this out? I'm implementing this in Ruby, so if there's a suitable library then that would be good to know, but I'm still interested in the algorithm!
UPDATE: A couple of people have asked for more specific details on the problem. So here goes:
The 5000 points are mostly within the same city. There might be a few outside it, but it's safe to assume that 99% of them lie within a 75km radius, and that all of them lie within a 200km radius.
The list of points changes rarely. For the sake of argument, let's say it gets updated once per day, and we have to deal with a few thousand requests in that time.
You could accelerate the search by partitioning the 2D space with a quad-tree or a kd-tree and then once you've reach a leaf node you compare the remaining distances one by one until you find the closest match.
See also this blog post which refers to this other blog post which both discuss nearest neighbors searches with kd-trees in Ruby.
You can get a very fast upper-bound estimator on distance using Manhattan distance (scaled for latitude), this should be good enough for rejecting 99.9% of candidates if they're not close (EDIT: since then you tell us they are close. In that case, your metric should be distance-squared, as per Lars H comment).
Consider this equivalent to rejecting anything outside a spherical-rectangle bounding-box (as an approximation to a circle bounding-box).
I don't do Ruby so here is algorithm with pseudocode:
Let the latitude, longitude of your reference point P (pa,po) and the other point X (xa,xo).
Precompute ka, the latitude scaling factor for longitudinal distances: ka (= cos(pa in°)). (Strictly, ka = constant is a linearized approximation in the vicinity of P.)
Then the distance estimator is: D(X,P) = ka*|xa-pa| + |xo-po| = ka*da + do
where |z| means abs(z). At worst this overestimates true distance by a factor of √2 (when da==do), hence we allow for that as follows:
Do a running search and keep Dmin, the fifth-smallest scaled-Manhattan-distance-estimate.
Hence you can reject upfront all points for which D(X,P) > √2 * Dmin (since they must be at least farther away than √((ka*da)² + do²) - that should eliminate 99.9% of points).
Keep a list of all remaining candidate points with D(X,P) <= √2 * Dmin. Update Dmin if you found a new fifth-smallest D. Priority-queue, or else a list of (coord,D) are good data structures.
Note that we never computed Euclidean distance, we only used float multiplication and addition.
(Consider this similar to quadtree except filtering out everything except the region that interests us, hence no need to compute accurate distances upfront or build the data structure.)
It would help if you tell us the expected spread in latitudes, longitudes (degrees, minutes or what? If all the points are close, the √2 factor in this estimator will be too conservative and mark every point as a candidate; a lookup-table based distance estimator would be preferable.)
Pseudocode:
initialize Dmin with the fifth-smallest D from the first five points in list
for point X in list:
if D(X,P) <= √2 * Dmin:
insert the tuple (X,D) in the priority-queue of candidates
if (Dmin>D): Dmin = D
# after first pass, reject candidates with D > √2 * Dmin (use the final value of Dmin)
# ...
# then a second pass on candidates to find lowest 5 exact distances
Since your list is quite short, I'd highly recommend brute force. Just compare all 5000 to the user-specified point. It'll be O(n) and you'll get paid.
Other than that, a quad-tree or Kd-tree are the usual approaches to spacial subdivision. But in your case, you'll end up doing a linear number of insertions into the tree, and then a constant number of logarithmic lookups... a bit of a waste, when you're probably better off just doing a linear number of distance comparisons and being done with it.
Now, if you want to find the N nearest points, you're looking at sorting on the computed distances and taking the first N, but that's still O(n log n)ish.
EDIT: It's worth noting that building the spacial tree becomes worthwhile if you're going to reuse the list of points for multiple queries.
Rather than pure brute-force, for 5000 nodes, I would calculate the individual x+y distances for every node, rather than the straight line distance.
Once you've sorted that list, if e.g. x+y for the 5th node is 38, you can rule out any node where either x or y distance is > 38. This way, you can rule out a lot of nodes without having to calculate the straight line distance. Then brute force calculate the straight line distance for the remaining nodes.
These algorithms are not easily explained, thus I will only give you some hints in the right direction. You should look for Voronoi Diagrams. With a Voronoi Diagram you can easily precompute a graph in O(n^2 log n) time and search the closest point in O(log n) time.
Precomputation is done with a cron job at night and searching is live. This corresponds to your specification.
Now you could save the k closests pairs of each of your 5000 points and then starting from the nearest point from the Voronoi Diagram and search the remaining 4 points.
But be warned that these algorithms are not very easy to implement.
A good reference is:
de Berg: Computational Geometry Algorithms Applications (2008) chapters 7.1 and 7.2
Since you have that few points, I would recommend doing a brute-force search, to the effect of trying all points against each other with is an O(n^2) operation, with n = 5000, or roughly 25/2 million iterations of a suitable algorithm, and just storing the relevant results. This would have sub 100 ms execution time in C, so we are looking at a second or two at the most in Ruby.
When the user picks a point, you can use your stored data to give the results in constant time.
EDIT I re-read your question, and it seems as though the user provides his own last point. In that case it's faster to just do a O(n) linear search through your set each time user provides a point.
if you need to repeat this multiple times, with different user-entered locations, but don't want to implement a quad-tree (or can't find a library implementation) then you can use a locality-sensitive hashing (kind-of) approach that's fairly intuitive:
take your (x,y) pairs and create two lists, one of (x, i) and one of (y, i) where i is the index of the point
sort both lists
then, when given a point (X, Y),
bisection sort for X and Y
expand outwards on both lists, looking for common indices
for common indices, calculate exact distances
stop expanding when the differences in X and Y exceed the exact distance of the most-distant of the current 5 points.
all you're doing is saying that a nearby point must have a similar x and a similar y value...