Algorithm for nearest point - ruby

I've got a list of ~5000 points (specified as longitude/latitude pairs), and I want to find the nearest 5 of these to another point, specified by the user.
Can anyone suggest an efficient algorithm for working this out? I'm implementing this in Ruby, so if there's a suitable library then that would be good to know, but I'm still interested in the algorithm!
UPDATE: A couple of people have asked for more specific details on the problem. So here goes:
The 5000 points are mostly within the same city. There might be a few outside it, but it's safe to assume that 99% of them lie within a 75km radius, and that all of them lie within a 200km radius.
The list of points changes rarely. For the sake of argument, let's say it gets updated once per day, and we have to deal with a few thousand requests in that time.

You could accelerate the search by partitioning the 2D space with a quad-tree or a kd-tree and then once you've reach a leaf node you compare the remaining distances one by one until you find the closest match.
See also this blog post which refers to this other blog post which both discuss nearest neighbors searches with kd-trees in Ruby.

You can get a very fast upper-bound estimator on distance using Manhattan distance (scaled for latitude), this should be good enough for rejecting 99.9% of candidates if they're not close (EDIT: since then you tell us they are close. In that case, your metric should be distance-squared, as per Lars H comment).
Consider this equivalent to rejecting anything outside a spherical-rectangle bounding-box (as an approximation to a circle bounding-box).
I don't do Ruby so here is algorithm with pseudocode:
Let the latitude, longitude of your reference point P (pa,po) and the other point X (xa,xo).
Precompute ka, the latitude scaling factor for longitudinal distances: ka (= cos(pa in°)). (Strictly, ka = constant is a linearized approximation in the vicinity of P.)
Then the distance estimator is: D(X,P) = ka*|xa-pa| + |xo-po| = ka*da + do
where |z| means abs(z). At worst this overestimates true distance by a factor of √2 (when da==do), hence we allow for that as follows:
Do a running search and keep Dmin, the fifth-smallest scaled-Manhattan-distance-estimate.
Hence you can reject upfront all points for which D(X,P) > √2 * Dmin (since they must be at least farther away than √((ka*da)² + do²) - that should eliminate 99.9% of points).
Keep a list of all remaining candidate points with D(X,P) <= √2 * Dmin. Update Dmin if you found a new fifth-smallest D. Priority-queue, or else a list of (coord,D) are good data structures.
Note that we never computed Euclidean distance, we only used float multiplication and addition.
(Consider this similar to quadtree except filtering out everything except the region that interests us, hence no need to compute accurate distances upfront or build the data structure.)
It would help if you tell us the expected spread in latitudes, longitudes (degrees, minutes or what? If all the points are close, the √2 factor in this estimator will be too conservative and mark every point as a candidate; a lookup-table based distance estimator would be preferable.)
Pseudocode:
initialize Dmin with the fifth-smallest D from the first five points in list
for point X in list:
if D(X,P) <= √2 * Dmin:
insert the tuple (X,D) in the priority-queue of candidates
if (Dmin>D): Dmin = D
# after first pass, reject candidates with D > √2 * Dmin (use the final value of Dmin)
# ...
# then a second pass on candidates to find lowest 5 exact distances

Since your list is quite short, I'd highly recommend brute force. Just compare all 5000 to the user-specified point. It'll be O(n) and you'll get paid.
Other than that, a quad-tree or Kd-tree are the usual approaches to spacial subdivision. But in your case, you'll end up doing a linear number of insertions into the tree, and then a constant number of logarithmic lookups... a bit of a waste, when you're probably better off just doing a linear number of distance comparisons and being done with it.
Now, if you want to find the N nearest points, you're looking at sorting on the computed distances and taking the first N, but that's still O(n log n)ish.
EDIT: It's worth noting that building the spacial tree becomes worthwhile if you're going to reuse the list of points for multiple queries.

Rather than pure brute-force, for 5000 nodes, I would calculate the individual x+y distances for every node, rather than the straight line distance.
Once you've sorted that list, if e.g. x+y for the 5th node is 38, you can rule out any node where either x or y distance is > 38. This way, you can rule out a lot of nodes without having to calculate the straight line distance. Then brute force calculate the straight line distance for the remaining nodes.

These algorithms are not easily explained, thus I will only give you some hints in the right direction. You should look for Voronoi Diagrams. With a Voronoi Diagram you can easily precompute a graph in O(n^2 log n) time and search the closest point in O(log n) time.
Precomputation is done with a cron job at night and searching is live. This corresponds to your specification.
Now you could save the k closests pairs of each of your 5000 points and then starting from the nearest point from the Voronoi Diagram and search the remaining 4 points.
But be warned that these algorithms are not very easy to implement.
A good reference is:
de Berg: Computational Geometry Algorithms Applications (2008) chapters 7.1 and 7.2

Since you have that few points, I would recommend doing a brute-force search, to the effect of trying all points against each other with is an O(n^2) operation, with n = 5000, or roughly 25/2 million iterations of a suitable algorithm, and just storing the relevant results. This would have sub 100 ms execution time in C, so we are looking at a second or two at the most in Ruby.
When the user picks a point, you can use your stored data to give the results in constant time.
EDIT I re-read your question, and it seems as though the user provides his own last point. In that case it's faster to just do a O(n) linear search through your set each time user provides a point.

if you need to repeat this multiple times, with different user-entered locations, but don't want to implement a quad-tree (or can't find a library implementation) then you can use a locality-sensitive hashing (kind-of) approach that's fairly intuitive:
take your (x,y) pairs and create two lists, one of (x, i) and one of (y, i) where i is the index of the point
sort both lists
then, when given a point (X, Y),
bisection sort for X and Y
expand outwards on both lists, looking for common indices
for common indices, calculate exact distances
stop expanding when the differences in X and Y exceed the exact distance of the most-distant of the current 5 points.
all you're doing is saying that a nearby point must have a similar x and a similar y value...

Related

What is an efficient algorithm to find all local minimums(maximums) in a matrix?

I want to find ALL local maximums in a N*N matrix, with a constraint that every 2 peaks found must be at least M cells away (in both directions). In other words, for very peak P found, local maximums within (2M+1)*(2M+1) sub-matrix around P are ignored, if that peak is lower than P.
By local maximum I mean the largest element in the (2M+1)*(2M+1) submatrix centered at the element.
For the naive method, the complexity is O(N*N*M*M). Is there an efficient algorithm to achieve this?
This is a sample matrix for N=5 and M=1 (3*3 grid):
As your matrix appears to be something like an image, using image processing techniques appears to be the natural choice.
You could define peaks (local maxima or minima) as image regions with zero crossing of both local partial derivatives. If you want maxima look for negative curvature at these places, if you're looking for minima watch out for positive curvature (curvature -> second order derivative).
There are linear convolution operators available (and a whole lot of theory behind them), that produce the partial derivatives in x and y direction (e.g., Sobel, Prewitt) and second order derivatives.
There's even algorithms for blob detection already, which appears to be related to your task (e.g., Laplacian of Gaussian).
If you are looking for speed, you might want to see if you can benefit of linear separability, precomputation of filter kernels (associativity), or DFT. Also note that this kind of tasks usually benefit hugely of parallelization. See if you can leverage more than one core, a GPU or an FPGA for some performance boost.
I would use a floodfill approach (it's not actually floodfill, but floodfill was what I had in mind when I came up with it):
Find all minima. Put them in a sorted list/stack.
Pick (and remove) the first item from the list (lowest minimum).
If the element is marked as used, discard the item and go to 2.
Mark all elements inside the submatrix around the item as used.
Go to 2.
The algortihm ends when the list is empty.
Total cost: O(N*N + p * log p + p * M * M) where p is the number of minima.

Spatial sorting Million points in 3d space

I have a collection of Million points in 3d space.
Each point is an object
Struct Point
{
double x;
double y;
double z;
};
The million points are stored inside an c++ vector MyPoints in some random order.
I want to sort these million points according to spatial distribution of points in space such that points which are physically closer should also be closer inside my array after sorting.
My first guess on how to do this is as follows: first sort points w.r.t Z-axis, then sort points along Y-axis and then sort points along X-axis
MyPointsSortedAlongZ = Sort(MyPoints, AlongZAxis )
MyPointsSortedAlongY = Sort(MyPointsSortedAlongZ , AlongYAxis )
MyPointsSortedAlongX = Sort(MyPointsSortedAlongY , AlongYAxis )
Now firstly, I dont know if this method is correct. Will my final array of points MyPointsSortedAlongX be sorted perfectly spatially (or nearly sorted spatially) ?
Secondly, if this method is correct, is it the fastest way to do this. What is a better method to do this ?
The CGAL library provides an implementation of a space filling curve algorithm that can be useful for that task.
Well, it really depends on what the metric you are going to use to compare between two arrays, but look for example on the metric which is sum of differences between adjacent points:
metric(arr) = sum[ d(arr[i],arr[i-1]) | i from 1 to n ]
where d(x,y) is the distance between point x and point y
Note that an optimal (smallest) solution to this metric is basically an optimal (shortest) path that goes through all points. This is the Traveling Salesman Problem (TSP), which is NP-Hard, so there is no known polynomial solution to it.
I'd suggest - first define exactly what is the metric to compare two arrays.
Then, use heuristics or approximations to the metric, such as Genetic Algorithms or hill climbing, or reduce the problem to TSP, and use a known heuristic/approximation for it.
Regarding your method:
It is easy to see it is not optimal for the simple example:
[(1,100),(1,-100),(2,0)]
Let's assume main sort by x, secondary sort by y.
It will give us the 'sorted' vector:
[(1,-100),(1,100),(2,0)]
according to the above metric, we get metric(arr) ~= 300
However, the order [(1,-100),(2,0),(1,100)] will get us metric(arr) ~= 200
So, the suggested heuristic is not optimal (as expected).
Maybe this helps:
A Template for the Nearest Neighbor Problem (DDJ 2001)
Sorting three times on the three axis is a waste. The third sort will completely undo what the other sorts have done.

Single Pass Seed Selection Algorithm for k-Means

I've recently read the Single Pass Seed Selection Algorithm for k-Means article, but not really understand the algorithm, which is:
Calculate distance matrix Dist in which Dist (i,j) represents distance from i to j
Find Sumv in which Sumv (i) is the sum of the distances from ith point to all other points.
Find the point i which is min (Sumv) and set Index = i
Add First to C as the first centroid
For each point xi, set D (xi) to be the distance between xi and the nearest point in C
Find y as the sum of distances of first n/k nearest points from the Index
Find the unique integer i so that D(x1)^2+D(x2)^2+...+D(xi)^2 >= y > D(x1)^2+D(x2)^2+...+D(x(i-1))^2
Add xi to C
Repeat steps 5-8 until k centers
Especially step 6, do we still use the same Index (same point) over and over or we use the newly added point from C? And about step 8, does i have to be larger than 1?
Honestly, I wouldn't worry about understanding that paper - its not very good.
The algorithm is poorly described.
Its not actually a single pass, it needs do to n^2/2 pairwise computations + one additional pass through the data.
They don't report the runtime of their seed selection scheme, probably because it is very bad doing O(n^2) work.
They are evaluating on very simple data sets that don't have a lot of bad solutions for k-Means to fall into.
One of their metrics of "better"ness is how many iterations it takes k-means to run given the seed selection. While it is an interesting metric, the small differences they report are meaningless (k-means++ seeding could be more iterations, but less work done per iteration), and they don't report the run time or which k-means algorithm they use.
You will get a lot more benefit from learning and understanding the k-means++ algorithm they are comparing against, and reading some of the history from that.
If you really want to understand what they are doing, I would brush up on your matlab and read their provided matlab code. But its not really worth it. If you look up the quantile seed selection algorithm, they are essentially doing something very similar. Instead of using the distance to the first seed to sort the points, they appear to be using the sum of pairwise distances (which means they don't need an initial seed, hence the unique solution).
Single Pass Seed Selection algorithm is a novel algorithm. Single Pass mean that without any iterations first seed can be selected. k-means++ performance is depends on first seed. It is overcome in SPSS. Please gothrough the paper "Robust Seed Selestion Algorithm for k-means" from the same authors
John J. Louis

Find all points in sphere of radius r around arbitrary coordinate

I'm looking for an efficient algorithm that for a space with known height, width and length, given a fixed radius R, and a list of points N, with 3-dimensional coordinates in that space, will find all the points within a fixed radius R of an arbitrary point on the grid. This query will be done many times with different points, so an expensive pre-processing/sorting step, in exchange for quick queries may be worth it. This is a bit of a bottleneck step of an application I'm working on, so any time I can cut off of it is useful
Things I have tried so far:
-The naive algorithm, iterate over all points and calculate distance
-Divide the space into a grid with cubes of length R, and put the points into these. That way, for each point, I only have to ever query the immediate neighboring buckets. This has a significant speedup
-I've tried using the manhattan distance as a heuristic. That is, within the buckets, before calculating a distance to any point, use the manhattan distance to filter out those that can't possibly be within radius R (that is, those with a manhattan distance of <= sqrt(3)*R). I thought this would offer a speedup, as it only needs addition instead of multiplication, but it actually slowed the program down by a little bit
EDIT: To compare the distances, I use the squared distance to eliminate having to use a sqrt function.
Obviously, there will be some limit on how much I can speed this up, but I could use any suggestions on things to try now.
Not that it probably matters on the algorithmic level, but I'm working in C.
You may get a speed benefit from storing your points in a k-d tree with three dimensions. That will give you searchs in O(log n) amortized time.
Don't compare on the radius, compare on the square of the radius. The reason being is, if the distance between two points is less than R, then the square of the distance is less than R^2.
This way, when you're using the distance formula, you don't need to compute the square root, which is a very expensive operation.
I would recommend using either K-D tree or z-curve:
http://en.wikipedia.org/wiki/Z-order_%28curve%29
How about Binary Indexed Tree ? (Topcoder tutorials referred) It can be extended to n Dimensions,and is simpler to code.
Nicolas Brodu's NEIGHAND library do exactly what you want, improving on the bin-lattice algorithm.
More details can be found in his article: Query Sphere Indexing for Neighborhood Requests
[I might be misunderstanding the question. I'm finding the problem statement difficult to parse.]
In the old days, it was often good to design a this type of algorithm with "early outs" that do tests to try to avoid a more expensive calculation. In modern processors, a failure of a branch-prediction is often very expensive, and those early-out tests can actually be more expensive that the full calculation. (The only way to know for sure is to measure.)
In this case, the calculation is pretty simple, so it may be best to avoid building a data structure or doing any clever early-out checks and instead try to optimize, vectorize, and parallelize to get the throughput you need.
For a point P(x, y, z) and a sphere S(x_s, y_s, z_s, radius), the membership test is:
(x - x_s)^2 + (x - y_s)^2 + (z - z_s)^2 < radius^2
where radius^2 can be pre-calculated once for all the points in the query (avoiding any square root calculations). These calculations are all independent, you can compute it for several points in parallel. With something like SSE, you could probably do four at a time. And if you have many points to test, you could split the list and further parallelize the work across multiple cores.

Fast way to compute the minimal distance of two sets of k-dimensional vectors

I two sets of k-dimensional vectors, where k is around 500 and the number of vectors is usually smaller. I want to compute the (arbitrarily defined) minimal distance between the two sets.
A naive approach would be this:
(loop for a in set1
for b in set2
minimizing (distance a b))
However, this requires O(n² * distance) computations. Is there a faster way of doing this?
I don't think you can do better than O(n^2) when the distance is arbitrary (you have to examine each of the possible distances!). For a given distance function we might be able to exploit the properties of the function, but there won't be any general algorithm which works with any distance function in better than O(n^2) (i.e. o(n^2) : note smallOh).
If your data is dynamic and you have to keep obtaining the closest pair of points at different times, for arbitrary distance function the following papers by Eppstein will probably help (which have special update operations in order to make finding the closest pair of points quick):
http://www.ics.uci.edu/~eppstein/projects/pairs/Papers/Epp-SODA-98.pdf. [O(nlog^2(n)) update time]
http://academic.research.microsoft.com/Paper/1847461.aspx
You will be able to adapt the above one set algorithms to a two set algorithm (for instance, by defining distance between points of same set to be infinity).
For Euclidean type (L^p) distance, there are known O(nlogn) time algorithms, which work with a given set of points (i.e. you dont need to have any special update algorithms):
http://www.cse.iitd.ernet.in/~ssen/cs852/scribe/scribe2/lec.pdf
http://en.wikipedia.org/wiki/Closest_pair_of_points_problem
Of course, the L^p is for one set, but you might be able to adapt it for two sets.
If you give your distance function, it might be easier for us to help you.
Hope it helps. Good luck!
If the components of your vectors are scalars I would guess that for your case of a moderate k=500 the O(n²) approach is probably as fast as you can get. You can simplify your calculation by minimizing distance². Also, the distance(A_i, B_i) = distance(B_i, A_i), so make sure you only compare them once (you only have 500!/(500-2)! pairs, not 500²).
If the components are m-dimensional vectors A and B instead, you could store the components of vector A in a R-tree or a kd-tree and then find the closest pair by iterating over all components of vector B and finding its closest partner from A--- this would be O(n). Don't forget that big-O is for n->infinity, so the trees might come with some pretty expensive constant term (i.e. this approach might only make sense for large k or if vector A is always the same).
Put the two sets of coordinates into a Spatial Index, e.g. a KD-tree.
You then compute the intersection of these two indices.

Resources