Spatial sorting Million points in 3d space - algorithm

I have a collection of Million points in 3d space.
Each point is an object
Struct Point
{
double x;
double y;
double z;
};
The million points are stored inside an c++ vector MyPoints in some random order.
I want to sort these million points according to spatial distribution of points in space such that points which are physically closer should also be closer inside my array after sorting.
My first guess on how to do this is as follows: first sort points w.r.t Z-axis, then sort points along Y-axis and then sort points along X-axis
MyPointsSortedAlongZ = Sort(MyPoints, AlongZAxis )
MyPointsSortedAlongY = Sort(MyPointsSortedAlongZ , AlongYAxis )
MyPointsSortedAlongX = Sort(MyPointsSortedAlongY , AlongYAxis )
Now firstly, I dont know if this method is correct. Will my final array of points MyPointsSortedAlongX be sorted perfectly spatially (or nearly sorted spatially) ?
Secondly, if this method is correct, is it the fastest way to do this. What is a better method to do this ?

The CGAL library provides an implementation of a space filling curve algorithm that can be useful for that task.

Well, it really depends on what the metric you are going to use to compare between two arrays, but look for example on the metric which is sum of differences between adjacent points:
metric(arr) = sum[ d(arr[i],arr[i-1]) | i from 1 to n ]
where d(x,y) is the distance between point x and point y
Note that an optimal (smallest) solution to this metric is basically an optimal (shortest) path that goes through all points. This is the Traveling Salesman Problem (TSP), which is NP-Hard, so there is no known polynomial solution to it.
I'd suggest - first define exactly what is the metric to compare two arrays.
Then, use heuristics or approximations to the metric, such as Genetic Algorithms or hill climbing, or reduce the problem to TSP, and use a known heuristic/approximation for it.
Regarding your method:
It is easy to see it is not optimal for the simple example:
[(1,100),(1,-100),(2,0)]
Let's assume main sort by x, secondary sort by y.
It will give us the 'sorted' vector:
[(1,-100),(1,100),(2,0)]
according to the above metric, we get metric(arr) ~= 300
However, the order [(1,-100),(2,0),(1,100)] will get us metric(arr) ~= 200
So, the suggested heuristic is not optimal (as expected).

Maybe this helps:
A Template for the Nearest Neighbor Problem (DDJ 2001)

Sorting three times on the three axis is a waste. The third sort will completely undo what the other sorts have done.

Related

Approximated closest pair algorithm

I have been thinking about a variation of the closest pair problem in which the only available information is the set of distances already calculated (we are not allowed to sort points according to their x-coordinates).
Consider 4 points (A, B, C, D), and the following distances:
dist(A,B) = 0.5
dist(A,C) = 5
dist(C,D) = 2
In this example, I don't need to evaluate dist(B,C) or dist(A,D), because it is guaranteed that these distances are greater than the current known minimum distance.
Is it possible to use this kind of information to reduce the O(n²) to something like O(nlogn)?
Is it possible to reduce the cost to something close to O(nlogn) if I accept a kind of approximated solution? In this case, I am thinking about some technique based on reinforcement learning that only converges to the real solution when the number of reinforcements go to infinite, but provides a great approximation for small n.
Processing time (measured by the big O notation) is not the only issue. To keep a very large amount of previous calculated distances can also be an issue.
Imagine this problem for a set with 10⁸ points.
What kind of solution should I look for? Was this kind of problem solved before?
This is not a classroom problem or something related. I have been just thinking about this problem.
I suggest using ideas that are derived from quickly solving k-nearest-neighbor searches.
The M-Tree data structure: (see http://en.wikipedia.org/wiki/M-tree and http://www.vldb.org/conf/1997/P426.PDF ) is designed to reduce the number distance comparisons that need to be performed to find "nearest neighbors".
Personally, I could not find an implementation of an M-Tree online that I was satisfied with (see my closed thread Looking for a mature M-Tree implementation) so I rolled my own.
My implementation is here: https://github.com/jon1van/MTreeMapRepo
Basically, this is binary tree in which each leaf node contains a HashMap of Keys that are "close" in some metric space you define.
I suggest using my code (or the idea behind it) to implement a solution in which you:
Search each leaf node's HashMap and find the closest pair of Keys within that small subset.
Return the closest pair of Keys when considering only the "winner" of each HashMap.
This style of solution would be a "divide and conquer" approach the returns an approximate solution.
You should know this code has an adjustable parameter the governs the maximum number of Keys that can be placed in an individual HashMap. Reducing this parameter will increase the speed of your search, but it will increase the probability that the correct solution won't be found because one Key is in HashMap A while the second Key is in HashMap B.
Also, each HashMap is associated a "radius". Depending on how accurate you want your result you maybe able to just search the HashMap with the largest hashMap.size()/radius (because this HashMap contains the highest density of points, thus it is a good search candidate)
Good Luck
If you only have sample distances, not original point locations in a plane you can operate on, then I suspect you are bounded at O(E).
Specifically, it would seem from your description that any valid solution would need to inspect every edge in order to rule out it having something interesting to say, meanwhile, inspecting every edge and taking the smallest solves the problem.
Planar versions bypass O(V^2), by using planar distances to deduce limitations on sets of edges, allowing us to avoid needing to look at most of the edge weights.
Use same idea as in space partitioning. Recursively split given set of points by choosing two points and dividing set in two parts, points that are closer to first point and points that are closer to second point. That is same as splitting points by a line passing between two chosen points.
That produces (binary) space partitioning, on which standard nearest neighbour search algorithms can be used.

Find all points in sphere of radius r around arbitrary coordinate

I'm looking for an efficient algorithm that for a space with known height, width and length, given a fixed radius R, and a list of points N, with 3-dimensional coordinates in that space, will find all the points within a fixed radius R of an arbitrary point on the grid. This query will be done many times with different points, so an expensive pre-processing/sorting step, in exchange for quick queries may be worth it. This is a bit of a bottleneck step of an application I'm working on, so any time I can cut off of it is useful
Things I have tried so far:
-The naive algorithm, iterate over all points and calculate distance
-Divide the space into a grid with cubes of length R, and put the points into these. That way, for each point, I only have to ever query the immediate neighboring buckets. This has a significant speedup
-I've tried using the manhattan distance as a heuristic. That is, within the buckets, before calculating a distance to any point, use the manhattan distance to filter out those that can't possibly be within radius R (that is, those with a manhattan distance of <= sqrt(3)*R). I thought this would offer a speedup, as it only needs addition instead of multiplication, but it actually slowed the program down by a little bit
EDIT: To compare the distances, I use the squared distance to eliminate having to use a sqrt function.
Obviously, there will be some limit on how much I can speed this up, but I could use any suggestions on things to try now.
Not that it probably matters on the algorithmic level, but I'm working in C.
You may get a speed benefit from storing your points in a k-d tree with three dimensions. That will give you searchs in O(log n) amortized time.
Don't compare on the radius, compare on the square of the radius. The reason being is, if the distance between two points is less than R, then the square of the distance is less than R^2.
This way, when you're using the distance formula, you don't need to compute the square root, which is a very expensive operation.
I would recommend using either K-D tree or z-curve:
http://en.wikipedia.org/wiki/Z-order_%28curve%29
How about Binary Indexed Tree ? (Topcoder tutorials referred) It can be extended to n Dimensions,and is simpler to code.
Nicolas Brodu's NEIGHAND library do exactly what you want, improving on the bin-lattice algorithm.
More details can be found in his article: Query Sphere Indexing for Neighborhood Requests
[I might be misunderstanding the question. I'm finding the problem statement difficult to parse.]
In the old days, it was often good to design a this type of algorithm with "early outs" that do tests to try to avoid a more expensive calculation. In modern processors, a failure of a branch-prediction is often very expensive, and those early-out tests can actually be more expensive that the full calculation. (The only way to know for sure is to measure.)
In this case, the calculation is pretty simple, so it may be best to avoid building a data structure or doing any clever early-out checks and instead try to optimize, vectorize, and parallelize to get the throughput you need.
For a point P(x, y, z) and a sphere S(x_s, y_s, z_s, radius), the membership test is:
(x - x_s)^2 + (x - y_s)^2 + (z - z_s)^2 < radius^2
where radius^2 can be pre-calculated once for all the points in the query (avoiding any square root calculations). These calculations are all independent, you can compute it for several points in parallel. With something like SSE, you could probably do four at a time. And if you have many points to test, you could split the list and further parallelize the work across multiple cores.

Algorithm for nearest point

I've got a list of ~5000 points (specified as longitude/latitude pairs), and I want to find the nearest 5 of these to another point, specified by the user.
Can anyone suggest an efficient algorithm for working this out? I'm implementing this in Ruby, so if there's a suitable library then that would be good to know, but I'm still interested in the algorithm!
UPDATE: A couple of people have asked for more specific details on the problem. So here goes:
The 5000 points are mostly within the same city. There might be a few outside it, but it's safe to assume that 99% of them lie within a 75km radius, and that all of them lie within a 200km radius.
The list of points changes rarely. For the sake of argument, let's say it gets updated once per day, and we have to deal with a few thousand requests in that time.
You could accelerate the search by partitioning the 2D space with a quad-tree or a kd-tree and then once you've reach a leaf node you compare the remaining distances one by one until you find the closest match.
See also this blog post which refers to this other blog post which both discuss nearest neighbors searches with kd-trees in Ruby.
You can get a very fast upper-bound estimator on distance using Manhattan distance (scaled for latitude), this should be good enough for rejecting 99.9% of candidates if they're not close (EDIT: since then you tell us they are close. In that case, your metric should be distance-squared, as per Lars H comment).
Consider this equivalent to rejecting anything outside a spherical-rectangle bounding-box (as an approximation to a circle bounding-box).
I don't do Ruby so here is algorithm with pseudocode:
Let the latitude, longitude of your reference point P (pa,po) and the other point X (xa,xo).
Precompute ka, the latitude scaling factor for longitudinal distances: ka (= cos(pa in°)). (Strictly, ka = constant is a linearized approximation in the vicinity of P.)
Then the distance estimator is: D(X,P) = ka*|xa-pa| + |xo-po| = ka*da + do
where |z| means abs(z). At worst this overestimates true distance by a factor of √2 (when da==do), hence we allow for that as follows:
Do a running search and keep Dmin, the fifth-smallest scaled-Manhattan-distance-estimate.
Hence you can reject upfront all points for which D(X,P) > √2 * Dmin (since they must be at least farther away than √((ka*da)² + do²) - that should eliminate 99.9% of points).
Keep a list of all remaining candidate points with D(X,P) <= √2 * Dmin. Update Dmin if you found a new fifth-smallest D. Priority-queue, or else a list of (coord,D) are good data structures.
Note that we never computed Euclidean distance, we only used float multiplication and addition.
(Consider this similar to quadtree except filtering out everything except the region that interests us, hence no need to compute accurate distances upfront or build the data structure.)
It would help if you tell us the expected spread in latitudes, longitudes (degrees, minutes or what? If all the points are close, the √2 factor in this estimator will be too conservative and mark every point as a candidate; a lookup-table based distance estimator would be preferable.)
Pseudocode:
initialize Dmin with the fifth-smallest D from the first five points in list
for point X in list:
if D(X,P) <= √2 * Dmin:
insert the tuple (X,D) in the priority-queue of candidates
if (Dmin>D): Dmin = D
# after first pass, reject candidates with D > √2 * Dmin (use the final value of Dmin)
# ...
# then a second pass on candidates to find lowest 5 exact distances
Since your list is quite short, I'd highly recommend brute force. Just compare all 5000 to the user-specified point. It'll be O(n) and you'll get paid.
Other than that, a quad-tree or Kd-tree are the usual approaches to spacial subdivision. But in your case, you'll end up doing a linear number of insertions into the tree, and then a constant number of logarithmic lookups... a bit of a waste, when you're probably better off just doing a linear number of distance comparisons and being done with it.
Now, if you want to find the N nearest points, you're looking at sorting on the computed distances and taking the first N, but that's still O(n log n)ish.
EDIT: It's worth noting that building the spacial tree becomes worthwhile if you're going to reuse the list of points for multiple queries.
Rather than pure brute-force, for 5000 nodes, I would calculate the individual x+y distances for every node, rather than the straight line distance.
Once you've sorted that list, if e.g. x+y for the 5th node is 38, you can rule out any node where either x or y distance is > 38. This way, you can rule out a lot of nodes without having to calculate the straight line distance. Then brute force calculate the straight line distance for the remaining nodes.
These algorithms are not easily explained, thus I will only give you some hints in the right direction. You should look for Voronoi Diagrams. With a Voronoi Diagram you can easily precompute a graph in O(n^2 log n) time and search the closest point in O(log n) time.
Precomputation is done with a cron job at night and searching is live. This corresponds to your specification.
Now you could save the k closests pairs of each of your 5000 points and then starting from the nearest point from the Voronoi Diagram and search the remaining 4 points.
But be warned that these algorithms are not very easy to implement.
A good reference is:
de Berg: Computational Geometry Algorithms Applications (2008) chapters 7.1 and 7.2
Since you have that few points, I would recommend doing a brute-force search, to the effect of trying all points against each other with is an O(n^2) operation, with n = 5000, or roughly 25/2 million iterations of a suitable algorithm, and just storing the relevant results. This would have sub 100 ms execution time in C, so we are looking at a second or two at the most in Ruby.
When the user picks a point, you can use your stored data to give the results in constant time.
EDIT I re-read your question, and it seems as though the user provides his own last point. In that case it's faster to just do a O(n) linear search through your set each time user provides a point.
if you need to repeat this multiple times, with different user-entered locations, but don't want to implement a quad-tree (or can't find a library implementation) then you can use a locality-sensitive hashing (kind-of) approach that's fairly intuitive:
take your (x,y) pairs and create two lists, one of (x, i) and one of (y, i) where i is the index of the point
sort both lists
then, when given a point (X, Y),
bisection sort for X and Y
expand outwards on both lists, looking for common indices
for common indices, calculate exact distances
stop expanding when the differences in X and Y exceed the exact distance of the most-distant of the current 5 points.
all you're doing is saying that a nearby point must have a similar x and a similar y value...

How to find nearest vector in {0,1,2}^12, over and over again

I'm searching a space of vectors of length 12, with entries 0, 1, 2. For example, one such vector is
001122001122. I have about a thousand good vectors, and about a thousand bad vectors. For each bad vector I need to locate the closest good vector. Distance between two vectors is just the number of coordinates which don't match. The good vectors aren't particularly nicely arranged, and the reason they're "good" doesn't seem to be helpful here. My main priority is that the algorithm be fast.
If I do a simple exhaustive search, I have to calculate about 1000*1000 distances. That seems pretty thick-headed.
If I apply Dijkstra's algorithm first using the good vectors, I can calculate the closest vector and minimal distance for every vector in the space, so that each bad vector requires a simple lookup. But the space has 3^12 = 531,441 vectors in it, so the precomputation is half a million distance computations. Not much savings.
Can you help me think of a better way?
Edit: Since people asked earnestly what makes them "good": Each vector represents a description of a hexagonal picture of six equilateral triangles, which is the 2D image of a 3D arrangement of cubes (think generalized Q-bert). The equilateral triangles are halves of faces of cubes (45-45-90), tilted into perspective. Six of the coordinates describe the nature of the triangle (perceived floor, left wall, right wall), and six coordinates describe the nature of the edges (perceived continuity, two kinds of perceived discontinuity). The 1000 good vectors are those that represent hexagons that can be witnessed when seeing cubes-in-perspective. The reason for the search is to apply local corrections to a hex map full of triangles...
Just to keep the things in perspective, and be sure you are not optimizing unnecessary things, the brute force approach without any optimization takes 12 seconds in my machine.
Code in Mathematica:
bad = Table[RandomInteger[5, 12], {1000}];
good = Table[RandomInteger[2, 12], {1000}];
distance[a_, b_] := Total[Sign#Abs[a - b]];
bestMatch = #[[2]] & /#
Position[
Table[Ordering#
Table[distance[good[[j]], bad[[i]]], {j, Length#good}], {i,
Length#bad}], 1] // Timing
As you may expect, the Time follows a O(n^2) law:
This sounds a lot like what spellcheckers have to do. The trick is generally to abuse tries.
The most basic thing you can do is build a trie over the good vectors, then do a flood-fill prioritizing branches with few mismatches. This will be very fast when there is a nearby vector, and degenerate to brute force when the closest vector is very far away. Not bad.
But I think you can do better. Bad vectors which share the same prefix will do the same initial branching work, so we can try to share that as well. So we also build a trie over the bad vectors and sortof do them all at once.
No guarantees this is correct, since both the algorithm and code are off the top of my head:
var goodTrie = new Trie(goodVectors)
var badTrie = new Trie(badVectors)
var result = new Map<Vector, Vector>()
var pq = new PriorityQueue(x => x.error)
pq.add(new {good: goodTrie, bad: badTrie, error: 0})
while pq.Count > 0
var g,b,e = q.Dequeue()
if b.Count == 0:
//all leafs of this path have been removed
continue
if b.IsLeaf:
//we have found a mapping with minimum error for this bad item
result[b.Item] = g.Item
badTrie.remove(b) //prevent redundant results
else:
//We are zipping down the tries. Branch to all possibilities.
q.EnqueueAll(from i in {0,1,2}
from j in {0,1,2}
select new {good: g[i], bad: b[j], error: e + i==j ? 0 : 1})
return result
A final optimization might be to re-order the vectors so positions with high agreement among the bad vectors come first and share more work.
3^12 isn't a very large search space. If speed is essential and generality of the algorithm is not, you could just map each vector to an int in the range 0..531440 and use it as an index into a precomputed table of "nearest good vectors".
If you gave each entry in that table a 32-bit word (which is more than enough), you'd be looking at about 2 MB for the table, in exchange for pretty much instantaneous "calculation".
edit: this is not much different from the precomputation the question suggests, but my point is just that depending on the application, there's not necessarily any problem with doing it that way, especially if you do all the precalculations before the application even runs.
My computational geometry is VERY rough, but it seems that you should be able to:
Calculate the Voronoi diagram for your set of good vectors.
Calculate the BSP tree for the cells of the diagram.
The Voronoi diagram will give you a 12th dimensional convex hull for each good vector that contains that all the points closest to that vector.
The BSP tree will give you a fast way to determine which cell a vector lies within and, therefore, which good vector it is closest to.
EDIT: I just noticed that you are using hamming distances instead of euclidean distances. I'm not sure how this could be adapted to fit that constraint. Sorry.
Assuming a packed representation for the vectors, one distance computation (comparing one good vector and one bad vector to yield the distance) can be completed in roughly 20 clock cycles or less. Hence a million such distance calculations can be done in 20 million cycles or (assuming a 2GHz cpu) 0.01 sec. Do these numbers help?
PS:- 20 cycles is a conservative overestimate.

Fast way to compute the minimal distance of two sets of k-dimensional vectors

I two sets of k-dimensional vectors, where k is around 500 and the number of vectors is usually smaller. I want to compute the (arbitrarily defined) minimal distance between the two sets.
A naive approach would be this:
(loop for a in set1
for b in set2
minimizing (distance a b))
However, this requires O(n² * distance) computations. Is there a faster way of doing this?
I don't think you can do better than O(n^2) when the distance is arbitrary (you have to examine each of the possible distances!). For a given distance function we might be able to exploit the properties of the function, but there won't be any general algorithm which works with any distance function in better than O(n^2) (i.e. o(n^2) : note smallOh).
If your data is dynamic and you have to keep obtaining the closest pair of points at different times, for arbitrary distance function the following papers by Eppstein will probably help (which have special update operations in order to make finding the closest pair of points quick):
http://www.ics.uci.edu/~eppstein/projects/pairs/Papers/Epp-SODA-98.pdf. [O(nlog^2(n)) update time]
http://academic.research.microsoft.com/Paper/1847461.aspx
You will be able to adapt the above one set algorithms to a two set algorithm (for instance, by defining distance between points of same set to be infinity).
For Euclidean type (L^p) distance, there are known O(nlogn) time algorithms, which work with a given set of points (i.e. you dont need to have any special update algorithms):
http://www.cse.iitd.ernet.in/~ssen/cs852/scribe/scribe2/lec.pdf
http://en.wikipedia.org/wiki/Closest_pair_of_points_problem
Of course, the L^p is for one set, but you might be able to adapt it for two sets.
If you give your distance function, it might be easier for us to help you.
Hope it helps. Good luck!
If the components of your vectors are scalars I would guess that for your case of a moderate k=500 the O(n²) approach is probably as fast as you can get. You can simplify your calculation by minimizing distance². Also, the distance(A_i, B_i) = distance(B_i, A_i), so make sure you only compare them once (you only have 500!/(500-2)! pairs, not 500²).
If the components are m-dimensional vectors A and B instead, you could store the components of vector A in a R-tree or a kd-tree and then find the closest pair by iterating over all components of vector B and finding its closest partner from A--- this would be O(n). Don't forget that big-O is for n->infinity, so the trees might come with some pretty expensive constant term (i.e. this approach might only make sense for large k or if vector A is always the same).
Put the two sets of coordinates into a Spatial Index, e.g. a KD-tree.
You then compute the intersection of these two indices.

Resources