Faster way to compare two sets of points in N-dimensional space? - algorithm

List1 contains a high number (~7^10) of N-dimensional points (N <=10), List2 contains the same or fewer number of N-dimensional points (N <=10).
My task is this: I want to check which point in List2 is closest (euclidean distance) to a point in List1 for every point in List1 and subsequently perform some operation on it. I have been doing it the simple- the nested loop way when I didn't have more than 50 points in List1, but with 7^10 points, this obviously takes up a lot of time.
What is the fastest way to do this? Any concepts from Computational Geometry might help?
EDIT: I have the following in place, I have built a kd-tree out of List2 and then now I am doing a nearest-neighborhood search for each point in List1. Now as I originally pointed out, List1 has 7^10 points, and hence though I am saving on the brute force, Euclidean distance method for every pair, the sheer large number of points in List1 is causing a lot of time consumption. Is there any way I can improve this?

Well a good way would be to use something like a kd-tree and perform nearest neighbour searching. Fortunately you do not have to implement this data structure yourself, it has been done before. I recommend this one, but there are others:
http://www.cs.umd.edu/~mount/ANN/

It's not possible to tell you which is the most efficient algorithm without knowing anything about the distribution of points in the two solutions. However, for a first guess...
First algorithm doesn't work — for two reasons: (1) a wrong assumption - I assume the bounding hulls are disjoint, and (2) a misreading of the question - it doesn't find the shortest edge for every pair of points.
...compute the convex hull of the two sets: the closest points must be on the hyperface on the two hulls through which the line between the two centres of gravity passes.
You can compute the convex hull by computing the centre points, the centre of gravity assuming all points have equal mass, and ordering the lists from furthest from the centre to least far. Then take the furthest away point in the list, add this to the convex hull, and then remove all points that are within the so-far computed convex hull (you will need to compute lots of 10d hypertriangles to do this). Repeat unil there is nothing left in the list that is not on the convex hull.
Second algorithm: partial
Compute the convex hull for List2. For each point of List1, if the point is outside the convex hull, then find the hyperface as for first algorithm: the nearest point must be on this face. If it is on the face, likewise. If it is inside, you can still find the hyperface by extending the line past the point from List1: the nearest point must be inside the ball that includes the hyperface to List2's centre of gravity: here, though, you need a new algorithm to get the nearest point, perhaps the kd-tree approach.
Perfomance
When List2 is something like evenly distributed, or normally distributed, through some fairly oblique shape, this will do a good job of reducing the number of points under consideration, and it should be compatible with the kd-tree suggestion.
There are some horrible worts cases, though: if List2 contains only points on the surface of a torus whose geometric centre is the centre of gravity of the list, then the convex hull will be very expensive to calculate, and will not help much in reducing the number of points under consideration.
My evaluation
These kinds of geometric techniques may be a useful complement to the kd-trees approach of other posters, but you need to know a little about the distribution of points before you can determine whether they are worth applying.

kd-tree is pretty fast. I've used the algorithm in this paper and it works well Bentley - K-d trees for semidynamic point sets
I'm sure there are libraries around, but it's nice to know what's going on sometimes - Bentley explains it well.
Basically, there are a number of ways to search a tree: Nearest N neighbors, All neighbors within a given radius, nearest N neighbors within a radius. Sometimes you want to search for bounded objects.
The idea is that the kdTree partitions the space recursively. Each node is split in 2 down the axis in one of the dimensions of the space you are in. Ideally it splits perpendicular to the node's longest dimension. You should keep splitting the space until you have about 4 points in each bucket.
Then for every query point, as you recursively visit nodes, you check the distance from to the partition wall for the particular node you are in. You descend both nodes (the one you are in and its sibling) if the distance to the partition wall is closer than the search radius. If the wall is beyond the radius, just search children of the node you are in.
When you get to a bucket (leaf node), you test the points in there to see if they are within the radius.
If you want the closest point, you can start with a massive radius, and pass a pointer or reference to it as you recurse - and in that way you can shrink the search radius as you find close points - and home in on the closest point pretty fast.

(A year later) kd trees that quit early, after looking at say 1M of all 200M points,
can be much faster in high dimensions.
The results are only statistically close to the absolute nearest, depending on the data and metric;
there's no free lunch.
(Note that sampling 1M points, and kd tree only those 1M, is quite different, worse.)
FLANN does this for image data with dim=128,
and is I believe in opencv. A local mod of the fast and solid
SciPy cKDTree also has cutoff= .

Related

Efficient sorting of integer vertices of a convex polygon

I am given as input n pairs of integers which describe points in the 2D plane that are known ahead of time to be vertices of some convex polygon.
I'd like to efficiently sort these points in a clockwise or counter-clockwise fashion.
At first I thought of doing something like the initial step of Graham Scan, but I can't see a simple way to break ties for vertices that make the same angle with the anchor point.
Notice that, as you walk along the sides of the polygon, sometimes these vertices may be getting closer to the anchor point, and sometimes they may be getting farther.
Something that does seem to work is producing a point in the interior of the polygon (for instance, the average of the n points) and using it is an anchor point for radial sorting of the input.
Indeed, because the anchor point lies in the interior, any ray emanating from it contains at most one input point, so there will be no ties.
The overall complexity is not affected: computing the midpoint is an O(n) task, and the bottleneck is the actual sorting.
This does involve a few more operations than the hopeful version of Graham Scan (where we assume there are no ties to be broken), but the bigger loss is leaving integer arithmethic behind by introducing division into the mix.
This in turn can be remedied with scaling everything by a factor of n, but at this point it seems like grasping at straws.
Am I missing something?
Is there a simpler, efficient way to solve this sorting problem, preferrably one that can avoid floating point calculations?

Nearest vertex search

I'm looking for effective algorithm to find a vertex nearest to point P(x, y, z). The set of vertices is fixed, each request comes with new point P. I tried kd-tree and others known methods and I've got same problem everywhere: if P is closer then all is fine, search is performed for few tree nodes only. However if P is far enough, then more and more nodes should be scanned and finally speed becomes unacceptable slow. In my task I've no ability to specify a small search radius. What are solutions for such case?
Thanks
Igor
One possible way to speed up your search would be to discretize space into a large number of rectangular prisms spaced apart at regular intervals. For example, you could split space up into lots of 1 × 1 × 1 unit cubes. You then distribute the points in space into these volumes. This gives you a sort of "hash function" for points that distributes points into the volume that contains them.
Once you have done this, do a quick precomputation step and find, for each of these volumes, the closest nonempty volumes. You could do this by checking all volumes one step away from the volume, then two steps away, etc.
Now, to do a nearest neighbor search, you can do the following. Start off by hashing your point in space to the volume that contains it. If that volume contains any points, iterate over all of them to find which one is closest. Then, for each of the volumes that you found in the first step of this process, iterate over those points to see if any of them are closer. The resulting closest point is the nearest neighbor to your test point.
If your volumes end up containing too many points, you can refine this approach by subdividing those volumes into even smaller volumes and repeating this same process. You could alternatively create a bunch of smaller k-d trees, one for each volume, to do the nearest-neighbor search. In this way, each k-d tree holds a much smaller number of points than your original k-d tree, and the points within each volume are all reasonable candidates for a nearest neighbor. Therefore, the search should be much, much faster.
This setup is similar in spirit to an octree, except that you divide space into a bunch of smaller regions rather than just eight.
Hope this helps!
Well, this is not an issue of the index structures used, but of your query:
the nearest neighbor becomes just much more fuzzy the further you are away from your data set.
So I doubt that any other index will help you much.
However, you may be able to plug in a threshold in your search. I.e. "find nearest neighbor, but only when within a maximum distance x".
For static, in-memory, 3-d point double vector data, with euclidean distance, the k-d-tree is hard to beat, actually. It just splits the data very very fast. An octree may sometimes be faster, but mostly for window queries I guess.
Now if you really have very few objects but millions of queries, you could try to do some hybrid approach. Roughly something like this: compute all points on the convex hull of your data set. Compute the center and radius. Whenever a query point is x times further away (you need to do the 3d math yourself to figure out the correct x), it's nearest neighbor must be one of the convex hull points. Then again use a k-d-tree, but one containing the hull points only.
Or even simpler. Find the min/max point in each dimension. Maybe add some additional extremes (in x+y, x-y, x+z, x-y, y+z, y-z etc.). So you get a small set of candidates. So lets for now assume that is 8 points. Precompute the center and the distances of these 6 points. Let m be the maximum distance from the center to these 8 points. For a query compute the distance to the center. If this is larger than m, compute the closest of these 6 candidates first. Then query the k-d-tree, but bound the search to this distance. This costs you 1 (for close) and 7 (for far neighbors) distance computations, and may significantly speed up your search by giving a good candidate early. For further speedups, organize these 6-26 candidates in a k-d-tree, too, to find the best bound quickly.

Nearest neighbor search with periodic boundary conditions

In a cubic box I have a large collection points in R^3. I'd like to find the k nearest neighbors for each point. Normally I'd think to use something like a k-d tree, but in this case I have periodic boundary conditions. As I understand it, a k-d tree works by partitioning the space by cutting it into hyper planes of one less dimension, i.e. in 3D we would split the space by drawing 2D planes. For any given point, it is either on the plane, above it, or below it. However, when you split the space with periodic boundary conditions a point could be considered to be on either side!
What's the most efficient method of finding and maintaining a list of nearest neighbors with periodic boundary conditions in R^3?
Approximations are not sufficient, and the points will only be moved one at a time (think Monte Carlo not N-body simulation).
Even in the Euclidean case, a point and its nearest neighbor may be on opposite sides of a hyperplane. The core of nearest-neighbor search in a k-d tree is a primitive that determines the distance between a point and a box; the only modification necessary for your case is to take the possibility of wraparound into account.
Alternatively, you could implement cover trees, which work on any metric.
(I'm posting this answer even though I'm not fully sure it works. Intuitively it seems right, but there might be an edge case I haven't considered)
If you're working with periodic boundary conditions, then you can think of space as being cut into a series of blocks of some fixed size that are all then superimposed on top of one another. Suppose that we're in R2. Then one option would be to replicate that block nine times and arrange them into a 3x3 grid of duplicates of the block. Given this, if we find the nearest neighbor of any single node in the central square, then either
The nearest neighbor is inside the central square, in which case the neighbor is a nearest neighbor, or
The nearest neighbor is in a square other than the central square. In that case, if we find the point in the central square that the neighbor corresponds to, that point should be the nearest neighbor of the original test point under the periodic boundary condition.
In other words, we just replicate the elements enough times so that the Euclidean distance between points lets us find the corresponding distance in the modulo space.
In n dimensions, you would need to make 3n copies of all the points, which sounds like a lot, but for R3 is only a 27x increase over the original data size. This is certainly a huge increase, but if it's within acceptable limits you should be able to use this trick to harness a standard kd-tree (or other spacial tree).
Hope this helps! (And hope this is correct!)

Shortest distance to rectangle caching

I have a list of rectangles that don't have to be parallel to the axes. I also have a master rectangle that is parallel to the axes.
I need an algorithm that can tell which rectangle is a point closest to(the point must be in the master rectangle). the list of rectangles and master rectangle won't change during the algorithm and will be called with many points so some data structure should be created to make the lookup faster.
To be clear: distance from a rectangle to a point is the distance between the closest point in the rectangle to the point.
What algorithm/data structure can be used for this? memory is on higher priority on this, n log n is ok but n^2 is not.
You should be able to do this with a Voronoi diagram with O(n log n) preprocessing time with O(log n) time queries. Because the objects are rectangles, not points, the cells may be curved. Nevertheless, a Voronoi diagram should work fine for your purposes. (See http://en.wikipedia.org/wiki/Voronoi_diagram)
For a quick and dirty solution that you could actually get working within a day, you could do something inspired by locality sensitive hashing. For example, if the rectangles are somewhat well-spaced, you could hash them into square buckets with a few different offsets, and then for each query examine each rectangle that falls in one of the handful of buckets that contain the query point.
You should be able to do this in O(n) time and O(n) memory.
Calculate the closest point on each edge of each rectangle to the point in question. To do this, see my detailed answer in the this question. Even though the question has to do with a point inside of the polygon (rather than outside of it), the algorithm still can be applied here.
Calculate the distance between each of these closest points on the edges, and find the closest point on the entire rectangle (for each rectangle) to the point in question. See the link above for more details.
Find the minimum distance between all of the rectangles. The rectangle corresponding with your minimum distance is the winner.
If memory is more valuable than speed, use brute force: for a given point S, compute the distance from S to each edge. Choose the rectangle with the shortest distance.
This solution requires no additional memory, while its execution time is in O(n).
Depending on your exact problem specification, you may have to adjust this solution if the rectangles are allowed to overlap with the master rectangle.
As you described, a distance between one point to a rectangle is the minimum length of all lines through that point which is perpendicular with all four edges of a rectangle and all lines connect that point with one of four vertices of the rectangle.
(My English is not good at describing a math solution, so I think you should think more deeply for understanding my explanation).
For each rectangle, you should save four vertices and four edges function for fast calculation distance between them with the specific point.

Largest triangle from a set of points [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
How to find largest triangle in convex hull aside from brute force search
I have a set of random points from which i want to find the largest triangle by area who's verticies are each on one of those points.
So far I have figured out that the largest triangle's verticies will only lie on the outside points of the cloud of points (or the convex hull) so i have programmed a function to do just that (using Graham scan in nlogn time).
However that's where I'm stuck. The only way I can figure out how to find the largest triangle from these points is to use brute force at n^3 time which is still acceptable in an average case as the convex hull algorithm usually kicks out the vast majority of points. However in a worst case scenario where points are on a circle, this method would fail miserably.
Dose anyone know an algorithm to do this more efficiently?
Note: I know that CGAL has this algorithm there but they do not go into any details on how its done. I don't want to use libraries, i want to learn this and program it myself (and also allow me to tweak it to exactly the way i want it to operate, just like the graham scan in which other implementations pick up collinear points that i don't want).
Don't know if this help, but if you choose two points from the convex hull and rotate all points of the hull so that the connecting line of the two points is parallel to the x-Axis, either the point with the maximum or the one with the minimum y-coordinate forms the triangle with the largest area together with the two points chosen first.
Of course once you have tested one point for all possible base lines, you can remove it from the list.
Here's a thought on how to get it down to O(n2 log n). I don't really know anything about computational geometry, so I'll mark it community wiki; please feel free to improve on this.
Preprocess the convex hull by finding for each point the range of slopes of lines through that point such that the set lies completely on one side of the line. Then invert this relationship: construct an interval tree for slopes with points in leaf nodes, such that when querying with a slope you find the points such that there is a tangent through those points.
If there are no sets of three or more collinear points on the convex hull, there are at most four points for each slope (two on each side), but in case of collinear points we can just ignore the intermediate points.
Now, iterate through all pairs of points (P,Q) on the convex hull. We want to find the point R such that triangle PQR has maximum area. Taking PQ as the base of the triangle, we want to maximize the height by finding R as far away from the line PQ as possible. The line through R parallel to PQ must be such that all points lie on one side of the line, so we can find a bounded number of candidates in time O(log n) using the preconstructed interval tree.
To improve this further in practice, do branch-and-bound in the set of pairs of points: find an upper bound for the height of any triangle (e.g. the maximum distance between two points), and discard any pair of points whose distance multiplied by this upper bound is less than the largest triangle found so far.
I think the rotating calipers method may apply here.
Off the top of my head, perhaps you could do something involving gridding/splitting the collection of points up into groups? Maybe... separating the points into three groups (not sure what the best way to do that in this case would be, though), doing something to discard those points in each group that are closer to the other two groups than other points in the same group, and then using the remaining points to find the largest triangle that can be made having one vertex in each group? This would actually make the case of all points being on a circle a lot simpler, because you'd just focus on the points that are near the center of the arcs contained within each group, as those would be the ones in each group furthest from the other two groups.
I'm not sure if this would give you the proper result for certain triangles/distributions of points, though. There may be situations where the resultant triangle isn't of optimal area, either because the grouping and/or the vertex choosing aren't/isn't optimal. Something like that.
Anyway, those are my thoughts on the problem. I hope I've at least been able to give you ideas for how to work on it.
How about dropping a point at a time from the convex hull? Starting with the convex hull, calculate the area of the triangle formed by each triple of adjacent points (p1p2p3, p2p3p4, etc.). Find the triangle with minimum area, then drop the middle of the three points that formed that triangle. (In other words, if the smallest area triangle is p3p4p5, drop P4.) Now you have a convex polygon with N-1 points. Repeat the same procedure until you are left with three points. This should take O(N^2) time.
I would not be at all surprised if there is some pathological case where this doesn't work, but I expect that it would work for the majority of cases. (In other words, I haven't proven this, and I have no source to cite.)

Resources