If I have a set of N points in 2D space, defined by vectors X and Y of their locations. What is an efficient algorithm that will
Select a fixed number (M) points to remove so as to maximize the shortest nearest-neighbor distance among the remaining points.
Remove a minimum number of points so that the shortest nearest-neighbor distance among the remaining points is greater than a fixed distance (D).
Sorting by points by their shortest nearest neighbor distance and removing the points with the smallest values does not give the correct answer, as then you remove both points of close pairs, while you may only need to remove one of the points in those pairs.
For my case, I am usually dealing with 1,000-10,000 points, and I may remove 50-90% of points.
You shouldn't need to store (or compute) the entire distance matrix: a Delaunay triangulation should efficiently (O(n log n) worst case) give you the closest neighbors of your point set. You should also be able to update it efficiently as you delete points.
For most cases of close pairs, you should be able to check to see which of the pair would be farthest from its neighbors if the other is removed. This is not an exact solution; especially if you remove a large proportion of points, removing a locally optimum point may exclude the globally optimum solution. Also, you should be able to deal with clusters of 3 or more locally close points. However, if you are only removing a small proportion of points from a randomly distributed set, both these cases may be relatively rare.
There may or may not be a better way (i.e., an exact and efficient algorithm) to solve your problem, but the above suggestions should lead to an approximate and/or combinatorial approach which works best when the points that need deleting are sparsely distributed.
Noam
One method is to break your 2D space into N partitions. Within each partition, determine an average position for each X,Y. Then perform the nearest neighbor algorightm on the averaged points. Then repeat the nearest neighbor test on the full point set of the partitions that matched.
Here's the catch. The larger the partitions, the fewer points you will have but the less accurate. The smaller the partitions, it will be more accurate but with more points to process.
I can't think of anything other than a brute force approach. But you can probably shorten the data set you are looking at significantly before any analysis.
So, what I would do is. First work out the nearest neighbour distance for each point. Let's call that P_in. Then work out the maximum distance of each point to its M nearest neighbours, call it P_iM. If P_in is greater than P_iM for any point then it can be excluded from the analysis. Basically if you have one point that is a distance of 10 from any other point, and you have another point that is a distance of 9 from the nearest M points then you should remove the first point.
Depending on the level of clustering or how big M is, this might reduce your data set quite a bit.
Related
I have a point p, and n line segments in the 2d space. Is there a way I can preprocess the line segments so that I can efficiently (i.e. sublineraly) find the line segment closest (i.e with lowest perpendicular distance) to P?
This is a real-world problem we're trying to solve. The best (approximate) answer we have is to preprocess the ends of the line segments of the points into a quad tree/2d kd tree, and find the nearest point. This should lead to a nearly optimal answer (or maybe even correct answer) in most cases.
Alternately one can use Mongodb's geonear, which works with points as well.
Can we do better than this, particularly in terms of accuracy?
If your segments are uniformly spread and not too long, you can think of a gridding approach: choose a cell size and determine for every cell which segment crosses it (this is done by "drawing" the segments on the grid). Then for a query point, find the nearest non-empty cell, by visiting neighborhoods of increasing size, and compute the exact nearest distance to the segment(s) so found. You need to continue the search as long as the distance between the query point and the next cells does not exceed the shortest distance found so far.
If the distribution is not uniform, a quad-tree decomposition can be better.
More generally, a suitable strategy is to use any acceleration device that quickly will report a small number of candidate segments, with a guaranty: the nearest segment must be among the candidates.
Consider this question relative to graph theory:
Let G a complete (every vertex is connected to all the other vertices) non-directed graph of size N x N. Two "salesmen" travel this way: the first always visits the nearest non visited vertex, the second the farthest, until they have both visited all the vertices. We must generate a matrix of distances and the starting points for the two salesmen (they can be different) such that:
All the distances are unique Edit: positive integers
The distance from a vertex to itself is always 0.
The difference between the total distance covered by the two salesmen must be a specific number, D.
The distance from A to B is equal to the distance from B to A
What efficient algorithms cn be useful to help me? I can only think of backtracking, but I don't see any way to reduce the work to be done by the program.
Geometry is helpful.
Using the distances of points on a circle seems like it would work. Seems like you could determine adjust D by making the circle radius larger or smaller.
Alternatively really any 2D shape, where the distances are all different could probably used as well. In this case you should scale up or down the shape to obtain the correct D.
Edit: Now that I think about it, the simplest solution may be to simply pick N random 2D points, say 32 bit integer coordinates to lower the chances of any distances being too close to equal. If two distances are too close, just pick a different point for one of them until it's valid.
Ideally, you'd then just need to work out a formula to determine the relationship between D and the scaling factor, which I'm not sure of offhand. If nothing else, you could also just use binary search or interpolation search or something to search for scaling factor to obtain the required D, but that's a slower method.
I'm looking for effective algorithm to find a vertex nearest to point P(x, y, z). The set of vertices is fixed, each request comes with new point P. I tried kd-tree and others known methods and I've got same problem everywhere: if P is closer then all is fine, search is performed for few tree nodes only. However if P is far enough, then more and more nodes should be scanned and finally speed becomes unacceptable slow. In my task I've no ability to specify a small search radius. What are solutions for such case?
Thanks
Igor
One possible way to speed up your search would be to discretize space into a large number of rectangular prisms spaced apart at regular intervals. For example, you could split space up into lots of 1 × 1 × 1 unit cubes. You then distribute the points in space into these volumes. This gives you a sort of "hash function" for points that distributes points into the volume that contains them.
Once you have done this, do a quick precomputation step and find, for each of these volumes, the closest nonempty volumes. You could do this by checking all volumes one step away from the volume, then two steps away, etc.
Now, to do a nearest neighbor search, you can do the following. Start off by hashing your point in space to the volume that contains it. If that volume contains any points, iterate over all of them to find which one is closest. Then, for each of the volumes that you found in the first step of this process, iterate over those points to see if any of them are closer. The resulting closest point is the nearest neighbor to your test point.
If your volumes end up containing too many points, you can refine this approach by subdividing those volumes into even smaller volumes and repeating this same process. You could alternatively create a bunch of smaller k-d trees, one for each volume, to do the nearest-neighbor search. In this way, each k-d tree holds a much smaller number of points than your original k-d tree, and the points within each volume are all reasonable candidates for a nearest neighbor. Therefore, the search should be much, much faster.
This setup is similar in spirit to an octree, except that you divide space into a bunch of smaller regions rather than just eight.
Hope this helps!
Well, this is not an issue of the index structures used, but of your query:
the nearest neighbor becomes just much more fuzzy the further you are away from your data set.
So I doubt that any other index will help you much.
However, you may be able to plug in a threshold in your search. I.e. "find nearest neighbor, but only when within a maximum distance x".
For static, in-memory, 3-d point double vector data, with euclidean distance, the k-d-tree is hard to beat, actually. It just splits the data very very fast. An octree may sometimes be faster, but mostly for window queries I guess.
Now if you really have very few objects but millions of queries, you could try to do some hybrid approach. Roughly something like this: compute all points on the convex hull of your data set. Compute the center and radius. Whenever a query point is x times further away (you need to do the 3d math yourself to figure out the correct x), it's nearest neighbor must be one of the convex hull points. Then again use a k-d-tree, but one containing the hull points only.
Or even simpler. Find the min/max point in each dimension. Maybe add some additional extremes (in x+y, x-y, x+z, x-y, y+z, y-z etc.). So you get a small set of candidates. So lets for now assume that is 8 points. Precompute the center and the distances of these 6 points. Let m be the maximum distance from the center to these 8 points. For a query compute the distance to the center. If this is larger than m, compute the closest of these 6 candidates first. Then query the k-d-tree, but bound the search to this distance. This costs you 1 (for close) and 7 (for far neighbors) distance computations, and may significantly speed up your search by giving a good candidate early. For further speedups, organize these 6-26 candidates in a k-d-tree, too, to find the best bound quickly.
List1 contains a high number (~7^10) of N-dimensional points (N <=10), List2 contains the same or fewer number of N-dimensional points (N <=10).
My task is this: I want to check which point in List2 is closest (euclidean distance) to a point in List1 for every point in List1 and subsequently perform some operation on it. I have been doing it the simple- the nested loop way when I didn't have more than 50 points in List1, but with 7^10 points, this obviously takes up a lot of time.
What is the fastest way to do this? Any concepts from Computational Geometry might help?
EDIT: I have the following in place, I have built a kd-tree out of List2 and then now I am doing a nearest-neighborhood search for each point in List1. Now as I originally pointed out, List1 has 7^10 points, and hence though I am saving on the brute force, Euclidean distance method for every pair, the sheer large number of points in List1 is causing a lot of time consumption. Is there any way I can improve this?
Well a good way would be to use something like a kd-tree and perform nearest neighbour searching. Fortunately you do not have to implement this data structure yourself, it has been done before. I recommend this one, but there are others:
http://www.cs.umd.edu/~mount/ANN/
It's not possible to tell you which is the most efficient algorithm without knowing anything about the distribution of points in the two solutions. However, for a first guess...
First algorithm doesn't work — for two reasons: (1) a wrong assumption - I assume the bounding hulls are disjoint, and (2) a misreading of the question - it doesn't find the shortest edge for every pair of points.
...compute the convex hull of the two sets: the closest points must be on the hyperface on the two hulls through which the line between the two centres of gravity passes.
You can compute the convex hull by computing the centre points, the centre of gravity assuming all points have equal mass, and ordering the lists from furthest from the centre to least far. Then take the furthest away point in the list, add this to the convex hull, and then remove all points that are within the so-far computed convex hull (you will need to compute lots of 10d hypertriangles to do this). Repeat unil there is nothing left in the list that is not on the convex hull.
Second algorithm: partial
Compute the convex hull for List2. For each point of List1, if the point is outside the convex hull, then find the hyperface as for first algorithm: the nearest point must be on this face. If it is on the face, likewise. If it is inside, you can still find the hyperface by extending the line past the point from List1: the nearest point must be inside the ball that includes the hyperface to List2's centre of gravity: here, though, you need a new algorithm to get the nearest point, perhaps the kd-tree approach.
Perfomance
When List2 is something like evenly distributed, or normally distributed, through some fairly oblique shape, this will do a good job of reducing the number of points under consideration, and it should be compatible with the kd-tree suggestion.
There are some horrible worts cases, though: if List2 contains only points on the surface of a torus whose geometric centre is the centre of gravity of the list, then the convex hull will be very expensive to calculate, and will not help much in reducing the number of points under consideration.
My evaluation
These kinds of geometric techniques may be a useful complement to the kd-trees approach of other posters, but you need to know a little about the distribution of points before you can determine whether they are worth applying.
kd-tree is pretty fast. I've used the algorithm in this paper and it works well Bentley - K-d trees for semidynamic point sets
I'm sure there are libraries around, but it's nice to know what's going on sometimes - Bentley explains it well.
Basically, there are a number of ways to search a tree: Nearest N neighbors, All neighbors within a given radius, nearest N neighbors within a radius. Sometimes you want to search for bounded objects.
The idea is that the kdTree partitions the space recursively. Each node is split in 2 down the axis in one of the dimensions of the space you are in. Ideally it splits perpendicular to the node's longest dimension. You should keep splitting the space until you have about 4 points in each bucket.
Then for every query point, as you recursively visit nodes, you check the distance from to the partition wall for the particular node you are in. You descend both nodes (the one you are in and its sibling) if the distance to the partition wall is closer than the search radius. If the wall is beyond the radius, just search children of the node you are in.
When you get to a bucket (leaf node), you test the points in there to see if they are within the radius.
If you want the closest point, you can start with a massive radius, and pass a pointer or reference to it as you recurse - and in that way you can shrink the search radius as you find close points - and home in on the closest point pretty fast.
(A year later) kd trees that quit early, after looking at say 1M of all 200M points,
can be much faster in high dimensions.
The results are only statistically close to the absolute nearest, depending on the data and metric;
there's no free lunch.
(Note that sampling 1M points, and kd tree only those 1M, is quite different, worse.)
FLANN does this for image data with dim=128,
and is I believe in opencv. A local mod of the fast and solid
SciPy cKDTree also has cutoff= .
Problem: I have two overlapping 2D shapes, A and B, each shape having the same number of pixels, but differing in shape. Some portion of the shapes are overlapping, and there are some pieces of each that are not overlapping. My goal is to move all the non-overlapping pixels in shape A to the non-overlapping pixels in shape B. Since the number of pixels in each shape is the same, I should be able to find a 1-to-1 mapping of pixels. The restriction is that I want to find the mapping that minimizes the total distance traveled by all the pixels that moved.
Brute Force: The brute force approach to solving this problem is obviously out of the question, since I would have to compute the total distance of all possible mappings of which I think there are n! (where n is the number of non-overlapping pixels in one shape) times the computation of calculating a distance for each pair of points in the mapping, n, giving a total of O( n * n! ) or something similar.
Backtracking: The only "better" solution I could think of was to use backtracking, where I would keep track of the current minimum so far and at any point when I'm evaluating a certain mapping, if I reach or exceed that minimum, I move on to the next mapping. Even this won't do any better than O( n! ).
Is there any way to solve this problem with a reasonable complexity?
Also note that the "obvious" approach of simply mapping a point to it's closest matching neighbour does not always yield the optimum solution.
Simpler Approach?: As a secondary question, if a feasible solution doesn't exist, one possibility might be to partition each non-overlapping section into small regions, and map these regions, greatly reducing the number of mappings. To calculate the distance between two regions I would use the center of mass (average of the pixel locations in the region). However, this presents the problem of how I should go about doing the partitioning in order to get a near-optimal answer.
Any ideas are appreciated!!
This is the Minimum Matching problem, and you are correct that it is a hard problem in general. However for the 2D Euclidean Bipartite Minimum Matching case it is solvable in close to O(n²) (see link).
For fast approximations, FryGuy is on the right track with Simulated Annealing. This is one approach.
Also take a look at Approximation algorithms for bipartite and non-bipartite matching in the plane for a O((n/ε)^1.5*log^5(n)) (1+ε)-randomized approximation scheme.
You might consider simulated annealing for this. Start off by assigning A[x] -> B[y] for each pixel, randomly, and calculate the sum of squared distances. Then swap a pair of x<->y mappings, randomly. Then choose to accept this with probability Q, where Q is higher if the new mapping is better, and tends towards zero over time. See the wikipedia article for a better explanation.
Sort pixels in shape A: in increasing order of 'x' and then 'y' ordinates
Sort pixels in shape B: in decreasing order of 'x' and then increasing 'y'
Map pixels at the same index: in the sorted list the first pixel in A will map to first pixel in B. Is this not the mapping you are looking for?