I have a static set of simple polygons (they may be nonconvex but are not self-intersecting) and a large number of query ellipses. Assume that this is all being done in 2D. I need to find the distance between each ellipse and the closest polygon to that ellipse. Distance is defined as the short distance between any two points on the ellipse and polygon respectively. If the ellipse intersects a polygon, then we can say the distance is 0 or assign some negative value.
A brute force approach would simply compute the distance between each ellipse and each polygon and return the lowest distance in O(mn) time where m is the number of polygons and n is the average number of vertices per polygon. I would like to reduce the m term here because I think I can cull the amount of polygons being considered though some spacial analysis.
I've considered a few approaches including Voronoi diagrams and R-trees and kd-trees. However, most of these seems to involve points and I'm not sure how to extend these to polygons. I think the most promising approach involves computing bounding boxes for each polygon and the ellipse and using an R-tree to find some set of nearby polygons. However, I'm not quite about the best way to find this close set of polygons. Or perhaps there's a better way that I'm overlooking.
Using bounding boxes or disks has the benefit of reducing the work to compute a distance ellipse/polygon to O(1). And it allows you to obtain a lower and an upper bound on the true distance.
Assume you use disks and also enclose the ellipse in a disk. You will need to perform a modified nearest neighbors search, that enumerates the disks such that the lower bound of their distance to the query disk is smaller than the best upper bound found so far.
This can be accelerated by means of a k-D tree (D=2) built on the disk centers. You can enhance every node in the k-D tree with the radius of the largest and smallest disks in the subtree it roots. During the search you will use this information to evaluate the bounds without knowing the exact radii of the disks, but the deeper you go in the tree, the better you know them.
Perform a search to obtain the tightest upper bound on the distance, then a second search to enumerate all the disks with a lower bound smaller than the tightest upper bound. This will reduce the number of disks to be considered.
You can also use bounding boxes, and store the min/max width/height of the boxes in the tree nodes.
Related
Given a 3D point cloud, how can I find the smallest bounding sphere that contains a given percentage of points?
I.e. if I have a point cloud with some noise, and I want to ignore 5% of outliers, how can I get the smallest sphere that contains the 95% remaining points, if I do not know which points are the outliers?
Example: I want to find the green sphere, not the red sphere:
I am looking for a reasonably fast and simple algorithm. It does not have to find the optimal solution, a reasonable approximation is fine as well.
I know how to calculate the approximate bounding sphere for 100% of points, e.g. with Ritter's algorithm.
How can I generalize this to an algorithm that finds the smallest sphere containing x% of points?
Just an idea: binary search.
First, use one of the bounding sphere algorithms to find the 100% bounding sphere first.
Fix the centerpoint of the 95% sphere to be the same as the centerpoint of the 100% sphere. (There is no guarantee it is, but you say you're ok with approximate answer.) Then use binary search on the radius of the sphere until you get 95% +- epsilon points inside.
Assuming the points are sorted by their distance (or squared distance, to be slightly faster) from the centerpoint, for a fixed radius r it takes O(log n) operations to find the number of points inside the sphere with radius r, e.g. by using another binary search. The binary search for the right r itself requires logarithmic number of such evaluation. Therefore The whole search should take just O(log2n) steps after you have found the 100% sphere.
Edit: if you think the center of the reduced sphere could be too far away from the full sphere, you can recalculate the bounding sphere, or just the center of the mass of the point set, each time after throwing away some points. Each recaculation should take no more than O(n). After recalculation, resort the points by their distance from the new centerpoint. Since you expect them to be already nearly sorted, you can rely on bubble sort, which for nearly-sorte data works in O(n + epsilon). Remember that there will be just a logarithmic number of these tests needed, so you should be able to get away with close to O(n log2 n) for the whole thing.
It depends on what exactly performance you're looking for and what you're willing to sacrifice for that. (I would be happy to learn that I'm wrong and there's a good exact algortihm for this.)
The algorithm of ryann is not that bad. I suggested robustifying with a geometric median then came to this sketch:
compute the NxN inter-distances in O(N^2)
sum each row of this matrix (= the distance of one point to the others) in O(N^2)
sort the obtained "crowd" distance in O(N*log N)
(the point with smallest distance is an approximation of the geometric median)
remove the 5% largest in O(1)
here we just consider largest crowd-distance as outliers,
instead of taking the largest distance from the median.
compute radius of obtained sphere in O(N)
Of course, it also suffers from sub-optimality but should perform a bit better in case of far outlier. Overall cost is O(N^2).
I would iterate the following two steps until convergence:
1) Given a group of points, find the smallest sphere enclosing 100% of the points and work out its centre.
2) Given a centre, find the group of points containing 95% of the original number which is closest to the centre.
Each step reduces (or at least does not increase) the radius of the sphere involved, so you can declare convergence when the radius stops decreasing.
In fact, I would iterate from multiple random starts, each start produced by finding the smallest sphere that contains all of a small subset of points. I note that if you have 10 outliers, and you divide your set of points into 11 parts, at least one of those parts will not have any outliers.
(This is very loosely based on https://en.wikipedia.org/wiki/Random_sample_consensus)
The distance from the average point location would probably give a reasonable indication if a point is an outlier or not.
The algorithm might look something like:
Find bounding sphere of points
Find average point location
Choose the point on the bounding sphere that is farthest from the average location, remove it as an outlier
Repeat steps 1-3 until you've removed 5% of points
Find the Euclidean minimum spanning tree, and check the edges in descending order of length. For each edge, consider the sets of points points in the two connected trees you get by deleting the edge.
If the smaller set of points is less that 5% of the total, and the bounding sphere around the larger set of points doesn't overlap it, then delete the smaller set of points. (This condition is necessary in case you have an 'oasis' of empty space in the center of your point cloud).
Repeat this until you hit your threshold or the lengths are getting 'small enough' that you don't care to delete them.
The input is a series of point coordinates (x0,y0),(x1,y1) .... (xn,yn) (n is not very large, say ~ 1000). We need to create some rectangles as bounding box of these points. There's no need to find the global optimal solution. The only requirement is if the euclidean distance between two point is less than R, they should be in the same bounding rectangle. I've searched for sometime and it seems to be a clustering problem and K-means method might be a useful one.
However, the input point coordinates didn't have specific pattern from time to time. So it maybe not possible to set a specific K in K-mean. I am wondering if there is any algorithm or method possible to solve this problem?
The only requirement is if the euclidean distance between two point is less than R, they should be in the same bounding rectangle
This is the definition of single-linkage hierarchical clustering cut at a height of R.
Note that this may yield overlapping rectangles.
For much faster and highly efficient methods, have a look at bulk loading strategies for R*-trees, such as sort-tile-recursive. It won't satisfy your "only" requirement above, but it will yield well balanced, non-overlapping rectangles.
K-means is obviously not appropriate for your requirements.
With only 1000 points I would do the following:
1) Work out the difference between all pairs of points. If the distance of a pair is less than R, they need to go in the same bounding rectangle, so use http://en.wikipedia.org/wiki/Disjoint-set_data_structure to record this.
2) For each subset that comes out of your Disjoint set data structure, work out the min and max co-ordinates of the points in it and use this to create a bounding box for the points in this subset.
If you have more points or are worried about efficiency, you will want to make stage (1) more efficient. One easy way would be to go through the points in order of x co-ordinate, keeping only points at most R to the left of the most recent point seen, and using a balanced tree structure to find from these the points at most R above or below the most recent point seen, before calculating the distance to the most recent point seen. One step up from this would be to create a spatial data structure to get yet more efficiency in finding pairs with distance R of each other.
Note that for some inputs you will get just one huge bounding box because you have long chains of points, and for some other inputs you will get bounding boxes inside bounding boxes, for instance if your points are in concentric circles.
In a cubic box I have a large collection points in R^3. I'd like to find the k nearest neighbors for each point. Normally I'd think to use something like a k-d tree, but in this case I have periodic boundary conditions. As I understand it, a k-d tree works by partitioning the space by cutting it into hyper planes of one less dimension, i.e. in 3D we would split the space by drawing 2D planes. For any given point, it is either on the plane, above it, or below it. However, when you split the space with periodic boundary conditions a point could be considered to be on either side!
What's the most efficient method of finding and maintaining a list of nearest neighbors with periodic boundary conditions in R^3?
Approximations are not sufficient, and the points will only be moved one at a time (think Monte Carlo not N-body simulation).
Even in the Euclidean case, a point and its nearest neighbor may be on opposite sides of a hyperplane. The core of nearest-neighbor search in a k-d tree is a primitive that determines the distance between a point and a box; the only modification necessary for your case is to take the possibility of wraparound into account.
Alternatively, you could implement cover trees, which work on any metric.
(I'm posting this answer even though I'm not fully sure it works. Intuitively it seems right, but there might be an edge case I haven't considered)
If you're working with periodic boundary conditions, then you can think of space as being cut into a series of blocks of some fixed size that are all then superimposed on top of one another. Suppose that we're in R2. Then one option would be to replicate that block nine times and arrange them into a 3x3 grid of duplicates of the block. Given this, if we find the nearest neighbor of any single node in the central square, then either
The nearest neighbor is inside the central square, in which case the neighbor is a nearest neighbor, or
The nearest neighbor is in a square other than the central square. In that case, if we find the point in the central square that the neighbor corresponds to, that point should be the nearest neighbor of the original test point under the periodic boundary condition.
In other words, we just replicate the elements enough times so that the Euclidean distance between points lets us find the corresponding distance in the modulo space.
In n dimensions, you would need to make 3n copies of all the points, which sounds like a lot, but for R3 is only a 27x increase over the original data size. This is certainly a huge increase, but if it's within acceptable limits you should be able to use this trick to harness a standard kd-tree (or other spacial tree).
Hope this helps! (And hope this is correct!)
I have a list of rectangles that don't have to be parallel to the axes. I also have a master rectangle that is parallel to the axes.
I need an algorithm that can tell which rectangle is a point closest to(the point must be in the master rectangle). the list of rectangles and master rectangle won't change during the algorithm and will be called with many points so some data structure should be created to make the lookup faster.
To be clear: distance from a rectangle to a point is the distance between the closest point in the rectangle to the point.
What algorithm/data structure can be used for this? memory is on higher priority on this, n log n is ok but n^2 is not.
You should be able to do this with a Voronoi diagram with O(n log n) preprocessing time with O(log n) time queries. Because the objects are rectangles, not points, the cells may be curved. Nevertheless, a Voronoi diagram should work fine for your purposes. (See http://en.wikipedia.org/wiki/Voronoi_diagram)
For a quick and dirty solution that you could actually get working within a day, you could do something inspired by locality sensitive hashing. For example, if the rectangles are somewhat well-spaced, you could hash them into square buckets with a few different offsets, and then for each query examine each rectangle that falls in one of the handful of buckets that contain the query point.
You should be able to do this in O(n) time and O(n) memory.
Calculate the closest point on each edge of each rectangle to the point in question. To do this, see my detailed answer in the this question. Even though the question has to do with a point inside of the polygon (rather than outside of it), the algorithm still can be applied here.
Calculate the distance between each of these closest points on the edges, and find the closest point on the entire rectangle (for each rectangle) to the point in question. See the link above for more details.
Find the minimum distance between all of the rectangles. The rectangle corresponding with your minimum distance is the winner.
If memory is more valuable than speed, use brute force: for a given point S, compute the distance from S to each edge. Choose the rectangle with the shortest distance.
This solution requires no additional memory, while its execution time is in O(n).
Depending on your exact problem specification, you may have to adjust this solution if the rectangles are allowed to overlap with the master rectangle.
As you described, a distance between one point to a rectangle is the minimum length of all lines through that point which is perpendicular with all four edges of a rectangle and all lines connect that point with one of four vertices of the rectangle.
(My English is not good at describing a math solution, so I think you should think more deeply for understanding my explanation).
For each rectangle, you should save four vertices and four edges function for fast calculation distance between them with the specific point.
List1 contains a high number (~7^10) of N-dimensional points (N <=10), List2 contains the same or fewer number of N-dimensional points (N <=10).
My task is this: I want to check which point in List2 is closest (euclidean distance) to a point in List1 for every point in List1 and subsequently perform some operation on it. I have been doing it the simple- the nested loop way when I didn't have more than 50 points in List1, but with 7^10 points, this obviously takes up a lot of time.
What is the fastest way to do this? Any concepts from Computational Geometry might help?
EDIT: I have the following in place, I have built a kd-tree out of List2 and then now I am doing a nearest-neighborhood search for each point in List1. Now as I originally pointed out, List1 has 7^10 points, and hence though I am saving on the brute force, Euclidean distance method for every pair, the sheer large number of points in List1 is causing a lot of time consumption. Is there any way I can improve this?
Well a good way would be to use something like a kd-tree and perform nearest neighbour searching. Fortunately you do not have to implement this data structure yourself, it has been done before. I recommend this one, but there are others:
http://www.cs.umd.edu/~mount/ANN/
It's not possible to tell you which is the most efficient algorithm without knowing anything about the distribution of points in the two solutions. However, for a first guess...
First algorithm doesn't work — for two reasons: (1) a wrong assumption - I assume the bounding hulls are disjoint, and (2) a misreading of the question - it doesn't find the shortest edge for every pair of points.
...compute the convex hull of the two sets: the closest points must be on the hyperface on the two hulls through which the line between the two centres of gravity passes.
You can compute the convex hull by computing the centre points, the centre of gravity assuming all points have equal mass, and ordering the lists from furthest from the centre to least far. Then take the furthest away point in the list, add this to the convex hull, and then remove all points that are within the so-far computed convex hull (you will need to compute lots of 10d hypertriangles to do this). Repeat unil there is nothing left in the list that is not on the convex hull.
Second algorithm: partial
Compute the convex hull for List2. For each point of List1, if the point is outside the convex hull, then find the hyperface as for first algorithm: the nearest point must be on this face. If it is on the face, likewise. If it is inside, you can still find the hyperface by extending the line past the point from List1: the nearest point must be inside the ball that includes the hyperface to List2's centre of gravity: here, though, you need a new algorithm to get the nearest point, perhaps the kd-tree approach.
Perfomance
When List2 is something like evenly distributed, or normally distributed, through some fairly oblique shape, this will do a good job of reducing the number of points under consideration, and it should be compatible with the kd-tree suggestion.
There are some horrible worts cases, though: if List2 contains only points on the surface of a torus whose geometric centre is the centre of gravity of the list, then the convex hull will be very expensive to calculate, and will not help much in reducing the number of points under consideration.
My evaluation
These kinds of geometric techniques may be a useful complement to the kd-trees approach of other posters, but you need to know a little about the distribution of points before you can determine whether they are worth applying.
kd-tree is pretty fast. I've used the algorithm in this paper and it works well Bentley - K-d trees for semidynamic point sets
I'm sure there are libraries around, but it's nice to know what's going on sometimes - Bentley explains it well.
Basically, there are a number of ways to search a tree: Nearest N neighbors, All neighbors within a given radius, nearest N neighbors within a radius. Sometimes you want to search for bounded objects.
The idea is that the kdTree partitions the space recursively. Each node is split in 2 down the axis in one of the dimensions of the space you are in. Ideally it splits perpendicular to the node's longest dimension. You should keep splitting the space until you have about 4 points in each bucket.
Then for every query point, as you recursively visit nodes, you check the distance from to the partition wall for the particular node you are in. You descend both nodes (the one you are in and its sibling) if the distance to the partition wall is closer than the search radius. If the wall is beyond the radius, just search children of the node you are in.
When you get to a bucket (leaf node), you test the points in there to see if they are within the radius.
If you want the closest point, you can start with a massive radius, and pass a pointer or reference to it as you recurse - and in that way you can shrink the search radius as you find close points - and home in on the closest point pretty fast.
(A year later) kd trees that quit early, after looking at say 1M of all 200M points,
can be much faster in high dimensions.
The results are only statistically close to the absolute nearest, depending on the data and metric;
there's no free lunch.
(Note that sampling 1M points, and kd tree only those 1M, is quite different, worse.)
FLANN does this for image data with dim=128,
and is I believe in opencv. A local mod of the fast and solid
SciPy cKDTree also has cutoff= .