Given a 3D point cloud, how can I find the smallest bounding sphere that contains a given percentage of points?
I.e. if I have a point cloud with some noise, and I want to ignore 5% of outliers, how can I get the smallest sphere that contains the 95% remaining points, if I do not know which points are the outliers?
Example: I want to find the green sphere, not the red sphere:
I am looking for a reasonably fast and simple algorithm. It does not have to find the optimal solution, a reasonable approximation is fine as well.
I know how to calculate the approximate bounding sphere for 100% of points, e.g. with Ritter's algorithm.
How can I generalize this to an algorithm that finds the smallest sphere containing x% of points?
Just an idea: binary search.
First, use one of the bounding sphere algorithms to find the 100% bounding sphere first.
Fix the centerpoint of the 95% sphere to be the same as the centerpoint of the 100% sphere. (There is no guarantee it is, but you say you're ok with approximate answer.) Then use binary search on the radius of the sphere until you get 95% +- epsilon points inside.
Assuming the points are sorted by their distance (or squared distance, to be slightly faster) from the centerpoint, for a fixed radius r it takes O(log n) operations to find the number of points inside the sphere with radius r, e.g. by using another binary search. The binary search for the right r itself requires logarithmic number of such evaluation. Therefore The whole search should take just O(log2n) steps after you have found the 100% sphere.
Edit: if you think the center of the reduced sphere could be too far away from the full sphere, you can recalculate the bounding sphere, or just the center of the mass of the point set, each time after throwing away some points. Each recaculation should take no more than O(n). After recalculation, resort the points by their distance from the new centerpoint. Since you expect them to be already nearly sorted, you can rely on bubble sort, which for nearly-sorte data works in O(n + epsilon). Remember that there will be just a logarithmic number of these tests needed, so you should be able to get away with close to O(n log2 n) for the whole thing.
It depends on what exactly performance you're looking for and what you're willing to sacrifice for that. (I would be happy to learn that I'm wrong and there's a good exact algortihm for this.)
The algorithm of ryann is not that bad. I suggested robustifying with a geometric median then came to this sketch:
compute the NxN inter-distances in O(N^2)
sum each row of this matrix (= the distance of one point to the others) in O(N^2)
sort the obtained "crowd" distance in O(N*log N)
(the point with smallest distance is an approximation of the geometric median)
remove the 5% largest in O(1)
here we just consider largest crowd-distance as outliers,
instead of taking the largest distance from the median.
compute radius of obtained sphere in O(N)
Of course, it also suffers from sub-optimality but should perform a bit better in case of far outlier. Overall cost is O(N^2).
I would iterate the following two steps until convergence:
1) Given a group of points, find the smallest sphere enclosing 100% of the points and work out its centre.
2) Given a centre, find the group of points containing 95% of the original number which is closest to the centre.
Each step reduces (or at least does not increase) the radius of the sphere involved, so you can declare convergence when the radius stops decreasing.
In fact, I would iterate from multiple random starts, each start produced by finding the smallest sphere that contains all of a small subset of points. I note that if you have 10 outliers, and you divide your set of points into 11 parts, at least one of those parts will not have any outliers.
(This is very loosely based on https://en.wikipedia.org/wiki/Random_sample_consensus)
The distance from the average point location would probably give a reasonable indication if a point is an outlier or not.
The algorithm might look something like:
Find bounding sphere of points
Find average point location
Choose the point on the bounding sphere that is farthest from the average location, remove it as an outlier
Repeat steps 1-3 until you've removed 5% of points
Find the Euclidean minimum spanning tree, and check the edges in descending order of length. For each edge, consider the sets of points points in the two connected trees you get by deleting the edge.
If the smaller set of points is less that 5% of the total, and the bounding sphere around the larger set of points doesn't overlap it, then delete the smaller set of points. (This condition is necessary in case you have an 'oasis' of empty space in the center of your point cloud).
Repeat this until you hit your threshold or the lengths are getting 'small enough' that you don't care to delete them.
Related
I'm looking for an algorithm that can quickly (I'm heavily constrained by performance) find a point inside of a circle, where this point is outside of all rectangles in a provided set (these rectangles can be rotated).
Or alternatively, to find a circle A with its center inside a circle B, where circle A does not intersect with a set of line segments.
The only solution I can come up with is to just loop through samples of points and then loop through the rectangles for each of them. But since my space is continuous, that's quite a pain. I'm basically satisfied with just a single point that doesn't intersect, but there will be cases where no such points exist. In the latter case I would ideally try to find a point with the least amount of intersections, or be able to find the answer that no such point exists.
Does anyone know of any algorithms that can accomplish this in something less than O(n^2)? Anything that would help identify good candidate points would be awesome too.
A typical example of the situation is this:
Lots of big rectangles, with small circle in which I hope to find a point (here indicated with blue). It's common that many of the rectangles fall completely outside of the circle, and also common that the circle is completely covered. There's only a small set of lengths and widths that tend to be used for the rectangles.
There are probably several interesting ways to do this. The simplest algorithm I can think of that gives a decent runtime is an algorithm as follows:
Treat all rectangles as a set of line segments.
Use an efficient algorithm to find the intersection of all line segments (for example the Bentley-Ottmann algorithm.)
Create a list of points of interest (POIs) that are either a) the corners of a rectangle or b) the intersection points computed in 2.
Create a finer set of line segments such that each line segment terminates at a POI defined in 3.
Using the POIs and the finer set of line segments from 4, compute a constrained triangulation (for example a Constrained Delaunay Triangulation.)
Pick any (unlabeled) triangle to start. Determine if the triangle lies within at least one rectangle (label it as a COVERED triangle) or not (label it as a FREE triangle). To do this you can use any point in polygon algorithm, for example ray-casting.
Run a Depth or Breadth first search starting at this triangle and expanding to neighbors, taking care not to cross between any triangle pair that would require crossing a line segment defined in 4. For every triangle visited, label it as the same label as the starting triangle.
Repeat 6-7 until all triangles are labeled (or all triangles covering the circle of interest are labeled.)
The union of all FREE triangles intersected with the circled of interest yields precisely the points that are not covered by any rectangle and are within the circle.
Note, this algorithm is a bit general and can be improved by focusing only in the area around the circle (for example a bounding box region can only be considered, with the bounding box encompassing all rectangles intersecting the circle.)
To analyze the runtime, consider the runtime of each key step:
has a runtime of O((n+k) log n) where k is the number of intersections, where n is the number of line segments.
has a runtime of O(m log m) where m is the number of POIs, m is O(n+k)
and 7. should be analyzed together. In the worst case, each triangle would need O(n) computations to check for containment in a rectangle. Given that there would be O(m) triangles this would yield a O(nm) bound. However, the purpose of the triangulation is to reuse the point in polygon computation for the seeding triangle to label as many neighboring triangles as possible. In practice the number of triangles that would require a point in polygon computation should be negligible. Therefore the runtime of this step is O(tn) where t is the number of traingles for which point in polygon computations are performed.
The runtime expected is, therefore, O((n+k) log n + t(n+k)) where k is the number of intersections in step 2 and t is the number of triangles for which point in polygon computations are performed. In the worst case this is O(n^2 log n) as you can create a pathological example with n^2 intersections, but this should be unlikely if not possible. Likewise, the number t should be kept to a minimum to make this as efficient as possible. If both t << n and k << n^2, this would be quite efficient.
One approximation that could yield performance improvement:
Consider approximating the circle by a set of r line segments, and including these line segments in steps 1-5. While this is an approximation, it would potentially improve the runtime, as only triangles inside the circle would ever need to be considered.
I have a rectangular area where there are circles with equal radius. I want to find which circles overlap with other circles (the output is a list of 2-element sets of overlapping circles).
I know how to check if two of the circles overlap (the distance between their centers is less than the diameter). I can perform this check for every pair of circles, but I was wondering if there is a better algorithm (faster than O(n^2)).
EDIT
The number of circles is usually about 100 and overlappings won't happen very often.
Here is some context:
The rectangle is a battlefield in a game. The movement of the units is done on small steps and I'm trying to detect collisions between units.
Given the new explanation of the problem statement, I would recommend a different approach.
Overlay a square grid over the battlefield, with a grid step equal to one circle diameter. Every circle can overlap at most four cells. In each cell, keep a list of the overlapping circles (and update it on every move).
Detecting potential collisions will now take about four cell/circle tests per circle, i.e. close to linear time.
For a simple solution, insert the centers in a 2d-tree and perform circular range queries around every center with a query radius 2R. In good conditions, this can be O(N Log(N)).
Alternatively, just sort the centers on X and try all circles in turn: by dichotomic search, locate the abscissa Xc and scan to Xc-2R and to Xc+2R, then check the 2D distance.
The cost of the dichotomic searches will be O(N Log(N)). If the circles are uniformly spread out in a square of side S, a stripe of width 4R contains 4RN/S circles, hence a total comparison cost of 4RN²/S. This is a good performance if S is large (think that for N tightly packed circles in a square, S~2R√N, hence 2N√N comparisons).
Direct answer: You cannot get better than O(n^2) in general since the circles could potentially all overlap, so you have to generate n^2 answers.
If you get more specific, you might get better answers. For example, if what you are really trying to do is find bounding spheres in a 2D simulation, you can profit from the fact that entities only move so far between frames, if the circles are sparse it's different from when they are tightly packed, etc. So let us know more about what it's all about.
EDIT: You edited your question - you indeed are looking for collision detection in a 2D simulation. If you check out https://en.wikipedia.org/wiki/Collision_detection , they point to several algorithms for exactly your case.
I like the one detailed right on that page where you keep one list of bounding intervals per axis (2 in "2D") and only need to "work hard" when those bounding intervals (which are themself by definition one-dimensional) change (i.e., there "overlap state"). This removes the O(n²) for well-behaved cases. They don't give an estimate for the complexity of that, but as it basically comes down to sorting, it looks more or less O(n logn) to me, and less when there are only minimal changes between frames.
I am interesting in finding the diameter of two points sets, in 128 dimensions. The first has 10000 points and the second 1000000. For that reason I would like to do something better than the naive approach which takes O(n²). The algorithm will be able to handle any number of points and dimensions, but I am currently very interested in these two particular data sets.
I am very interesting in gaining speed over accuracy, thus, based on this, I would find the (approximate) bounding box of the point set, by computing the min and max value per coordinate, thus O(n*d) time. Then, if I find the diameter of this box, the problem is solved.
In the 3d case, I could find the diameter of the one side, since I know the two edges and then, I could apply the Pythagorean theorem on the other, which is vertical to this side. I am not sure for this however and for sure, I can't see how to generalize it to d dimensions.
An interesting answer can be found here, but it seems to be specific for 3 dimensions and I want a method for d dimensions.
Interesting paper: On computing the diameter of a point set in high dimensional Euclidean space. Link. However, implementing the algorithm seems too much for me in this phase.
The classic 2-approximation algorithm for this problem, with running time O(nd), is to choose an arbitrary point and then return the maximum distance to another point. The diameter is no smaller than this value and no larger than twice this value.
I would like to add a comment, but not enough reputation for that...
I just want to warn other readers that the "bounding box" solution is very inaccurate. Take for example the Euclidean ball of radius one. This set has diameter two, but its bounding box is [-1, 1]^d, which has diameter twice the square root of d. For d = 128, this is already a very bad approximation.
For a crude estimate, I would stay with David Eisenstat's answer.
There is a precision based algorithm which performs very well on any dimension, which is based on computing the dimension of an axial bounding box.
The idea is that it's possible to find the lower and upper boundaries of the axis bounding box length function since it's partial derivatives are limited, and depend on the angle between the axises.
The limit of the local maxima derivatives between two axises in 2d space can be computed as:
sin(a/2)*(1 + tan(a/2))
That means that, for example, for 90deg between axises the boundary is 1.42 (sqrt(2))
Which reduces to a/2 when a => 0, so the upper boundary is proportional to the angle.
For a multidimensional case the formula varies slightly, but still it's easy to compute.
So, the search of local minima convolves in logarithmic time.
The good news is that we can run the search of such local maxima in parallel.
Also, we can filter out both the regions of the search based on the best achieved result so far, as well as the points themselves, which are belo the lower limit of the search in the worst region.
The worst case of the algorithm is where all of the points are placed on the surface of a sphere.
This can be firther improved: when we detect a local search which operates on just few points, we swap to bruteforce for this particular axis. It works fast, because we need only the points which are subject to that particular local search, which can be determined as points actually bound by two opposite spherical cones of a particular angle sharing the same axis.
It's hard to figure out the big O notation, because it depends on desired precision and the distribution of points (bad when most of the points are on a sphere's surface).
The algorithm i use is here:
Set the initial angle a = pi/2.
Take one axis for each dimension. The angle and the axises form the initial 'bucket'
For each axis, compute the span on that axis by projecting all the points onto the axis, and finding min and max of the coordinates on the axis.
Compute the upper and lower bounds of the diameter which is interesting. It's based on the formula: sin(a/2)*(1 + tan(a/2)) and multiplied by assimetry cooficient, computed from the length of the current axis projections.
For the next step, kill all of the points which fall under the lower bound in each dimension at the same time.
For each exis, If the amount of points above the upper bound is less then some reasonable amount (experimentally computed) then compute using a bruteforce (N^2) on the set of the points in question, and adjust the lower bound, and kill the axis for the next step.
For the next step, Kill all of the axises, which have all of their points under the lower bound.
If the precision is satisfactory (upper bound - lower bound) < epsilon, then return the upper bound as the result.
For all of the survived axises, there is a virtual cone on that axis (actually, the two opposite cones), which covers some area on a virtual sphere which encloses a face of the cube. If i'm not mistaken, it's angle would be a * sqrt(2). Set the new angle to a / sqrt(2). Create a whole bucket of new axises (2 * number of dimensions), so the new cone areas would cover the initial cone area. It's the hard part for me, as i have not enough imagination for n>3-dimensional case.
Continue from step (3).
You can paralellize the procedure, synchronizing the limits computed so far for the points from (5) through (7).
I'm going to summarize the algorithm proposed by Timothy Shields.
Pick random point x.
Pick point y furthest from x.
If not done, let x = y, and go to step 2
The more times you repeat, the more accurate the result will be... ??
EDIT: actually this algorithm is not very good. Think about a 2D rectangle with vertices ABCD. There are two maxima: between AC and BD, which are separated by a sizable valley. This algorithm will get stuck at one or the other 50/50. If AC is slightly larger than BD, you'll be getting the wrong answer 50% of the time no matter how many times you iterate. Other regular polygons have the same issue, and in higher dimensions it is even worse.
I'm looking for efficient solution of following problem: For given set of points in n-dimensional euclidian space find such member of this set that minimizes total distance to other points in set.
The obvious naïve approach is quadratic, so I'm looking for something less than quadratic.
My first thought was that all I need is just to find the center of bounding sphere and then, find the closest point in set to this point. But this is actually not true, imagine right triangle - all its vertices are equidistant from such center, nevertheless, exactly one vertice meets our requirements.
It would be nice it one will shed some light on this issue.
What minimizes the distance to all of the points is their average. Only a guess, but after you'll find the average you could find a point closest to it. As correctly pointed out in comments, median instead of average will actually minimize the distance (average will minimize squared distance). Median can also be calculated in O(n). For high dimensional datasets this solution would be O(n*m) of course, where m is the number of dimensions.
Also some links:
See accepted answer here: Algorithm to find point of minimum total distance from locations
And link provided by mcdowella: http://en.wikipedia.org/wiki/Geometric_median
I am making this up as I go along, but there appears to be a close connection between "best point of a set" and "best point" in convex optimization.
Your score function is a sum of distances. Each distance is convex U-shaped (OK V-shaped in this case) so their sum is convex U-shaped. In particular it has a perfectly good derivative everywhere except at points in the set, and this derivative is optimistic - if you take the value at a point and its derivative, neglecting any point at the point you are looking at, then predictions based on this will be optimistic - the line formed using the derivative lies almost entirely beneath the correct answer but grazes it at a single point.
This leads to the following algorithm:
Repeatedly
Pick a point at random and look to see if is the best point so far. If so, take note of it. Take the derivative of the sum of distances at this point. Use this, and the value at that point, to work out the predicted sum of distances at every other point and discard the points where this prediction is worse than the best answer so far as possible answers (although you still need to take them into account when working out distances and derivatives). These will be the points on the far side of a plane drawn through the chosen point normal to the derivative.
Now discard the chosen point as a contender as well and repeat if there are any points left to consider.
I would expect this to be something like n log n on randomly chosen points. However, if the set of points form the vertices of a regular polygon in n dimensions then it will cost N^2, discarding only the chosen point each time - any of the N points is in fact a correct answer and they all have the same sum of distances from each other.
I will of course up-vote anybody who can confirm or deny this general principle for finding the best of a set of given points under a convex objective function.
OK - I was interested enough in this to program this up - so I have 200+ lines of Java to dump in here if anybody cares. In 2 dimensions it's very fast, but at 20 dimensions you gain only a factor of two or so - this is reasonably understandable - each iteration cuts off points by projecting the problem down to a line and chopping off a fraction of the points outside the line. A randomly chosen point will be about half as far away from the centre as the other points - and very roughly you can expect the projection to cut off all but some multiple of the d-th root of 1/2 so as d increases the fraction of points you can discard in each iteration reduces.
List1 contains a high number (~7^10) of N-dimensional points (N <=10), List2 contains the same or fewer number of N-dimensional points (N <=10).
My task is this: I want to check which point in List2 is closest (euclidean distance) to a point in List1 for every point in List1 and subsequently perform some operation on it. I have been doing it the simple- the nested loop way when I didn't have more than 50 points in List1, but with 7^10 points, this obviously takes up a lot of time.
What is the fastest way to do this? Any concepts from Computational Geometry might help?
EDIT: I have the following in place, I have built a kd-tree out of List2 and then now I am doing a nearest-neighborhood search for each point in List1. Now as I originally pointed out, List1 has 7^10 points, and hence though I am saving on the brute force, Euclidean distance method for every pair, the sheer large number of points in List1 is causing a lot of time consumption. Is there any way I can improve this?
Well a good way would be to use something like a kd-tree and perform nearest neighbour searching. Fortunately you do not have to implement this data structure yourself, it has been done before. I recommend this one, but there are others:
http://www.cs.umd.edu/~mount/ANN/
It's not possible to tell you which is the most efficient algorithm without knowing anything about the distribution of points in the two solutions. However, for a first guess...
First algorithm doesn't work — for two reasons: (1) a wrong assumption - I assume the bounding hulls are disjoint, and (2) a misreading of the question - it doesn't find the shortest edge for every pair of points.
...compute the convex hull of the two sets: the closest points must be on the hyperface on the two hulls through which the line between the two centres of gravity passes.
You can compute the convex hull by computing the centre points, the centre of gravity assuming all points have equal mass, and ordering the lists from furthest from the centre to least far. Then take the furthest away point in the list, add this to the convex hull, and then remove all points that are within the so-far computed convex hull (you will need to compute lots of 10d hypertriangles to do this). Repeat unil there is nothing left in the list that is not on the convex hull.
Second algorithm: partial
Compute the convex hull for List2. For each point of List1, if the point is outside the convex hull, then find the hyperface as for first algorithm: the nearest point must be on this face. If it is on the face, likewise. If it is inside, you can still find the hyperface by extending the line past the point from List1: the nearest point must be inside the ball that includes the hyperface to List2's centre of gravity: here, though, you need a new algorithm to get the nearest point, perhaps the kd-tree approach.
Perfomance
When List2 is something like evenly distributed, or normally distributed, through some fairly oblique shape, this will do a good job of reducing the number of points under consideration, and it should be compatible with the kd-tree suggestion.
There are some horrible worts cases, though: if List2 contains only points on the surface of a torus whose geometric centre is the centre of gravity of the list, then the convex hull will be very expensive to calculate, and will not help much in reducing the number of points under consideration.
My evaluation
These kinds of geometric techniques may be a useful complement to the kd-trees approach of other posters, but you need to know a little about the distribution of points before you can determine whether they are worth applying.
kd-tree is pretty fast. I've used the algorithm in this paper and it works well Bentley - K-d trees for semidynamic point sets
I'm sure there are libraries around, but it's nice to know what's going on sometimes - Bentley explains it well.
Basically, there are a number of ways to search a tree: Nearest N neighbors, All neighbors within a given radius, nearest N neighbors within a radius. Sometimes you want to search for bounded objects.
The idea is that the kdTree partitions the space recursively. Each node is split in 2 down the axis in one of the dimensions of the space you are in. Ideally it splits perpendicular to the node's longest dimension. You should keep splitting the space until you have about 4 points in each bucket.
Then for every query point, as you recursively visit nodes, you check the distance from to the partition wall for the particular node you are in. You descend both nodes (the one you are in and its sibling) if the distance to the partition wall is closer than the search radius. If the wall is beyond the radius, just search children of the node you are in.
When you get to a bucket (leaf node), you test the points in there to see if they are within the radius.
If you want the closest point, you can start with a massive radius, and pass a pointer or reference to it as you recurse - and in that way you can shrink the search radius as you find close points - and home in on the closest point pretty fast.
(A year later) kd trees that quit early, after looking at say 1M of all 200M points,
can be much faster in high dimensions.
The results are only statistically close to the absolute nearest, depending on the data and metric;
there's no free lunch.
(Note that sampling 1M points, and kd tree only those 1M, is quite different, worse.)
FLANN does this for image data with dim=128,
and is I believe in opencv. A local mod of the fast and solid
SciPy cKDTree also has cutoff= .