Point Set Similarity Metrics - computational-geometry

I am looking for some quick metrics to take to measure the similarity between two sets of points, such as distance between centroids, amount of overlap etc
I would ideally like to take a few more metrics to improve the accuracy of the basic similarity measurements, but the key feature is that they ideally need to be fast and scalable.

Not knowing the application, similarity could mean anything.
The number of points.
The dimension of the points.
The average, variances, and covariances of the coordinates.
The position, volume, or orientation of a bounding box of type ...
The radius and position of a smallest enclosing ball.
The volume and ... of the convex hull.
The ... of clusters of kind ...
The density of points in ...
The ... of ...
...
There is probably a real-world reason R why a set of points P are distributed in a certain way. So try to deduce R, and compare the R's for similarity instead.

Related

Find smallest circle with maximum density of points from a given set

Given the (lat, lon) coordinates of a group of n locations on the surface of the earth, find a (lat, lon) point c, and a value of r > 0 such that
we maximize the density, d, of locations per square
mile, say, in the surface area described and contained by the circle defined by c and r.
At first I thought maybe you could solve this using linear programming. However, density depends on area depends on r squared. Quadratic term. So, I don't think problem is amenable to linear programming.
Is there a known method for solving this kind of thing? Suppose you simplify the problem to (x, y) coordinates on the Cartesian plane. Does that make it easier?
You've got two variables c and r that you're trying to find so as to maximize the density, which is a function of c and r (and the locations, which is a constant). So maybe a hill-climbing, gradient descent, or simulated annealing approach might work? You can make a pretty good guess for your first value. Just use the centroid of the locations. I think the local maximum you reach from there would be a global maximum.
Steps:
Cluster your points using a density based clustering algorithm1;
Calculate the density of each cluster;
Recursively (or iteratively) sub-cluster the points in the most dense cluster;
The algorithm has to be ignoring the outliers and making them a cluster in their own. This way, all the outliers with high density will be kept and outliers with low density will be weaned out.
Keep track of the cluster with highest density observed till now. Return when you finally reach a cluster made of a single point.
This algorithm will work only when you have clusters like the ones shown below as the recursive exploration will be resulting in similarly shaped clusters:
The algorithm will fail with awkwardly shaped clusters like this because as you can see that even though the triangles are most densely placed when you calculate the density in the donut shape, they will report a far lower density wrt the circle centered at [0, 0]:
1. One density based clustering algorithm that will work for you is DBSCAN.

Retrieve k-nearest spheres within a limited range

I would like to know if I am missing any acceleration structure that is designed for retrieving k-nearest spheres within a range.
The context of my question is molecular visualization, specifically, I need to retrieve k-nearest spheres to a point to produce a function that will be used to guide sphere tracing step length.
To simplify, the search can be limited in range to the point being tested.
All I have seen in the articles handle k-nearest points to a point, but my case is different, since I want to work with spheres closest to a point. It seems possible to adapt the kd-trees, changing the test of points to spheres but I believe that it would affect the performance. So I wonder if there is a better structure or if I should use and adapt the kd-trees.
Currently, I am using an Hybrid Bounding Volume Hierarchy but I think that the search performance could be better with other structure, since I have a big overlap of bounding volumes due to the nature of the molecules.
PS: I don't care much about the construction time. I want good search performance and decent memory occupation.
You could use a 3-step approach:
Find the nearest neighbor using the center-points of the spheres.
For this nearest neighbor you substract its radius and add the maximum radius. Then you Perform a spherical range query with the new distance. This will return all center points of spheres who may be the closest to your original sphere.
Then you manually calculate the actual distance for each sphere using it's actual radius.
This should be reasonably efficient assuming that the radius of spheres is not massively bigger than their average distance.

Find the diameter of a set of n points in d-dimensional space

I am interesting in finding the diameter of two points sets, in 128 dimensions. The first has 10000 points and the second 1000000. For that reason I would like to do something better than the naive approach which takes O(n²). The algorithm will be able to handle any number of points and dimensions, but I am currently very interested in these two particular data sets.
I am very interesting in gaining speed over accuracy, thus, based on this, I would find the (approximate) bounding box of the point set, by computing the min and max value per coordinate, thus O(n*d) time. Then, if I find the diameter of this box, the problem is solved.
In the 3d case, I could find the diameter of the one side, since I know the two edges and then, I could apply the Pythagorean theorem on the other, which is vertical to this side. I am not sure for this however and for sure, I can't see how to generalize it to d dimensions.
An interesting answer can be found here, but it seems to be specific for 3 dimensions and I want a method for d dimensions.
Interesting paper: On computing the diameter of a point set in high dimensional Euclidean space. Link. However, implementing the algorithm seems too much for me in this phase.
The classic 2-approximation algorithm for this problem, with running time O(nd), is to choose an arbitrary point and then return the maximum distance to another point. The diameter is no smaller than this value and no larger than twice this value.
I would like to add a comment, but not enough reputation for that...
I just want to warn other readers that the "bounding box" solution is very inaccurate. Take for example the Euclidean ball of radius one. This set has diameter two, but its bounding box is [-1, 1]^d, which has diameter twice the square root of d. For d = 128, this is already a very bad approximation.
For a crude estimate, I would stay with David Eisenstat's answer.
There is a precision based algorithm which performs very well on any dimension, which is based on computing the dimension of an axial bounding box.
The idea is that it's possible to find the lower and upper boundaries of the axis bounding box length function since it's partial derivatives are limited, and depend on the angle between the axises.
The limit of the local maxima derivatives between two axises in 2d space can be computed as:
sin(a/2)*(1 + tan(a/2))
That means that, for example, for 90deg between axises the boundary is 1.42 (sqrt(2))
Which reduces to a/2 when a => 0, so the upper boundary is proportional to the angle.
For a multidimensional case the formula varies slightly, but still it's easy to compute.
So, the search of local minima convolves in logarithmic time.
The good news is that we can run the search of such local maxima in parallel.
Also, we can filter out both the regions of the search based on the best achieved result so far, as well as the points themselves, which are belo the lower limit of the search in the worst region.
The worst case of the algorithm is where all of the points are placed on the surface of a sphere.
This can be firther improved: when we detect a local search which operates on just few points, we swap to bruteforce for this particular axis. It works fast, because we need only the points which are subject to that particular local search, which can be determined as points actually bound by two opposite spherical cones of a particular angle sharing the same axis.
It's hard to figure out the big O notation, because it depends on desired precision and the distribution of points (bad when most of the points are on a sphere's surface).
The algorithm i use is here:
Set the initial angle a = pi/2.
Take one axis for each dimension. The angle and the axises form the initial 'bucket'
For each axis, compute the span on that axis by projecting all the points onto the axis, and finding min and max of the coordinates on the axis.
Compute the upper and lower bounds of the diameter which is interesting. It's based on the formula: sin(a/2)*(1 + tan(a/2)) and multiplied by assimetry cooficient, computed from the length of the current axis projections.
For the next step, kill all of the points which fall under the lower bound in each dimension at the same time.
For each exis, If the amount of points above the upper bound is less then some reasonable amount (experimentally computed) then compute using a bruteforce (N^2) on the set of the points in question, and adjust the lower bound, and kill the axis for the next step.
For the next step, Kill all of the axises, which have all of their points under the lower bound.
If the precision is satisfactory (upper bound - lower bound) < epsilon, then return the upper bound as the result.
For all of the survived axises, there is a virtual cone on that axis (actually, the two opposite cones), which covers some area on a virtual sphere which encloses a face of the cube. If i'm not mistaken, it's angle would be a * sqrt(2). Set the new angle to a / sqrt(2). Create a whole bucket of new axises (2 * number of dimensions), so the new cone areas would cover the initial cone area. It's the hard part for me, as i have not enough imagination for n>3-dimensional case.
Continue from step (3).
You can paralellize the procedure, synchronizing the limits computed so far for the points from (5) through (7).
I'm going to summarize the algorithm proposed by Timothy Shields.
Pick random point x.
Pick point y furthest from x.
If not done, let x = y, and go to step 2
The more times you repeat, the more accurate the result will be... ??
EDIT: actually this algorithm is not very good. Think about a 2D rectangle with vertices ABCD. There are two maxima: between AC and BD, which are separated by a sizable valley. This algorithm will get stuck at one or the other 50/50. If AC is slightly larger than BD, you'll be getting the wrong answer 50% of the time no matter how many times you iterate. Other regular polygons have the same issue, and in higher dimensions it is even worse.

algorithm to create bounding rectangles for 2D points

The input is a series of point coordinates (x0,y0),(x1,y1) .... (xn,yn) (n is not very large, say ~ 1000). We need to create some rectangles as bounding box of these points. There's no need to find the global optimal solution. The only requirement is if the euclidean distance between two point is less than R, they should be in the same bounding rectangle. I've searched for sometime and it seems to be a clustering problem and K-means method might be a useful one.
However, the input point coordinates didn't have specific pattern from time to time. So it maybe not possible to set a specific K in K-mean. I am wondering if there is any algorithm or method possible to solve this problem?
The only requirement is if the euclidean distance between two point is less than R, they should be in the same bounding rectangle
This is the definition of single-linkage hierarchical clustering cut at a height of R.
Note that this may yield overlapping rectangles.
For much faster and highly efficient methods, have a look at bulk loading strategies for R*-trees, such as sort-tile-recursive. It won't satisfy your "only" requirement above, but it will yield well balanced, non-overlapping rectangles.
K-means is obviously not appropriate for your requirements.
With only 1000 points I would do the following:
1) Work out the difference between all pairs of points. If the distance of a pair is less than R, they need to go in the same bounding rectangle, so use http://en.wikipedia.org/wiki/Disjoint-set_data_structure to record this.
2) For each subset that comes out of your Disjoint set data structure, work out the min and max co-ordinates of the points in it and use this to create a bounding box for the points in this subset.
If you have more points or are worried about efficiency, you will want to make stage (1) more efficient. One easy way would be to go through the points in order of x co-ordinate, keeping only points at most R to the left of the most recent point seen, and using a balanced tree structure to find from these the points at most R above or below the most recent point seen, before calculating the distance to the most recent point seen. One step up from this would be to create a spatial data structure to get yet more efficiency in finding pairs with distance R of each other.
Note that for some inputs you will get just one huge bounding box because you have long chains of points, and for some other inputs you will get bounding boxes inside bounding boxes, for instance if your points are in concentric circles.

Calculation of centroid & volume of a polyhedron when the vertices are given

Given the location of vertices of a convex polyhedron (3D), I need to calculate the centroid and volume of the polyhedron. Following code is available at Mathworks site.
function C = centroid(P)
k=convhulln(P);
if length(unique(k(:)))<size(P,1)
error('Polyhedron is not convex.');
end
T = delaunayn(P);
n = size(T,1);
W = zeros(n,1);
C=0;
for m = 1:n
sp = P(T(m,:),:);
[null,W(m)]=convhulln(sp);
C = C + W(m) * mean(sp);
end
C=C./sum(W);
return
end
The code is elegant but is terribly slow. I need to calculate the volume and centroid of thousands of polyhedrons hundreds of times. Using this code in its current state is not feasible. Does anyone know a better approach or can this code be made faster? There are some minor changes I can think of such as, replacing mean with expression for mean.
There is a much simpler approach to calculate the volume with minimal effort. The first flavour uses 3 local topological information sets of the polyhedron, the tangent unit vector of the edges, the unit vectors of the in-plane normal on this tangent and the unit vector of the facet itself (which are very simple to extract from the vertices). Please refer to Volume of a Polyhedron for further details.
The second flavour uses the face areas, the normal vectors and the face barycenters to calculate the polyhedron volume according to this Wikipedia Article.
Both algorithms are rather simple and very easy to implement and through the simple summing structure easy to vectorize too. I suppose that both approaches will be much faster than doing a fully fledged tesselation of the polyhedron.
The centroid of the polyhedron can then be calculated by applying the divergence theorem transferring the integration over the full polyhedron volume into an integration over the polyhedron surface. A detailed description can be found in Calculating the volume and centroid of a polyhedron in 3d. I did not check if the tesselation of the polyhedron into triangles is really necessary or one could work with the more complex polygon surfaces of the polyhedron too, but in any case the area tessellation of the faces is much simpler than the volume tesselation.
In total such a combined approach should be much faster than the volume approach.
Thinking your only option if quickhull isn't good enough is cudahull if you want exact solutions. Although, even then you're only going to get about a 40x increase max (it seems).
I'm assuming that your the convex hulls you make each have at least 10 vertices (if it's much less than that, there isn't much you can do). If you don't mind "close enough" solutions. You could create a version of quickhull that limits the number the of vertices per polygon. The number of vertices you limit the calculation to will also allow for calculation of maximum error if needed.
The thing is that as the number of vertices on the convex hull approach infinity, you end up with a sphere. This means due to the way quick hull works, each additional vertex you add to the convex hull has less of an effect* than the ones before it.
*Depending on how quickhull is coded, this may only be true in a general sense. Making this true in practice would require modifying quickhull's recursion algorithm, so while the "next vertex" is always calculated (except for after the last vertex is added, or no points remain for that section), vertices are actually added to the convex hull in the order that maximizes the increase to the polyhedrons volume (likely by order of most distant to least distant). You'll incur some performance cost for keeping track of the order to add vertex's but as long as the ratio of pending convex hull points vs pending points is high enough, it should be worth it. As for error, the best option is probably to stop the algorithm either when the actual convex hull is reached, or the max increase to volume gets smaller than a certain fraction of the current total volume. If performance is more important, then simply limit the number of convex hull points per polygon.
You could also look at the various approximate convex hull algorithms, but the method I outlined above should work well for volume/centroid approximation with ability to determine error.
How much you can speed up your code depends a lot on how you want to calculate your centroid. See this answer about centroid calculation for your options. It turns out that if you need the centroid of the solid polyhedron, you're basically out of luck. If, however, only the vertices of the polyhedron have weight, then you could simply write
[k,volume] = convhulln(P);
centroid = mean(P(k,:));

Resources