I have a large number of points in 3D space (x,y,z) represented as an array of 3 float structs. I also have access to a strong graphics card with CUDA capability. I want the following:
Divide the points in the array into clusters so that every point within a cluster has a maximum euclidean distance of X to at least one other point within the cluster.
Examle in 2D:
The "brute force" way of doing this is of course to calculate the distance between every point and every other point, to see if any of the distances is below the threshold X, and if so mark those points as belonging to the same cluster. This is an O(n²) algorithm.
This can be done in parallel in CUDA ofcourse with n² threads, but is there a better way?
The algorithm can be reduced to O(n) by using binning:
impose a 3D grid spaced as X, that is a 3D lattice (each cell of the lattice is a cubic bin);
assign each points in space to the corresponding bin (the bin that geometrically contains that points);
every time you need to evaluate the distances from one point, you just use only the points in the bin of the point itself and the ones in the 26 neighbouring bins (3x3x3 = 27)
The points in the other bins are further than X, so you don't need to evaluate the distances at all.
In this way, assuming a constant density in the points, you will have to compute the distance only for a constant number of pair points / total number of points.
Assigning the points to the bins is O(n) as well.
If the points are not uniformly distributed, the bins can be smaller (and you must consider more than 26 neighbours to evaluate the distances) and eventually sparse.
This is a typical trick used for molecular dynamics, ray tracing, meshing,... However I know of the term binning from molecular dynamics simulation: the name can change (link-cell, kd-trees too use the same principle, even if more articulated), the algorithm remains the same!
And, good news, the algorithm is well suited for parallel implementation.
refs:
https://en.wikipedia.org/wiki/Cell_lists
Related
For my simulation purposes, I want to generate a randomly distributed k number of spheres (having the same radii) in a confined 3D space (inside a rectangle) where k is in order of 1000. Those spheres should not impinge on one another.
So, I want to generate random k points in a 3D space at least d distance away from one another; considering the number of points and the frequency at which I need those points for simulation, I don't want to apply brute force; I'm looking for some efficient algorithms achieving this.
How about just starting with some regular tessellation of the space (i.e. some primitive 3d lattice) and putting a single point somewhere in each tile? You'd then only need to check a small number of neighboring tiles for proximity.
To get a more statistically uniform, i.e. less regular, set of points, you could:
perturb points in space
generate an overly dense lattice and reject some points
"warp" the space so that the lattice was more dense in certain areas
You could perturb the points sequentially, giving you a monte-carlo chain over their coordinates, and potentially saving work elsewhere. Presumably you could tailor this so that the equilibrium distribution was what you wanted.
Given the (lat, lon) coordinates of a group of n locations on the surface of the earth, find a (lat, lon) point c, and a value of r > 0 such that
we maximize the density, d, of locations per square
mile, say, in the surface area described and contained by the circle defined by c and r.
At first I thought maybe you could solve this using linear programming. However, density depends on area depends on r squared. Quadratic term. So, I don't think problem is amenable to linear programming.
Is there a known method for solving this kind of thing? Suppose you simplify the problem to (x, y) coordinates on the Cartesian plane. Does that make it easier?
You've got two variables c and r that you're trying to find so as to maximize the density, which is a function of c and r (and the locations, which is a constant). So maybe a hill-climbing, gradient descent, or simulated annealing approach might work? You can make a pretty good guess for your first value. Just use the centroid of the locations. I think the local maximum you reach from there would be a global maximum.
Steps:
Cluster your points using a density based clustering algorithm1;
Calculate the density of each cluster;
Recursively (or iteratively) sub-cluster the points in the most dense cluster;
The algorithm has to be ignoring the outliers and making them a cluster in their own. This way, all the outliers with high density will be kept and outliers with low density will be weaned out.
Keep track of the cluster with highest density observed till now. Return when you finally reach a cluster made of a single point.
This algorithm will work only when you have clusters like the ones shown below as the recursive exploration will be resulting in similarly shaped clusters:
The algorithm will fail with awkwardly shaped clusters like this because as you can see that even though the triangles are most densely placed when you calculate the density in the donut shape, they will report a far lower density wrt the circle centered at [0, 0]:
1. One density based clustering algorithm that will work for you is DBSCAN.
To effectively find n nearest neighbors of a point in d-dimensional space, I selected the dimension with greatest scatter (i.e. in this coordinate differences between points are largest). The whole range from minimal to maximal value in this dimension was split into k bins. Each bin contains points which coordinates (in this dimensions) are within the range of that bin. It was ensured that there are at least 2n points in each bin.
The algorithm for finding n nearest neighbors of point x is following:
Identify bin kx,in which point x lies(its projection to be precise).
Compute distances between x and all the points in bin kx.
Sort computed distances in ascending order.
Select first n distances. Points to which these distances were measured are returned as n
nearest neighbors of x.
This algorithm is not working for all cases. When algorithm can fail to compute nearest neighbors?
Can anyone propose modification of the algorithm to ensure proper operation for all cases?
Where KNN failure:
If the data is a jumble of all different classes then knn will fail because it will try to find k nearest neighbours but all points are random
outliers points
Let's say you have two clusters of different classes. Then if you have a outlier point as query, knn will assign one of the classes even though the query point is far away from both clusters.
This is failing because (any of) the k nearest neighbors of x could be in a different bin than x.
What do you mean by "not working"? You do understand that, what you are doing is only an approximate method.
Try normalising the data and then choosing the dimension, else scatter makes no sense.
The best vector for discrimination or for clustering may not be one of the original dimensions, but any combination of dimensions.
Use PCA (Principal Component Analysis) or LDA (Linear Discriminant Analysis), to identify a discriminative dimension.
Given N points(in 2D) with x and y coordinates. You have to find a point P (in N given points) such that the sum of distances from other(N-1) points to P is minimum.
for ex. N points given p1(x1,y1),p2(x2,y2) ...... pN(xN,yN).
we have find a point P among p1 , p2 .... PN whose sum of distances from all other points is minimum.
I used brute force approach , but I need a better approach. I also tried by finding median, mean etc. but it is not working for all cases.
then I came up with an idea that I would treat X as a vertices of a polygon and find centroid of this polygon, and then I will choose a point from Y nearest to the centroid. But I'm not sure whether centroid minimizes sum of its distances to the vertices of polygon, so I'm not sure whether this is a good way? Is there any algorithm for solving this problem?
If your points are nicely distributed and if there are so many of them that brute force (calculating the total distance from each point to every other point) is unappealing the following might give you a good enough answer. By 'nicely distributed' I mean (approximately) uniformly or (approximately) randomly and without marked clustering in multiple locations.
Create a uniform k*k grid, where k is an odd integer, across your space. If your points are nicely distributed the one which you are looking for is (probably) in the central cell of this grid. For all the other cells in the grid count the number of points in each cell and approximate the average position of the points in each cell (either use the cell centre or calculate the average (x,y) for points in the cell).
For each point in the central cell, compute the distance to every other point in the central cell, and the weighted average distance to the points in the other cells. This will, of course, be the distance from the point to the 'average' position of points in the other cells, weighted by the number of points in the other cells.
You'll have to juggle the increased accuracy of higher values for k against the increased computational load and figure out what works best for your points. If the distribution of points across cells is far from uniform then this approach may not be suitable.
This sort of approach is quite widely used in large-scale simulations where points have properties, such as gravity and charge, which operate over distances. Whether it suits your needs, I don't know.
The point in consideration is known as the Geometric Median
The centroid or center of mass, defined similarly to the geometric median as minimizing the sum of the squares of the distances to each sample, can be found by a simple formula — its coordinates are the averages of the coordinates of the samples but no such formula is known for the geometric median, and it has been shown that no explicit formula, nor an exact algorithm involving only arithmetic operations and kth roots can exist in general.
I'm not sure if I understand your question but when you calculate the minimum spanning tree the sum from any point to any other point from the tree is minimum.
I'm trying to design an implementation of Vector Quantization as a c++ template class that can handle different types and dimensions of vectors (e.g. 16 dimension vectors of bytes, or 4d vectors of doubles, etc).
I've been reading up on the algorithms, and I understand most of it:
here and here
I want to implement the Linde-Buzo-Gray (LBG) Algorithm, but I'm having difficulty figuring out the general algorithm for partitioning the clusters. I think I need to define a plane (hyperplane?) that splits the vectors in a cluster so there is an equal number on each side of the plane.
[edit to add more info]
This is an iterative process, but I think I start by finding the centroid of all the vectors, then use that centroid to define the splitting plane, get the centroid of each of the sides of the plane, continuing until I have the number of clusters needed for the VQ algorithm (iterating to optimize for less distortion along the way). The animation in the first link above shows it nicely.
My questions are:
What is an algorithm to find the plane once I have the centroid?
How can I test a vector to see if it is on either side of that plane?
If you start with one centroid, then you'll have to split it, basically by doubling it and slightly moving the points apart in an arbitrary direction. The plane is just the plane orthogonal to that direction.
But you don't need to compute that plane.
More generally, the region (i) is defined as the set of points which are closer to the centroid c_i than to any other centroid. When you have two centroids, each region is a half space, thus separated by a (hyper)plane.
How to test on a vector x to see on which side of the plane it is? (that's with two centroids)
Just compute the distance ||x-c1|| and ||x-c2||, the index of the minimum value (1 or 2) will give you which region the point x belongs to.
More generally, if you have n centroids, you would compute all the distances ||x-c_i||, and the centroid x is closest to (i.e., for which the distance is minimal) will give you the region x is belonging to.
I don't quite understand the algorithm, but the second question is easy:
Let's call V a vector which extends from any point on the plane to the point-in-question. Then the point-in-question lies on the same side of the (hyper)plane as the normal N iff V·N > 0