How to select points at a regular density - algorithm

how do I select a subset of points at a regular density? More formally,
a set A of irregularly spaced points,
a metric of distance dist (e.g., Euclidean distance),
and a target density d,
how can I select a smallest subset B that satisfies below?
for every point x in A,
there exists a point y in B
which satisfies dist(x,y) <= d
My current best shot is to
start with A itself
pick out the closest (or just particularly close) couple of points
randomly exclude one of them
repeat as long as the condition holds
and repeat the whole procedure for best luck. But are there better ways?
I'm trying to do this with 280,000 18-D points, but my question is in general strategy. So I also wish to know how to do it with 2-D points. And I don't really need a guarantee of a smallest subset. Any useful method is welcome. Thank you.
bottom-up method
select a random point
select among unselected y for which min(d(x,y) for x in selected) is largest
keep going!
I'll call it bottom-up and the one I originally posted top-down. This is much faster in the beginning, so for sparse sampling this should be better?
performance measure
If guarantee of optimality is not required, I think these two indicators could be useful:
radius of coverage: max {y in unselected} min(d(x,y) for x in selected)
radius of economy: min {y in selected != x} min(d(x,y) for x in selected)
RC is minimum allowed d, and there is no absolute inequality between these two. But RC <= RE is more desirable.
my little methods
For a little demonstration of that "performance measure," I generated 256 2-D points distributed uniformly or by standard normal distribution. Then I tried my top-down and bottom-up methods with them. And this is what I got:
RC is red, RE is blue. X axis is number of selected points. Did you think bottom-up could be as good? I thought so watching the animation, but it seems top-down is significantly better (look at the sparse region). Nevertheless, not too horrible given that it's much faster.
Here I packed everything.

You can model your problem with graphs, assume points as nodes, and connect two nodes with edge if their distance is smaller than d, Now you should find the minimum number of vertex such that they are with their connected vertices cover all nodes of graph, this is minimum vertex cover problem (which is NP-Hard in general), but you can use fast 2-approximation : repeatedly taking both endpoints of an edge into the vertex cover, then removing them from the graph.
P.S: sure you should select nodes which are fully disconnected from the graph, After removing this nodes (means selecting them), your problem is vertex cover.

A genetic algorithm may probably produce good results here.
I have been playing a little with this problem and these are my findings:
A simple method (call it random-selection) to obtain a set of points fulfilling the stated condition is as follows:
start with B empty
select a random point x from A and place it in B
remove from A every point y such that dist(x, y) < d
while A is not empty go to 2
A kd-tree can be used to perform the look ups in step 3 relatively fast.
The experiments I have run in 2D show that the subsets generated are approximately half the size of the ones generated by your top-down approach.
Then I have used this random-selection algorithm to seed a genetic algorithm that resulted in a further 25% reduction on the size of the subsets.
For mutation, giving a chromosome representing a subset B, I randomly choose an hyperball inside the minimal axis-aligned hyperbox that covers all the points in A. Then, I remove from B all the points that are also in the hyperball and use the random-selection to complete it again.
For crossover I employ a similar approach, using a random hyperball to divide the mother and father chromosomes.
I have implemented everything in Perl using my wrapper for the GAUL library (GAUL can be obtained from here.
The script is here:
It accepts a list of n-dimensional points from stdin and generates a collection of pictures showing the best solution for every iteration of the genetic algorithm. The companion script can be used to generate the random points with a uniform distribution.

Here is a proposal which makes an assumption of Manhattan distance metric:
Divide up the entire space into a grid of granularity d. Formally: partition A so that points (x1,...,xn) and (y1,...,yn) are in the same partition exactly when (floor(x1/d),...,floor(xn/d))=(floor(y1/d),...,floor(yn/d)).
Pick one point (arbitrarily) from each grid space -- that is, choose a representative from each set in the partition created in step 1. Don't worry if some grid spaces are empty! Simply don't choose a representative for this space.
Actually, the implementation won't have to do any real work to do step one, and step two can be done in one pass through the points, using a hash of the partition identifier (the (floor(x1/d),...,floor(xn/d))) to check whether we have already chosen a representative for a particular grid space, so this can be very, very fast.
Some other distance metrics may be able to use an adapted approach. For example, the Euclidean metric could use d/sqrt(n)-size grids. In this case, you might want to add a post-processing step that tries to reduce the cover a bit (since the grids described above are no longer exactly radius-d balls -- the balls overlap neighboring grids a bit), but I'm not sure how that part would look.

To be lazy, this can be casted to a set cover problem, which can be handled by mixed-integer problem solver/optimizers. Here is a GNU MathProg model for the GLPK LP/MIP solver. Here C denotes which point can "satisfy" each point.
param N, integer, > 0;
set C{1..N};
var x{i in 1..N}, binary;
s.t. cover{i in 1..N}: sum{j in C[i]} x[j] >= 1;
minimize goal: sum{i in 1..N} x[i];
With normally distributed 1000 points, it didn't find the optimum subset in 4 minutes, but it said it knew the true minimum and it selected only one more point.


Two salesmen - one always visits the nearest neighbour, the other the farthest

Consider this question relative to graph theory:
Let G a complete (every vertex is connected to all the other vertices) non-directed graph of size N x N. Two "salesmen" travel this way: the first always visits the nearest non visited vertex, the second the farthest, until they have both visited all the vertices. We must generate a matrix of distances and the starting points for the two salesmen (they can be different) such that:
All the distances are unique Edit: positive integers
The distance from a vertex to itself is always 0.
The difference between the total distance covered by the two salesmen must be a specific number, D.
The distance from A to B is equal to the distance from B to A
What efficient algorithms cn be useful to help me? I can only think of backtracking, but I don't see any way to reduce the work to be done by the program.
Geometry is helpful.
Using the distances of points on a circle seems like it would work. Seems like you could determine adjust D by making the circle radius larger or smaller.
Alternatively really any 2D shape, where the distances are all different could probably used as well. In this case you should scale up or down the shape to obtain the correct D.
Edit: Now that I think about it, the simplest solution may be to simply pick N random 2D points, say 32 bit integer coordinates to lower the chances of any distances being too close to equal. If two distances are too close, just pick a different point for one of them until it's valid.
Ideally, you'd then just need to work out a formula to determine the relationship between D and the scaling factor, which I'm not sure of offhand. If nothing else, you could also just use binary search or interpolation search or something to search for scaling factor to obtain the required D, but that's a slower method.

Finding all points in certain radius of another point

I am making a simple game and stumbled upon this problem. Assume several points in 2D space. What I want is to make points close to each other interact in some way.
Let me throw a picture here for better understanding of the problem:
Now, the problem isn't about computing the distance. I know how to do that.
At first I had around 10 points and I could simply check every combination, but as you can already assume, this is extremely inefficient with increasing number of points. What if I had a million of points in total, but all of them would be very distant to each other?
I'm trying to find a suitable data structure or a way to look at this problem, so every point can only mind their surrounding and not whole space. Are there any known algorithms for this? I don't exactly know how to name this problem so I can google exactly what I want.
If you don't know of such known algorighm, all ideas are very welcome.
This is a range searching problem. More specifically - the 2-d circular range reporting problem.
Quoting from "Solving Query-Retrieval Problems by Compacting Voronoi Diagrams" [Aggarwal, Hansen, Leighton, 1990]:
Input: A set P of n points in the Euclidean plane E²
Query: Find all points of P contained in a disk in E² with radius r centered at q.
The best results were obtained in "Optimal Halfspace Range Reporting in Three Dimensions" [Afshani, Chan, 2009]. Their method requires O(n) space data structure that supports queries in O(log n + k) worst-case time. The structure can be preprocessed by a randomized algorithm that runs in O(n log n) expected time. (n is the number of input points, and k in the number of output points).
The CGAL library supports circular range search queries. See here.
You're still going to have to iterate through every point, but there are two optimizations you can perform:
1) You can eliminate obvious points by checking if x1 < radius and if y1 < radius (like Brent already mentioned in another answer).
2) Instead of calculating the distance, you can calculate the square of the distance and compare it to the square of the allowed radius. This saves you from performing expensive square root calculations.
This is probably the best performance you're gonna get.
This looks like a nearest neighbor problem. You should be using the kd tree for storing the points.
Space partitioning is what you want..
If you could get those points to be sorted by x and y values, then you could quickly pick out those points (binary search?) which are within a box of the central point: x +- r, y +- r. Once you have that subset of points, then you can use the distance formula to see if they are within the radius.
I assume you have a minimum and maximum X and Y coordinate? If so how about this.
Call our radius R, Xmax-Xmin X, and Ymax-Ymin Y.
Have a 2D matrix of [X/R, Y/R] of double-linked lists. Put each dot structure on the correct linked list.
To find dots you need to interact with, you only need check your cell plus your 8 neighbors.
Example: if X and Y are 100 each, and R is 1, then put a dot at 43.2, 77.1 in cell [43,77]. You'll check cells [42,76] [43,76] [44,76] [42,77] [43,77] [44,77] [42,78] [43,78] [44,78] for matches. Note that not all cells in your own box will match (for instance 43.9,77.9 is in the same list but more than 1 unit distant), and you'll always need to check all 8 neighbors.
As dots move (it sounds like they'd move?) you'd simply unlink them (fast and easy with a double-link list) and relink in their new location. Moving any dot is O(1). Moving them all is O(n).
If that array size gives too many cells, you can make bigger cells with the same algo and probably same code; just be prepared for fewer candidate dots to actually be close enough. For instance if R=1 and the map is a million times R by a million times R, you wouldn't be able to make a 2D array that big. Better perhaps to have each cell be 1000 units wide? As long as density was low, the same code as before would probably work: check each dot only against other dots in this cell plus the neighboring 8 cells. Just be prepared for more candidates failing to be within R.
If some cells will have a lot of dots, each cell having a linked list, perhaps the cell should have an red-black tree indexed by X coordinate? Even in the same cell the vast majority of other cell members will be too far away so just traverse the tree from X-R to X+R. Rather than loop over all dots, and go diving into each one's tree, perhaps you could instead iterate through the tree looking for X coords within R and if/when you find them calculate the distance. As you traverse one cell's tree from low to high X, you need only check the neighboring cell to the left's tree while in the first R entries.
You could also go to cells smaller than R. You'd have fewer candidates that fail to be close enough. For instance with R/2, you'd check 25 link lists instead of 9, but have on average (if randomly distributed) 25/36ths as many dots to check. That might be a minor gain.

How does one decide the final clusters when using the means shift algorthm?

I am reading a bit about the means shift clustering algorithm ( and this is what i got so far. For each point in your data set : select all points within a certain distance of it (including the original point), calculate the mean for all these points, repeat until these means stabilize.
What I'm confused about is how does one go from here in deciding what the final clusters are , and on what conditions do these means merge. Also, does the distance used to select the points fluctuate through the iterations or does it remain constant?
Thanks in advance
The mean shift cluster finding is a simple iterative process which is actually guaranteed to converge. The iteration starts from a starting point x, and the iteration steps are (note that x may have several components, as the algorithm will work in higher dimensions, as well):
calculate the weighted mean position x' of all points around x - maybe the simplest form is to calculate the average of positions of all points within d distance from x, but the gaussian function is also commonly used and mathematically beneficial.
set x <- x'
repeat until the difference between x and x' is very small
This can be used in cluster analysis by starting with different values of x. The final values will end up at different cluster centers. The number of clusters cannot be known (other than it is <= number of points).
The upper level algorithm is:
go through a selection of starting values
for each value, calculate the convergence value as shown above
if the value is not already in the list of convergence values, add it to the list (allow some reasonable tolerance for numerical imprecision)
And then you have the list of clusters. The only difficult thing is finding a reasonable selection of starting values. It is easy with one or two dimensions, but with higher dimensionalities exhaustive searches are not quite possible.
All starting points, which end up into the same mode (point of convergence) belong to the same cluster.
It may be of interest that if you are doing this on a 2D image, it should be sufficient to calculate the gradient (i.e. the first iteration) for each pixel. This is a fast operation with common convolution techniques, and then it is relatively easy to group the pixels into clusters.

Combinatorial optimization

Suppose we have a connected and undirected graph: G=(V,E).
Definition of connected-set: a group of points belonging to V of G forms a valid connected-set iff every point in this group is within T-1 edges away from any other point in the same group, T is the number of points in the group.
Pls note that a connected set is just a connected subgraph of G without the edges but with the points.
And we have an arbitrary function F defined on connected-set, i.e given an arbitrary connected-set CS F(CS) will give us a real value.
Two connected-sets are said disjoint if their union is not a connected set.
For an visual explanation, pls see the graph below:
In the graph, the red,black,green point sets are all valid connected-sets, green set is disjoint to red set, but black set is not disjoint to the red one.
Now the question:
We want to find a bunch of disjoint connected-sets from G so that:
(1)every connected-set has at least K points. (K is a global parameter).
(2)the sum of their function values,i.e max(Σ F(CS)) are maximized.
Is there any efficient algorithm to tackle such a problem other than an exhaustive search?
For example, the graph can be a planar graph in the 2D Euclidean plane, and the function value F of a connected-set CS can be defined as the area of the minimum bounding rectangle of all the points in CS(minimum bounding rectangle is the smallest rectangle enclosing all the points in the CS).
If you can define your function and prove it is a Submodular Function (property analogous to that of Convexity in continuous Optimization) then there are very efficient (strongly polynomial) algorithms that will solve your problem e.g. Minimum Norm Point.
To prove that your function is Submodular you only need to prove the following:
There are several available implementations of the Minimum Norm Point algorithm e.g. Matlab Toolbox for Submodular Function Optimization
I doubt there is an efficient algorithm since for a complete graph for instance, you cannot solve the problem without knowing the value of F on every subgraph (except if you have assumptions on F: monotonicity for instance).
Nevertheless, I'd go for a non deterministic algorithm. Try simulated annealing, with transitions being:
Remove a point from a set (if it stays connected)
Move a point from a set to another (if they stay connected)
Remove a set
Add a set with one point
Good luck, this seems to be a difficult problem.
For such a general F, it is not an easy task to draft an optimized algorithm, far from the brute force approach.
For instance, since we want to find a bunch of CS where F(CS) is maximized, should we assume we want actually to find max(Σ F(CS)) for all CS or the highest F value from all possible CS, max(F(csi))? We don't know for sure.
Also, F being arbitrary, we cannot estimate the probability of having F(cs+p1) > F(cs) => F(cs+p1+p2) > F(cs).
However, we can still discuss it:
It seems we can deduce from the problem that we can treat each CS independently, meaning if n = F(cs1) adding any cs2 (being disjoint from cs1) will have no impact on the n value.
It seems also believable, and this is where we should be able to get some gain, that the calculation of F can be made starting from any point of a CS, and, in general, if CS = cs1+cs2, F(CS) = F(cs1+cs2) = F(cs2+cs1).
Then we want to inject memoization in the algorithm in order to speed up the process when a CS is grown up little by little in order to find max(F(cs)) [considering F general, the dynamic programming approach, for instance starting from a CS made of all points, then reducing it little by little, doesn't seem to have a big interest].
Ideally, we could start with a CS made of a point, extending it by one, checking and storing F values (for each subset). Each test would first check if the F value exists in order not to calculate it ; then repeat the process for another point etc..., find the best subsets that maximize F. For a large number of points, this is a very lengthy experience.
A more reasonable approach would be to try random points and grow the CS up to a given size, then try another area distinct from the bigger CS obtained at the previous stage. One could try to assess the probability explained above, and direct the algorithm in a certain way depending on the result.
But, again due to lack of F properties, we can expect an exponential space need via memoization (like storing F(p1,...,pn), for all subsets). And an exponential complexity.
I would use dynamic programming. You can start out rephrasing your problem as a node coloring problem:
Your goal is to assign a color to each node. (In other words you are looking for a coloring of the nodes)
The available colors are black and white.
In order to judge a coloring you have to examine the set of "maximal connected sets of black nodes".
A set of black nodes is called connected if the induced subgraph is connected
A connected set of black nodes is called maximal none of the nodes in the set has a black neighbor in the original graph that is not contained in the set)
Your goal is to find the coloring that maximizes ΣF(CS). (Here you sum over the "maximal connected sets of black nodes")
You have some extra constraints are specified in your original post.
Perhaps you could look for an algorithm that does something like the following
Pick a node
Try to color the chosen node white
Look for a coloring of the remaining nodes that maximizes ΣF(CS)
Try to color the chosen node black
Look for a coloring of the remaining nodes that maximizes ΣF(CS)
Each time you have colored a node white then you can examine whether or not the graph has become "decomposable" (I made up this word. It is not official):
A partially colored graph is called "decomposable" if it contains a pair of none-white nodes that are not connected by any path that does not contain a white node.
If your partially colored graph is decomposable then you can split your problem in to two sub-problems.
EDIT: I added an alternative idea and deleted it again. :)

3D clustering Algorithm

Problem Statement:
I have the following problem:
There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any two points of those top N points must be greater than R. The distribution of those points are not uniform. It is very common that certain regions of the space contain a lot of points.
To find an algorithm that can scale well to many processors and has a small memory requirement.
Normal spatial decomposition is not sufficient for this kind of problem due to the non-uniform distribution. irregular spatial decomposition that evenly divide the number of points may help us the problem. I will really appreciate that if someone can shed some lights on how to solve this problem.
Use an Octree. For 3D data with a limited value domain that scales very well to huge data sets.
Many of the aforementioned methods such as locality sensitive hashing are approximate versions designed for much higher dimensionality where you can't split sensibly anymore.
Splitting at each level into 8 bins (2^d for d=3) works very well. And since you can stop when there are too few points in a cell, and build a deeper tree where there are a lot of points that should fit your requirements quite well.
For more details, see Wikipedia:
Alternatively, you could try to build an R-tree. But the R-tree tries to balance, making it harder to find the most dense areas. For your particular task, this drawback of the Octree is actually helpful! The R-tree puts a lot of effort into keeping the tree depth equal everywhere, so that each point can be found at approximately the same time. However, you are only interested in the dense areas, which will be found on the longest paths in the Octree without even having to look at the actual points yet!
I don't have a definite answer for you, but I have a suggestion for an approach that might yield a solution.
I think it's worth investigating locality-sensitive hashing. I think dividing the points evenly and then applying this kind of LSH to each set should be readily parallelisable. If you design your hashing algorithm such that the bucket size is defined in terms of R, it seems likely that for a given set of points divided into buckets, the points satisfying your criteria are likely to exist in the fullest buckets.
Having performed this locally, perhaps you can apply some kind of map-reduce-style strategy to combine spatial buckets from different parallel runs of the LSH algorithm in a step-wise manner, making use of the fact that you can begin to exclude parts of your problem space by discounting entire buckets. Obviously you'll have to be careful about edge cases that span different buckets, but I suspect that at each stage of merging, you could apply different bucket sizes/offsets such that you remove this effect (e.g. perform merging spatially equivalent buckets, as well as adjacent buckets). I believe this method could be used to keep memory requirements small (i.e. you shouldn't need to store much more than the points themselves at any given moment, and you are always operating on small(ish) subsets).
If you're looking for some kind of heuristic then I think this result will immediately yield something resembling a "good" solution - i.e. it will give you a small number of probable points which you can check satisfy your criteria. If you are looking for an exact answer, then you are going to have to apply some other methods to trim the search space as you begin to merge parallel buckets.
Another thought I had was that this could relate to finding the metric k-center. It's definitely not the exact same problem, but perhaps some of the methods used in solving that are applicable in this case. The problem is that this assumes you have a metric space in which computing the distance metric is possible - in your case, however, the presence of a billion points makes it undesirable and difficult to perform any kind of global traversal (e.g. sorting of the distances between points). As I said, just a thought, and perhaps a source of further inspiration.
Here are some possible parts of a solution.
There are various choices at each stage,
which will depend on Ncluster, on how fast the data changes,
and on what you want to do with the means.
3 steps: quantize, box, K-means.
1) quantize: reduce the input XYZ coordinates to say 8 bits each,
by taking 2^8 percentiles of X,Y,Z separately.
This will speed up the whole flow without much loss of detail.
You could sort all 1G points, or just a random 1M,
to get 8-bit x0 < x1 < ... x256, y0 < y1 < ... y256, z0 < z1 < ... z256
with 2^(30-8) points in each range.
To map float X -> 8 bit x, unrolled binary search is fast —
see Bentley, Pearls p. 95.
Added: Kd trees
split any point cloud into different-sized boxes, each with ~ Leafsize points —
much better than splitting X Y Z as above.
But afaik you'd have to roll your own Kd tree code
to split only the first say 16M boxes, and keep counts only, not the points.
2) box: count the number of points in each 3d box,
[xj .. xj+1, yj .. yj+1, zj .. zj+1].
The average box will have 2^(30-3*8) points;
the distribution will depend on how clumpy the data is.
If some boxes are too big or get too many points, you could
a) split them into 8,
b) track the centre of the points in each box,
otherwide just take box midpoints.
K-means clustering
on the 2^(3*8) box centres.
(Google parallel "k means" -> 121k hits.)
This depends strongly on K aka Ncluster, also on your radius R.
A rough approach would be to grow a
of the say 27*Ncluster boxes with the most points,
then take the biggest ones subject to your Radius constraint.
(I like to start with a
Minimum spanning tree,
then remove the K-1 longest links to get K clusters.)
See also
Color quantization .
I'd make Nbit, here 8, a parameter from the beginning.
What is your Ncluster ?
Added: if your points are moving in time, see
collision-detection-of-huge-number-of-circles on SO.
I would also suggest to use an octree. The OctoMap framework is very good at dealing with huge 3D point clouds. It does not store all the points directly, but updates the occupancy density of every node (aka 3D box).
After the tree is built, you can use a simple iterator to find the node with the highest density. If you would like to model the point density or distribution inside the nodes, the OctoMap is very easy to adopt.
Here you can see how it was extended to model the point distribution using a planar model.
Just an idea. Create a graph with given points and edges between points when distance < R.
Creation of this kind of graph is similar to spatial decomposition. Your questions can be answered with local search in graph. First are vertices with max degree, second is finding of maximal unconnected set of max degree vertices.
I think creation of graph and search can be made parallel. This approach can have large memory requirement. Splitting domain and working with graphs for smaller volumes can reduce memory need.
