3D clustering Algorithm - algorithm

Problem Statement:
I have the following problem:
There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any two points of those top N points must be greater than R. The distribution of those points are not uniform. It is very common that certain regions of the space contain a lot of points.
Goal:
To find an algorithm that can scale well to many processors and has a small memory requirement.
Thoughts:
Normal spatial decomposition is not sufficient for this kind of problem due to the non-uniform distribution. irregular spatial decomposition that evenly divide the number of points may help us the problem. I will really appreciate that if someone can shed some lights on how to solve this problem.

Use an Octree. For 3D data with a limited value domain that scales very well to huge data sets.
Many of the aforementioned methods such as locality sensitive hashing are approximate versions designed for much higher dimensionality where you can't split sensibly anymore.
Splitting at each level into 8 bins (2^d for d=3) works very well. And since you can stop when there are too few points in a cell, and build a deeper tree where there are a lot of points that should fit your requirements quite well.
For more details, see Wikipedia:
https://en.wikipedia.org/wiki/Octree
Alternatively, you could try to build an R-tree. But the R-tree tries to balance, making it harder to find the most dense areas. For your particular task, this drawback of the Octree is actually helpful! The R-tree puts a lot of effort into keeping the tree depth equal everywhere, so that each point can be found at approximately the same time. However, you are only interested in the dense areas, which will be found on the longest paths in the Octree without even having to look at the actual points yet!

I don't have a definite answer for you, but I have a suggestion for an approach that might yield a solution.
I think it's worth investigating locality-sensitive hashing. I think dividing the points evenly and then applying this kind of LSH to each set should be readily parallelisable. If you design your hashing algorithm such that the bucket size is defined in terms of R, it seems likely that for a given set of points divided into buckets, the points satisfying your criteria are likely to exist in the fullest buckets.
Having performed this locally, perhaps you can apply some kind of map-reduce-style strategy to combine spatial buckets from different parallel runs of the LSH algorithm in a step-wise manner, making use of the fact that you can begin to exclude parts of your problem space by discounting entire buckets. Obviously you'll have to be careful about edge cases that span different buckets, but I suspect that at each stage of merging, you could apply different bucket sizes/offsets such that you remove this effect (e.g. perform merging spatially equivalent buckets, as well as adjacent buckets). I believe this method could be used to keep memory requirements small (i.e. you shouldn't need to store much more than the points themselves at any given moment, and you are always operating on small(ish) subsets).
If you're looking for some kind of heuristic then I think this result will immediately yield something resembling a "good" solution - i.e. it will give you a small number of probable points which you can check satisfy your criteria. If you are looking for an exact answer, then you are going to have to apply some other methods to trim the search space as you begin to merge parallel buckets.
Another thought I had was that this could relate to finding the metric k-center. It's definitely not the exact same problem, but perhaps some of the methods used in solving that are applicable in this case. The problem is that this assumes you have a metric space in which computing the distance metric is possible - in your case, however, the presence of a billion points makes it undesirable and difficult to perform any kind of global traversal (e.g. sorting of the distances between points). As I said, just a thought, and perhaps a source of further inspiration.

Here are some possible parts of a solution.
There are various choices at each stage,
which will depend on Ncluster, on how fast the data changes,
and on what you want to do with the means.
3 steps: quantize, box, K-means.
1) quantize: reduce the input XYZ coordinates to say 8 bits each,
by taking 2^8 percentiles of X,Y,Z separately.
This will speed up the whole flow without much loss of detail.
You could sort all 1G points, or just a random 1M,
to get 8-bit x0 < x1 < ... x256, y0 < y1 < ... y256, z0 < z1 < ... z256
with 2^(30-8) points in each range.
To map float X -> 8 bit x, unrolled binary search is fast —
see Bentley, Pearls p. 95.
Added: Kd trees
split any point cloud into different-sized boxes, each with ~ Leafsize points —
much better than splitting X Y Z as above.
But afaik you'd have to roll your own Kd tree code
to split only the first say 16M boxes, and keep counts only, not the points.
2) box: count the number of points in each 3d box,
[xj .. xj+1, yj .. yj+1, zj .. zj+1].
The average box will have 2^(30-3*8) points;
the distribution will depend on how clumpy the data is.
If some boxes are too big or get too many points, you could
a) split them into 8,
b) track the centre of the points in each box,
otherwide just take box midpoints.
3)
K-means clustering
on the 2^(3*8) box centres.
(Google parallel "k means" -> 121k hits.)
This depends strongly on K aka Ncluster, also on your radius R.
A rough approach would be to grow a
heap
of the say 27*Ncluster boxes with the most points,
then take the biggest ones subject to your Radius constraint.
(I like to start with a
Minimum spanning tree,
then remove the K-1 longest links to get K clusters.)
See also
Color quantization .
I'd make Nbit, here 8, a parameter from the beginning.
What is your Ncluster ?
Added: if your points are moving in time, see
collision-detection-of-huge-number-of-circles on SO.

I would also suggest to use an octree. The OctoMap framework is very good at dealing with huge 3D point clouds. It does not store all the points directly, but updates the occupancy density of every node (aka 3D box).
After the tree is built, you can use a simple iterator to find the node with the highest density. If you would like to model the point density or distribution inside the nodes, the OctoMap is very easy to adopt.
Here you can see how it was extended to model the point distribution using a planar model.

Just an idea. Create a graph with given points and edges between points when distance < R.
Creation of this kind of graph is similar to spatial decomposition. Your questions can be answered with local search in graph. First are vertices with max degree, second is finding of maximal unconnected set of max degree vertices.
I think creation of graph and search can be made parallel. This approach can have large memory requirement. Splitting domain and working with graphs for smaller volumes can reduce memory need.

Related

Using a spatial index to find points within range of each other

I'm trying to find a spatial index structure suitable for a particular problem : using a union-find data structure, I want to connect\associate points that are within a certain range of each other.
I have a lot of points and I'm trying to optimize an existing solution by using a better spatial index.
Right now, I'm using a simple 2D grid indexing each square of width [threshold distance] of my point map, and I look for potential unions by searching for points in adjacent squares in the grid.
Then I compute the squared Euclidean distance to the adjacent cells combinations, which I compare to my squared threshold, and I use the union-find structure (optimized using path compression and etc.) to build groups of points.
Here is some illustration of the method. The single black points actually represent the set of points that belong to a cell of the grid, and the outgoing colored arrows represent the actual distance comparisons with the outside points.
(I'm also checking for potential connected points that belong to the same cells).
By using this pattern I make sure I'm not doing any distance comparison twice by using a proper "neighbor cell" pattern that doesn't overlap with already tested stuff when I iterate over the grid cells.
Issue is : this approach is not even close to being fast enough, and I'm trying to replace the "spatial grid index" method with something that could maybe be faster.
I've looked into quadtrees as a suitable spatial index for this problem, but I don't think it is suitable to solve it (I don't see any way of performing repeated "neighbours" checks for a particular cell more effectively using a quadtree), but maybe I'm wrong on that.
Therefore, I'm looking for a better algorithm\data structure to effectively index my points and query them for proximity.
Thanks in advance.
I have some comments:
1) I think your problem is equivalent to a "spatial join". A spatial join takes two sets of geometries, for example a set R of rectangles and a set P of points and finds for every rectangle all points in that rectangle. In Your case, R would be the rectangles (edge length = 2 * max distance) around each point and P the set of your points. Searching for spatial join may give you some useful references.
2) You may want to have a look at space filling curves. Space filling curves create a linear order for a set of spatial entities (points) with the property that points that a close in the linear ordering are usually also close in space (and vice versa). This may be useful when developing an algorithm.
3) Have look at OpenVDB. OpenVDB has a spatial index structure that is highly optimized to traverse 'voxel'-cells and their neighbors.
4) Have a look at the PH-Tree (disclaimer: this is my own project). The PH-Tree is a somewhat like a quadtree but uses low level bit operations to optimize navigation. It is also Z-ordered/Morten-ordered (see space filling curves above). You can create a window-query for each point which returns all points within that rectangle. To my knowledge, the PH-Tree is the fastest index structure for this kind of operation, especially if you typically have only 9 points in a rectangle. If you are interested in the code, the V13 implementation is probably the fastest, however the V16 should be much easier to understand and modify.
I tried on my rather old desktop machine, using about 1,000,000 points I can do about 200,000 window queries per second, so it should take about 5 second to find all neighbors for every point.
If you are using Java, my spatial index collection may also be useful.
A standard approach to this is the "sweep and prune" algorithm. Sort all the points by X coordinate, then iterate through them. As you do, maintain the lowest index of the point which is within the threshold distance (in X) of the current point. The points within that range are candidates for merging. You then do the same thing sorting by Y. Then you only need to check the Euclidean distance for those pairs which showed up in both the X and Y scans.
Note that with your current union-find approach, you can end up unioning points which are quite far from each other, if there are a bunch of nearby points "bridging" them. So your basic approach -- of unioning groups of points based on proximity -- can induce an arbitrary amount of distance error, not just the threshold distance.

Clustering while trying to minimise spare capacity

I am trying to cluster ~30 million points (x and y co-ordinates) into clusters - the addition that makes it challenging is I am trying to minimise the spare capacity of each cluster while also ensuring the maximum distance between the cluster and any one point is not huge (>5km or so).
Each cluster is made from equipment that can serve 64 points, if a cluster contains less than 65 points then we need one of these pieces of equipment. However if a cluster contains 65 points then we need two of these pieces of equipment, this means we have a spare capacity of 63 for that cluster. We also need to connect each point to the cluster, so the distance from each point to the cluster is also a factor in the equipment cost.
Ultimately I am trying to minimise the cost of equipment which seems to be an equivalent problem to minimising the average spare capacity whilst also ensuring the distance from the cluster to any one point is less than 5km (an approximation, but will do for the thought experiment - maybe there are better ways to impose this restriction).
I have tried multiple approaches:
K-means
Most should know how this works
Average spare capacity of 32
Runs in O(n^2)
Sorted list of a-b distances
I tried an alternative approach like so:
Initialise cluster points by randomly selecting points from the data
Determine the distance matrix between every point and every cluster
Flatten it into a list
Sort the list
Go from smallest to longest distance assigning points to clusters
Assign clusters points until they reach 64, then no more can be assigned
Stop iterating through the list once all points have been assigned
Update the cluster centroid based on the assigned points
Repeat steps 1 - 7 until the cluster locations converge (as in K-means)
Collect cluster locations that are nearby into one cluster
This had an average spare capacity of approximately 0, by design
This worked well for my test data set, but as soon as I expanded to the full set (30 million points) it took far too long, probably because we have to sort the full list O(NlogN) and then iterate over it until all points have been assigned O(NK) and then repeat that until convergence
Linear Programming
This was quite simple to implement using libraries, but also took far too long again because of the complexity
I am open to any suggestions on possible algorithms/languages best suited to do this. I have experience with machine learning, but couldn't think of an obvious way of doing this using that.
Let me know if I missed any information out.
Since you have both pieces already, my first new suggestion would be to partition the points with k-means for k = n/6400 (you can tweak this parameter) and then use integer programming on each super-cluster. When I get a chance I'll write up my other suggestion, which involves a randomly shifted quadtree dissection.
Old pre-question-edit answer below.
You seem more concerned with minimizing equipment and running time than having the tightest possible clusters, so here's a suggestion along those lines.
The idea is to start with 1-node clusters and then use (almost) perfect matchings to pair clusters with each other, doubling the size. Do this 6 times to get clusters of 64.
To compute the matching, we use the centroid of each cluster to represent it. Now we just need an approximate matching on a set of points in the Euclidean plane. With apologies to the authors of many fine papers on Euclidean matching, here's an O(n log n) heuristic. If there are two or fewer points, match them in the obvious way. Otherwise, choose a random point P and partition the other points by comparing their (alternate between x- and y-) coordinate with P (as in kd-trees), breaking ties by comparing the other coordinate. Assign P to a half with an odd number of points if possible. (If both are even, let P be unmatched.) Recursively match the halves.
Let p = ceil(N/64).
That is the optimum number of equipment.
Let s = ceil(sqrt(p)).
Sort the data by the x axis. Slice the data into slices of 64*s entries each (but the last slide).
In each slice, sort the data by the y axis. Take 64 objects each and assign them to one equipment. It's easy to see that all but possibly the last equipment are optimally used, and close.
Sorting is so incredibly cheap that this will be extremely fast. Give it a try, and you'll likely be surprised by the quality vs. runtime trade-off! I wouldn't be surprised if it finds competitive results to most that you tried except the LP approach, and it will run in just few seconds.
Alternatively: sort all objects by their Hilbert curve coordinate. Partition into p partitions, assign one equipment each.
The second one is much harder to implement and likely slower. It can sometimes be better, but also sometimes worse.
If distance is more important to you, try the following strategy: build a spatial index (e.g., k-d-tree, or if you have Haversine, a R*-tree). For each point, find the 63 nearest neighbors and store this. Sort by distance, descending. This will give you a "difficulty" score. Now don't put equipment at the most difficult point, but nearby - at it's neighbor with the smallest max(distance to the difficult point, distance to it's 63 nearest neighbor). Repeat this for a few points, but after about 10% of the data, begin again the entire procedure with the remaining points.
The problem is that you didn't well specify when to prefer keeping the distances small, even when using more equipment... You could incorporate this, by only considering neighbors within a certain bound. The point with the fewest neighbors within the bound is then the hardest; and it's best covered by a neighbor with the most uncovered points within the bound etc.

Clustering elements based on highest similarity

I'm working with Docker images which consist of a set of re-usable layers. Now given a collection of images, I would like to combine images which have a large amount of shared layers.
To be more exact: Given a collection of N images, I want to create clusters where all images in a cluster share more than X percent of services with eachother. Each image is only allowed to belong to one cluster.
My own research points in the direction of cluster algorithms where I use a similarity measure to decide which images belong in a cluster together. The similarity measure I know how to write. However, I'm having difficulty finding an exact algorithm or pseudo-algorithm to get started.
Can someone recommend an algorithm to solve this problem or provide pseudo-code please?
EDIT: after some more searching I believe I'm looking for something like this hierarchical clustering ( https://github.com/lbehnke/hierarchical-clustering-java ) but with a threshold X so that neighbors with less than X% similarity don't get combined and stay in a separate cluster.
I believe you are a developer and you have no experience with data science?
There are a number of clustering algorithms and they have their advantages and disadvantages (please consult https://en.wikipedia.org/wiki/Cluster_analysis), but I think solution for your problem is easier than one can think.
I assume that N is small enough so you can store a matrix with N^2 float values in RAM memory? If this is the case, you are in a very comfortable situation. You write that you know how to implement similarity measure, so just calculate the measure for all N^2 pairs and store it in a matrix (it is a symmetric matrix, so only half of it can be stored). Please ensure that your similarity measure assigns special value for pair of images, where similarity measure is less than some X%, like 0 or infinity (it depends on that you treat a function like similarity measure or like a distance). I think perfect solution is to assign 1 for pairs, where similarity is greater than X% threshold and 0 otherwise.
After that, treat is just like a graph. Get first vertex and make, e.g., deep first search or any other graph walking routine. This is your first cluster. After that get first not visited vertex and repeat graph walking. Of course you can store graph as an adjacency list to save memory.
This algorithm assumes that you really do not pay attention to that how much images are similar and which pairs are more similar than other, but if they are similar enough (similarity measure is greater than a given threshold).
Unfortunately in cluster analysis it is common that 100% of possible pairs has to be computed. It is possible to save some number of distance calls using some fancy data structures for k-nearest neighbor search, but you have to assure that your similarity measure hold triangle inequality.
If you are not satisfied with this answer, please specify more details of your problem and read about:
K-means (main disadvantage: you have to specify number of clusters)
Hierarchical clustering (slow computation time, at the top all images are in one cluster, you have to cut a dendrogram at proper distance)
Spectral clustering (for graphs, but I think it is too complicated for this easy problem)
I ended up solving the problem by using hierarchical clustering and then traversing each branch of the dendrogram top to bottom until I find a cluster where the distance is below a threshold. Worst case there is no such cluster but then I'll end up in a leaf of the dendrogram which means that element is in a cluster of its own.

Clustering 2d integer coordinates into sets of at most N points

I have a number of points on a relatively small 2-dimensional grid, which wraps around in both dimensions. The coordinates can only be integers. I need to divide them into sets of at most N points that are close together, where N will be quite a small cut-off, I suspect 10 at most.
I'm designing an AI for a game, and I'm 99% certain using minimax on all the game pieces will give me a usable lookahead of about 1 move, if that. However distant game pieces should be unable to affect each other until we're looking ahead by a large number of moves, so I want to partition the game into a number of sub-games of N pieces at a time. However, I need to ensure I select a reasonable N pieces at a time, i.e. ones that are close together.
I don't care whether outliers are left on their own or lumped in with their least-distant cluster. Breaking up natural clusters larger than N is inevitable, and only needs to be sort-of reasonable. Because this is used in a game AI with limited response time, I'm looking for as fast an algorithm as possible, and willing to trade off accuracy for performance.
Does anyone have any suggestions for algorithms to look at adapting? K-means and relatives don't seem appropriate, as I don't know how many clusters I want to find but I have a bound on how large clusters I want. I've seen some evidence that approximating a solution by snapping points to a grid can help some clustering algorithms, so I'm hoping the integer coordinates makes the problem easier. Hierarchical distance-based clustering will be easy to adapt to the wrap-around coordinates, as I just plug in a different distance function, and also relatively easy to cap the size of the clusters. Are there any other ideas I should be looking at?
I'm more interested in algorithms than libraries, though libraries with good documentation of how they work would be welcome.
EDIT: I originally asked this question when I was working on an entry for the Fall 2011 AI Challenge, which I sadly never got finished. The page I linked to has a reasonably short reasonably high-level description of the game.
The two key points are:
Each player has a potentially large number of ants
Every ant is given orders every turn, moving 1 square either north, south, east or west; this means the branching factor of the game is O(4ants).
In the contest there were also strict time constraints on each bot's turn. I had thought to approach the game by using minimax (the turns are really simultaneous, but as a heuristic I thought it would be okay), but I feared there wouldn't be time to look ahead very many moves if I considered the whole game at once. But as each ant moves only one square each turn, two ants cannot N spaces apart by the shortest route possibly interfere with one another until we're looking ahead N/2 moves.
So the solution I was searching for was a good way to pick smaller groups of ants at a time and minimax each group separately. I had hoped this would allow me to search deeper into the move-tree without losing much accuracy. But obviously there's no point using a very expensive clustering algorithm as a time-saving heuristic!
I'm still interested in the answer to this question, though more in what I can learn from the techniques than for this particular contest, since it's over! Thanks for all the answers so far.
The median-cut algorithm is very simple to implement in 2D and would work well here. Your outliers would end up as groups of 1 which you could discard or whatever.
Further explanation requested:
Median cut is a quantization algorithm but all quantization algorithms are special case clustering algorithms. In this case the algorithm is extremely simple: find the smallest bounding box containing all points, split the box along its longest side (and shrink it to fit the points), repeat until the target amount of boxes is achieved.
A more detailed description and coded example
Wiki on color quantization has some good visuals and links
Since you are writing a game where (I assume) only a constant number of pieces move between each clusering, you can take advantage of a Online algorithm to get consant update times.
The property of not locking yourself to a number of clusters is called Nonstationary, I believe.
This paper seams to have a good algorithm with both of the above two properties: Improving the Robustness of 'Online Agglomerative Clustering Method' Based on Kernel-Induce Distance Measures (You might be able to find it elsewhere as well).
Here is a nice video showing the algorithm in works:
Construct a graph G=(V, E) over your grid, and partition it.
Since you are interested in algorithms rather than libraries, here is a recent paper:
Daniel Delling, Andrew V. Goldberg, Ilya Razenshteyn, and Renato F. Werneck. Graph Partitioning with Natural Cuts. In 25th International Parallel and Distributed Processing Symposium (IPDPS’11). IEEE Computer
Society, 2011. [PDF]
From the text:
The goal of the graph partitioning problem is to find a minimum-cost partition P such that the size of each cell is bounded by U.
So you will set U=10.
You can calculate a minimum spanning tree and remove the longest edges. Then you can calculate the k-means. Remove another long edge and calculate the k-means. Rinse and repeat until you have N=10. I believe this algorithm is named single-link k-means and the cluster are similar to voronoi diagrams:
"The single-link k-clustering algorithm ... is precisely Kruskal's algorithm ... equivalent to finding an MST and deleting the k-1 most expensive edges."
See for example here: https://stats.stackexchange.com/questions/1475/visualization-software-for-clustering
Consider the case where you only want two clusters. If you run k-means, then you will get two points, and the division between the two clusters is a plane orthogonal to the line between the centres of the two clusters. You can find out which cluster a point is in by projecting it down to the line and then comparing its position on the line with a threshold (e.g. take the dot product between the line and a vector from either of the two cluster centres and the point).
For two clusters, this means that you can adjust the sizes of the clusters by moving the threshold. You can sort the points on their distance along the line connecting the two cluster centres and then move the threshold along the line quite easily, trading off the inequality of the split with how neat the clusters are.
You probably don't have k=2, but you can run this hierarchically, by dividing into two clusters, and then sub-dividing the clusters.
(After comment)
I'm not good with pictures, but here is some relevant algebra.
With k-means we divide points according to their distance from cluster centres, so for a point Xi and two centres Ai and Bi we might be interested in
SUM_i (Xi - Ai)^2 - SUM_i(Xi - Bi)^2
This is SUM_i Ai^2 - SUM_i Bi^2 + 2 SUM_i (Bi - Ai)Xi
So a point gets assigned to either cluster depending on the sign of K + 2(B - A).X - a constant plus the dot product between the vector to the point and the vector joining the two cluster circles. In two dimensions, the dividing line between the points on the plane that end up in one cluster and the points on the plane that end up in the other cluster is a line perpendicular to the line between the two cluster centres. What I am suggesting is that, in order to control the number of points after your division, you compute (B - A).X for each point X and then choose a threshold that divides all points in one cluster from all points in the other cluster. This amounts to sliding the dividing line up or down the line between the two cluster centres, while keeping it perpendicular to the line between them.
Once you have dot products Yi, where Yi = SUM_j (Bj - Aj) Xij, a measure of how closely grouped a cluster is is SUM_i (Yi - Ym)^2, where Ym is the mean of the Yi in the cluster. I am suggesting that you use the sum of these values for the two clusters to tell how good a split you have. To move a point into or out of a cluster and get the new sum of squares without recomputing everything from scratch, note that SUM_i (Si + T)^2 = SUM_i Si^2 + 2T SUM_i Si + T^2, so if you keep track of sums and sums of squares you can work out what happens to a sum of squares when you add or subtract a value to every component, as the mean of the cluster changes when you add or remove a point to it.

Space partitioning algorithm

I have a set of points which are contained within the rectangle. I'd like to split the rectangles into subrectangles based on point density (giving a number of subrectangles or desired density, whichever is easiest).
The partitioning doesn't have to be exact (almost any approximation better than regular grid would do), but the algorithm has to cope with the large number of points - approx. 200 millions. The desired number of subrectangles however is substantially lower (around 1000).
Does anyone know any algorithm which may help me with this particular task?
Just to understand the problem.
The following is crude and perform badly, but I want to know if the result is what you want>
Assumption> Number of rectangles is even
Assumption> Point distribution is markedly 2D (no big accumulation in one line)
Procedure>
Bisect n/2 times in either axis, looping from one end to the other of each previously determined rectangle counting "passed" points and storing the number of passed points at each iteration. Once counted, bisect the rectangle selecting by the points counted in each loop.
Is that what you want to achieve?
I think I'd start with the following, which is close to what #belisarius already proposed. If you have any additional requirements, such as preferring 'nearly square' rectangles to 'long and thin' ones you'll need to modify this naive approach. I'll assume, for the sake of simplicity, that the points are approximately randomly distributed.
Split your initial rectangle in 2 with a line parallel to the short side of the rectangle and running exactly through the mid-point.
Count the number of points in both half-rectangles. If they are equal (enough) then go to step 4. Otherwise, go to step 3.
Based on the distribution of points between the half-rectangles, move the line to even things up again. So if, perchance, the first cut split the points 1/3, 2/3, move the line half-way into the heavy half of the rectangle. Go to step 2. (Be careful not to get trapped here, moving the line in ever decreasing steps first in one direction, then the other.)
Now, pass each of the half-rectangles in to a recursive call to this function, at step 1.
I hope that outlines the proposal well enough. It has limitations: it will produce a number of rectangles equal to some power of 2, so adjust it if that's not good enough. I've phrased it recursively, but it's ideal for parallelisation. Each split creates two tasks, each of which splits a rectangle and creates two more tasks.
If you don't like that approach, perhaps you could start with a regular grid with some multiple (10 - 100 perhaps) of the number of rectangles you want. Count the number of points in each of these tiny rectangles. Then start gluing the tiny rectangles together until the less-tiny rectangle contains (approximately) the right number of points. Or, if it satisfies your requirements well enough, you could use this as a discretisation method and integrate it with my first approach, but only place the cutting lines along the boundaries of the tiny rectangles. This would probably be much quicker as you'd only have to count the points in each tiny rectangle once.
I haven't really thought about the running time of either of these; I have a preference for the former approach 'cos I do a fair amount of parallel programming and have oodles of processors.
You're after a standard Kd-tree or binary space partitioning tree, I think. (You can look it up on Wikipedia.)
Since you have very many points, you may wish to only approximately partition the first few levels. In this case, you should take a random sample of your 200M points--maybe 200k of them--and split the full data set at the midpoint of the subsample (along whichever axis is longer). If you actually choose the points at random, the probability that you'll miss a huge cluster of points that need to be subdivided will be approximately zero.
Now you have two problems of about 100M points each. Divide each along the longer axis. Repeat until you stop taking subsamples and split along the whole data set. After ten breadth-first iterations you'll be done.
If you have a different problem--you must provide tick marks along the X and Y axis and fill in a grid along those as best you can, rather than having the irregular decomposition of a Kd-tree--take your subsample of points and find the 0/32, 1/32, ..., 32/32 percentiles along each axis. Draw your grid lines there, then fill the resulting 1024-element grid with your points.
R-tree
Good question.
I think the area you need to investigate is "computational geometry" and the "k-partitioning" problem. There's a link that might help get you started here
You might find that the problem itself is NP-hard which means a good approximation algorithm is the best you're going to get.
Would K-means clustering or a Voronoi diagram be a good fit for the problem you are trying to solve?
That's looks like Cluster analysis.
Would a QuadTree work?
A quadtree is a tree data structure in which each internal node has exactly four children. Quadtrees are most often used to partition a two dimensional space by recursively subdividing it into four quadrants or regions. The regions may be square or rectangular, or may have arbitrary shapes. This data structure was named a quadtree by Raphael Finkel and J.L. Bentley in 1974. A similar partitioning is also known as a Q-tree. All forms of Quadtrees share some common features:
They decompose space into adaptable cells
Each cell (or bucket) has a maximum capacity. When maximum capacity is reached, the bucket splits
The tree directory follows the spatial decomposition of the Quadtree

Resources