How do I group objects in a set by proximity? - algorithm

I have a set containing thousands of addresses. If I can get the longitude and latitude of each address, how do I split the set into groups by proximity?
Further, I may want to retry the 'clustering' according to different rules:
N groups
M addresses per group
maximum distance between any address in a group

You could try the k-means clustering algorithm.

You want vector quantization:
http://en.wikipedia.org/wiki/Vector_quantization
"It works by dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point, as in k-means and some other clustering algorithms."
Here the vectors are the geographic coordinates of each address, and you can feed your algorithms with other parameters depending on your constraints (proximity, group size, number of groups...).
You can start with k-means, but from my experience a Voronoi based algorithm is more flexible. A good introduction here.

It depends a bit on the scale of the data you are wanting to cluster. The brute force approach is to calculate the distance between all combination of points into a distance array. The resulting array is N^2 and since the distance from A to B is the same as B to A you only need half those, so the resulting set is N^2/2.
For relatively close lat lon coordinates you can sometimes get away with using the lat long as an x,y grid and calculating the Cartesian distance. Since the real world is not flat the Cartesian distance will have error. For a more exact calculation which you should use if your addresses are located across the country, see this link from Mathforum.com.
If you don't have the scale to handle the entire distance matrix, you will need to do some algorithm programming to increase efficiency.

The "N groups" and "M addresses per group" constraints are mutually exclusive. One implies the other.

Build a matrix of distances between all addresses.
Starting with a random address, sort the matrix by ascending distance to that address
Removing the addresses from the matrix as you go along, place the addresses closest to the start address into a new group until you reach your criteria (size of group or max distance).
Once a group is full, choose another random address and resort the matrix by distance to that address
Continue like this until all addresses are taken out of the matrix.
If addresses were distributed evenly, each group would have a sort of circular shape around the start address. The problem comes when start addresses are near existing groups. When this happens, the new group will sort of wrap around the old one and could even circle it completely if your stop criteria is only group size. If you use the max-distance constraint, then this is not going to happen (assuming no other constraints).
I don't really know if this is a good way of doing it but it's what I'd try. I'm sure lots of optimization would be required. Especially for addresses on the edges.

Related

Clustering while trying to minimise spare capacity

I am trying to cluster ~30 million points (x and y co-ordinates) into clusters - the addition that makes it challenging is I am trying to minimise the spare capacity of each cluster while also ensuring the maximum distance between the cluster and any one point is not huge (>5km or so).
Each cluster is made from equipment that can serve 64 points, if a cluster contains less than 65 points then we need one of these pieces of equipment. However if a cluster contains 65 points then we need two of these pieces of equipment, this means we have a spare capacity of 63 for that cluster. We also need to connect each point to the cluster, so the distance from each point to the cluster is also a factor in the equipment cost.
Ultimately I am trying to minimise the cost of equipment which seems to be an equivalent problem to minimising the average spare capacity whilst also ensuring the distance from the cluster to any one point is less than 5km (an approximation, but will do for the thought experiment - maybe there are better ways to impose this restriction).
I have tried multiple approaches:
K-means
Most should know how this works
Average spare capacity of 32
Runs in O(n^2)
Sorted list of a-b distances
I tried an alternative approach like so:
Initialise cluster points by randomly selecting points from the data
Determine the distance matrix between every point and every cluster
Flatten it into a list
Sort the list
Go from smallest to longest distance assigning points to clusters
Assign clusters points until they reach 64, then no more can be assigned
Stop iterating through the list once all points have been assigned
Update the cluster centroid based on the assigned points
Repeat steps 1 - 7 until the cluster locations converge (as in K-means)
Collect cluster locations that are nearby into one cluster
This had an average spare capacity of approximately 0, by design
This worked well for my test data set, but as soon as I expanded to the full set (30 million points) it took far too long, probably because we have to sort the full list O(NlogN) and then iterate over it until all points have been assigned O(NK) and then repeat that until convergence
Linear Programming
This was quite simple to implement using libraries, but also took far too long again because of the complexity
I am open to any suggestions on possible algorithms/languages best suited to do this. I have experience with machine learning, but couldn't think of an obvious way of doing this using that.
Let me know if I missed any information out.
Since you have both pieces already, my first new suggestion would be to partition the points with k-means for k = n/6400 (you can tweak this parameter) and then use integer programming on each super-cluster. When I get a chance I'll write up my other suggestion, which involves a randomly shifted quadtree dissection.
Old pre-question-edit answer below.
You seem more concerned with minimizing equipment and running time than having the tightest possible clusters, so here's a suggestion along those lines.
The idea is to start with 1-node clusters and then use (almost) perfect matchings to pair clusters with each other, doubling the size. Do this 6 times to get clusters of 64.
To compute the matching, we use the centroid of each cluster to represent it. Now we just need an approximate matching on a set of points in the Euclidean plane. With apologies to the authors of many fine papers on Euclidean matching, here's an O(n log n) heuristic. If there are two or fewer points, match them in the obvious way. Otherwise, choose a random point P and partition the other points by comparing their (alternate between x- and y-) coordinate with P (as in kd-trees), breaking ties by comparing the other coordinate. Assign P to a half with an odd number of points if possible. (If both are even, let P be unmatched.) Recursively match the halves.
Let p = ceil(N/64).
That is the optimum number of equipment.
Let s = ceil(sqrt(p)).
Sort the data by the x axis. Slice the data into slices of 64*s entries each (but the last slide).
In each slice, sort the data by the y axis. Take 64 objects each and assign them to one equipment. It's easy to see that all but possibly the last equipment are optimally used, and close.
Sorting is so incredibly cheap that this will be extremely fast. Give it a try, and you'll likely be surprised by the quality vs. runtime trade-off! I wouldn't be surprised if it finds competitive results to most that you tried except the LP approach, and it will run in just few seconds.
Alternatively: sort all objects by their Hilbert curve coordinate. Partition into p partitions, assign one equipment each.
The second one is much harder to implement and likely slower. It can sometimes be better, but also sometimes worse.
If distance is more important to you, try the following strategy: build a spatial index (e.g., k-d-tree, or if you have Haversine, a R*-tree). For each point, find the 63 nearest neighbors and store this. Sort by distance, descending. This will give you a "difficulty" score. Now don't put equipment at the most difficult point, but nearby - at it's neighbor with the smallest max(distance to the difficult point, distance to it's 63 nearest neighbor). Repeat this for a few points, but after about 10% of the data, begin again the entire procedure with the remaining points.
The problem is that you didn't well specify when to prefer keeping the distances small, even when using more equipment... You could incorporate this, by only considering neighbors within a certain bound. The point with the fewest neighbors within the bound is then the hardest; and it's best covered by a neighbor with the most uncovered points within the bound etc.

Clustering elements based on highest similarity

I'm working with Docker images which consist of a set of re-usable layers. Now given a collection of images, I would like to combine images which have a large amount of shared layers.
To be more exact: Given a collection of N images, I want to create clusters where all images in a cluster share more than X percent of services with eachother. Each image is only allowed to belong to one cluster.
My own research points in the direction of cluster algorithms where I use a similarity measure to decide which images belong in a cluster together. The similarity measure I know how to write. However, I'm having difficulty finding an exact algorithm or pseudo-algorithm to get started.
Can someone recommend an algorithm to solve this problem or provide pseudo-code please?
EDIT: after some more searching I believe I'm looking for something like this hierarchical clustering ( https://github.com/lbehnke/hierarchical-clustering-java ) but with a threshold X so that neighbors with less than X% similarity don't get combined and stay in a separate cluster.
I believe you are a developer and you have no experience with data science?
There are a number of clustering algorithms and they have their advantages and disadvantages (please consult https://en.wikipedia.org/wiki/Cluster_analysis), but I think solution for your problem is easier than one can think.
I assume that N is small enough so you can store a matrix with N^2 float values in RAM memory? If this is the case, you are in a very comfortable situation. You write that you know how to implement similarity measure, so just calculate the measure for all N^2 pairs and store it in a matrix (it is a symmetric matrix, so only half of it can be stored). Please ensure that your similarity measure assigns special value for pair of images, where similarity measure is less than some X%, like 0 or infinity (it depends on that you treat a function like similarity measure or like a distance). I think perfect solution is to assign 1 for pairs, where similarity is greater than X% threshold and 0 otherwise.
After that, treat is just like a graph. Get first vertex and make, e.g., deep first search or any other graph walking routine. This is your first cluster. After that get first not visited vertex and repeat graph walking. Of course you can store graph as an adjacency list to save memory.
This algorithm assumes that you really do not pay attention to that how much images are similar and which pairs are more similar than other, but if they are similar enough (similarity measure is greater than a given threshold).
Unfortunately in cluster analysis it is common that 100% of possible pairs has to be computed. It is possible to save some number of distance calls using some fancy data structures for k-nearest neighbor search, but you have to assure that your similarity measure hold triangle inequality.
If you are not satisfied with this answer, please specify more details of your problem and read about:
K-means (main disadvantage: you have to specify number of clusters)
Hierarchical clustering (slow computation time, at the top all images are in one cluster, you have to cut a dendrogram at proper distance)
Spectral clustering (for graphs, but I think it is too complicated for this easy problem)
I ended up solving the problem by using hierarchical clustering and then traversing each branch of the dendrogram top to bottom until I find a cluster where the distance is below a threshold. Worst case there is no such cluster but then I'll end up in a leaf of the dendrogram which means that element is in a cluster of its own.

A special case of grouping coordinates

I'm trying to write a program to place students in cars for carpooling to an event. I have the addresses for each student, and can geocode each address to get coordinates (the addresses are close enough that I can simply use euclidean distances between coordinates.) Some of the students have cars and can drives others. How can I efficiently group students in cars? I know that grouping is usually done using algorithms like K-Mean, but I can only find algorithms to group N points into M arbitrary-sized groups. My groups are of a specific size and positioning. Where can I start? A simply greedy algorithm will ensure the first cars assigned have minimum pick-up distance, but the average will be high, I imagine.
Say that you are trying to minimize the total distance traveled. Clearly traveling salesman problem is a special instance of your problem so your problem is NP-hard. That puts us in the heuristics/approximation algorithms domain.
The problem also needs some more specification, for example howmany students can fit in a given car. Lets say, as many as you want.
How about you solve it as a minimum spanning tree rooted at the final destination. Then each student with the car is is responsible for collecting all its children nodes. So the total distance traveled in at most 2x the total length of spanning tree which is a 2x bound right there. Of course this is ridiculous 'coz the nodes next to root will be driving a mega bus instead of a car in this case.
So then you start playing the packing game where you try to fill the cars greedily.
I know this is not a solution, but this might help you specify the problem better.
This is an old question, but since I found it, others will as well.
Group students together by distance. Find the distance between all sets of two students. Start with the closest students and add them in a group, and continue adding until all students are in groups. If students are beyond a threshold distance, like 50 miles, don't combine them into a group (this will cause a few students to go solo). If students have different sized cars, stop adding them when the max car size has been reached between the students in the group (and whichever one you're trying to add).
Finding the optimal (you asked for efficient) solution would require a more defined problem, which it seems like you don't have. If you wanted to eliminate individual drivers though, taking the above solution and special casing the outliers, working them individually into groups and swapping people around adjacent groups to fit them in, could find a very strong solution.

3D clustering Algorithm

Problem Statement:
I have the following problem:
There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any two points of those top N points must be greater than R. The distribution of those points are not uniform. It is very common that certain regions of the space contain a lot of points.
Goal:
To find an algorithm that can scale well to many processors and has a small memory requirement.
Thoughts:
Normal spatial decomposition is not sufficient for this kind of problem due to the non-uniform distribution. irregular spatial decomposition that evenly divide the number of points may help us the problem. I will really appreciate that if someone can shed some lights on how to solve this problem.
Use an Octree. For 3D data with a limited value domain that scales very well to huge data sets.
Many of the aforementioned methods such as locality sensitive hashing are approximate versions designed for much higher dimensionality where you can't split sensibly anymore.
Splitting at each level into 8 bins (2^d for d=3) works very well. And since you can stop when there are too few points in a cell, and build a deeper tree where there are a lot of points that should fit your requirements quite well.
For more details, see Wikipedia:
https://en.wikipedia.org/wiki/Octree
Alternatively, you could try to build an R-tree. But the R-tree tries to balance, making it harder to find the most dense areas. For your particular task, this drawback of the Octree is actually helpful! The R-tree puts a lot of effort into keeping the tree depth equal everywhere, so that each point can be found at approximately the same time. However, you are only interested in the dense areas, which will be found on the longest paths in the Octree without even having to look at the actual points yet!
I don't have a definite answer for you, but I have a suggestion for an approach that might yield a solution.
I think it's worth investigating locality-sensitive hashing. I think dividing the points evenly and then applying this kind of LSH to each set should be readily parallelisable. If you design your hashing algorithm such that the bucket size is defined in terms of R, it seems likely that for a given set of points divided into buckets, the points satisfying your criteria are likely to exist in the fullest buckets.
Having performed this locally, perhaps you can apply some kind of map-reduce-style strategy to combine spatial buckets from different parallel runs of the LSH algorithm in a step-wise manner, making use of the fact that you can begin to exclude parts of your problem space by discounting entire buckets. Obviously you'll have to be careful about edge cases that span different buckets, but I suspect that at each stage of merging, you could apply different bucket sizes/offsets such that you remove this effect (e.g. perform merging spatially equivalent buckets, as well as adjacent buckets). I believe this method could be used to keep memory requirements small (i.e. you shouldn't need to store much more than the points themselves at any given moment, and you are always operating on small(ish) subsets).
If you're looking for some kind of heuristic then I think this result will immediately yield something resembling a "good" solution - i.e. it will give you a small number of probable points which you can check satisfy your criteria. If you are looking for an exact answer, then you are going to have to apply some other methods to trim the search space as you begin to merge parallel buckets.
Another thought I had was that this could relate to finding the metric k-center. It's definitely not the exact same problem, but perhaps some of the methods used in solving that are applicable in this case. The problem is that this assumes you have a metric space in which computing the distance metric is possible - in your case, however, the presence of a billion points makes it undesirable and difficult to perform any kind of global traversal (e.g. sorting of the distances between points). As I said, just a thought, and perhaps a source of further inspiration.
Here are some possible parts of a solution.
There are various choices at each stage,
which will depend on Ncluster, on how fast the data changes,
and on what you want to do with the means.
3 steps: quantize, box, K-means.
1) quantize: reduce the input XYZ coordinates to say 8 bits each,
by taking 2^8 percentiles of X,Y,Z separately.
This will speed up the whole flow without much loss of detail.
You could sort all 1G points, or just a random 1M,
to get 8-bit x0 < x1 < ... x256, y0 < y1 < ... y256, z0 < z1 < ... z256
with 2^(30-8) points in each range.
To map float X -> 8 bit x, unrolled binary search is fast —
see Bentley, Pearls p. 95.
Added: Kd trees
split any point cloud into different-sized boxes, each with ~ Leafsize points —
much better than splitting X Y Z as above.
But afaik you'd have to roll your own Kd tree code
to split only the first say 16M boxes, and keep counts only, not the points.
2) box: count the number of points in each 3d box,
[xj .. xj+1, yj .. yj+1, zj .. zj+1].
The average box will have 2^(30-3*8) points;
the distribution will depend on how clumpy the data is.
If some boxes are too big or get too many points, you could
a) split them into 8,
b) track the centre of the points in each box,
otherwide just take box midpoints.
3)
K-means clustering
on the 2^(3*8) box centres.
(Google parallel "k means" -> 121k hits.)
This depends strongly on K aka Ncluster, also on your radius R.
A rough approach would be to grow a
heap
of the say 27*Ncluster boxes with the most points,
then take the biggest ones subject to your Radius constraint.
(I like to start with a
Minimum spanning tree,
then remove the K-1 longest links to get K clusters.)
See also
Color quantization .
I'd make Nbit, here 8, a parameter from the beginning.
What is your Ncluster ?
Added: if your points are moving in time, see
collision-detection-of-huge-number-of-circles on SO.
I would also suggest to use an octree. The OctoMap framework is very good at dealing with huge 3D point clouds. It does not store all the points directly, but updates the occupancy density of every node (aka 3D box).
After the tree is built, you can use a simple iterator to find the node with the highest density. If you would like to model the point density or distribution inside the nodes, the OctoMap is very easy to adopt.
Here you can see how it was extended to model the point distribution using a planar model.
Just an idea. Create a graph with given points and edges between points when distance < R.
Creation of this kind of graph is similar to spatial decomposition. Your questions can be answered with local search in graph. First are vertices with max degree, second is finding of maximal unconnected set of max degree vertices.
I think creation of graph and search can be made parallel. This approach can have large memory requirement. Splitting domain and working with graphs for smaller volumes can reduce memory need.

Appropriate similarity metrics for multiple sets of 2D coordinates

I have a collection of 2D coordinate sets (on the scale of a 100K-500K points in each set) and I am looking for the most efficient way to measure the similarity of 1 set to the other. I know of the usuals: Cosine, Jaccard/Tanimoto, etc. However I am hoping for some suggestions on any fast/efficient ones to measure similarity, especially ones that can cluster by similarity.
Edit 1: The image shows what I need to do. I need to cluster all the reds, blues and greens by their shape/orientatoin, etc.
alt text http://img402.imageshack.us/img402/8121/curves.png
It seems that the first step of any solution is going to be to find the centroid, or other reference point, of each shape, so that they can be compared regardless of absolute position.
One algorithm that comes to mind would be to start at the point nearest the centroid and walk to its nearest neighbors. Compare the offsets of those neighbors (from the centroid) between the sets being compared. Keep walking to the next-nearest neighbors of the centroid, or the nearest not-already-compared neighbors of the ones previously compared, and keep track of the aggregate difference (perhaps RMS?) between the two shapes. Also, at each step of this process calculate the rotational offset that would bring the two shapes into closest alignment [and whether mirroring affects it as well?]. When you are finished you will have three values for every pair of sets, including their direct similarity, their relative rotational offset (mostly only useful if they are close matches after rotation), and their similarity after rotation.
Try K-means algorithm. It dynamically calculated the centroid of each cluster and calculates distance to all the pointers and associates them to the nearest cluster.
Since your clustering is based on a nearness-to-shape metric, perhaps you need some form of connected component labeling. UNION-FIND can give you a fast basic set primitive.
For union-only, start every point in a different set, and merge them if they meet some criterion of nearness, influenced by local colinearity since that seems important to you. Then keep merging until you pass some over-threshold condition for how difficult your merge is. If you treat it like line-growing (only join things at their ends) then some data structures become simpler. Are all your clusters open lines and curves? No closed curves, like circles?
The crossing lines are trickier to get right, you either have to find some way merge then split, or you set your merge criteria to extremely favor colinearity and you luck out on the crossing lines.

Resources