How to cluster objects (without coordinates) - algorithm

I have a list of opaque objects. I am only able to calculate the distance between them (not true, just setting the conditions for the problem):
class Thing {
public double DistanceTo(Thing other);
}
I would like to cluster these objects. I would like to control the number of clusters and I would like for "close" objects to be in the same cluster:
List<Cluster> cluster(int numClusters, List<Thing> things);
Can anyone suggest (and link to ;-)) some clustering algorithms (the simpler, the better!) or libraries that can help me?
Clarification Most clustering algorithms require that the objects be laid out in some N-dimensional space. This space is used to find "centroids" of clusters. In my case, I do not know what N is, nor do I know how to extract a coordinate system from the objects. All I know is how far apart 2 objects are. I would like to find a good clustering algorithm that uses only that information.
Imagine that you are clustering based upon the "smell" of an object. You don't know how to lay "smells out" on a 2D plane, but you do know whether two smells are similar or not.

I think you are looking for K-Medoids. It's like K-means in that you specify the number of clusters, K, in advance, but it doesn't require that you have a notion of "averaging" the objects you're clustering like K-means does.
Instead, every cluster has a representative medoid, which is the member of the cluster closest to the middle. You could think of it as a version of K-means that finds "medians" instead of "means". All you need is a distance metric to cluster things, and I've used this in some of my own work for exactly the same reasons you cite.
Naive K-medoids is not the fastest algorithm, but there are fast variants that are probably good enough for your purposes. Here are descriptions of the algorithms and links to the documentation for their implementations in R:
PAM is the basic O(n^2) implementation of K-medoids.
CLARA is a much faster, sampled version of PAM. It works by clustering randomly sampled subset of objects with PAM and grouping the entire set of objects based on the subset. You should still be able to get very good clusterings fast with this.
If you need more information, here's a paper that gives an overview of these and other K-medoids methods.

Here's an outline for a clustering algorithm that doesn't have the K-means requirement of finding a centroid.
Determine the distance between all objects. Record the n most separate objects. [finds roots of our clusters, time O(n^2)]
Assign each of these n random points to n new distinct clusters.
For every other object:[assign objects to clusters, time O(n^2)]
For each cluster:
Calculate the average distance from a cluster to that object by averaging the distance of each object in the cluster to the object.
Assign the object to the closest cluster.
This algorithm will certainly cluster the objects. But its runtime is O(n^2). Plus it is guided by those first n points chosen.
Can anyone improve upon this (better runtime perf, less dependent upon initial choices)? I would love to see your ideas.

Here's a quick algorithm.
While (points_left > 0) {
Select a random point that is not already clustered
Add point and all points within x distance
that aren't already clustered to a new cluster.
}
Alternatively, read the wikipedia page. K-means clustering is a good choice:
The K-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster.
The algorithm steps are:
* Choose the number of clusters, k.
* Randomly generate k clusters and determine the cluster centers, or
directly generate k random points as cluster centers.
* Assign each point to the nearest cluster center.
* Recompute the new cluster centers.
* Repeat the two previous steps until some convergence criterion is
met (usually that the assignment hasn't changed).
The main advantages of this algorithm
are its simplicity and speed which
allows it to run on large datasets.
Its disadvantage is that it does not
yield the same result with each run,
since the resulting clusters depend on
the initial random assignments. It
minimizes intra-cluster variance, but
does not ensure that the result has a
global minimum of variance. Another
disadvantage is the requirement for
the concept of a mean to be definable
which is not always the case. For such
datasets the k-medoids variant is
appropriate.

How about this approach:
Assign all objects to one cluster.
Find the two objects, a and b, that are within the same cluster, k, and that are maximum distance apart. To clarify, there should be one a and b for the whole set, not one a and b for each cluster.
Split cluster k into two clusters, k1 and k2, one with object a and one with object b.
For all other objects in cluster k, add them to either k1 or k2 by determining the minimum average distance to all other objects in that cluster.
Repeat steps 2-5 until N clusters are formed.
I think this algorithm should give you a fairly good clustering, although the efficiency might be pretty bad. To improve the efficiency you could alter step 3 so that you find the minimum distance to only the original object that started the cluster, rather than the average distance to all objects already in the cluster.

Phylogenetic DNA sequence analysis regularly uses hierarchical clustering on text strings, with [alignment] distance matrices. Here's a nice R tutorial for clustering:
http://www.statmethods.net/advstats/cluster.html
(Shortcut: Go straight to the "Hierarchical Agglomerative" section...)
Here are some other [language] libraries :
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/
http://code.google.com/p/scipy-cluster/
This approach could help determine how many [k] "natural" clusters there are and which objects to use as roots for the k-means approaches above.

Related

Clustering elements based on highest similarity

I'm working with Docker images which consist of a set of re-usable layers. Now given a collection of images, I would like to combine images which have a large amount of shared layers.
To be more exact: Given a collection of N images, I want to create clusters where all images in a cluster share more than X percent of services with eachother. Each image is only allowed to belong to one cluster.
My own research points in the direction of cluster algorithms where I use a similarity measure to decide which images belong in a cluster together. The similarity measure I know how to write. However, I'm having difficulty finding an exact algorithm or pseudo-algorithm to get started.
Can someone recommend an algorithm to solve this problem or provide pseudo-code please?
EDIT: after some more searching I believe I'm looking for something like this hierarchical clustering ( https://github.com/lbehnke/hierarchical-clustering-java ) but with a threshold X so that neighbors with less than X% similarity don't get combined and stay in a separate cluster.
I believe you are a developer and you have no experience with data science?
There are a number of clustering algorithms and they have their advantages and disadvantages (please consult https://en.wikipedia.org/wiki/Cluster_analysis), but I think solution for your problem is easier than one can think.
I assume that N is small enough so you can store a matrix with N^2 float values in RAM memory? If this is the case, you are in a very comfortable situation. You write that you know how to implement similarity measure, so just calculate the measure for all N^2 pairs and store it in a matrix (it is a symmetric matrix, so only half of it can be stored). Please ensure that your similarity measure assigns special value for pair of images, where similarity measure is less than some X%, like 0 or infinity (it depends on that you treat a function like similarity measure or like a distance). I think perfect solution is to assign 1 for pairs, where similarity is greater than X% threshold and 0 otherwise.
After that, treat is just like a graph. Get first vertex and make, e.g., deep first search or any other graph walking routine. This is your first cluster. After that get first not visited vertex and repeat graph walking. Of course you can store graph as an adjacency list to save memory.
This algorithm assumes that you really do not pay attention to that how much images are similar and which pairs are more similar than other, but if they are similar enough (similarity measure is greater than a given threshold).
Unfortunately in cluster analysis it is common that 100% of possible pairs has to be computed. It is possible to save some number of distance calls using some fancy data structures for k-nearest neighbor search, but you have to assure that your similarity measure hold triangle inequality.
If you are not satisfied with this answer, please specify more details of your problem and read about:
K-means (main disadvantage: you have to specify number of clusters)
Hierarchical clustering (slow computation time, at the top all images are in one cluster, you have to cut a dendrogram at proper distance)
Spectral clustering (for graphs, but I think it is too complicated for this easy problem)
I ended up solving the problem by using hierarchical clustering and then traversing each branch of the dendrogram top to bottom until I find a cluster where the distance is below a threshold. Worst case there is no such cluster but then I'll end up in a leaf of the dendrogram which means that element is in a cluster of its own.

Can k-means clustering do classification?

I want to know whether the k-means clustering algorithm can do classification?
If I have done a simple k-means clustering .
Assume I have many data , I use k-means clusterings, then get 2 clusters A, B. and the centroid calculating method is Euclidean distance.
Cluster A at left side.
Cluster B at right side.
So, if I have one new data. What should I do?
Run k-means clustering algorithm again, and can get which cluster does the new data belong to?
Record the last centroid and use Euclidean distance to calculating to decide the new data belong to?
other method?
The simplest method of course is 2., assign each object to the closest centroid (technically, use sum-of-squares, not Euclidean distance; this is more correct for k-means, and saves you a sqrt computation).
Method 1. is fragile, as k-means may give you a completely different solution; in particular if it didn't fit your data well in the first place (e.g. too high dimensional, clusters of too different size, too many clusters, ...)
However, the following method may be even more reasonable:
3. Train an actual classifier.
Yes, you can use k-means to produce an initial partitioning, then assume that the k-means partitions could be reasonable classes (you really should validate this at some point though), and then continue as you would if the data would have been user-labeled.
I.e. run k-means, train a SVM on the resulting clusters. Then use SVM for classification.
k-NN classification, or even assigning each object to the nearest cluster center (option 1) can be seen as very simple classifiers. The latter is a 1NN classifier, "trained" on the cluster centroids only.
Yes, we can do classification.
I wouldn't say the algorithm itself (like #1) is particularly well-suited to classifying points, as incorporating data to be classified into your training data tends to be frowned upon (unless you have a real-time system, but I think elaborating on this would get a bit far from the point).
To classify a new point, simply calculate the Euclidean distance to each cluster centroid to determine the closest one, then classify it under that cluster.
There are data structures that allows you to more efficiently determine the closest centroid (like a kd-tree), but the above is the basic idea.
If you've already done k-means clustering on your data to get two clusters, then you could use k Nearest Neighbors on the new data point to find out which class it belongs to.
Here another method:
I saw it on "The Elements of Statistical Learning". I'll change the notation a little bit. Let C be the number of classes and K the number of clusters. Now, follow these steps:
Apply K-means clustering to the training data in each class seperately, using K clusters per class.
Assign a class label to each of the C*K clusters.
Classify observation x to the class of the closest cluster.
It seems like a nice approach for classification that reduces data observations by using clusters.
If you are doing real-time analysis where you want to recognize new conditions during use (or adapt to a changing system), then you can choose some radius around the centroids to decide whether a new point starts a new cluster or should be included in an existing one. (That's a common need in monitoring of plant data, for instance, where it may take years after installation before some operating conditions occur.) If real-time monitoring is your case, check RTEFC or RTMAC, which are efficient, simple real-time variants of K-means. RTEFC in particular, which is non-iterative. See http://gregstanleyandassociates.com/whitepapers/BDAC/Clustering/clustering.htm
Yes, you can use that for classification. If you've decided you have collected enough data for all possible cases, you can stop updating the clusters, and just classify new points based on the nearest centroid. As in any real-time method, there will be sensitivity to outliers - e.g., a caused by sensor error or failure when using sensor data. If you create new clusters, outliers could be considered legitimate if one purpose of the clustering is identify faults in the sensors, although that the most useful when you can do some labeling of clusters.
You are confusing the concepts of clustering and classification. When you have labeled data, you already know how the data is clustered according to the labels and there is no point in clustering the data unless if you want to find out how well your features can discriminate the classes.
If you run the k-means algorithm to find the centroid of each class and then use the distances from the centroids to classify a new data point, you in fact implement a form of the linear discriminant analysis algorithm assuming the same multiple-of-identity covariance matrix for all classes.
After k-means Clustering algorithm converges, it can be used for classification, with few labeled exemplars/training data.
It is a very common approach when the number of training instances(data) with labels are very limited due to high cost of labeling.

Trajectory Clustering: Which Clustering Method?

As a newbie in Machine Learning, I have a set of trajectories that may be of different lengths. I wish to cluster them, because some of them are actually the same path and they just SEEM different due to the noise.
In addition, not all of them are of the same lengths. So maybe although Trajectory A is not the same as Trajectory B, yet it is part of Trajectory B. I wish to present this property after the clustering as well.
I have only a bit knowledge of K-means Clustering and Fuzzy N-means Clustering. How may I choose between them two? Or should I adopt other methods?
Any method that takes the "belongness" into consideration?
(e.g. After the clustering, I have 3 clusters A, B and C. One particular trajectory X belongs to cluster A. And a shorter trajectory Y, although is not clustered in A, is identified as part of trajectory B.)
=================== UPDATE ======================
The aforementioned trajectories are the pedestrians' trajectories. They can be either presented as a series of (x, y) points or a series of step vectors (length, direction). The presentation form is under my control.
It might be a little late but I am also working on the same problem.
I suggest you take a look at TRACLUS, an algorithm created by Jae-Gil Lee, Jiawei Han and Kyu-Young Wang, published on SIGMOD’07.
http://web.engr.illinois.edu/~hanj/pdf/sigmod07_jglee.pdf
This is so far the best approach I have seen for clustering trajectories because:
Can discover common sub-trajectories.
Focuses on Segments instead of points (so it filters out noise-outliers).
It works over trajectories of different length.
Basically is a 2 phase approach:
Phase one - Partition: Divide trajectories into segments, this is done using MDL Optimization with complexity of O(n) where n is the numbers of points in a given trajectory. Here the input is a set of trajectories and output is a set of segments.
Complexity: O(n) where n is number of points on a trajectory
Input: Set of trajectories.
Output: Set D of segments
Phase two - Group: This phase discovers the clusters using some version of density-based clustering like in DBSCAN. Input in this phase is the set of segments obtained from phase one and some parameters of what constitutes a neighborhood and the minimum amount of lines that can constitute a cluster. Output is a set of clusters. Clustering is done over segments. They define their own distance measure made of 3 components: Parallel distance, perpendicular distance and angular distance. This phase has a complexity of O(n log n) where n is the number of segments.
Complexity: O(n log n) where n is number of segments on set D
Input: Set D of segments, parameter E that sets neighborhood treshold and parameter MinLns that is the minimun number of lines.
Output: Set C of Cluster, that is a Cluster of segments (trajectories clustered).
Finally they calculate a for each cluster a representative trajectory, which is nothing else that a discovered common sub-trajectory in each cluster.
They have pretty cool examples and the paper is very well explained. Once again this is not my algorithm, so don't forget to cite them if you are doing research.
PS: I made some slides based on their work, just for educational purposes:
http://www.slideshare.net/ivansanchez1988/trajectory-clustering-traclus-algorithm
Every clustering algorithm needs a metric. You need to define distance between your samples. In your case simple Euclidean distance is not a good idea, especially if the trajectories can have different lengths.
If you define a metric, than you can use any clustering algorithm that allows for custom metric. Probably you do not know the correct number of clusters beforehand, then hierarchical clustering is a good option. K-means doesn't allow for custom metric, but there are modifications of K-means that do (like K-medoids)
The hard part is defining distance between two trajectories (time series). Common approach is DTW (Dynamic Time Warping). To improve performance you can approximate your trajectory by smaller amount of points (many algorithms for that).
Neither will work. Because what is a proper mean here?
Have a look at distance based clustering methods, such as hierarchical clustering (for small data sets, but you probably don't have thousands of trajectories) and DBSCAN.
Then you only need to choose an appropriate distance function that allows e.g. differences in time and spatial resolution of trajectories.
Distance functions such as dynamic time warping (DTW) distance can accomodate this.
This is good concept and having possibility for real-time applications. In my view, one can adopt any clustering but need to select appropriate dissimilarity measure, later need to think about computational complexity.
Paper (http://link.springer.com/chapter/10.1007/978-81-8489-203-1_15) used Hausdorff and suggest the technique for reducing complexity, and paper (http://www.cit.iit.bas.bg/CIT_2015/v-15-2/4-5-TCMVS%20A-edited-md-Gotovop.pdf) described the use of "Trajectory Clustering Technique Based on Multi-View Similarity"

3D clustering Algorithm

Problem Statement:
I have the following problem:
There are more than a billion points in 3D space. The goal is to find the top N points which has largest number of neighbors within given distance R. Another condition is that the distance between any two points of those top N points must be greater than R. The distribution of those points are not uniform. It is very common that certain regions of the space contain a lot of points.
Goal:
To find an algorithm that can scale well to many processors and has a small memory requirement.
Thoughts:
Normal spatial decomposition is not sufficient for this kind of problem due to the non-uniform distribution. irregular spatial decomposition that evenly divide the number of points may help us the problem. I will really appreciate that if someone can shed some lights on how to solve this problem.
Use an Octree. For 3D data with a limited value domain that scales very well to huge data sets.
Many of the aforementioned methods such as locality sensitive hashing are approximate versions designed for much higher dimensionality where you can't split sensibly anymore.
Splitting at each level into 8 bins (2^d for d=3) works very well. And since you can stop when there are too few points in a cell, and build a deeper tree where there are a lot of points that should fit your requirements quite well.
For more details, see Wikipedia:
https://en.wikipedia.org/wiki/Octree
Alternatively, you could try to build an R-tree. But the R-tree tries to balance, making it harder to find the most dense areas. For your particular task, this drawback of the Octree is actually helpful! The R-tree puts a lot of effort into keeping the tree depth equal everywhere, so that each point can be found at approximately the same time. However, you are only interested in the dense areas, which will be found on the longest paths in the Octree without even having to look at the actual points yet!
I don't have a definite answer for you, but I have a suggestion for an approach that might yield a solution.
I think it's worth investigating locality-sensitive hashing. I think dividing the points evenly and then applying this kind of LSH to each set should be readily parallelisable. If you design your hashing algorithm such that the bucket size is defined in terms of R, it seems likely that for a given set of points divided into buckets, the points satisfying your criteria are likely to exist in the fullest buckets.
Having performed this locally, perhaps you can apply some kind of map-reduce-style strategy to combine spatial buckets from different parallel runs of the LSH algorithm in a step-wise manner, making use of the fact that you can begin to exclude parts of your problem space by discounting entire buckets. Obviously you'll have to be careful about edge cases that span different buckets, but I suspect that at each stage of merging, you could apply different bucket sizes/offsets such that you remove this effect (e.g. perform merging spatially equivalent buckets, as well as adjacent buckets). I believe this method could be used to keep memory requirements small (i.e. you shouldn't need to store much more than the points themselves at any given moment, and you are always operating on small(ish) subsets).
If you're looking for some kind of heuristic then I think this result will immediately yield something resembling a "good" solution - i.e. it will give you a small number of probable points which you can check satisfy your criteria. If you are looking for an exact answer, then you are going to have to apply some other methods to trim the search space as you begin to merge parallel buckets.
Another thought I had was that this could relate to finding the metric k-center. It's definitely not the exact same problem, but perhaps some of the methods used in solving that are applicable in this case. The problem is that this assumes you have a metric space in which computing the distance metric is possible - in your case, however, the presence of a billion points makes it undesirable and difficult to perform any kind of global traversal (e.g. sorting of the distances between points). As I said, just a thought, and perhaps a source of further inspiration.
Here are some possible parts of a solution.
There are various choices at each stage,
which will depend on Ncluster, on how fast the data changes,
and on what you want to do with the means.
3 steps: quantize, box, K-means.
1) quantize: reduce the input XYZ coordinates to say 8 bits each,
by taking 2^8 percentiles of X,Y,Z separately.
This will speed up the whole flow without much loss of detail.
You could sort all 1G points, or just a random 1M,
to get 8-bit x0 < x1 < ... x256, y0 < y1 < ... y256, z0 < z1 < ... z256
with 2^(30-8) points in each range.
To map float X -> 8 bit x, unrolled binary search is fast —
see Bentley, Pearls p. 95.
Added: Kd trees
split any point cloud into different-sized boxes, each with ~ Leafsize points —
much better than splitting X Y Z as above.
But afaik you'd have to roll your own Kd tree code
to split only the first say 16M boxes, and keep counts only, not the points.
2) box: count the number of points in each 3d box,
[xj .. xj+1, yj .. yj+1, zj .. zj+1].
The average box will have 2^(30-3*8) points;
the distribution will depend on how clumpy the data is.
If some boxes are too big or get too many points, you could
a) split them into 8,
b) track the centre of the points in each box,
otherwide just take box midpoints.
3)
K-means clustering
on the 2^(3*8) box centres.
(Google parallel "k means" -> 121k hits.)
This depends strongly on K aka Ncluster, also on your radius R.
A rough approach would be to grow a
heap
of the say 27*Ncluster boxes with the most points,
then take the biggest ones subject to your Radius constraint.
(I like to start with a
Minimum spanning tree,
then remove the K-1 longest links to get K clusters.)
See also
Color quantization .
I'd make Nbit, here 8, a parameter from the beginning.
What is your Ncluster ?
Added: if your points are moving in time, see
collision-detection-of-huge-number-of-circles on SO.
I would also suggest to use an octree. The OctoMap framework is very good at dealing with huge 3D point clouds. It does not store all the points directly, but updates the occupancy density of every node (aka 3D box).
After the tree is built, you can use a simple iterator to find the node with the highest density. If you would like to model the point density or distribution inside the nodes, the OctoMap is very easy to adopt.
Here you can see how it was extended to model the point distribution using a planar model.
Just an idea. Create a graph with given points and edges between points when distance < R.
Creation of this kind of graph is similar to spatial decomposition. Your questions can be answered with local search in graph. First are vertices with max degree, second is finding of maximal unconnected set of max degree vertices.
I think creation of graph and search can be made parallel. This approach can have large memory requirement. Splitting domain and working with graphs for smaller volumes can reduce memory need.

Clustering [assessment] algorithm with distance matrix as an input

Can anyone suggest some clustering algorithm which can work with distance matrix as an input? Or the algorithm which can assess the "goodness" of the clustering also based on the distance matrix?
At this moment I'm using a modification of Kruskal's algorithm (http://en.wikipedia.org/wiki/Kruskal%27s_algorithm) to split data into two clusters. It has a problem though. When the data has no distinct clusters the algorithm will still create two clusters with one cluster containing one element and the other containing all the rest. In this case I would rather have one cluster containing all the elements and another one which is empty.
Are there any algorithms which are capable of doing this type of clustering?
Are there any algorithms which can estimate how well the clustering was done or even better how many clusters are there in the data?
The algorithms should work only with distance(similarity) matrices as an input.
Or the algorithm which can assess the
"goodness" of the clustering also
based on the distance matrix?
KNN should be useful in assessing the “goodness” of a clustering assignment. Here's how:
Given a distance matrix with each point labeled according to the cluster it belongs to (its “cluster label”):
Test the cluster label of each point against the cluster labels implied from k-nearest neighbors classification
If the k-nearest neighbors imply an alternative cluster, that classified point lowers the overall “goodness” rating of the cluster
Sum up the “goodness rating” contributions from each one of your pixels to get a total “goodness rating” for the whole cluster
Unlike k-means cluster analysis, your algorithm will return information about poorly categorized points. You can use that information to reassign certain points to a new cluster thereby improving the overall "goodness" of your clustering.
Since the algorithm knows nothing about the placement of the centroids of the clusters and hence, nothing about the global cluster density, the only way to insure clusters that are both locally and globally dense would be to run the algorithm for a range of k values and finding an arrangement that maximizes the goodness over the range of k values.
For a significant amount of points, you'll probably need to optimize this algorithm; possibly with a hash-table to keep track of the the nearest points relative to each point. Otherwise this algorithm will take quite awhile to compute.
Some approaches that can be used to estimate the number of clusters are:
Minimum Description Length
Bayesian Information Criterion
The gap statistic
scipy.cluster.hierarchy runs 3 steps, just like Matlab(TM)
clusterdata:
Y = scipy.spatial.distance.pdist( pts ) # you have this already
Z = hier.linkage( Y, method ) # N-1
T = hier.fcluster( Z, ncluster, criterion=criterion )
Here linkage might be a modified Kruskal, dunno.
This SO answer
(ahem) uses the above.
As a measure of clustering, radius = rms distance to cluster centre is fast and reasonable,
for 2d/3d points.
Tell us about your Npt, ndim, ncluster, hier/flat ?
Clustering is a largish area, one size does not fit all.

Resources