A clustering algorithm that accepts an arbitrary distance function

A clustering algorithm that accepts an arbitrary distance function - algorithm

I have about 200 points in Cartesian plane (2D). I want to cluster these points to k clusters with respect to arbitrary distance function (not matrix) and get the so-called centroid or representatives of these clusters. I know kmeans does this with respect to some special distance functions such as Euclidean, Manhattan, Cosine, etc. But, kmeans cannot handle arbitrary distance function because for example in centroid-updating phase of kmeans with respect to Euclidean distance function, mean of the points in each cluster is the LSE and minimizes the sum of distances of the nodes in the cluster to its centroid (mean); however, mean of the points may not minimize the ditances when the distance function is something arbitrary. Could you please help me about it and tell me if you know about any clustering algorithms that can work for me?

If you replace "mean" with "most central point in cluster", then you get the k-medoids algorithm. Wikipedia claims that a metric is required, but I believe that to be incorrect, since I can't see where the majorization-minimization proof needs the triangle inequality or even symmetry.

There are various clustering algorithms that can work with arbitrary distance functions, in particular:
hierarchical clustering
k-medoids (PAM)
DBSCAN
OPTICS
many many more - get some good clustering book and/or software
But the only one which enforces k clusters and uses a "cluster representative" model is k-medoids. You may be putting too many constraints on the cluster model to get a wider choice.

Since you want something that represents a centroid but is not one of the data points, a technique I once used was to perform something like Kmedoids on N random samples, then I took all the members of each cluster and used them as samples to build a classifier which returned a class label... in the end each class label returned from the classifier ends up being an abstract notion of a set of cluster/centroids. I did this for a very specific and nuanced reason, I know the flaws.
If you don't want to have to specify K, and your vectors are not enormous and super sparse, then I would take a look at the cobweb clustering in JavaML, JavaML also has a decent KMedoids.

Related

Clustering elements based on highest similarity

I'm working with Docker images which consist of a set of re-usable layers. Now given a collection of images, I would like to combine images which have a large amount of shared layers.
To be more exact: Given a collection of N images, I want to create clusters where all images in a cluster share more than X percent of services with eachother. Each image is only allowed to belong to one cluster.
My own research points in the direction of cluster algorithms where I use a similarity measure to decide which images belong in a cluster together. The similarity measure I know how to write. However, I'm having difficulty finding an exact algorithm or pseudo-algorithm to get started.
Can someone recommend an algorithm to solve this problem or provide pseudo-code please?
EDIT: after some more searching I believe I'm looking for something like this hierarchical clustering ( https://github.com/lbehnke/hierarchical-clustering-java ) but with a threshold X so that neighbors with less than X% similarity don't get combined and stay in a separate cluster.

I believe you are a developer and you have no experience with data science?
There are a number of clustering algorithms and they have their advantages and disadvantages (please consult https://en.wikipedia.org/wiki/Cluster_analysis), but I think solution for your problem is easier than one can think.
I assume that N is small enough so you can store a matrix with N^2 float values in RAM memory? If this is the case, you are in a very comfortable situation. You write that you know how to implement similarity measure, so just calculate the measure for all N^2 pairs and store it in a matrix (it is a symmetric matrix, so only half of it can be stored). Please ensure that your similarity measure assigns special value for pair of images, where similarity measure is less than some X%, like 0 or infinity (it depends on that you treat a function like similarity measure or like a distance). I think perfect solution is to assign 1 for pairs, where similarity is greater than X% threshold and 0 otherwise.
After that, treat is just like a graph. Get first vertex and make, e.g., deep first search or any other graph walking routine. This is your first cluster. After that get first not visited vertex and repeat graph walking. Of course you can store graph as an adjacency list to save memory.
This algorithm assumes that you really do not pay attention to that how much images are similar and which pairs are more similar than other, but if they are similar enough (similarity measure is greater than a given threshold).
Unfortunately in cluster analysis it is common that 100% of possible pairs has to be computed. It is possible to save some number of distance calls using some fancy data structures for k-nearest neighbor search, but you have to assure that your similarity measure hold triangle inequality.
If you are not satisfied with this answer, please specify more details of your problem and read about:
K-means (main disadvantage: you have to specify number of clusters)
Hierarchical clustering (slow computation time, at the top all images are in one cluster, you have to cut a dendrogram at proper distance)
Spectral clustering (for graphs, but I think it is too complicated for this easy problem)

I ended up solving the problem by using hierarchical clustering and then traversing each branch of the dendrogram top to bottom until I find a cluster where the distance is below a threshold. Worst case there is no such cluster but then I'll end up in a leaf of the dendrogram which means that element is in a cluster of its own.

Can k-means clustering do classification?

I want to know whether the k-means clustering algorithm can do classification?
If I have done a simple k-means clustering .
Assume I have many data , I use k-means clusterings, then get 2 clusters A, B. and the centroid calculating method is Euclidean distance.
Cluster A at left side.
Cluster B at right side.
So, if I have one new data. What should I do?
Run k-means clustering algorithm again, and can get which cluster does the new data belong to?
Record the last centroid and use Euclidean distance to calculating to decide the new data belong to?
other method?

The simplest method of course is 2., assign each object to the closest centroid (technically, use sum-of-squares, not Euclidean distance; this is more correct for k-means, and saves you a sqrt computation).
Method 1. is fragile, as k-means may give you a completely different solution; in particular if it didn't fit your data well in the first place (e.g. too high dimensional, clusters of too different size, too many clusters, ...)
However, the following method may be even more reasonable:
3. Train an actual classifier.
Yes, you can use k-means to produce an initial partitioning, then assume that the k-means partitions could be reasonable classes (you really should validate this at some point though), and then continue as you would if the data would have been user-labeled.
I.e. run k-means, train a SVM on the resulting clusters. Then use SVM for classification.
k-NN classification, or even assigning each object to the nearest cluster center (option 1) can be seen as very simple classifiers. The latter is a 1NN classifier, "trained" on the cluster centroids only.

Yes, we can do classification.
I wouldn't say the algorithm itself (like #1) is particularly well-suited to classifying points, as incorporating data to be classified into your training data tends to be frowned upon (unless you have a real-time system, but I think elaborating on this would get a bit far from the point).
To classify a new point, simply calculate the Euclidean distance to each cluster centroid to determine the closest one, then classify it under that cluster.
There are data structures that allows you to more efficiently determine the closest centroid (like a kd-tree), but the above is the basic idea.

If you've already done k-means clustering on your data to get two clusters, then you could use k Nearest Neighbors on the new data point to find out which class it belongs to.

Here another method:
I saw it on "The Elements of Statistical Learning". I'll change the notation a little bit. Let C be the number of classes and K the number of clusters. Now, follow these steps:
Apply K-means clustering to the training data in each class seperately, using K clusters per class.
Assign a class label to each of the C*K clusters.
Classify observation x to the class of the closest cluster.
It seems like a nice approach for classification that reduces data observations by using clusters.

If you are doing real-time analysis where you want to recognize new conditions during use (or adapt to a changing system), then you can choose some radius around the centroids to decide whether a new point starts a new cluster or should be included in an existing one. (That's a common need in monitoring of plant data, for instance, where it may take years after installation before some operating conditions occur.) If real-time monitoring is your case, check RTEFC or RTMAC, which are efficient, simple real-time variants of K-means. RTEFC in particular, which is non-iterative. See http://gregstanleyandassociates.com/whitepapers/BDAC/Clustering/clustering.htm
Yes, you can use that for classification. If you've decided you have collected enough data for all possible cases, you can stop updating the clusters, and just classify new points based on the nearest centroid. As in any real-time method, there will be sensitivity to outliers - e.g., a caused by sensor error or failure when using sensor data. If you create new clusters, outliers could be considered legitimate if one purpose of the clustering is identify faults in the sensors, although that the most useful when you can do some labeling of clusters.

You are confusing the concepts of clustering and classification. When you have labeled data, you already know how the data is clustered according to the labels and there is no point in clustering the data unless if you want to find out how well your features can discriminate the classes.
If you run the k-means algorithm to find the centroid of each class and then use the distances from the centroids to classify a new data point, you in fact implement a form of the linear discriminant analysis algorithm assuming the same multiple-of-identity covariance matrix for all classes.

After k-means Clustering algorithm converges, it can be used for classification, with few labeled exemplars/training data.
It is a very common approach when the number of training instances(data) with labels are very limited due to high cost of labeling.

Trajectory Clustering: Which Clustering Method?

As a newbie in Machine Learning, I have a set of trajectories that may be of different lengths. I wish to cluster them, because some of them are actually the same path and they just SEEM different due to the noise.
In addition, not all of them are of the same lengths. So maybe although Trajectory A is not the same as Trajectory B, yet it is part of Trajectory B. I wish to present this property after the clustering as well.
I have only a bit knowledge of K-means Clustering and Fuzzy N-means Clustering. How may I choose between them two? Or should I adopt other methods?
Any method that takes the "belongness" into consideration?
(e.g. After the clustering, I have 3 clusters A, B and C. One particular trajectory X belongs to cluster A. And a shorter trajectory Y, although is not clustered in A, is identified as part of trajectory B.)
=================== UPDATE ======================
The aforementioned trajectories are the pedestrians' trajectories. They can be either presented as a series of (x, y) points or a series of step vectors (length, direction). The presentation form is under my control.

It might be a little late but I am also working on the same problem.
I suggest you take a look at TRACLUS, an algorithm created by Jae-Gil Lee, Jiawei Han and Kyu-Young Wang, published on SIGMOD’07.
http://web.engr.illinois.edu/~hanj/pdf/sigmod07_jglee.pdf
This is so far the best approach I have seen for clustering trajectories because:
Can discover common sub-trajectories.
Focuses on Segments instead of points (so it filters out noise-outliers).
It works over trajectories of different length.
Basically is a 2 phase approach:
Phase one - Partition: Divide trajectories into segments, this is done using MDL Optimization with complexity of O(n) where n is the numbers of points in a given trajectory. Here the input is a set of trajectories and output is a set of segments.
Complexity: O(n) where n is number of points on a trajectory
Input: Set of trajectories.
Output: Set D of segments
Phase two - Group: This phase discovers the clusters using some version of density-based clustering like in DBSCAN. Input in this phase is the set of segments obtained from phase one and some parameters of what constitutes a neighborhood and the minimum amount of lines that can constitute a cluster. Output is a set of clusters. Clustering is done over segments. They define their own distance measure made of 3 components: Parallel distance, perpendicular distance and angular distance. This phase has a complexity of O(n log n) where n is the number of segments.
Complexity: O(n log n) where n is number of segments on set D
Input: Set D of segments, parameter E that sets neighborhood treshold and parameter MinLns that is the minimun number of lines.
Output: Set C of Cluster, that is a Cluster of segments (trajectories clustered).
Finally they calculate a for each cluster a representative trajectory, which is nothing else that a discovered common sub-trajectory in each cluster.
They have pretty cool examples and the paper is very well explained. Once again this is not my algorithm, so don't forget to cite them if you are doing research.
PS: I made some slides based on their work, just for educational purposes:
http://www.slideshare.net/ivansanchez1988/trajectory-clustering-traclus-algorithm

Every clustering algorithm needs a metric. You need to define distance between your samples. In your case simple Euclidean distance is not a good idea, especially if the trajectories can have different lengths.
If you define a metric, than you can use any clustering algorithm that allows for custom metric. Probably you do not know the correct number of clusters beforehand, then hierarchical clustering is a good option. K-means doesn't allow for custom metric, but there are modifications of K-means that do (like K-medoids)
The hard part is defining distance between two trajectories (time series). Common approach is DTW (Dynamic Time Warping). To improve performance you can approximate your trajectory by smaller amount of points (many algorithms for that).

Neither will work. Because what is a proper mean here?
Have a look at distance based clustering methods, such as hierarchical clustering (for small data sets, but you probably don't have thousands of trajectories) and DBSCAN.
Then you only need to choose an appropriate distance function that allows e.g. differences in time and spatial resolution of trajectories.
Distance functions such as dynamic time warping (DTW) distance can accomodate this.

This is good concept and having possibility for real-time applications. In my view, one can adopt any clustering but need to select appropriate dissimilarity measure, later need to think about computational complexity.
Paper (http://link.springer.com/chapter/10.1007/978-81-8489-203-1_15) used Hausdorff and suggest the technique for reducing complexity, and paper (http://www.cit.iit.bas.bg/CIT_2015/v-15-2/4-5-TCMVS%20A-edited-md-Gotovop.pdf) described the use of "Trajectory Clustering Technique Based on Multi-View Similarity"

Clustering [assessment] algorithm with distance matrix as an input

Can anyone suggest some clustering algorithm which can work with distance matrix as an input? Or the algorithm which can assess the "goodness" of the clustering also based on the distance matrix?
At this moment I'm using a modification of Kruskal's algorithm (http://en.wikipedia.org/wiki/Kruskal%27s_algorithm) to split data into two clusters. It has a problem though. When the data has no distinct clusters the algorithm will still create two clusters with one cluster containing one element and the other containing all the rest. In this case I would rather have one cluster containing all the elements and another one which is empty.
Are there any algorithms which are capable of doing this type of clustering?
Are there any algorithms which can estimate how well the clustering was done or even better how many clusters are there in the data?
The algorithms should work only with distance(similarity) matrices as an input.

Or the algorithm which can assess the
"goodness" of the clustering also
based on the distance matrix?
KNN should be useful in assessing the “goodness” of a clustering assignment. Here's how:
Given a distance matrix with each point labeled according to the cluster it belongs to (its “cluster label”):
Test the cluster label of each point against the cluster labels implied from k-nearest neighbors classification
If the k-nearest neighbors imply an alternative cluster, that classified point lowers the overall “goodness” rating of the cluster
Sum up the “goodness rating” contributions from each one of your pixels to get a total “goodness rating” for the whole cluster
Unlike k-means cluster analysis, your algorithm will return information about poorly categorized points. You can use that information to reassign certain points to a new cluster thereby improving the overall "goodness" of your clustering.
Since the algorithm knows nothing about the placement of the centroids of the clusters and hence, nothing about the global cluster density, the only way to insure clusters that are both locally and globally dense would be to run the algorithm for a range of k values and finding an arrangement that maximizes the goodness over the range of k values.
For a significant amount of points, you'll probably need to optimize this algorithm; possibly with a hash-table to keep track of the the nearest points relative to each point. Otherwise this algorithm will take quite awhile to compute.

Some approaches that can be used to estimate the number of clusters are:
Minimum Description Length
Bayesian Information Criterion
The gap statistic

scipy.cluster.hierarchy runs 3 steps, just like Matlab(TM)
clusterdata:
Y = scipy.spatial.distance.pdist( pts ) # you have this already
Z = hier.linkage( Y, method ) # N-1
T = hier.fcluster( Z, ncluster, criterion=criterion )
Here linkage might be a modified Kruskal, dunno.
This SO answer
(ahem) uses the above.
As a measure of clustering, radius = rms distance to cluster centre is fast and reasonable,
for 2d/3d points.
Tell us about your Npt, ndim, ncluster, hier/flat ?
Clustering is a largish area, one size does not fit all.

How to cluster objects (without coordinates)

I have a list of opaque objects. I am only able to calculate the distance between them (not true, just setting the conditions for the problem):
class Thing {
public double DistanceTo(Thing other);
}
I would like to cluster these objects. I would like to control the number of clusters and I would like for "close" objects to be in the same cluster:
List<Cluster> cluster(int numClusters, List<Thing> things);
Can anyone suggest (and link to ;-)) some clustering algorithms (the simpler, the better!) or libraries that can help me?
Clarification Most clustering algorithms require that the objects be laid out in some N-dimensional space. This space is used to find "centroids" of clusters. In my case, I do not know what N is, nor do I know how to extract a coordinate system from the objects. All I know is how far apart 2 objects are. I would like to find a good clustering algorithm that uses only that information.
Imagine that you are clustering based upon the "smell" of an object. You don't know how to lay "smells out" on a 2D plane, but you do know whether two smells are similar or not.

I think you are looking for K-Medoids. It's like K-means in that you specify the number of clusters, K, in advance, but it doesn't require that you have a notion of "averaging" the objects you're clustering like K-means does.
Instead, every cluster has a representative medoid, which is the member of the cluster closest to the middle. You could think of it as a version of K-means that finds "medians" instead of "means". All you need is a distance metric to cluster things, and I've used this in some of my own work for exactly the same reasons you cite.
Naive K-medoids is not the fastest algorithm, but there are fast variants that are probably good enough for your purposes. Here are descriptions of the algorithms and links to the documentation for their implementations in R:
PAM is the basic O(n^2) implementation of K-medoids.
CLARA is a much faster, sampled version of PAM. It works by clustering randomly sampled subset of objects with PAM and grouping the entire set of objects based on the subset. You should still be able to get very good clusterings fast with this.
If you need more information, here's a paper that gives an overview of these and other K-medoids methods.

Here's an outline for a clustering algorithm that doesn't have the K-means requirement of finding a centroid.
Determine the distance between all objects. Record the n most separate objects. [finds roots of our clusters, time O(n^2)]
Assign each of these n random points to n new distinct clusters.
For every other object:[assign objects to clusters, time O(n^2)]
For each cluster:
Calculate the average distance from a cluster to that object by averaging the distance of each object in the cluster to the object.
Assign the object to the closest cluster.
This algorithm will certainly cluster the objects. But its runtime is O(n^2). Plus it is guided by those first n points chosen.
Can anyone improve upon this (better runtime perf, less dependent upon initial choices)? I would love to see your ideas.

Here's a quick algorithm.
While (points_left > 0) {
Select a random point that is not already clustered
Add point and all points within x distance
that aren't already clustered to a new cluster.
}
Alternatively, read the wikipedia page. K-means clustering is a good choice:
The K-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster.
The algorithm steps are:
* Choose the number of clusters, k.
* Randomly generate k clusters and determine the cluster centers, or
directly generate k random points as cluster centers.
* Assign each point to the nearest cluster center.
* Recompute the new cluster centers.
* Repeat the two previous steps until some convergence criterion is
met (usually that the assignment hasn't changed).
The main advantages of this algorithm
are its simplicity and speed which
allows it to run on large datasets.
Its disadvantage is that it does not
yield the same result with each run,
since the resulting clusters depend on
the initial random assignments. It
minimizes intra-cluster variance, but
does not ensure that the result has a
global minimum of variance. Another
disadvantage is the requirement for
the concept of a mean to be definable
which is not always the case. For such
datasets the k-medoids variant is
appropriate.

How about this approach:
Assign all objects to one cluster.
Find the two objects, a and b, that are within the same cluster, k, and that are maximum distance apart. To clarify, there should be one a and b for the whole set, not one a and b for each cluster.
Split cluster k into two clusters, k1 and k2, one with object a and one with object b.
For all other objects in cluster k, add them to either k1 or k2 by determining the minimum average distance to all other objects in that cluster.
Repeat steps 2-5 until N clusters are formed.
I think this algorithm should give you a fairly good clustering, although the efficiency might be pretty bad. To improve the efficiency you could alter step 3 so that you find the minimum distance to only the original object that started the cluster, rather than the average distance to all objects already in the cluster.

Phylogenetic DNA sequence analysis regularly uses hierarchical clustering on text strings, with [alignment] distance matrices. Here's a nice R tutorial for clustering:
http://www.statmethods.net/advstats/cluster.html
(Shortcut: Go straight to the "Hierarchical Agglomerative" section...)
Here are some other [language] libraries :
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/
http://code.google.com/p/scipy-cluster/
This approach could help determine how many [k] "natural" clusters there are and which objects to use as roots for the k-means approaches above.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio