A density based clustering library that takes distance matrix as input - matrix

Need help with finding an open/free density based clustering library that takes a distance matrix as input and returns clusters with each element within it maximum "x" distance away from each of the other elements in the clusters (basically returning clusters with specified density).
I checked out the DBSCAN algorithm, it seems to suit my needs. Any clean implementations of DBSCAN that you might no off, which can take off with a pre-computed distance matrix and output clusters with the desired density?
Your inputs will be really useful.

ELKI (at http://elki.dbs.ifi.lmu.de/ ) can load external distance matrixes, either in a binary or an Ascii format and then run distance-based clustering algorithms on it.
Certain algorithms such as k-means cannot work however, as these rely on the distance to the /mean/, which is obviously not precomputed. But e.g. DBSCAN and OPTICS work fine with precomputed distances.

I haven't tried it out yet, but I'm looking for something similar and came across this python implementation of DBSCAN:
http://scikit-learn.org/dev/auto_examples/cluster/plot_dbscan.html#example-cluster-plot-dbscan-py

Matlab file exchange has an implementation which is straightforward to adapt to precomputed matrices. Just remove the call to pdist1 outside the function in your code.

Related

Clustering elements based on highest similarity

I'm working with Docker images which consist of a set of re-usable layers. Now given a collection of images, I would like to combine images which have a large amount of shared layers.
To be more exact: Given a collection of N images, I want to create clusters where all images in a cluster share more than X percent of services with eachother. Each image is only allowed to belong to one cluster.
My own research points in the direction of cluster algorithms where I use a similarity measure to decide which images belong in a cluster together. The similarity measure I know how to write. However, I'm having difficulty finding an exact algorithm or pseudo-algorithm to get started.
Can someone recommend an algorithm to solve this problem or provide pseudo-code please?
EDIT: after some more searching I believe I'm looking for something like this hierarchical clustering ( https://github.com/lbehnke/hierarchical-clustering-java ) but with a threshold X so that neighbors with less than X% similarity don't get combined and stay in a separate cluster.
I believe you are a developer and you have no experience with data science?
There are a number of clustering algorithms and they have their advantages and disadvantages (please consult https://en.wikipedia.org/wiki/Cluster_analysis), but I think solution for your problem is easier than one can think.
I assume that N is small enough so you can store a matrix with N^2 float values in RAM memory? If this is the case, you are in a very comfortable situation. You write that you know how to implement similarity measure, so just calculate the measure for all N^2 pairs and store it in a matrix (it is a symmetric matrix, so only half of it can be stored). Please ensure that your similarity measure assigns special value for pair of images, where similarity measure is less than some X%, like 0 or infinity (it depends on that you treat a function like similarity measure or like a distance). I think perfect solution is to assign 1 for pairs, where similarity is greater than X% threshold and 0 otherwise.
After that, treat is just like a graph. Get first vertex and make, e.g., deep first search or any other graph walking routine. This is your first cluster. After that get first not visited vertex and repeat graph walking. Of course you can store graph as an adjacency list to save memory.
This algorithm assumes that you really do not pay attention to that how much images are similar and which pairs are more similar than other, but if they are similar enough (similarity measure is greater than a given threshold).
Unfortunately in cluster analysis it is common that 100% of possible pairs has to be computed. It is possible to save some number of distance calls using some fancy data structures for k-nearest neighbor search, but you have to assure that your similarity measure hold triangle inequality.
If you are not satisfied with this answer, please specify more details of your problem and read about:
K-means (main disadvantage: you have to specify number of clusters)
Hierarchical clustering (slow computation time, at the top all images are in one cluster, you have to cut a dendrogram at proper distance)
Spectral clustering (for graphs, but I think it is too complicated for this easy problem)
I ended up solving the problem by using hierarchical clustering and then traversing each branch of the dendrogram top to bottom until I find a cluster where the distance is below a threshold. Worst case there is no such cluster but then I'll end up in a leaf of the dendrogram which means that element is in a cluster of its own.

Can k-means clustering do classification?

I want to know whether the k-means clustering algorithm can do classification?
If I have done a simple k-means clustering .
Assume I have many data , I use k-means clusterings, then get 2 clusters A, B. and the centroid calculating method is Euclidean distance.
Cluster A at left side.
Cluster B at right side.
So, if I have one new data. What should I do?
Run k-means clustering algorithm again, and can get which cluster does the new data belong to?
Record the last centroid and use Euclidean distance to calculating to decide the new data belong to?
other method?
The simplest method of course is 2., assign each object to the closest centroid (technically, use sum-of-squares, not Euclidean distance; this is more correct for k-means, and saves you a sqrt computation).
Method 1. is fragile, as k-means may give you a completely different solution; in particular if it didn't fit your data well in the first place (e.g. too high dimensional, clusters of too different size, too many clusters, ...)
However, the following method may be even more reasonable:
3. Train an actual classifier.
Yes, you can use k-means to produce an initial partitioning, then assume that the k-means partitions could be reasonable classes (you really should validate this at some point though), and then continue as you would if the data would have been user-labeled.
I.e. run k-means, train a SVM on the resulting clusters. Then use SVM for classification.
k-NN classification, or even assigning each object to the nearest cluster center (option 1) can be seen as very simple classifiers. The latter is a 1NN classifier, "trained" on the cluster centroids only.
Yes, we can do classification.
I wouldn't say the algorithm itself (like #1) is particularly well-suited to classifying points, as incorporating data to be classified into your training data tends to be frowned upon (unless you have a real-time system, but I think elaborating on this would get a bit far from the point).
To classify a new point, simply calculate the Euclidean distance to each cluster centroid to determine the closest one, then classify it under that cluster.
There are data structures that allows you to more efficiently determine the closest centroid (like a kd-tree), but the above is the basic idea.
If you've already done k-means clustering on your data to get two clusters, then you could use k Nearest Neighbors on the new data point to find out which class it belongs to.
Here another method:
I saw it on "The Elements of Statistical Learning". I'll change the notation a little bit. Let C be the number of classes and K the number of clusters. Now, follow these steps:
Apply K-means clustering to the training data in each class seperately, using K clusters per class.
Assign a class label to each of the C*K clusters.
Classify observation x to the class of the closest cluster.
It seems like a nice approach for classification that reduces data observations by using clusters.
If you are doing real-time analysis where you want to recognize new conditions during use (or adapt to a changing system), then you can choose some radius around the centroids to decide whether a new point starts a new cluster or should be included in an existing one. (That's a common need in monitoring of plant data, for instance, where it may take years after installation before some operating conditions occur.) If real-time monitoring is your case, check RTEFC or RTMAC, which are efficient, simple real-time variants of K-means. RTEFC in particular, which is non-iterative. See http://gregstanleyandassociates.com/whitepapers/BDAC/Clustering/clustering.htm
Yes, you can use that for classification. If you've decided you have collected enough data for all possible cases, you can stop updating the clusters, and just classify new points based on the nearest centroid. As in any real-time method, there will be sensitivity to outliers - e.g., a caused by sensor error or failure when using sensor data. If you create new clusters, outliers could be considered legitimate if one purpose of the clustering is identify faults in the sensors, although that the most useful when you can do some labeling of clusters.
You are confusing the concepts of clustering and classification. When you have labeled data, you already know how the data is clustered according to the labels and there is no point in clustering the data unless if you want to find out how well your features can discriminate the classes.
If you run the k-means algorithm to find the centroid of each class and then use the distances from the centroids to classify a new data point, you in fact implement a form of the linear discriminant analysis algorithm assuming the same multiple-of-identity covariance matrix for all classes.
After k-means Clustering algorithm converges, it can be used for classification, with few labeled exemplars/training data.
It is a very common approach when the number of training instances(data) with labels are very limited due to high cost of labeling.

A clustering algorithm that accepts an arbitrary distance function

I have about 200 points in Cartesian plane (2D). I want to cluster these points to k clusters with respect to arbitrary distance function (not matrix) and get the so-called centroid or representatives of these clusters. I know kmeans does this with respect to some special distance functions such as Euclidean, Manhattan, Cosine, etc. But, kmeans cannot handle arbitrary distance function because for example in centroid-updating phase of kmeans with respect to Euclidean distance function, mean of the points in each cluster is the LSE and minimizes the sum of distances of the nodes in the cluster to its centroid (mean); however, mean of the points may not minimize the ditances when the distance function is something arbitrary. Could you please help me about it and tell me if you know about any clustering algorithms that can work for me?
If you replace "mean" with "most central point in cluster", then you get the k-medoids algorithm. Wikipedia claims that a metric is required, but I believe that to be incorrect, since I can't see where the majorization-minimization proof needs the triangle inequality or even symmetry.
There are various clustering algorithms that can work with arbitrary distance functions, in particular:
hierarchical clustering
k-medoids (PAM)
DBSCAN
OPTICS
many many more - get some good clustering book and/or software
But the only one which enforces k clusters and uses a "cluster representative" model is k-medoids. You may be putting too many constraints on the cluster model to get a wider choice.
Since you want something that represents a centroid but is not one of the data points, a technique I once used was to perform something like Kmedoids on N random samples, then I took all the members of each cluster and used them as samples to build a classifier which returned a class label... in the end each class label returned from the classifier ends up being an abstract notion of a set of cluster/centroids. I did this for a very specific and nuanced reason, I know the flaws.
If you don't want to have to specify K, and your vectors are not enormous and super sparse, then I would take a look at the cobweb clustering in JavaML, JavaML also has a decent KMedoids.

Measuring distance between vectors

I have a set of 300.000 or so vectors which I would like to compare in some way, and given one vector I want to be able to find the closest vector I have thought of three methods.
Simple Euclidian distance
Cosine similarity
Use a kernel (for instance Gaussian) to calculate the Gram matrix.
Treat the vector as a discrete probability distribution (which makes
sense to do) and calculate some divergence measure.
I do not really understand when it is useful to do one rather than the other. My data has a lot of zero-elements. With that in mind, is there some general rule of thumbs as to which of the three methods is the best?
Sorry for the weak question, but I had to start somewhere...
Thank you!
Your question is not quite clear, are you looking for a distance metric between vectors, or an algorithm to efficiently find the nearest neighbour?
If your vectors just contain a numeric type such as doubles or integers, you can find a nearest neighbour efficiently using a structure such as the kd-tree. (since you are just looking at points in d-dimensional space). See http://en.wikipedia.org/wiki/Nearest_neighbor_search, for other methods.
Otherwise, choosing a distance metric and algorithm is very much dependent on the content of the vectors.
If your vectors are very sparse in nature and if they are binary, you can use Hamming or Hellinger distance. When your vector dimensions are large, avoid using Euclidean (refer http://en.wikipedia.org/wiki/Curse_of_dimensionality)
Please refer to http://citeseerx.ist.psu.edu/viewdoc/download?rep=rep1&type=pdf&doi=10.1.1.154.8446 for a survey of distance/similarity measures, although the paper limits it to pair of probability distributions.

Clustering Algorithm - Applied to a set of Earthquake Data

So I'm looking to apply a clustering algorithm to the earth data provided by the usgs.
http://earthquake.usgs.gov/earthquakes/feed/
My main goal is to determine the top 10 most dangerous places (either by amount of earthquakes or the magnitude of an earthquake that a place experiences) to be based on an earthquake feed.
Are there any suggestions on how to do it? I'm looking at k-means then just taking the sum of the k-means (with each earthquake magnitude weighted in each cluster) to look at the most dangerous clusters.
I'm also writing this in ruby as a code reference.
Thanks
K-means can't handle outliers in the data set very well.
Furthermore, it is designed around variance, but variance in latitude and longitude is not really meaningful. In fact, k-means cannot handle the latitude +-180° wrap-around. Instead, you will want to use the great-circle distance.
So try to use a density based clustering algorithm that allows you to use distances such as the great-circle distance!
Read up on Wikipedia and a good book on cluster analysis.

Resources