I am trying to cluster my data using Kmeans. The problem is that when I look at the distribution of the clusters on the 2 first PCA (which covers 82% variance in the data) the figure shows that the clustering is good. But the silhouette coefficient is very low. What is the reason?
I am using the below code for Kmeans clustering and calculating the Silhouette score:
kmeans = KMeans(n_clusters=n_clusters, n_init=20, init='random')
cluster_labels = kmeans.fit_predict(x)
silhouette_avg = silhouette_score(x, cluster_labels)
Related
I'm new to matlab and I want to know how to perform k-means algorithm in MATLAB, and also I want to know how to define cluster centers when performing k means.
For example, let's say I'm creating an array as given below.
m = [11.00; 10.60; 11.00; 12.00; 11.75; 11.55; 11.20; 12.10; 11.60; 11.40;...
11.10; 11.60; 11.00; 10.15; 9.55; 9.35; 8.55; 7.65; 7.80; 8.45; 8.05]
I want to cluster the above values into 5 clusters where k = 5.
And I would like to take the cluster centers as 2, 5, 10, 20, 40.
So my question is how can I define the cluster centers and perform k-means algorithm in MATLAB?
Is there a specific parameter to set the cluster centers in MATLAB kmeans() function?
Please help me to solve the above problems.
The goal of k-means clustering is to find the k cluster centers to minimize the overall distance of all points from their respective cluster centers.
With this goal, you'd write
[clusterIndex, clusterCenters] = kmeans(m,5,'start',[2;5;10;20;40])
This would adjust the cluster centers from their start position until an optimal position and assignment had been found.
If you instead want to associate your points in m to fixed cluster centers, you would not use kmeans, but calculate the clusterIndex directly using min
distanceToCenter = bsxfun(#minus,m,[2 5 10 20 40]);
[~, clusterIndex] = min(abs(distanceToCenter),[],2);
ie. you calculate the difference between every point and every center, and find, for each point, the center with the minimum (absolute) distance.
To plot results, you can align the points on a line and color them according to the center:
nCenters = length(clusterCenters);
cmap = hsv(nCenters);
figure
hold on
for iCenter = 1:nCenters
plot(m(iCenter==clusterIndex),1,'.','color',cmap(iCenter,:));
plot(clusterCenters(iCenter),1,'*','color',cmap(iCenter,:));
end
You can see a classic implementation in my post here - K-Means Algorithm with Arbitrary Distance Function Matlab (Chebyshev Distance).
Enjoy...
I currently have 2 clusters which essentially lie along 2 lines on a 3D surface:
I've tried a simple kmeans algorithm which yielded the above separation. (Big red dots are the means)
I have tried clustering with soft k-means with different variances for each mean along the 3 dimensions. However, it also failed to model this, presumably as it cannot fit the shape of a diagonal Gaussian.
Is there another algorithm that can take into account that the data is diagonal? Alternatively is there a way to essentially "rotate" the data such that soft k-means could be made to work?
K-means is not well prepared for correlations.
But these clusters look reasonably Gaussian to me, you should try Gaussian Mixture Modeling instead.
I want to know whether the k-means clustering algorithm can do classification?
If I have done a simple k-means clustering .
Assume I have many data , I use k-means clusterings, then get 2 clusters A, B. and the centroid calculating method is Euclidean distance.
Cluster A at left side.
Cluster B at right side.
So, if I have one new data. What should I do?
Run k-means clustering algorithm again, and can get which cluster does the new data belong to?
Record the last centroid and use Euclidean distance to calculating to decide the new data belong to?
other method?
The simplest method of course is 2., assign each object to the closest centroid (technically, use sum-of-squares, not Euclidean distance; this is more correct for k-means, and saves you a sqrt computation).
Method 1. is fragile, as k-means may give you a completely different solution; in particular if it didn't fit your data well in the first place (e.g. too high dimensional, clusters of too different size, too many clusters, ...)
However, the following method may be even more reasonable:
3. Train an actual classifier.
Yes, you can use k-means to produce an initial partitioning, then assume that the k-means partitions could be reasonable classes (you really should validate this at some point though), and then continue as you would if the data would have been user-labeled.
I.e. run k-means, train a SVM on the resulting clusters. Then use SVM for classification.
k-NN classification, or even assigning each object to the nearest cluster center (option 1) can be seen as very simple classifiers. The latter is a 1NN classifier, "trained" on the cluster centroids only.
Yes, we can do classification.
I wouldn't say the algorithm itself (like #1) is particularly well-suited to classifying points, as incorporating data to be classified into your training data tends to be frowned upon (unless you have a real-time system, but I think elaborating on this would get a bit far from the point).
To classify a new point, simply calculate the Euclidean distance to each cluster centroid to determine the closest one, then classify it under that cluster.
There are data structures that allows you to more efficiently determine the closest centroid (like a kd-tree), but the above is the basic idea.
If you've already done k-means clustering on your data to get two clusters, then you could use k Nearest Neighbors on the new data point to find out which class it belongs to.
Here another method:
I saw it on "The Elements of Statistical Learning". I'll change the notation a little bit. Let C be the number of classes and K the number of clusters. Now, follow these steps:
Apply K-means clustering to the training data in each class seperately, using K clusters per class.
Assign a class label to each of the C*K clusters.
Classify observation x to the class of the closest cluster.
It seems like a nice approach for classification that reduces data observations by using clusters.
If you are doing real-time analysis where you want to recognize new conditions during use (or adapt to a changing system), then you can choose some radius around the centroids to decide whether a new point starts a new cluster or should be included in an existing one. (That's a common need in monitoring of plant data, for instance, where it may take years after installation before some operating conditions occur.) If real-time monitoring is your case, check RTEFC or RTMAC, which are efficient, simple real-time variants of K-means. RTEFC in particular, which is non-iterative. See http://gregstanleyandassociates.com/whitepapers/BDAC/Clustering/clustering.htm
Yes, you can use that for classification. If you've decided you have collected enough data for all possible cases, you can stop updating the clusters, and just classify new points based on the nearest centroid. As in any real-time method, there will be sensitivity to outliers - e.g., a caused by sensor error or failure when using sensor data. If you create new clusters, outliers could be considered legitimate if one purpose of the clustering is identify faults in the sensors, although that the most useful when you can do some labeling of clusters.
You are confusing the concepts of clustering and classification. When you have labeled data, you already know how the data is clustered according to the labels and there is no point in clustering the data unless if you want to find out how well your features can discriminate the classes.
If you run the k-means algorithm to find the centroid of each class and then use the distances from the centroids to classify a new data point, you in fact implement a form of the linear discriminant analysis algorithm assuming the same multiple-of-identity covariance matrix for all classes.
After k-means Clustering algorithm converges, it can be used for classification, with few labeled exemplars/training data.
It is a very common approach when the number of training instances(data) with labels are very limited due to high cost of labeling.
So I'm looking to apply a clustering algorithm to the earth data provided by the usgs.
http://earthquake.usgs.gov/earthquakes/feed/
My main goal is to determine the top 10 most dangerous places (either by amount of earthquakes or the magnitude of an earthquake that a place experiences) to be based on an earthquake feed.
Are there any suggestions on how to do it? I'm looking at k-means then just taking the sum of the k-means (with each earthquake magnitude weighted in each cluster) to look at the most dangerous clusters.
I'm also writing this in ruby as a code reference.
Thanks
K-means can't handle outliers in the data set very well.
Furthermore, it is designed around variance, but variance in latitude and longitude is not really meaningful. In fact, k-means cannot handle the latitude +-180° wrap-around. Instead, you will want to use the great-circle distance.
So try to use a density based clustering algorithm that allows you to use distances such as the great-circle distance!
Read up on Wikipedia and a good book on cluster analysis.
Can anyone suggest some clustering algorithm which can work with distance matrix as an input? Or the algorithm which can assess the "goodness" of the clustering also based on the distance matrix?
At this moment I'm using a modification of Kruskal's algorithm (http://en.wikipedia.org/wiki/Kruskal%27s_algorithm) to split data into two clusters. It has a problem though. When the data has no distinct clusters the algorithm will still create two clusters with one cluster containing one element and the other containing all the rest. In this case I would rather have one cluster containing all the elements and another one which is empty.
Are there any algorithms which are capable of doing this type of clustering?
Are there any algorithms which can estimate how well the clustering was done or even better how many clusters are there in the data?
The algorithms should work only with distance(similarity) matrices as an input.
Or the algorithm which can assess the
"goodness" of the clustering also
based on the distance matrix?
KNN should be useful in assessing the “goodness” of a clustering assignment. Here's how:
Given a distance matrix with each point labeled according to the cluster it belongs to (its “cluster label”):
Test the cluster label of each point against the cluster labels implied from k-nearest neighbors classification
If the k-nearest neighbors imply an alternative cluster, that classified point lowers the overall “goodness” rating of the cluster
Sum up the “goodness rating” contributions from each one of your pixels to get a total “goodness rating” for the whole cluster
Unlike k-means cluster analysis, your algorithm will return information about poorly categorized points. You can use that information to reassign certain points to a new cluster thereby improving the overall "goodness" of your clustering.
Since the algorithm knows nothing about the placement of the centroids of the clusters and hence, nothing about the global cluster density, the only way to insure clusters that are both locally and globally dense would be to run the algorithm for a range of k values and finding an arrangement that maximizes the goodness over the range of k values.
For a significant amount of points, you'll probably need to optimize this algorithm; possibly with a hash-table to keep track of the the nearest points relative to each point. Otherwise this algorithm will take quite awhile to compute.
Some approaches that can be used to estimate the number of clusters are:
Minimum Description Length
Bayesian Information Criterion
The gap statistic
scipy.cluster.hierarchy runs 3 steps, just like Matlab(TM)
clusterdata:
Y = scipy.spatial.distance.pdist( pts ) # you have this already
Z = hier.linkage( Y, method ) # N-1
T = hier.fcluster( Z, ncluster, criterion=criterion )
Here linkage might be a modified Kruskal, dunno.
This SO answer
(ahem) uses the above.
As a measure of clustering, radius = rms distance to cluster centre is fast and reasonable,
for 2d/3d points.
Tell us about your Npt, ndim, ncluster, hier/flat ?
Clustering is a largish area, one size does not fit all.