I am working on a project where I have to create a k-means model based on some training observations. I have 380 observations ( with 700 features). I am using the K-means algorithm from Spark MlLib. When I chose a k (number of clusters) greater than 10, some of my clusters only get 1 point assigned to them ( for example at 25, 6 of them get only 1 point). First I thought that some points have a big distance from the others, but the problem is that there are not always the same points that are assigned to there own cluster.
Is that an expected behavior? If it is a problem how big it is?
This is typical for k-means.
In particular if you have many more features than data points, and if you have non-continuous features. It is a kind of overfitting - because of the high dimensionality, many points are "unique" in one sense or another.
Since k-means involves random, you don't get the same result every time.
You need to explore more advanced algorithms - k-means is really old and limited. Spark may not be the best tool for you, because it has so few algorithms to offer.
Related
I am trying to understand PCA and K-Means algorithms in order to extract some relevant features from a set of features.
I don't know what branch of computer science study these topics, seems on internet there aren't good resources, just some paper that I don't understand well. An example of paper http://www.ifp.illinois.edu/~qitian/e_paper/icip02/icip02.pdf
I have csv files of pepole walks composed as follow:
TIME, X, Y, Z, these values are registred by the accelerometer
What I did
I transformed the dataset as a table in Python
I used tsfresh, a Python library, to extract from each walk a vector of features, these features are a lot, 2k+ features from each walk.
I have to use PFA, Principal Feature Analysis, to select the relevant features from the set of
vectors features
In order to do the last point, I have to reduce the dimension of the set of features walks with PCA (PCA will make the data different from the original one cause it modifies the data with the eigenvectors and eigenvalues of the covariance matrix of the original data). Here I have the first question:
How the input of PCA should look? The rows are the number of walks and the columns are the features or viceversa, so the rows are the number of the features and the columns are the number of walks of pepole?
After I reduced this data, I should use the K-Means algorithm on the reduced 'features' data. How the input should look in the K-Means? And what's the propouse on using this algorithm? All I know this algorithm it's used to 'cluster' some data, so in each cluster there are some 'points' based on some rule. What I did and think is:
If I use in PCA an input that looks like: the rows are the number of walks and the columns are the number of features, then for K-Means I should change the columns with rows cause in this way each point it's a feature (but this is not the original data with the features, it's just the reduced one, so I don't know). So then for each cluster I see with euclidean distance who has the lower distance from the centroid and select that feature. So how many clusters I should declare? If I declare that the clusters are the same as the number of features, I will extract always the same number of features. How can I say that a point in the reduced data correspond to this feature in the original set of features?
I know it's not correct what I am saying maybe, but I am trying to understand it, can some of you help me? If am I in the right way? Thanks!
For the PCA, make sure you separate the understanding of the method the algorithm uses (eigenvectors and such) and the result. The result, is a linear mapping, mapping the original space A, to A', where possibly, the dimension (number of features in your case) is less than the original space A.
So the first feature/element in space A', is a linear combination of features of A.
The row/column depends on implementation, but if you use scikit PCA the columns are the features.
You can feed the PCA output, the A' space, to K-means, and it will cluster them, based on a space of usually reduced dimension.
Each point will be part of a cluster, and the idea is that if you would calculate K-Means on A, you would probably end up with the same/similar clusters like with A'. Computationally A' is a lot cheaper. You now have a clustering, on A' and A. As we agree that points similar in A' are also similar in A.
The number of clusters is difficult to answer, if you don't know anything search the elbow method. But say you want to get a sense of different type of things you have, I argue go for 3~8 and not too much, compare 2-3 points closest to
each center, and you have something consumable. The number of features can be larger than the number of clusters. e.g. If we want to know the most dense area in some area (2D) you can easily have 50 clusters, to get a sense where 50 cities could be. Here we have number of cluster way higher than space dimension, and it makes sense.
I am trying to apply k-means (or other algorithms) clustering on some data. I want the silhouette score of the clustering results become good and at the same time, I prefer to less number of clusters. So I am wondering how can I jointly evaluate the number of clusters with silhouette score (or other metrics).
For example, the clustering model got these results below:
size = 2: score = 0.534
size = 7: score = 0.617
size = 20: score = 0.689
I think that the model with clustering size of 7 is the best comparing with others. Although the score of the last model is the best, the number of clusters is too many. I had try to divide the silhouette score with cluster size but it seems too trivial.
Don't hack. Do it properly.
That means defining mathematically what is "good" in your personal opinion (and of course why the proposed equations capture this well). Then use this evaluation measure, but be prepared that others may disagree on your take that many clusters are bad.
Yes. Silhouette divided by the number of clusters is not a good idea. In particular, it is not a very theoretically well founded model, is it?
In short: I am using k-means clustering with correlation distance. How to check, how many clusters should be used, if any?
There are many indices and answers on how to establish a number of clusters when grouping data:
example 1, example 2, etc. For now, I am using Dunn's index, but it is not sufficient due to one of the reasons described below.
All those approaches exhibit at least one of following problems, I have to avoid:
Indexes:
clustering quality index derivation makes some assumptions regarding data covariance matrix, i.e. since such moment only euclidean or euclidean-like metrics apply - correlation one is not an option anymore
it requires at least two nonempty clusters to compare already calculated partitions - there is no possibility to state whether there is any reason to make a division into groups at all
Clustering approaches:
clustering approaches estimating number of clusters itself (e.g. affinity propagation) are much slower and do not scale well
To sum up: is there any criterion or index, which allows to check for existence of groups in data (maybe estimating number of them), without limitation on metric used?
EDIT: Space I am operating on has up to few thousands features.
I have a method, but it is my own invention and rather experimental. Whilst theoretically it works in multi-dimensions, I've only had any success in 2D (take the first two principal components if clustering multi-dimensional data).
I call it gravitational clustering. You pass in a smear, then you produce a attraction round each point using 1 / (d + smear)^2 (smear prevents values going to infinity, and controls the granularity of the clustering). Points them move uphill to their local maximum on energy field. If they all move to the same point, you have no clusters, if they move to different points, you have clusters, if they all stay at their their own local maximum, again you have no clusters.
I am coding my application each function so i am not using tools which does everything for you
Been looking for solution when to cut my agglomerative hierarchical clustering
How do i cluster?
I have coded application in c# 4.5.2
So far i am using standard hierarchical which uses Euclidean_Distance to calculate distance between document pairs
Also it uses UPGMA to calculate distance between clusters to decide merge which ones
I also coded Rand Index and F Measure to test my manually labeled data-set success
However the problem is when stop merging more clusters
I am really bad at understanding mathematical equations without real data example or a well explained pseudo code
There are mathematical equations everywhere but no real life example
So looking for your answers. For example it is written in many places Bayesian information criterion (BIC) is a good solution but i cant figure out how to apply it to my software
I also have other distance or similarity metrics such as cosine similarity or Sorensen Dice Distance etc
There are so many questions on stackexchange or stackoverflow about this but all answers are using tools
like matlab or R or etc
Try to compute some measure of how well each particular clustering fits - for example, the sum of distances from cluster centres, or the sum of squared errors. You should find that this error decreases as you increase the number of clusters - it is easier to fit with more clusters, and increases as you decrease the number of clusters.
Now draw a graph and look for an "elbow" where the error starts to get large more quickly as the number of clusters decreases. You could then assume that the minimum number of clusters before the error starts increasing very rapidly is the true number of clusters in the data.
See for example the graph in Cluster analysis in R: determine the optimal number of clusters just below the text "We might conclude that 4 clusters would be indicated by this method:"
I have need to do some cluster analysis on a set of 2 dimensional data (I may add extra dimensions along the way).
The analysis itself will form part of the data being fed into a visualisation, rather than the inputs into another process (e.g. Radial Basis Function Networks).
To this end, I'd like to find a set of clusters which primarily "looks right", rather than elucidating some hidden patterns.
My intuition is that k-means would be a good starting place for this, but that finding the right number of clusters to run the algorithm with would be problematic.
The problem I'm coming to is this:
How to determine the 'best' value for k such that the clusters formed are stable and visually verifiable?
Questions:
Assuming that this isn't NP-complete, what is the time complexity for finding a good k. (probably reported in number of times to run the k-means algorithm).
is k-means a good starting point for this type of problem? If so, what other approaches would you recommend. A specific example, backed by an anecdote/experience would be maxi-bon.
what short cuts/approximations would you recommend to increase the performance.
For problems with an unknown number of clusters, agglomerative hierarchical clustering is often a better route than k-means.
Agglomerative clustering produces a tree structure, where the closer you are to the trunk, the fewer the number of clusters, so it's easy to scan through all numbers of clusters. The algorithm starts by assigning each point to its own cluster, and then repeatedly groups the two closest centroids. Keeping track of the grouping sequence allows an instant snapshot for any number of possible clusters. Therefore, it's often preferable to use this technique over k-means when you don't know how many groups you'll want.
There are other hierarchical clustering methods (see the paper suggested in Imran's comments). The primary advantage of an agglomerative approach is that there are many implementations out there, ready-made for your use.
In order to use k-means, you should know how many cluster there is. You can't try a naive meta-optimisation, since the more cluster you'll add (up to 1 cluster for each data point), the more it will brought you to over-fitting. You may look for some cluster validation methods and optimize the k hyperparameter with it but from my experience, it rarely work well. It's very costly too.
If I were you, I would do a PCA, eventually on polynomial space (take care of your available time) depending on what you know of your input, and cluster along the most representatives components.
More infos on your data set would be very helpful for a more precise answer.
Here's my approximate solution:
Start with k=2.
For a number of tries:
Run the k-means algorithm to find k clusters.
Find the mean square distance from the origin to the cluster centroids.
Repeat the 2-3, to find a standard deviation of the distances. This is a proxy for the stability of the clusters.
If stability of clusters for k < stability of clusters for k - 1 then return k - 1
Increment k by 1.
The thesis behind this algorithm is that the number of sets of k clusters is small for "good" values of k.
If we can find a local optimum for this stability, or an optimal delta for the stability, then we can find a good set of clusters which cannot be improved by adding more clusters.
In a previous answer, I explained how Self-Organizing Maps (SOM) can be used in visual clustering.
Otherwise, there exist a variation of the K-Means algorithm called X-Means which is able to find the number of clusters by optimizing the Bayesian Information Criterion (BIC), in addition to solving the problem of scalability by using KD-trees.
Weka includes an implementation of X-Means along with many other clustering algorithm, all in an easy to use GUI tool.
Finally you might to refer to this page which discusses the Elbow Method among other techniques for determining the number of clusters in a dataset.
You might look at papers on cluster validation. Here's one that is cited in papers that involve microarray analysis, which involves clustering genes with related expression levels.
One such technique is the Silhouette measure that evaluates how closely a labeled point is to its centroid. The general idea is that, if a point is assigned to one centroid but is still close to others, perhaps it was assigned to the wrong centroid. By counting these events across training sets and looking across various k-means clusterings, one looks for the k such that the labeled points overall fall into the "best" or minimally ambiguous arrangement.
It should be said that clustering is more of a data visualization and exploration technique. It can be difficult to elucidate with certainty that one clustering explains the data correctly, above all others. It's best to merge your clusterings with other relevant information. Is there something functional or otherwise informative about your data, such that you know some clusterings are impossible? This can reduce your solution space considerably.
From your wikipedia link:
Regarding computational complexity,
the k-means clustering problem is:
NP-hard in general Euclidean
space d even for 2 clusters
NP-hard for a general number of
clusters k even in the plane
If k and d are fixed, the problem can be
exactly solved in time O(ndk+1 log n),
where n is the number of entities to
be clustered
Thus, a variety of heuristic
algorithms are generally used.
That said, finding a good value of k is usually a heuristic process (i.e. you try a few and select the best).
I think k-means is a good starting point, it is simple and easy to implement (or copy). Only look further if you have serious performance problems.
If the set of points you want to cluster is exceptionally large a first order optimisation would be to randomly select a small subset, use that set to find your k-means.
Choosing the best K can be seen as a Model Selection problem. One possible approach is Minimum Description Length, which in this context means: You could store a table with all the points (in which case K=N). At the other extreme, you have K=1, and all the points are stored as their distances from a single centroid. This Section from Introduction to Information Retrieval by Manning and Schutze suggest minimising the Akaike Information Criterion as a heuristic for an optimal K.
This problematic belongs to the "internal evaluation" class of "clustering optimisation problems" which curent state of the art solution seems to use the **Silhouette* coeficient* as stated here
https://en.wikipedia.org/wiki/Cluster_analysis#Applications
and here:
https://en.wikipedia.org/wiki/Silhouette_(clustering) :
"silhouette plots and averages may be used to determine the natural number of clusters within a dataset"
scikit-learn provides a sample usage implementation of the methodology here
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html