Is K-means for clustering data with many zero values? - matrix

I need to cluster a matrix which contains mostly zeros values...Is K-means appropriate for these kind of data or do I need to consider a different algorithm?

No. The reason is that the mean is not sensible on sparse data. The resulting mean vectors will have very different characteristics than your actual data; they will often end up being more similar to each other than to actual documents!
There are some modifications that improve k-means for sparse data such as spherical k-means.
But largely, k-means on such data is just a crude heuristic. The results aren't entirely useless, but they are not the best that you can do either. It works, but by chance, not by design.

k-means is widely used to cluster sparse data such as document-term vectors, so I'd say go ahead. Whether you get good results depends on the data and what you're looking for, of course.
There are a few things to keep in mind:
If you have very sparse data, then a sparse representation of your input can reduce memory usage and runtime by many orders of magnitude, so pick a good k-means implementation.
Euclidean distance isn't always the best metric for sparse vectors, but normalizing them to unit length may give better results.
The cluster centroids are in all likelihood going to be dense regardless of the input sparsity, so don't use too many features.
Doing dimensionality reduction, e.g. SVD, on the samples may boost the running time and cluster quality a lot.

Related

Distance metric for algorithms

I am currently working on a project in which I need to quantify the (dis)similarity between algorithms - that is, I have a few tens of algorithms that are used for the same purpose and I would like to quantify which ones are closest (i.e., more similar) to others, and which are truly 'novel'.
Both my Google-Fu and my SO-Jutsu have failed me, so I would appreciate if anyone could shed a light on this. Does such a metric even exist?
As one measure of similarity, you could create n datasets, somewhat intelligently constructed, and then run each of your algorithms on all of these datasets. You then get an n-dimensional vector of runtimes associated with each algorithm, which you can then slap any old distance on. I'd imagine something like cosine distance would be a good first guess, since if your datasets are of various sizes you would sort of be classifying your algorithms by the way that they scale. In addition to runtimes, you could monitor maximum memory usage or whatever else you can think of measuring.

Which clustering algorithm is the best when number of cluster is unknown but there is no noise?

I have a dataset with unknown number of clusters and I aim to cluster them. Since, I don't know the number of clusters in advance, I tried to use density-based algorithms especially DBSCAN. The problem that I have with DBSCAN is that how to detect appropriate epsilon. The method suggested in the DBSCAN paper assume there are some noises and when we plot sorted k-dist graph we can detect valley and define the threshold for epsilon. But, my dataset obtained from a controlled environment and there are no noise.
Does anybody have an idea of how to detect epsilon? Or, suggest better clustering algorithm could fit this problem.
In general, there is no unsupervised epsilon detection. From what little you describe, DBSCAN is a very appropriate approach.
Real-world data tend to have a gentle gradient of distances; deciding what distance should be the cut-off is a judgement call requiring knowledge of the paradigm and end-use. In short, the problem requires knowledge not contained in the raw data.
I suggest that you use a simple stepping approach to converge on the solution you want. Set epsilon to some simple value that your observation suggests will be appropriate. If you get too much fragmentation, increase epsilon by a factor of 3; if the clusters are too large, decrease by a factor of 3. Repeat your runs until you get the desired results.

clustering or k-medians of points on a graph

I'm plotting frametimes of my application and I'd like to automatically work out medians. I think the k-medians algorithm is exactly what I'm after, but not sure how my problem applies. My data points are at regular intervals, so I don't have arbitrary 2D data but I also don't have just 1D data as the time dimension matters.
How should I go about computing these clusters (I'd be more than happy with just 2-medians instead of k-medians)? The data can be quite noisy, which is why I want medians instead of means, and I don't want the noise to interfere with the clustering.
Also, is there a more in-depth article than wikipedia's K medians clustering?
Don't use clustering.
Cluster analysis is really designed for multivariate data.
1 dimensional data is fundamentally different, because it is ordered. Multivariate data is not. This means that you can construct much more efficient algorithms for 1-dimensional data than for multivariate data.
Here, you want to perform time series segmentation. You may want to look into methods such as natural breaks optimization, but also e.g. kernel density estimation.
The simplest approach is to keep track of the standard deviation, and once a number of points deviates from this substantially, segment there.

Is kd-Tree an alternative to K-means clustering?

I'm working with BOW object detection and I'm working on the encoding stage. I have seen some implementations that use kd-Tree in the encoding stage, but most writings suggest that K-means clustering is the way to go.
What is the difference between the two?
In object detection, k-means is used to quantize descriptors. A kd-tree can be used to search for descriptors with or without quantization. Each approach has its pros and cons. Specifically, kd-trees are not much better than brute-force search when the number of descriptor dimensions exceeds 20.
kd-tree AFAIK is used for the labeling phase, its much faster, when clustering over a large number of groups, hundreds if not thousands, then the naive approach of simply taking the argmin of all the distances to each group, k-means http://en.wikipedia.org/wiki/K-means_clustering is the actual clustering algorithm, its fast though not always very precise, some implementations return the groups, while others the groups and the labels of the training data set, this is what I ussually use http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html in conjunction with http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans2.html
kd-Tree and K-means algorithm are two different types of clustering method.
Here are several types of clustering method as follows:
kd-Tree is a hierarchical-clustering method (median-based).
K-means is a means-based clustering method.
GMM (Gaussian mixture model) is a probability-based clustering method (soft-clustering).
etc.
[UPDATE]:
Generally, there are two types of clustering method, soft clustering, and hard clustering. Probabilistic clustering like the GMM are soft clustering type with assigning objects to the clusters with probabilities, and others are hard clustering with assigning objects to a cluster absolutely.

Using a smoother with the L Method to determine the number of K-Means clusters

Has anyone tried to apply a smoother to the evaluation metric before applying the L-method to determine the number of k-means clusters in a dataset? If so, did it improve the results? Or allow a lower number of k-means trials and hence much greater increase in speed? Which smoothing algorithm/method did you use?
The "L-Method" is detailed in:
Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms, Salvador & Chan
This calculates the evaluation metric for a range of different trial cluster counts. Then, to find the knee (which occurs for an optimum number of clusters), two lines are fitted using linear regression. A simple iterative process is applied to improve the knee fit - this uses the existing evaluation metric calculations and does not require any re-runs of the k-means.
For the evaluation metric, I am using a reciprocal of a simplified version of the Dunns Index. Simplified for speed (basically my diameter and inter-cluster calculations are simplified). The reciprocal is so that the index works in the correct direction (ie. lower is generally better).
K-means is a stochastic algorithm, so typically it is run multiple times and the best fit chosen. This works pretty well, but when you are doing this for 1..N clusters the time quickly adds up. So it is in my interest to keep the number of runs in check. Overall processing time may determine whether my implementation is practical or not - I may ditch this functionality if I cannot speed it up.
I had asked a similar question in the past here on SO. My question was about coming up with a consistent way of finding the knee to the L-shape you described. The curves in question represented the trade-off between complexity and a fit measure of the model.
The best solution was to find the point with the maximum distance d according to the figure shown:
Note: I haven't read the paper you linked to yet..

Resources