I'm working with BOW object detection and I'm working on the encoding stage. I have seen some implementations that use kd-Tree in the encoding stage, but most writings suggest that K-means clustering is the way to go.
What is the difference between the two?
In object detection, k-means is used to quantize descriptors. A kd-tree can be used to search for descriptors with or without quantization. Each approach has its pros and cons. Specifically, kd-trees are not much better than brute-force search when the number of descriptor dimensions exceeds 20.
kd-tree AFAIK is used for the labeling phase, its much faster, when clustering over a large number of groups, hundreds if not thousands, then the naive approach of simply taking the argmin of all the distances to each group, k-means http://en.wikipedia.org/wiki/K-means_clustering is the actual clustering algorithm, its fast though not always very precise, some implementations return the groups, while others the groups and the labels of the training data set, this is what I ussually use http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html in conjunction with http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans2.html
kd-Tree and K-means algorithm are two different types of clustering method.
Here are several types of clustering method as follows:
kd-Tree is a hierarchical-clustering method (median-based).
K-means is a means-based clustering method.
GMM (Gaussian mixture model) is a probability-based clustering method (soft-clustering).
etc.
[UPDATE]:
Generally, there are two types of clustering method, soft clustering, and hard clustering. Probabilistic clustering like the GMM are soft clustering type with assigning objects to the clusters with probabilities, and others are hard clustering with assigning objects to a cluster absolutely.
Related
I have researched that K-medoid Algorithm (PAM) is a parition-based clustering algorithm and a variant of K-means algorithm. It has solved the problems of K-means like producing empty clusters and the sensitivity to outliers/noise.
However, the time complexity of K-medoid is O(n^2), unlike K-means (Lloyd's Algorithm) which has a time complexity of O(n). I would like to ask if there are other drawbacks of K-medoid algorithm aside from its time complexity.
The main disadvantage of K-Medoid algorithms (either PAM, CLARA or CLARANS) is that they are not suitable for clustering non-spherical (arbitrary shaped) groups of objects.
This is because they rely on minimizing the distances between the non-medoid objects and the medoid (the cluster center) - briefly, they use compactness as clustering criteria instead of connectivity.
Another disadvantage of PAM is that it may obtain different results for different runs on the same dataset because the first k medoids are chosen randomly.
In addition to the aforementioned disadvantages, you must also specify the value for k (the number of clusters) in advance.
Which clusetring machine learning algorithm is best to be used for clustering one-dimensional numerical features (scalar values)?
Is it Birch, Spectral clustering, k-means, DBSCAN...or something else?
All of these methods are better for multivariate data. Except for k-means which historically was used on oneudimensional data, they were all designed with the multivariate problem in mind, and none of them is well optimized for the particular case of 1-dimensional data.
For one-dimensional data, use kernel density estimation. KDE is a nice technique in 1d, has a strong statistical support, and becomes hard to use for clustering in multiple dimensions.
Take a look at K-means clustering algorithm. This algorithm works really well for clustering one dimensional feature vectors. But K means clustering algorithm doesn't work very well when there are outliers in your training dataset in which case you can use some advanced machine learning algorithms.
I'd suggest that before implementing a machine learning algorithm (classification, clustering etc.) for your dataset and problem statement, you can use Weka Toolkit to check which algorithm best fits your problem statement. Weka toolkit is a collection of a large number of machine learning and data mining algorithms that can be easily implemented for a given question. Once you have identified which algorithm works best for your problem, you can modify or write your own implementation of the algorithm. By tweaking it, you can even achieve more accuracy. You can download weka from here.
I have a dataset with unknown number of clusters and I aim to cluster them. Since, I don't know the number of clusters in advance, I tried to use density-based algorithms especially DBSCAN. The problem that I have with DBSCAN is that how to detect appropriate epsilon. The method suggested in the DBSCAN paper assume there are some noises and when we plot sorted k-dist graph we can detect valley and define the threshold for epsilon. But, my dataset obtained from a controlled environment and there are no noise.
Does anybody have an idea of how to detect epsilon? Or, suggest better clustering algorithm could fit this problem.
In general, there is no unsupervised epsilon detection. From what little you describe, DBSCAN is a very appropriate approach.
Real-world data tend to have a gentle gradient of distances; deciding what distance should be the cut-off is a judgement call requiring knowledge of the paradigm and end-use. In short, the problem requires knowledge not contained in the raw data.
I suggest that you use a simple stepping approach to converge on the solution you want. Set epsilon to some simple value that your observation suggests will be appropriate. If you get too much fragmentation, increase epsilon by a factor of 3; if the clusters are too large, decrease by a factor of 3. Repeat your runs until you get the desired results.
I'm plotting frametimes of my application and I'd like to automatically work out medians. I think the k-medians algorithm is exactly what I'm after, but not sure how my problem applies. My data points are at regular intervals, so I don't have arbitrary 2D data but I also don't have just 1D data as the time dimension matters.
How should I go about computing these clusters (I'd be more than happy with just 2-medians instead of k-medians)? The data can be quite noisy, which is why I want medians instead of means, and I don't want the noise to interfere with the clustering.
Also, is there a more in-depth article than wikipedia's K medians clustering?
Don't use clustering.
Cluster analysis is really designed for multivariate data.
1 dimensional data is fundamentally different, because it is ordered. Multivariate data is not. This means that you can construct much more efficient algorithms for 1-dimensional data than for multivariate data.
Here, you want to perform time series segmentation. You may want to look into methods such as natural breaks optimization, but also e.g. kernel density estimation.
The simplest approach is to keep track of the standard deviation, and once a number of points deviates from this substantially, segment there.
I need to cluster a matrix which contains mostly zeros values...Is K-means appropriate for these kind of data or do I need to consider a different algorithm?
No. The reason is that the mean is not sensible on sparse data. The resulting mean vectors will have very different characteristics than your actual data; they will often end up being more similar to each other than to actual documents!
There are some modifications that improve k-means for sparse data such as spherical k-means.
But largely, k-means on such data is just a crude heuristic. The results aren't entirely useless, but they are not the best that you can do either. It works, but by chance, not by design.
k-means is widely used to cluster sparse data such as document-term vectors, so I'd say go ahead. Whether you get good results depends on the data and what you're looking for, of course.
There are a few things to keep in mind:
If you have very sparse data, then a sparse representation of your input can reduce memory usage and runtime by many orders of magnitude, so pick a good k-means implementation.
Euclidean distance isn't always the best metric for sparse vectors, but normalizing them to unit length may give better results.
The cluster centroids are in all likelihood going to be dense regardless of the input sparsity, so don't use too many features.
Doing dimensionality reduction, e.g. SVD, on the samples may boost the running time and cluster quality a lot.