Which clusetring machine learning algorithm is best to be used for clustering one-dimensional numerical features (scalar values)?
Is it Birch, Spectral clustering, k-means, DBSCAN...or something else?
All of these methods are better for multivariate data. Except for k-means which historically was used on oneudimensional data, they were all designed with the multivariate problem in mind, and none of them is well optimized for the particular case of 1-dimensional data.
For one-dimensional data, use kernel density estimation. KDE is a nice technique in 1d, has a strong statistical support, and becomes hard to use for clustering in multiple dimensions.
Take a look at K-means clustering algorithm. This algorithm works really well for clustering one dimensional feature vectors. But K means clustering algorithm doesn't work very well when there are outliers in your training dataset in which case you can use some advanced machine learning algorithms.
I'd suggest that before implementing a machine learning algorithm (classification, clustering etc.) for your dataset and problem statement, you can use Weka Toolkit to check which algorithm best fits your problem statement. Weka toolkit is a collection of a large number of machine learning and data mining algorithms that can be easily implemented for a given question. Once you have identified which algorithm works best for your problem, you can modify or write your own implementation of the algorithm. By tweaking it, you can even achieve more accuracy. You can download weka from here.
Related
I'm currently working on a Machine Learning project for my Artificial Intelligence exam. The goal is to correctly choose two classification algorithms to compare using WEKA, bearing in mind that these two algorithms must be different enough to give the comparison a reason to be made. Besides, the algorithms must handle both nominal and numeric data (I suppose this is mandatory to let the comparison be made).
My professor suggested to choose a statistical classifier and a decision tree classifier, for example, or to delve into a comparison between a bottom-up classifier and a top-down one.
Since I have very little experience in the Machine Learning field, I am doing some research on the various algorithms WEKA offers, and I stepped on kNN, that is, k-nearest neighbors algorithm.
Is it statistical? And could it be compared with a Decision Stump algorithm, for example?
Or else, can you suggest a couple of algorithms that match these requirements I have pointed out above?
P. S.: Handled data must be both numerical and nominal. On WEKA there are numerical/nominal features and numerical/nominal classes. Do I have to choose algorithms with both numerical/nominal features AND classes or just one of them?
I would really appreciate any help guys, thanks for your patience!
Based on your professor's description, I would not consider k-Nearest Neighbors (kNN) a statistical classifier. In most contexts, a statistical classifier is one that generalizes via statistics of the training data (either by using statistics directly or by transforming them). An example of this is the Naïve Bayes Classifier.
By contrast, kNN is an example of Instance-Based Learning. It doesn't use statistics of the training data; rather, it compares new observations directly to the training instances to perform classification.
With regard to comparison, yes you can compare performance of kNN with a Decision Stump (or any other classifier). Since any two supervised classifiers will yield a classification accuracies with respect to your training/testing data, you can compare their performance.
Please do bear with me if you find my query a little stupid. But I am currently doing a high school research project on how Fourier transformation can be used in recognizing human speech(similar to how Shazam works). But I need to two different Fast Fourier Transformation algorithms for this project. One of the algorithms I am using would definitely be the Cooley-Tukey FTT algorithm. However, I am unsure of another FTT algorithm I should use. Thus, what would be a good algorithm to use and is there any pseudo code/source code for that particular algorithm? I was only able to find algorithms for Cooley-Tukey thus far.
Thanks!
If you don't need speed (due to some performance constraints), then a DFT (straight matrix multiply) should produce very similar results (differing due to rounding noise) using a very different algorithm.
I'm plotting frametimes of my application and I'd like to automatically work out medians. I think the k-medians algorithm is exactly what I'm after, but not sure how my problem applies. My data points are at regular intervals, so I don't have arbitrary 2D data but I also don't have just 1D data as the time dimension matters.
How should I go about computing these clusters (I'd be more than happy with just 2-medians instead of k-medians)? The data can be quite noisy, which is why I want medians instead of means, and I don't want the noise to interfere with the clustering.
Also, is there a more in-depth article than wikipedia's K medians clustering?
Don't use clustering.
Cluster analysis is really designed for multivariate data.
1 dimensional data is fundamentally different, because it is ordered. Multivariate data is not. This means that you can construct much more efficient algorithms for 1-dimensional data than for multivariate data.
Here, you want to perform time series segmentation. You may want to look into methods such as natural breaks optimization, but also e.g. kernel density estimation.
The simplest approach is to keep track of the standard deviation, and once a number of points deviates from this substantially, segment there.
I need to cluster a matrix which contains mostly zeros values...Is K-means appropriate for these kind of data or do I need to consider a different algorithm?
No. The reason is that the mean is not sensible on sparse data. The resulting mean vectors will have very different characteristics than your actual data; they will often end up being more similar to each other than to actual documents!
There are some modifications that improve k-means for sparse data such as spherical k-means.
But largely, k-means on such data is just a crude heuristic. The results aren't entirely useless, but they are not the best that you can do either. It works, but by chance, not by design.
k-means is widely used to cluster sparse data such as document-term vectors, so I'd say go ahead. Whether you get good results depends on the data and what you're looking for, of course.
There are a few things to keep in mind:
If you have very sparse data, then a sparse representation of your input can reduce memory usage and runtime by many orders of magnitude, so pick a good k-means implementation.
Euclidean distance isn't always the best metric for sparse vectors, but normalizing them to unit length may give better results.
The cluster centroids are in all likelihood going to be dense regardless of the input sparsity, so don't use too many features.
Doing dimensionality reduction, e.g. SVD, on the samples may boost the running time and cluster quality a lot.
I'm working with BOW object detection and I'm working on the encoding stage. I have seen some implementations that use kd-Tree in the encoding stage, but most writings suggest that K-means clustering is the way to go.
What is the difference between the two?
In object detection, k-means is used to quantize descriptors. A kd-tree can be used to search for descriptors with or without quantization. Each approach has its pros and cons. Specifically, kd-trees are not much better than brute-force search when the number of descriptor dimensions exceeds 20.
kd-tree AFAIK is used for the labeling phase, its much faster, when clustering over a large number of groups, hundreds if not thousands, then the naive approach of simply taking the argmin of all the distances to each group, k-means http://en.wikipedia.org/wiki/K-means_clustering is the actual clustering algorithm, its fast though not always very precise, some implementations return the groups, while others the groups and the labels of the training data set, this is what I ussually use http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html in conjunction with http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans2.html
kd-Tree and K-means algorithm are two different types of clustering method.
Here are several types of clustering method as follows:
kd-Tree is a hierarchical-clustering method (median-based).
K-means is a means-based clustering method.
GMM (Gaussian mixture model) is a probability-based clustering method (soft-clustering).
etc.
[UPDATE]:
Generally, there are two types of clustering method, soft clustering, and hard clustering. Probabilistic clustering like the GMM are soft clustering type with assigning objects to the clusters with probabilities, and others are hard clustering with assigning objects to a cluster absolutely.