clustering or k-medians of points on a graph - algorithm

I'm plotting frametimes of my application and I'd like to automatically work out medians. I think the k-medians algorithm is exactly what I'm after, but not sure how my problem applies. My data points are at regular intervals, so I don't have arbitrary 2D data but I also don't have just 1D data as the time dimension matters.
How should I go about computing these clusters (I'd be more than happy with just 2-medians instead of k-medians)? The data can be quite noisy, which is why I want medians instead of means, and I don't want the noise to interfere with the clustering.
Also, is there a more in-depth article than wikipedia's K medians clustering?

Don't use clustering.
Cluster analysis is really designed for multivariate data.
1 dimensional data is fundamentally different, because it is ordered. Multivariate data is not. This means that you can construct much more efficient algorithms for 1-dimensional data than for multivariate data.
Here, you want to perform time series segmentation. You may want to look into methods such as natural breaks optimization, but also e.g. kernel density estimation.
The simplest approach is to keep track of the standard deviation, and once a number of points deviates from this substantially, segment there.

Related

Which clustering algorithm is the best when number of cluster is unknown but there is no noise?

I have a dataset with unknown number of clusters and I aim to cluster them. Since, I don't know the number of clusters in advance, I tried to use density-based algorithms especially DBSCAN. The problem that I have with DBSCAN is that how to detect appropriate epsilon. The method suggested in the DBSCAN paper assume there are some noises and when we plot sorted k-dist graph we can detect valley and define the threshold for epsilon. But, my dataset obtained from a controlled environment and there are no noise.
Does anybody have an idea of how to detect epsilon? Or, suggest better clustering algorithm could fit this problem.
In general, there is no unsupervised epsilon detection. From what little you describe, DBSCAN is a very appropriate approach.
Real-world data tend to have a gentle gradient of distances; deciding what distance should be the cut-off is a judgement call requiring knowledge of the paradigm and end-use. In short, the problem requires knowledge not contained in the raw data.
I suggest that you use a simple stepping approach to converge on the solution you want. Set epsilon to some simple value that your observation suggests will be appropriate. If you get too much fragmentation, increase epsilon by a factor of 3; if the clusters are too large, decrease by a factor of 3. Repeat your runs until you get the desired results.

Is K-means for clustering data with many zero values?

I need to cluster a matrix which contains mostly zeros values...Is K-means appropriate for these kind of data or do I need to consider a different algorithm?
No. The reason is that the mean is not sensible on sparse data. The resulting mean vectors will have very different characteristics than your actual data; they will often end up being more similar to each other than to actual documents!
There are some modifications that improve k-means for sparse data such as spherical k-means.
But largely, k-means on such data is just a crude heuristic. The results aren't entirely useless, but they are not the best that you can do either. It works, but by chance, not by design.
k-means is widely used to cluster sparse data such as document-term vectors, so I'd say go ahead. Whether you get good results depends on the data and what you're looking for, of course.
There are a few things to keep in mind:
If you have very sparse data, then a sparse representation of your input can reduce memory usage and runtime by many orders of magnitude, so pick a good k-means implementation.
Euclidean distance isn't always the best metric for sparse vectors, but normalizing them to unit length may give better results.
The cluster centroids are in all likelihood going to be dense regardless of the input sparsity, so don't use too many features.
Doing dimensionality reduction, e.g. SVD, on the samples may boost the running time and cluster quality a lot.

Difference between quadratic split and linear split

I am trying to understand how r-tree works, and saw that there are two types of splits: quadratic and linear.
What are actually the differences between linear and quadratic? and in which case one would be preferred over the other?
The original R-Tree paper describes the differences between PickSeeds and LinearPickSeeds in sections 3.5.2 and 3.5.3, and the charts in section 4 show the performance differences between the two algorithms. Note that figure 4.2 uses an exponential scale for the Y-axis.
http://www.cs.bgu.ac.il/~atdb082/wiki.files/paper6.pdf
I would personally use LinearPickSeeds for cases where the R-Tree has high "churn" and memory usage is not critical, and QuadraticPickSeeds for cases where the R-Tree is relatively static or in a limited memory environment. But that's just a rule of thumb; I don't have benchmarks to back that up.
Both are heuristics to find small area split.
In quadratic you choose two objects that create as much empty space as possible. In linear you choose two objects that are farthest apart.
Quadratic provides a bit better quality of split. However for many practical purposes linear is as simple, fast and good as quadratic.
There are even more variants: Exhaustive search, Greenes split, Ang Tan split and the R*-tree split.
All of them are heuristics to find a good split in acceptable time.
In my experiments, R*-tree splitting works best, because it produces more rectangular pages. Ang-Tan, while being "linear" produces slices that are actually a pain for most queries. Often, cost at construction/insertion is not too important, but query is.

Is kd-Tree an alternative to K-means clustering?

I'm working with BOW object detection and I'm working on the encoding stage. I have seen some implementations that use kd-Tree in the encoding stage, but most writings suggest that K-means clustering is the way to go.
What is the difference between the two?
In object detection, k-means is used to quantize descriptors. A kd-tree can be used to search for descriptors with or without quantization. Each approach has its pros and cons. Specifically, kd-trees are not much better than brute-force search when the number of descriptor dimensions exceeds 20.
kd-tree AFAIK is used for the labeling phase, its much faster, when clustering over a large number of groups, hundreds if not thousands, then the naive approach of simply taking the argmin of all the distances to each group, k-means http://en.wikipedia.org/wiki/K-means_clustering is the actual clustering algorithm, its fast though not always very precise, some implementations return the groups, while others the groups and the labels of the training data set, this is what I ussually use http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html in conjunction with http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.vq.kmeans2.html
kd-Tree and K-means algorithm are two different types of clustering method.
Here are several types of clustering method as follows:
kd-Tree is a hierarchical-clustering method (median-based).
K-means is a means-based clustering method.
GMM (Gaussian mixture model) is a probability-based clustering method (soft-clustering).
etc.
[UPDATE]:
Generally, there are two types of clustering method, soft clustering, and hard clustering. Probabilistic clustering like the GMM are soft clustering type with assigning objects to the clusters with probabilities, and others are hard clustering with assigning objects to a cluster absolutely.

How to efficiently find k-nearest neighbours in high-dimensional data?

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?
The Wikipedia article for kd-trees has a link to the ANN library:
ANN is a library written in C++, which
supports data structures and
algorithms for both exact and
approximate nearest neighbor searching
in arbitrarily high dimensions.
Based on our own experience, ANN
performs quite efficiently for point
sets ranging in size from thousands to
hundreds of thousands, and in
dimensions as high as 20. (For applications in significantly higher
dimensions, the results are rather
spotty, but you might try it anyway.)
As far as algorithm/data structures are concerned:
The library implements a number of
different data structures, based on
kd-trees and box-decomposition trees,
and employs a couple of different
search strategies.
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).
use a kd-tree
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
reduce the number of dimensions
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Is there some clever algorithm or data structure to solve this exactly in reasonable time?
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.
You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.
No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.
BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.
One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
#Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition

Resources