I see that for k-means, we have Lloyd's algorithm, Elkan's algorithm, and we also have hierarchical version of k-means.
For all these algorithms, I see that Elkan's algorithm could provide a boost in term of speed. But what I want to know, is the quality from all these k-means algorithms. Each time, we run these algorithms, the result would be different, due to their heuristic and probabilistic nature. Now, my question is, when it comes to clustering algorithm like k-means, if we want to have a better quality result (as in lesser distortion, etc.) between all these k-means algorithms, which algorithm would be able to give you the better quality? Is it possible to measure such thing?
A better solution is usually one that has a better (lower) J(x,c) value, where:
J(x,c) = 1/|x| * Sum(distance(x(i),c(centroid(i)))) for each i in [1,|x|]
Wherre:
x is the list of samples
|x| is the size of x (number of elements)
[1,|x|] all the numbers from 1 to |x| (inclusive)
c is the list of centroids (or means) of clusters (i.e., for k clusters |c| = k)
distance(a,b) (sometimes denoted as ||a-b|| is the distance between "point" a to "point" b (In euclidean 2D space it is sqrt((a.x-b.x)^2 + (a.y-b.y)^2))
centroid(i) - the centroid/mean which is closest to x(i)
Note that this approach does not require switching to supervised technique and can be fully automated!
As I understand it, you need some data with labels to cross-validate you clustering algorithm.
How about the pathological case of the two-moons dataset? unsupervised k-means will fail badly. A high quality method I am aware of employs a more probabilistic approach using mutual information and combinatorial optimization. Basically you cast the clustering problem as the problem of finding the optimal [cluster] subset of the full point-set for the case of two clusters.
You can find the relevant paper here (page 42) and the corresponding Matlab code here to play with (checkout the two-moons case). If you are interested in a C++ high-performance implementation of that with a speed up of >30x then you can find it here HPSFO.
To compare the quality, you should have a labeled dataset and measure the results by some criteria like NMI
Related
I built a classifier model using KNN as learners for an ensemble based on the random subspace method.
I have three predictors, whose dimension is 541 samples, and I develop an optimization procedure to find the best k (number of neighbours).
I chose the k that maximize the AUC of the classifier, whose performance is computed with 10-fold-cross-validation.
The result for the best k was 269 for each single weak learners (that are 60 as a result of a similar optimization).
Now, my question is:
Are 269 neighbours too many? I trust the results of the optimization, but I have never used so many neighbours and I am worried about overfitting.
Thank you in advance,
MP
The choice of k-value in k-NN is rather data dependent. We can argue about more general characteristics of smaller or bigger choices of k-values, but specifying a certain number as good/bad is not very accurate to tell. Because of this, if your CV implementation is correct, you can trust the results and move further with it because the CV will give the optimal for your specific case. For more of a general discussion, we can say these about the choice of the k-value:
1- Smaller choice of k-value : Small choice of k-values might increase the overall accuracy and are less costly to implement, but will make the system less robust to noisy input.
2- Bigger choice of k-value : Bigger choice of k-values will make the system more robust against noisy input, but will be more costly to execute and have weaker decision boundaries compared to smaller k-values.
You can always compare these general characteristics while choosing the k-value in your application. However, for choosing the optimal values using an algorithm like CV will give you a definite answer.
I read a a paper that mention max min clustering algorithm, but i don't really quite understand what this algorithm does. Googling "max min clustering algorithm" doesn't yield any helpful result. does anybody know what this algorithm mean? this is an excerpt of the paper:
Max-min clustering proceeds by choosing an observation at random as the first centroid c1, and by setting the set C of centroids to {c1}. During the ith iteration, ci is chosen such that it maximizes the minimum Euclidean distance between ci and observations in C. Max-min clustering is preferable to a density-based clustering algorithm (e.g. k-means) which would tend to select many examples from the dense group of non-seizure data points.
I don't quite understand the bolded part.
link to paper is here
We choose each new centroid to be as far as possible from the existing centroids. Here's some Python code.
def maxminclustering(observations, k):
observations = set(observations)
if k < 1 or not observations: return set()
centroids = set([observations.pop()])
for i in range(min(k - 1, len(observations))):
newcentroid = max(observations,
key=lambda observation:
min(distance(observation, centroid)
for centroid in centroids))
observations.remove(newcentroid)
centroids.add(newcentroid)
return centroids
This sounds a lot like the farthest-points heuristic for seeding k-means, but then not performing any k-means iterations at all.
This is a surprisingly simple, but quite effective strategy. Basically it will find a number of data points that are well spread out, which can make k-means converge fast. Usually, one would discard the first (random) data point.
It only works well for low values of k though (it avoids placing centroids in the center of the data set!), and it is not very favorable to multiple runs - it tends to choose the same initial centroids again.
K-means++ can be seen as a more randomized version of this. Instead of always choosing the farthes object, it chooses far objects with increased likelihood, but may at random also choose a near neighbor. This way, you get more diverse results when running it multiple times.
You can try it out in ELKI, it is named FarthestPointsInitialMeans. If you choose the algorithm SingleAssignmentKMeans, then it will not perform k-means iterations, but only do the initial assignment. That will probably give you this "MaxMin clustering" algorithm.
What is the best way to test a clustering algorithm? I am using an agglomerative clustering algorithm with a stop criterion. How do I test if the clusters are formed correctly or not?
A good rule of thumb for evaluating how much a graph can be clustered (on a coarse grained level) has to do with the "eigenvalue gap". Given a weighted graph A, calculate the eigenvalues and sort them (this is the eigenvalue spectrum). When plotted, if there is a large jump in the spectrum at some point, there is a natural corresponding block to partition the graph.
Below is an example (in numpy python) that shows, given an almost block diagonal matrix there a large gap in the eigenvalue spectrum at the number of blocks (parameterized by c in the code). Note that a matrix permutation (identical to labeling your graph nodes) still gives the same spectral gap:
from numpy import *
import pylab as plt
# Make a block diagonal matrix
N = 30
c = 5
A = zeros((N*c,N*c))
for m in xrange(c):
A[m*N:(m+1)*N, m*N:(m+1)*N] = random.random((N,N))
# Add some noise
A += random.random(A.shape) * 0.1
# Make symmetric
A += A.T - diag(A.diagonal())
# Show the original matrix
plt.subplot(131)
plt.imshow(A.copy(), interpolation='nearest')
# Permute the matrix for effect
idx = random.permutation(N*c)
A = A[idx,:][:,idx]
# Compute eigenvalues
L = linalg.eigvalsh(A)
# Show the results
plt.subplot(132)
plt.imshow(A, interpolation='nearest')
plt.subplot(133)
plt.plot(sorted(L,reverse=True))
plt.plot([c-.5,c-.5],[0,max(L)],'r--')
plt.ylim(0,max(L))
plt.xlim(0,20)
plt.show()
It depends on what you want to test against.
When testing your own implementation of a known algorithm, you might want to compare the results with that of a known good implementation.
Hierarchical clustering is hard to test with respect to quality, as it is hierarchical. The common measures such as Rand index etc. are only valid for strict partitionings. You can get a strict partitioning from a hierarchical clustering, but then you need to fix the height to cut at.
Ideally you have some kind of pre-clustered data (supervised learning) and test the results of your clustering algorithm on that. Simply count the number of correct classifications divided by the total number of classifications performed to get an accuracy score.
If you are doing unsupervised learning, then there is really no way to evaluate your algorithm.
It is sometimes useful to construct input data where there is a known, and perhaps obvious, answer by construction. For a clustering algorithm, you might construct data with N clusters such that the maximum distance between any two points in the same cluster is smaller than the minimum distance between any two points in different clusters. Another option would be to generate a number of different data sets plotable as 2-d scatter diagrams with clusters obvious to the eye, then compare the result from your algorithm with this structure, perhaps moving the clusters together to see when the algorithm fails to see them.
You might be able to do better given knowledge of your particular clustering algorithm, but the above might at least have some chance of flushing obvious bugs from cover.
I am looking for a general algorithm to help in situations with similar constraints as this example :
I am thinking of a system where images are constructed based on a set of operations. Each operation has a set of parameters. The total "gene" of the image is then the sequential application of the operations with the corresponding parameters. The finished image is then given a vote by one or more real humans according to how "beautiful" it is.
The question is what kind of algorithm would be able to do better than simply random search if you want to find the most beautiful image? (and hopefully improve the confidence over time as votes tick in and improve the fitness function)
Given that the operations will probably be correlated, it should be possible to do better than random search. So for example operation A with parameters a1 and a2 followed by B with parameters b1 could generally be vastly superior to B followed by A. The order of operations will matter.
I have tried googling for research papers on random walk and markov chains as that is my best guesses about where to look, but so far have found no scenarios similar enough. I would really appreciate even just a hint of where to look for such an algorithm.
I think what you are looking for fall in a broad research area called metaheuristics (which include many non-linear optimization algorithms such as genetic algorithms, simulated annealing or tabu search).
Then if your raw fitness function is just giving a statistical value somehow approximating a real (but unknown) fitness function, you can probably still use most metaheuristics by (somehow) smoothing your fitness function (averaging results would do that).
Do you mean the Metropolis algorithm?
This approach uses a random walk, weighted by the fitness function. It is useful for locating local extrema in complicated fitness landscapes, but is generally slower than deterministic approaches where those will work.
You're pretty much describing a genetic algorithm in which the sequence of operations represents the "gene" ("chromosome" would be a better term for this, where the parameter[s] passed to each operation represents a single "gene", and multiple genes make up a chromosome), the image produced represents the phenotypic expression of the gene, and the votes from the real humans represent the fitness function.
If I understand your question, you're looking for an alternative algorithm of some sort that will evaluate the operations and produce a "beauty" score similar to what the real humans produce. Good luck with that - I don't think there really is any such thing, and I'm not surprised that you didn't find anything. Human brains, and correspondingly human evaluations of aesthetics, are much too staggeringly complex to be reducible to a simplistic algorithm.
Interestingly, your question seems to encapsulate the bias against using real human responses as the fitness function in genetic-algorithm-based software. This is a subject of relevance to me, since my namesake software is specifically designed to use human responses (or "votes") to evaluate music produced via a genetic process.
Simple Markov Chain
Markov chains, which you mention, aren't a bad way to go. A Markov chain is just a state machine, represented as a graph with edge weights which are transition probabilities. In your case, each of your operations is a node in the graph, and the edges between the nodes represent allowable sequences of operations. Since order matters, your edges are directed. You then need three components:
A generator function to construct the graph of allowed transitions (which operations are allowed to follow one another). If any operation is allowed to follow any other, then this is easy to write: all nodes are connected, and your graph is said to be complete. You can initially set all the edge weights to 1.
A function to traverse the graph, crossing N nodes, where N is your 'gene-length'. At each node, your choice is made randomly, but proportionally weighted by the values of the edges (so better edges have a higher chance of being selected).
A weighting update function which can be used to adjust the weightings of the edges when you get feedback about an image. For example, a simple update function might be to give each edge involved in a 'pleasing' image a positive vote each time that image is nominated by a human. The weighting of each edge is then normalised, with the currently highest voted edge set to 1, and all the others correspondingly reduced.
This graph is then a simple learning network which will be refined by subsequent voting. Over time as votes accumulate, successive traversals will tend to favour the more highly rated sequences of operations, but will still occasionally explore other possibilities.
Advantages
The main advantage of this approach is that it's easy to understand and code, and makes very few assumptions about the problem space. This is good news if you don't know much about the search space (e.g. which sequences of operations are likely to be favourable).
It's also easy to analyse and debug - you can inspect the weightings at any time and very easily calculate things like the top 10 best sequences known so far, etc. This is a big advantage - other approaches are typically much harder to investigate ("why did it do that?") because of their increased abstraction. Although very efficient, you can easily melt your brain trying to follow and debug the convergence steps of a simplex crawler!
Even if you implement a more sophisticated production algorithm, having a simple baseline algorithm is crucial for sanity checking and efficiency comparisons. It's also easy to tinker with, by messing with the update function. For example, an even more baseline approach is pure random walk, which is just a null weighting function (no weighting updates) - whatever algorithm you produce should perform significantly better than this if its existence is to be justified.
This idea of baselining is very important if you want to evaluate the quality of your algorithm's output empirically. In climate modelling, for example, a simple test is "does my fancy simulation do any better at predicting the weather than one where I simply predict today's weather will be the same as yesterday's?" Since weather is often correlated on a timescale of several days, this baseline can give surprisingly good predictions!
Limitations
One disadvantage of the approach is that it is slow to converge. A more agressive choice of update function will push promising results faster (for example, weighting new results according to a power law, rather than the simple linear normalisation), at the cost of giving alternatives less credence.
This is equivalent to fiddling with the mutation rate and gene pool size in a genetic algorithm, or the cooling rate of a simulated annealing approach. The tradeoff between 'climbing hills or exploring the landscape' is an inescapable "twiddly knob" (free parameter) which all search algorithms must deal with, either directly or indirectly. You are trying to find the highest point in some fitness search space. Your algorithm is trying to do that in less tries than random inspection, by looking at the shape of the space and trying to infer something about it. If you think you're going up a hill, you can take a guess and jump further. But if it turns out to be a small hill in a bumpy landscape, then you've just missed the peak entirely.
Also note that since your fitness function is based on human responses, you are limited to a relatively small number of iterations regardless of your choice of algorithmic approach. For example, you would see the same issue with a genetic algorithm approach (fitness function limits the number of individuals and generations) or a neural network (limited training set).
A final potential limitation is that if your "gene-lengths" are long, there are many nodes, and many transitions are allowed, then the size of the graph will become prohibitive, and the algorithm impractical.
I have need to do some cluster analysis on a set of 2 dimensional data (I may add extra dimensions along the way).
The analysis itself will form part of the data being fed into a visualisation, rather than the inputs into another process (e.g. Radial Basis Function Networks).
To this end, I'd like to find a set of clusters which primarily "looks right", rather than elucidating some hidden patterns.
My intuition is that k-means would be a good starting place for this, but that finding the right number of clusters to run the algorithm with would be problematic.
The problem I'm coming to is this:
How to determine the 'best' value for k such that the clusters formed are stable and visually verifiable?
Questions:
Assuming that this isn't NP-complete, what is the time complexity for finding a good k. (probably reported in number of times to run the k-means algorithm).
is k-means a good starting point for this type of problem? If so, what other approaches would you recommend. A specific example, backed by an anecdote/experience would be maxi-bon.
what short cuts/approximations would you recommend to increase the performance.
For problems with an unknown number of clusters, agglomerative hierarchical clustering is often a better route than k-means.
Agglomerative clustering produces a tree structure, where the closer you are to the trunk, the fewer the number of clusters, so it's easy to scan through all numbers of clusters. The algorithm starts by assigning each point to its own cluster, and then repeatedly groups the two closest centroids. Keeping track of the grouping sequence allows an instant snapshot for any number of possible clusters. Therefore, it's often preferable to use this technique over k-means when you don't know how many groups you'll want.
There are other hierarchical clustering methods (see the paper suggested in Imran's comments). The primary advantage of an agglomerative approach is that there are many implementations out there, ready-made for your use.
In order to use k-means, you should know how many cluster there is. You can't try a naive meta-optimisation, since the more cluster you'll add (up to 1 cluster for each data point), the more it will brought you to over-fitting. You may look for some cluster validation methods and optimize the k hyperparameter with it but from my experience, it rarely work well. It's very costly too.
If I were you, I would do a PCA, eventually on polynomial space (take care of your available time) depending on what you know of your input, and cluster along the most representatives components.
More infos on your data set would be very helpful for a more precise answer.
Here's my approximate solution:
Start with k=2.
For a number of tries:
Run the k-means algorithm to find k clusters.
Find the mean square distance from the origin to the cluster centroids.
Repeat the 2-3, to find a standard deviation of the distances. This is a proxy for the stability of the clusters.
If stability of clusters for k < stability of clusters for k - 1 then return k - 1
Increment k by 1.
The thesis behind this algorithm is that the number of sets of k clusters is small for "good" values of k.
If we can find a local optimum for this stability, or an optimal delta for the stability, then we can find a good set of clusters which cannot be improved by adding more clusters.
In a previous answer, I explained how Self-Organizing Maps (SOM) can be used in visual clustering.
Otherwise, there exist a variation of the K-Means algorithm called X-Means which is able to find the number of clusters by optimizing the Bayesian Information Criterion (BIC), in addition to solving the problem of scalability by using KD-trees.
Weka includes an implementation of X-Means along with many other clustering algorithm, all in an easy to use GUI tool.
Finally you might to refer to this page which discusses the Elbow Method among other techniques for determining the number of clusters in a dataset.
You might look at papers on cluster validation. Here's one that is cited in papers that involve microarray analysis, which involves clustering genes with related expression levels.
One such technique is the Silhouette measure that evaluates how closely a labeled point is to its centroid. The general idea is that, if a point is assigned to one centroid but is still close to others, perhaps it was assigned to the wrong centroid. By counting these events across training sets and looking across various k-means clusterings, one looks for the k such that the labeled points overall fall into the "best" or minimally ambiguous arrangement.
It should be said that clustering is more of a data visualization and exploration technique. It can be difficult to elucidate with certainty that one clustering explains the data correctly, above all others. It's best to merge your clusterings with other relevant information. Is there something functional or otherwise informative about your data, such that you know some clusterings are impossible? This can reduce your solution space considerably.
From your wikipedia link:
Regarding computational complexity,
the k-means clustering problem is:
NP-hard in general Euclidean
space d even for 2 clusters
NP-hard for a general number of
clusters k even in the plane
If k and d are fixed, the problem can be
exactly solved in time O(ndk+1 log n),
where n is the number of entities to
be clustered
Thus, a variety of heuristic
algorithms are generally used.
That said, finding a good value of k is usually a heuristic process (i.e. you try a few and select the best).
I think k-means is a good starting point, it is simple and easy to implement (or copy). Only look further if you have serious performance problems.
If the set of points you want to cluster is exceptionally large a first order optimisation would be to randomly select a small subset, use that set to find your k-means.
Choosing the best K can be seen as a Model Selection problem. One possible approach is Minimum Description Length, which in this context means: You could store a table with all the points (in which case K=N). At the other extreme, you have K=1, and all the points are stored as their distances from a single centroid. This Section from Introduction to Information Retrieval by Manning and Schutze suggest minimising the Akaike Information Criterion as a heuristic for an optimal K.
This problematic belongs to the "internal evaluation" class of "clustering optimisation problems" which curent state of the art solution seems to use the **Silhouette* coeficient* as stated here
https://en.wikipedia.org/wiki/Cluster_analysis#Applications
and here:
https://en.wikipedia.org/wiki/Silhouette_(clustering) :
"silhouette plots and averages may be used to determine the natural number of clusters within a dataset"
scikit-learn provides a sample usage implementation of the methodology here
http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html