Using a smoother with the L Method to determine the number of K-Means clusters - algorithm

Has anyone tried to apply a smoother to the evaluation metric before applying the L-method to determine the number of k-means clusters in a dataset? If so, did it improve the results? Or allow a lower number of k-means trials and hence much greater increase in speed? Which smoothing algorithm/method did you use?
The "L-Method" is detailed in:
Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms, Salvador & Chan
This calculates the evaluation metric for a range of different trial cluster counts. Then, to find the knee (which occurs for an optimum number of clusters), two lines are fitted using linear regression. A simple iterative process is applied to improve the knee fit - this uses the existing evaluation metric calculations and does not require any re-runs of the k-means.
For the evaluation metric, I am using a reciprocal of a simplified version of the Dunns Index. Simplified for speed (basically my diameter and inter-cluster calculations are simplified). The reciprocal is so that the index works in the correct direction (ie. lower is generally better).
K-means is a stochastic algorithm, so typically it is run multiple times and the best fit chosen. This works pretty well, but when you are doing this for 1..N clusters the time quickly adds up. So it is in my interest to keep the number of runs in check. Overall processing time may determine whether my implementation is practical or not - I may ditch this functionality if I cannot speed it up.

I had asked a similar question in the past here on SO. My question was about coming up with a consistent way of finding the knee to the L-shape you described. The curves in question represented the trade-off between complexity and a fit measure of the model.
The best solution was to find the point with the maximum distance d according to the figure shown:
Note: I haven't read the paper you linked to yet..

Related

Expectation vs. direct numerical optimization of likelihood function for estimating high-dimensional Markov-Switching /HMM model

I am currently estimating a Markov-switching model with many parameters using direct optimization of the log likelihood function (through the forward-backward algorithm). I do the numerical optimization using matlab's genetic algorithm, since other approaches such as the (mostly gradient or simplex-based) algorithms in fmincon and fminsearchbnd were not very useful, given that likelihood function is not only of very high dimension but also shows many local maxima and is highly nonlinear.
The genetic algorithm seems to work very well. However, I am planning to further increase the dimension of the problem. I have read about an EM algorithm to estimate Markov-switching models. From what I understand this algorithm releases a sequence of increasing log-likelhood values. It thus seems suitable to estimate models with very many parameters.
My question is if the EM algorithm is suitable for my application involving many parameters (perhaps better suitable as the genetic algorithm). Speed is not the main limitation (the genetic algorithm is altready extremely slow) but I would need to have some certainty to end up close to the global optimum and not run into one of the many local optima. Do you have any experience or suggestions regarding this?
The EM algorithm finds local optima, and does not guarantee that they are global optima. In fact, if you start it off with a HMM where one of the transition probabilities is zero, that probability will typically never change from zero, because those transitions will appear only with expectation zero in the expectation step, so those starting points have no hope of finding a global optimum which does not have that transition probability zero.
The standard workaround for this is to start it off from a variety of different random parameter settings, pick the highest local optima found, and hope for the best. You might be slightly reassured if a significant proportion of the runs converged to the same (or to equivalent) best local optimum found, on the not very reliable theory that anything better would be found from at least the same fraction of random starts, and so would have showed up by now.
I haven't worked it out in detail, but the EM algorithm solves such a general set of problems that I expect that if it guaranteed to find the global optimum then it would be capable of finding the solution to NP-complete problems with unprecedented efficiency.

Distance metric for algorithms

I am currently working on a project in which I need to quantify the (dis)similarity between algorithms - that is, I have a few tens of algorithms that are used for the same purpose and I would like to quantify which ones are closest (i.e., more similar) to others, and which are truly 'novel'.
Both my Google-Fu and my SO-Jutsu have failed me, so I would appreciate if anyone could shed a light on this. Does such a metric even exist?
As one measure of similarity, you could create n datasets, somewhat intelligently constructed, and then run each of your algorithms on all of these datasets. You then get an n-dimensional vector of runtimes associated with each algorithm, which you can then slap any old distance on. I'd imagine something like cosine distance would be a good first guess, since if your datasets are of various sizes you would sort of be classifying your algorithms by the way that they scale. In addition to runtimes, you could monitor maximum memory usage or whatever else you can think of measuring.

Which clustering algorithm is the best when number of cluster is unknown but there is no noise?

I have a dataset with unknown number of clusters and I aim to cluster them. Since, I don't know the number of clusters in advance, I tried to use density-based algorithms especially DBSCAN. The problem that I have with DBSCAN is that how to detect appropriate epsilon. The method suggested in the DBSCAN paper assume there are some noises and when we plot sorted k-dist graph we can detect valley and define the threshold for epsilon. But, my dataset obtained from a controlled environment and there are no noise.
Does anybody have an idea of how to detect epsilon? Or, suggest better clustering algorithm could fit this problem.
In general, there is no unsupervised epsilon detection. From what little you describe, DBSCAN is a very appropriate approach.
Real-world data tend to have a gentle gradient of distances; deciding what distance should be the cut-off is a judgement call requiring knowledge of the paradigm and end-use. In short, the problem requires knowledge not contained in the raw data.
I suggest that you use a simple stepping approach to converge on the solution you want. Set epsilon to some simple value that your observation suggests will be appropriate. If you get too much fragmentation, increase epsilon by a factor of 3; if the clusters are too large, decrease by a factor of 3. Repeat your runs until you get the desired results.

What algorithms have high time complexity, to help "burn" more CPU cycles?

I am trying to write a demo for an embedded processor, which is a multicore architecture and is very fast in floating point calculations. The problem is that the current hardware I have is the processor connected through an evaluation board where the DRAM to chip rate is somewhat limited, and the board to PC rate is very slow and inefficient.
Thus, when demonstrating big matrix multiplication, I can do, say, 128x128 matrices in a couple of milliseconds, but the I/O takes (lots of) seconds kills the demo.
So, I am looking for some kind of a calculation with higher complexity than n^3, the more the better (but preferably easy to program and to explain/understand) to make the computation part more dominant in the time budget, where the dataset is preferably bound to about 16KB per thread (core).
Any suggestion?
PS: I think it is very similar to this question in its essence.
You could generate large (256-bit) numbers and factor them; that's commonly used in "stress-test" tools. If you specifically want to exercise floating point computation, you can build a basic n-body simulator with a Runge-Kutta integrator and run that.
What you can do is
Declare a std::vector of int
populate it with N-1 to 0
Now keep using std::next_permutation repeatedly until they are sorted again i..e..next_permutation returns false.
With N integers this will need O(N !) calculations and also deterministic
PageRank may be a good fit. Articulated as a linear algebra problem, one repeatedly squares a certain floating-point matrix of controllable size until convergence. In the graphical metaphor, one "ripples" change coming into each node onto the other edges. Both treatments can be made parallel.
You could do a least trimmed squares fit. One use of this is to identify outliers in a data set. For example you could generate samples from some smooth function (a polynomial say) and add (large) noise to some of the samples, and then the problem is to find a subset H of the samples of a given size that minimises the sum of the squares of the residuals (for the polynomial fitted to the samples in H). Since there are a large number of such subsets, you have a lot of fits to do! There are approximate algorithms for this, for example here.
Well one way to go would be to implement brute-force solver for the Traveling Salesman problem in some M-space (with M > 1).
The brute-force solution is to just try every possible permutation and then calculate the total distance for each permutation, without any optimizations (including no dynamic programming tricks like memoization).
For N points, there are (N!) permutations (with a redundancy factor of at least (N-1), but remember, no optimizations). Each pair of points requires (M) subtractions, (M) multiplications and one square root operation to determine their pythagorean distance apart. Each permutation has (N-1) pairs of points to calculate and add to the total distance.
So order of computation is O(M((N+1)!)), whereas storage space is only O(N).
Also, this should not be either too hard, nor too intensive to parallelize across the cores, though it does take some overhead. (I can demonstrate, if needed).
Another idea might be to compute a fractal map. Basically, choose a grid of whatever dimensionality you want. Then, for each grid point, do the fractal iteration to get the value. Some points might require only a few iterations; I believe some will iterate forever (chaos; of course, this can't really happen when you have a finite number of floating-point numbers, but still). The ones that don't stop you'll have to "cut off" after a certain number of iterations... just make this preposterously high, and you should be able to demonstrate a high-quality fractal map.
Another benefit of this is that grid cells are processed completely independently, so you will never need to do communication (not even at boundaries, as in stencil computations, and definitely not O(pairwise) as in direct N-body simulations). You can usefully use O(gridcells) number of processors to parallelize this, although in practice you can probably get better utilization by using gridcells/factor processors and dynamically scheduling grid points to processors on an as-ready basis. The computation is basically all floating-point math.
Mandelbrot/Julia and Lyupanov come to mind as potential candidates, but any should do.

How to efficiently find k-nearest neighbours in high-dimensional data?

So I have about 16,000 75-dimensional data points, and for each point I want to find its k nearest neighbours (using euclidean distance, currently k=2 if this makes it easiser)
My first thought was to use a kd-tree for this, but as it turns out they become rather inefficient as the number of dimension grows. In my sample implementation, its only slightly faster than exhaustive search.
My next idea would be using PCA (Principal Component Analysis) to reduce the number of dimensions, but I was wondering: Is there some clever algorithm or data structure to solve this exactly in reasonable time?
The Wikipedia article for kd-trees has a link to the ANN library:
ANN is a library written in C++, which
supports data structures and
algorithms for both exact and
approximate nearest neighbor searching
in arbitrarily high dimensions.
Based on our own experience, ANN
performs quite efficiently for point
sets ranging in size from thousands to
hundreds of thousands, and in
dimensions as high as 20. (For applications in significantly higher
dimensions, the results are rather
spotty, but you might try it anyway.)
As far as algorithm/data structures are concerned:
The library implements a number of
different data structures, based on
kd-trees and box-decomposition trees,
and employs a couple of different
search strategies.
I'd try it first directly and if that doesn't produce satisfactory results I'd use it with the data set after applying PCA/ICA (since it's quite unlikely your going to end up with few enough dimensions for a kd-tree to handle).
use a kd-tree
Unfortunately, in high dimensions this data structure suffers severely from the curse of dimensionality, which causes its search time to be comparable to the brute force search.
reduce the number of dimensions
Dimensionality reduction is a good approach, which offers a fair trade-off between accuracy and speed. You lose some information when you reduce your dimensions, but gain some speed.
By accuracy I mean finding the exact Nearest Neighbor (NN).
Principal Component Analysis(PCA) is a good idea when you want to reduce the dimensional space your data live on.
Is there some clever algorithm or data structure to solve this exactly in reasonable time?
Approximate nearest neighbor search (ANNS), where you are satisfied with finding a point that might not be the exact Nearest Neighbor, but rather a good approximation of it (that is the 4th for example NN to your query, while you are looking for the 1st NN).
That approach cost you accuracy, but increases performance significantly. Moreover, the probability of finding a good NN (close enough to the query) is relatively high.
You could read more about ANNS in the introduction our kd-GeRaF paper.
A good idea is to combine ANNS with dimensionality reduction.
Locality Sensitive Hashing (LSH) is a modern approach to solve the Nearest Neighbor problem in high dimensions. The key idea is that points that lie close to each other are hashed to the same bucket. So when a query arrives, it will be hashed to a bucket, where that bucket (and usually its neighboring ones) contain good NN candidates).
FALCONN is a good C++ implementation, which focuses in cosine similarity. Another good implementation is our DOLPHINN, which is a more general library.
You could conceivably use Morton Codes, but with 75 dimensions they're going to be huge. And if all you have is 16,000 data points, exhaustive search shouldn't take too long.
No reason to believe this is NP-complete. You're not really optimizing anything and I'd have a hard time figure out how to convert this to another NP-complete problem (I have Garey and Johnson on my shelf and can't find anything similar). Really, I'd just pursue more efficient methods of searching and sorting. If you have n observations, you have to calculate n x n distances right up front. Then for every observation, you need to pick out the top k nearest neighbors. That's n squared for the distance calculation, n log (n) for the sort, but you have to do the sort n times (different for EVERY value of n). Messy, but still polynomial time to get your answers.
BK-Tree isn't such a bad thought. Take a look at Nick's Blog on Levenshtein Automata. While his focus is strings it should give you a spring board for other approaches. The other thing I can think of are R-Trees, however I don't know if they've been generalized for large dimensions. I can't say more than that since I neither have used them directly nor implemented them myself.
One very common implementation would be to sort the Nearest Neighbours array that you have computed for each data point.
As sorting the entire array can be very expensive, you can use methods like indirect sorting, example Numpy.argpartition in Python Numpy library to sort only the closest K values you are interested in. No need to sort the entire array.
#Grembo's answer above should be reduced significantly. as you only need K nearest Values. and there is no need to sort the entire distances from each point.
If you just need K neighbours this method will work very well reducing your computational cost, and time complexity.
if you need sorted K neighbours, sort the output again
see
Documentation for argpartition

Resources