Single Pass Seed Selection Algorithm for k-Means - algorithm

I've recently read the Single Pass Seed Selection Algorithm for k-Means article, but not really understand the algorithm, which is:
Calculate distance matrix Dist in which Dist (i,j) represents distance from i to j
Find Sumv in which Sumv (i) is the sum of the distances from ith point to all other points.
Find the point i which is min (Sumv) and set Index = i
Add First to C as the first centroid
For each point xi, set D (xi) to be the distance between xi and the nearest point in C
Find y as the sum of distances of first n/k nearest points from the Index
Find the unique integer i so that D(x1)^2+D(x2)^2+...+D(xi)^2 >= y > D(x1)^2+D(x2)^2+...+D(x(i-1))^2
Add xi to C
Repeat steps 5-8 until k centers
Especially step 6, do we still use the same Index (same point) over and over or we use the newly added point from C? And about step 8, does i have to be larger than 1?

Honestly, I wouldn't worry about understanding that paper - its not very good.
The algorithm is poorly described.
Its not actually a single pass, it needs do to n^2/2 pairwise computations + one additional pass through the data.
They don't report the runtime of their seed selection scheme, probably because it is very bad doing O(n^2) work.
They are evaluating on very simple data sets that don't have a lot of bad solutions for k-Means to fall into.
One of their metrics of "better"ness is how many iterations it takes k-means to run given the seed selection. While it is an interesting metric, the small differences they report are meaningless (k-means++ seeding could be more iterations, but less work done per iteration), and they don't report the run time or which k-means algorithm they use.
You will get a lot more benefit from learning and understanding the k-means++ algorithm they are comparing against, and reading some of the history from that.
If you really want to understand what they are doing, I would brush up on your matlab and read their provided matlab code. But its not really worth it. If you look up the quantile seed selection algorithm, they are essentially doing something very similar. Instead of using the distance to the first seed to sort the points, they appear to be using the sum of pairwise distances (which means they don't need an initial seed, hence the unique solution).

Single Pass Seed Selection algorithm is a novel algorithm. Single Pass mean that without any iterations first seed can be selected. k-means++ performance is depends on first seed. It is overcome in SPSS. Please gothrough the paper "Robust Seed Selestion Algorithm for k-means" from the same authors
John J. Louis

Related

Calculate "moving" Covariance

I've been trying to figure out how to efficiently calculate the covariance in a moving window, i.e. moving from a set of values (x[0], y[0])..(x[n-1], y[n-1]) to a new set of values (x[1], y[1])..(x[n], y[n]). In other words, the value (x[0], y[0]) gets replaces by the value (x[n], y[n]). For performance reasons I need to calculate the covariance incrementally in the sense that I'd like to express the new covariance Cov(x[1]..x[n], y[1]..y[n]) in terms of the previous covariance Cov(x[0]..x[n-1], y[0]..y[n-1]).
Starting off with the naive formula for covariance as described here:
[https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Covariance][1]
All I can come up with is:
Cov(x[1]..x[n], y[1]..y[n]) =
Cov(x[0]..x[n-1], y[0]..y[n-1]) +
(x[n]*y[n] - x[0]*y[0]) / n -
AVG(x[1]..x[n]) * AVG(y[1]..y[n]) +
AVG(x[0]..x[n-1]) * AVG(y[0]..y[n-1])
I'm sorry about the notation, I hope it's more or less clear what I'm trying to express.
However, I'm not sure if this is sufficiently numerically stable. Dealing with large values I might run into arithmetic overflows or other (for example cancellation) issues.
Is there a better way to do this?
Thanks for any help.
It looks like you are trying some form of "add the new value and subtract the old one". You are correct to worry: this method is not numerically stable. Keeping sums this way is subject to drift, but the real killer is the fact that at each step you are subtracting a large number from another large number to get what is likely a very small number.
One improvement would be to maintain your sums (of x_i, y_i, and x_i*y_i) independently, and recompute the naive formula from them at each step. Your running sums would still drift, and the naive formula is still numerically unstable, but at least you would only have one step of numerical instability.
A stable way to solve this problem would be to implement a formula for (stably) merging statistical sets, and evaluate your overall covariance using a merge tree. Moving your window would update one of your leaves, requiring an update of each node from that leaf to the root. For a window of size n, this method would take O(log n) time per update instead of the O(1) naive computation, but the result would be stable and accurate. Also, if you don't need the statistics for each incremental step, you can update the tree once per each output sample instead of once per input sample. If you have k input samples per output sample, this reduces the cost per input sample to O(1 + (log n)/k).
From the comments: the wikipedia page you reference includes a section on Knuth's online algorithm, which is relatively stable, though still prone to drift. You should be able to do something comparable for covariance; and resetting your computation every K*n samples should limit the drift at minimal cost.
Not sure why no one has mentioned this, but you can use the Welford online algorithm which relies on the running mean:
The equations should look like:
the online mean given by:

maxmin clustering algorithm

I read a a paper that mention max min clustering algorithm, but i don't really quite understand what this algorithm does. Googling "max min clustering algorithm" doesn't yield any helpful result. does anybody know what this algorithm mean? this is an excerpt of the paper:
Max-min clustering proceeds by choosing an observation at random as the first centroid c1, and by setting the set C of centroids to {c1}. During the ith iteration, ci is chosen such that it maximizes the minimum Euclidean distance between ci and observations in C. Max-min clustering is preferable to a density-based clustering algorithm (e.g. k-means) which would tend to select many examples from the dense group of non-seizure data points.
I don't quite understand the bolded part.
link to paper is here
We choose each new centroid to be as far as possible from the existing centroids. Here's some Python code.
def maxminclustering(observations, k):
observations = set(observations)
if k < 1 or not observations: return set()
centroids = set([observations.pop()])
for i in range(min(k - 1, len(observations))):
newcentroid = max(observations,
key=lambda observation:
min(distance(observation, centroid)
for centroid in centroids))
observations.remove(newcentroid)
centroids.add(newcentroid)
return centroids
This sounds a lot like the farthest-points heuristic for seeding k-means, but then not performing any k-means iterations at all.
This is a surprisingly simple, but quite effective strategy. Basically it will find a number of data points that are well spread out, which can make k-means converge fast. Usually, one would discard the first (random) data point.
It only works well for low values of k though (it avoids placing centroids in the center of the data set!), and it is not very favorable to multiple runs - it tends to choose the same initial centroids again.
K-means++ can be seen as a more randomized version of this. Instead of always choosing the farthes object, it chooses far objects with increased likelihood, but may at random also choose a near neighbor. This way, you get more diverse results when running it multiple times.
You can try it out in ELKI, it is named FarthestPointsInitialMeans. If you choose the algorithm SingleAssignmentKMeans, then it will not perform k-means iterations, but only do the initial assignment. That will probably give you this "MaxMin clustering" algorithm.

Find correlation in large dataset

I have a huge dataset. We are talking about 100 3D matrices with 121x145x121 cells. Any cell has a value between 0 and 1, and I need a way to cluster these cells according to their correlation. The problem is the dataset is too big for any algorithm I know; even using just half of it (any matrix is a MRI scan of a brain) we have around 400 billion pairs. Any ideas?
As a first step I would be tempted to try K-means clustering.
This appears in the Matlab statistics toolbox as the function kmeans.
In this algorithm you only end up computing the distances between the K current centres and the data, so the number of pairs is much smaller than comparing all choices.
In Matlab, I've also found that the speed of the operation can be quite dependent on the organisation of your matrix (due to memory caching and optimisation issues). I would recommend transforming your 3d matrices so that the columns (held together in memory) correspond to the 100 values for a particular cell.
This can be done with the permute function.
Try a weighted K-means++ clustering algorithm. Create one matrix of the sum of values for all the 100 input matrices at every point to produce one "grey scale" matrix, then adjust the K-means++ algorithm to work with weighted, (wt), values.
In the initialization phase choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(X)^2 x wt^2 .
The assignment step should be okay, but when computing the centroids in the update step adjust the formula to account for the weights. (Or use the same formula but each point is used wt times).
You may not be able to use a library function to do this but you start with a 100 fold decrease in number of points and matrices to work with.

Algorithm for nearest point

I've got a list of ~5000 points (specified as longitude/latitude pairs), and I want to find the nearest 5 of these to another point, specified by the user.
Can anyone suggest an efficient algorithm for working this out? I'm implementing this in Ruby, so if there's a suitable library then that would be good to know, but I'm still interested in the algorithm!
UPDATE: A couple of people have asked for more specific details on the problem. So here goes:
The 5000 points are mostly within the same city. There might be a few outside it, but it's safe to assume that 99% of them lie within a 75km radius, and that all of them lie within a 200km radius.
The list of points changes rarely. For the sake of argument, let's say it gets updated once per day, and we have to deal with a few thousand requests in that time.
You could accelerate the search by partitioning the 2D space with a quad-tree or a kd-tree and then once you've reach a leaf node you compare the remaining distances one by one until you find the closest match.
See also this blog post which refers to this other blog post which both discuss nearest neighbors searches with kd-trees in Ruby.
You can get a very fast upper-bound estimator on distance using Manhattan distance (scaled for latitude), this should be good enough for rejecting 99.9% of candidates if they're not close (EDIT: since then you tell us they are close. In that case, your metric should be distance-squared, as per Lars H comment).
Consider this equivalent to rejecting anything outside a spherical-rectangle bounding-box (as an approximation to a circle bounding-box).
I don't do Ruby so here is algorithm with pseudocode:
Let the latitude, longitude of your reference point P (pa,po) and the other point X (xa,xo).
Precompute ka, the latitude scaling factor for longitudinal distances: ka (= cos(pa in°)). (Strictly, ka = constant is a linearized approximation in the vicinity of P.)
Then the distance estimator is: D(X,P) = ka*|xa-pa| + |xo-po| = ka*da + do
where |z| means abs(z). At worst this overestimates true distance by a factor of √2 (when da==do), hence we allow for that as follows:
Do a running search and keep Dmin, the fifth-smallest scaled-Manhattan-distance-estimate.
Hence you can reject upfront all points for which D(X,P) > √2 * Dmin (since they must be at least farther away than √((ka*da)² + do²) - that should eliminate 99.9% of points).
Keep a list of all remaining candidate points with D(X,P) <= √2 * Dmin. Update Dmin if you found a new fifth-smallest D. Priority-queue, or else a list of (coord,D) are good data structures.
Note that we never computed Euclidean distance, we only used float multiplication and addition.
(Consider this similar to quadtree except filtering out everything except the region that interests us, hence no need to compute accurate distances upfront or build the data structure.)
It would help if you tell us the expected spread in latitudes, longitudes (degrees, minutes or what? If all the points are close, the √2 factor in this estimator will be too conservative and mark every point as a candidate; a lookup-table based distance estimator would be preferable.)
Pseudocode:
initialize Dmin with the fifth-smallest D from the first five points in list
for point X in list:
if D(X,P) <= √2 * Dmin:
insert the tuple (X,D) in the priority-queue of candidates
if (Dmin>D): Dmin = D
# after first pass, reject candidates with D > √2 * Dmin (use the final value of Dmin)
# ...
# then a second pass on candidates to find lowest 5 exact distances
Since your list is quite short, I'd highly recommend brute force. Just compare all 5000 to the user-specified point. It'll be O(n) and you'll get paid.
Other than that, a quad-tree or Kd-tree are the usual approaches to spacial subdivision. But in your case, you'll end up doing a linear number of insertions into the tree, and then a constant number of logarithmic lookups... a bit of a waste, when you're probably better off just doing a linear number of distance comparisons and being done with it.
Now, if you want to find the N nearest points, you're looking at sorting on the computed distances and taking the first N, but that's still O(n log n)ish.
EDIT: It's worth noting that building the spacial tree becomes worthwhile if you're going to reuse the list of points for multiple queries.
Rather than pure brute-force, for 5000 nodes, I would calculate the individual x+y distances for every node, rather than the straight line distance.
Once you've sorted that list, if e.g. x+y for the 5th node is 38, you can rule out any node where either x or y distance is > 38. This way, you can rule out a lot of nodes without having to calculate the straight line distance. Then brute force calculate the straight line distance for the remaining nodes.
These algorithms are not easily explained, thus I will only give you some hints in the right direction. You should look for Voronoi Diagrams. With a Voronoi Diagram you can easily precompute a graph in O(n^2 log n) time and search the closest point in O(log n) time.
Precomputation is done with a cron job at night and searching is live. This corresponds to your specification.
Now you could save the k closests pairs of each of your 5000 points and then starting from the nearest point from the Voronoi Diagram and search the remaining 4 points.
But be warned that these algorithms are not very easy to implement.
A good reference is:
de Berg: Computational Geometry Algorithms Applications (2008) chapters 7.1 and 7.2
Since you have that few points, I would recommend doing a brute-force search, to the effect of trying all points against each other with is an O(n^2) operation, with n = 5000, or roughly 25/2 million iterations of a suitable algorithm, and just storing the relevant results. This would have sub 100 ms execution time in C, so we are looking at a second or two at the most in Ruby.
When the user picks a point, you can use your stored data to give the results in constant time.
EDIT I re-read your question, and it seems as though the user provides his own last point. In that case it's faster to just do a O(n) linear search through your set each time user provides a point.
if you need to repeat this multiple times, with different user-entered locations, but don't want to implement a quad-tree (or can't find a library implementation) then you can use a locality-sensitive hashing (kind-of) approach that's fairly intuitive:
take your (x,y) pairs and create two lists, one of (x, i) and one of (y, i) where i is the index of the point
sort both lists
then, when given a point (X, Y),
bisection sort for X and Y
expand outwards on both lists, looking for common indices
for common indices, calculate exact distances
stop expanding when the differences in X and Y exceed the exact distance of the most-distant of the current 5 points.
all you're doing is saying that a nearby point must have a similar x and a similar y value...

Fast way to compute the minimal distance of two sets of k-dimensional vectors

I two sets of k-dimensional vectors, where k is around 500 and the number of vectors is usually smaller. I want to compute the (arbitrarily defined) minimal distance between the two sets.
A naive approach would be this:
(loop for a in set1
for b in set2
minimizing (distance a b))
However, this requires O(n² * distance) computations. Is there a faster way of doing this?
I don't think you can do better than O(n^2) when the distance is arbitrary (you have to examine each of the possible distances!). For a given distance function we might be able to exploit the properties of the function, but there won't be any general algorithm which works with any distance function in better than O(n^2) (i.e. o(n^2) : note smallOh).
If your data is dynamic and you have to keep obtaining the closest pair of points at different times, for arbitrary distance function the following papers by Eppstein will probably help (which have special update operations in order to make finding the closest pair of points quick):
http://www.ics.uci.edu/~eppstein/projects/pairs/Papers/Epp-SODA-98.pdf. [O(nlog^2(n)) update time]
http://academic.research.microsoft.com/Paper/1847461.aspx
You will be able to adapt the above one set algorithms to a two set algorithm (for instance, by defining distance between points of same set to be infinity).
For Euclidean type (L^p) distance, there are known O(nlogn) time algorithms, which work with a given set of points (i.e. you dont need to have any special update algorithms):
http://www.cse.iitd.ernet.in/~ssen/cs852/scribe/scribe2/lec.pdf
http://en.wikipedia.org/wiki/Closest_pair_of_points_problem
Of course, the L^p is for one set, but you might be able to adapt it for two sets.
If you give your distance function, it might be easier for us to help you.
Hope it helps. Good luck!
If the components of your vectors are scalars I would guess that for your case of a moderate k=500 the O(n²) approach is probably as fast as you can get. You can simplify your calculation by minimizing distance². Also, the distance(A_i, B_i) = distance(B_i, A_i), so make sure you only compare them once (you only have 500!/(500-2)! pairs, not 500²).
If the components are m-dimensional vectors A and B instead, you could store the components of vector A in a R-tree or a kd-tree and then find the closest pair by iterating over all components of vector B and finding its closest partner from A--- this would be O(n). Don't forget that big-O is for n->infinity, so the trees might come with some pretty expensive constant term (i.e. this approach might only make sense for large k or if vector A is always the same).
Put the two sets of coordinates into a Spatial Index, e.g. a KD-tree.
You then compute the intersection of these two indices.

Resources