How to adjust K-means clustering? - algorithm

I have tried to leverage K-means clustering approach for the problem which is formulated similar to the at Wikipedia.
minimize the within-cluster sum of squares (WCSS):
but in my formulation within-cluster sum of modules has to be minimized.
set of integers X and number of clusters k are given. Need to choose values of k cluster integers mu such that within-cluster sum of modules of differences is minimized.
I was doing it interatively, picking up initial mu values randomly and then adjusting it to the mean of elements assigned to the cluster.
However, this approach gives correct answer only for simple test cases.

What do you mean by "correct answer"? K-means is strictly depends on initial condition (random selected initial mean centers) and distribution of data. It is not guaranteed that you always get the same mean centers for a distribution.

Related

Weighted sampling without replacement and negative weights

I have an unusual sampling problem that I'm trying to implement for a Monte Carlo technique. I am aware there are related questions and answers regarding the fully-positive problem.
I have a list of n weights w_1,...,w_n and I need to choose k elements, labelled s_1,...,s_k say. The probability distribution that I want to sample from is
p(s_1,...,s_k) = |w_s_1 + ... + w_s_k| / P_total
where P_total is a normalization factor (the sum of all possible p(s,...) without P_total). I don't really care about how the elements are ordered for my purpose.
Note that some of the w_i may be less than zero and the absolute magnitude signs above. With purely non-negative w_i this distribution is relatively straightforward by sampling without replacement - a tree method being the most efficient as far as I can tell. With some negative weights, though, I feel like I must resort to explicitly writing out each possibility and sampling from this exponentially large set. Any suggestions or insights would be appreciated!
Rejection sampling is worth a try. Compute the maximum weight of a sample (max of the abs of each of the k least and k greatest). Repeatedly generate a uniform random sample and accept it with probability equal to its weight over the maximum weight until a sample is accepted.

maxmin clustering algorithm

I read a a paper that mention max min clustering algorithm, but i don't really quite understand what this algorithm does. Googling "max min clustering algorithm" doesn't yield any helpful result. does anybody know what this algorithm mean? this is an excerpt of the paper:
Max-min clustering proceeds by choosing an observation at random as the first centroid c1, and by setting the set C of centroids to {c1}. During the ith iteration, ci is chosen such that it maximizes the minimum Euclidean distance between ci and observations in C. Max-min clustering is preferable to a density-based clustering algorithm (e.g. k-means) which would tend to select many examples from the dense group of non-seizure data points.
I don't quite understand the bolded part.
link to paper is here
We choose each new centroid to be as far as possible from the existing centroids. Here's some Python code.
def maxminclustering(observations, k):
observations = set(observations)
if k < 1 or not observations: return set()
centroids = set([observations.pop()])
for i in range(min(k - 1, len(observations))):
newcentroid = max(observations,
key=lambda observation:
min(distance(observation, centroid)
for centroid in centroids))
observations.remove(newcentroid)
centroids.add(newcentroid)
return centroids
This sounds a lot like the farthest-points heuristic for seeding k-means, but then not performing any k-means iterations at all.
This is a surprisingly simple, but quite effective strategy. Basically it will find a number of data points that are well spread out, which can make k-means converge fast. Usually, one would discard the first (random) data point.
It only works well for low values of k though (it avoids placing centroids in the center of the data set!), and it is not very favorable to multiple runs - it tends to choose the same initial centroids again.
K-means++ can be seen as a more randomized version of this. Instead of always choosing the farthes object, it chooses far objects with increased likelihood, but may at random also choose a near neighbor. This way, you get more diverse results when running it multiple times.
You can try it out in ELKI, it is named FarthestPointsInitialMeans. If you choose the algorithm SingleAssignmentKMeans, then it will not perform k-means iterations, but only do the initial assignment. That will probably give you this "MaxMin clustering" algorithm.

Divide 2D array into continuous regions of as-equal-as-possible sums?

I have a 2D array of floating-point numbers, and I'd like to divide this array into an arbitrary number of regions such that the sum of all the regions' elements are more or less equal. The regions must be continuous. By as-equal-as-possible, I mean that the standard deviation of the region sums should be reduced as much as possible.
I'm doing this because I have a map of values corresponding to the "population" in an area, and I want to divide this area into groups of relatively equal population.
Thanks!
I would do it like this:
1.compute the whole sum
2.compute local centers of mass (coordinates)
3.now compute the region sum
for example:
region sum = whole sum / number of centers of masses
4.for each center of mass
start a region
and incrementally increase the size until it sum match region sum
avoid intersection of regions (use some map of usage for that)
if region has the desired sum or has nowhere to grow stop
You will have to tweak this algorithm a little to suite your needs and input data
Hope it helps a little ...
Standard deviation is way to measure that whether the divisions are close to equal. Lower standard deviation means closer the sums are.
As the problem seems n-p like clustering problems , Genetic algorithms can be used to get good solutions to the problem :-
Standard deviation can be used as fitness measure for chromosomes.
Consider k contagious regions then each gene(element) will have one of the k values which maintain the contagious nature of the regions.
apply genetic algorithm on the chromosomes and get the best chromosome for that value of k after a fixed amount of generations.
vary k from 2 to n and get best chromosome by applying genetic algorithms.

Single Pass Seed Selection Algorithm for k-Means

I've recently read the Single Pass Seed Selection Algorithm for k-Means article, but not really understand the algorithm, which is:
Calculate distance matrix Dist in which Dist (i,j) represents distance from i to j
Find Sumv in which Sumv (i) is the sum of the distances from ith point to all other points.
Find the point i which is min (Sumv) and set Index = i
Add First to C as the first centroid
For each point xi, set D (xi) to be the distance between xi and the nearest point in C
Find y as the sum of distances of first n/k nearest points from the Index
Find the unique integer i so that D(x1)^2+D(x2)^2+...+D(xi)^2 >= y > D(x1)^2+D(x2)^2+...+D(x(i-1))^2
Add xi to C
Repeat steps 5-8 until k centers
Especially step 6, do we still use the same Index (same point) over and over or we use the newly added point from C? And about step 8, does i have to be larger than 1?
Honestly, I wouldn't worry about understanding that paper - its not very good.
The algorithm is poorly described.
Its not actually a single pass, it needs do to n^2/2 pairwise computations + one additional pass through the data.
They don't report the runtime of their seed selection scheme, probably because it is very bad doing O(n^2) work.
They are evaluating on very simple data sets that don't have a lot of bad solutions for k-Means to fall into.
One of their metrics of "better"ness is how many iterations it takes k-means to run given the seed selection. While it is an interesting metric, the small differences they report are meaningless (k-means++ seeding could be more iterations, but less work done per iteration), and they don't report the run time or which k-means algorithm they use.
You will get a lot more benefit from learning and understanding the k-means++ algorithm they are comparing against, and reading some of the history from that.
If you really want to understand what they are doing, I would brush up on your matlab and read their provided matlab code. But its not really worth it. If you look up the quantile seed selection algorithm, they are essentially doing something very similar. Instead of using the distance to the first seed to sort the points, they appear to be using the sum of pairwise distances (which means they don't need an initial seed, hence the unique solution).
Single Pass Seed Selection algorithm is a novel algorithm. Single Pass mean that without any iterations first seed can be selected. k-means++ performance is depends on first seed. It is overcome in SPSS. Please gothrough the paper "Robust Seed Selestion Algorithm for k-means" from the same authors
John J. Louis

Find correlation in large dataset

I have a huge dataset. We are talking about 100 3D matrices with 121x145x121 cells. Any cell has a value between 0 and 1, and I need a way to cluster these cells according to their correlation. The problem is the dataset is too big for any algorithm I know; even using just half of it (any matrix is a MRI scan of a brain) we have around 400 billion pairs. Any ideas?
As a first step I would be tempted to try K-means clustering.
This appears in the Matlab statistics toolbox as the function kmeans.
In this algorithm you only end up computing the distances between the K current centres and the data, so the number of pairs is much smaller than comparing all choices.
In Matlab, I've also found that the speed of the operation can be quite dependent on the organisation of your matrix (due to memory caching and optimisation issues). I would recommend transforming your 3d matrices so that the columns (held together in memory) correspond to the 100 values for a particular cell.
This can be done with the permute function.
Try a weighted K-means++ clustering algorithm. Create one matrix of the sum of values for all the 100 input matrices at every point to produce one "grey scale" matrix, then adjust the K-means++ algorithm to work with weighted, (wt), values.
In the initialization phase choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(X)^2 x wt^2 .
The assignment step should be okay, but when computing the centroids in the update step adjust the formula to account for the weights. (Or use the same formula but each point is used wt times).
You may not be able to use a library function to do this but you start with a 100 fold decrease in number of points and matrices to work with.

Resources