does k-means clusterer of apache commons math contains a means method? - algorithm

I have to get the means of a k-means clustering. currently I'm using the apache commons math library which implements a k-means plus plus clustering algorithm. do anybody know, if there is a simple way to get the means after the clustering with this library or have i to implement it by myself?
if not, can you explain me how to calculate it or give me a code example?

The output of the clustering algorithm must at least contain the cluster assignments, i.e. which cluster each point belongs to. If you have that, then the k-means clustering cluster centers are simply given by the mean of the points that belong to each cluster.

The KMeansPlusPlusClusterer (package org.apache.commons.math3.ml.clustering, version 3.2+ ) returns a List of CentroidCluster objects. From a CentroidCluster you can get the cluster center (= mean of cluster points) by calling the getCenter() method.

Related

ELKI cluster extraction HiSC HiCO

I'm comuputing HiCO and HiSC clustering algorithms on my dataset. If I'm not mistaken, the algorithms use different approach to define relevant subspaces for clusters in the 1st step and in the 2nd they apply OPTICS for clustering. I'm getting only cluster order file after I run the algorithms.
Is there any way to extract clusters from it? for example like OPTICSXi? (I know there are 3 extraction methods under hierarchical clustering but I can't see anything for HiCO or HiSC)
Thank you in advance for any hints
Use OPTICSXi as algorithm, then use HiCO or HiSC "inside".
The Xi extraction can be parameterized to use a different OPTICS variant like HiCO, HiSC, and DeLi-Clu. It just defaults to using regular OPTICS.
-algorithm clustering.optics.OPTICSXi
-opticsxi.algorithm de.lmu.ifi.dbs.elki.algorithm.clustering.correlation.HiCO
respectively
-algorithm clustering.optics.OPTICSXi
-opticsxi.algorithm de.lmu.ifi.dbs.elki.algorithm.clustering.subspace.HiSC
We currently don't have implementations of the other extraction methods in ELKI yet, sorry.

Distributed cross correlation matrix computation

How can I calculate pearson cross correlation matrix of large (>10TB) data set, possibly in distributed manner ? Any efficient distributed algorithm suggestion will be appreciated.
update:
I read the implementation of apache spark mlib correlation
Pearson Computaation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
Covariance Computation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
but for me it looks like all the computation is happening at one node and it is not distributed in real sense.
Please put some light in here. I also tried executing it on a 3 node spark cluster and below are the screenshot:
As you can see from 2nd image that data is pulled up at one node and then computation is being done.Am i right in here ?
To start with, have a look at this to see if things are going right. You may then refer to any of these implementations: MPI/OpenMP: Agomezl or Meismyles, MapReduce: Vangjee or Seawolf42. It'd also be interesting to read this before you proceed. On a different note, James's thesis provides some pointers if you're interested in computing the correlations that are robust to outliers.
Each local data sets can converted into stdv and covariances.
Also stdv and covariance and sum make correlation.
This is working example
https://github.com/jeesim2/distributed-correlation

Elbow method for determining number of clusters on Mahout

I'm using Mahout for clustering and I have implemented elbow method for determining number of clusters, so that I wouldn't have to specify it.
I have tried this on one machine, but now I'm having doubts when it comes to cluster of computers.
I have planned to use Oozie to simulate looping (running clustering algorithm each time incrementing the number of clusters by one). I read that Oozie is used for DAGs, but I see no other way of doing this.
The question is, does this look like a sound approach? If not, any alternatives?

Is there any hadoop implement of Louvain method?

This is louvain method to find community in social graph.
https://sites.google.com/site/findcommunities/
I want to run it on a big graph.
If you are not stuck on Hadoop, I saw this implementation for Apach Spark.
https://github.com/Sotera/spark-distributed-louvain-modularity
I don't know of an implementation of this clustering method, which looks to be based on modularity. The main source of clustering algorithms in the Hadoop ecosystem is in Mahout.
Take a look here: https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Perhaps one of the clustering algorithms listed would work or provide the basis for your own implementation.

Clustering geo-data for heatmap

I have a list of tweets with their geo locations.
They are going to be displayed in a heatmap image transparently placed over Google Map.
The trick is to find groups of locations residing next to each other and display
them as a single heatmap circle/figure of a certain heat/color, based on cluster size.
Is there some library ready to grouping locations in a map into clusters?
Or I better should decide my clusterization params and build a custom algorithm?
I don't know if there is a 'library ready to grouping locations in a map into clusters', maybe it is, maybe it isn't. Anyways, I don't recommend you to build your custom clustering algorithm since there are a lot of libraries already implemented for this.
#recursive sent you a link with a php code for k-means (one clustering algorithm). There is also a huge Java library with other techniques (Java-ML) including k-means too, hierarchical clustering, k-means++ (to select the centroids), etc.
Finally I'd like to tell you that clustering is a non-supervised algorithm, which means that effectively, it will give you a set of clusters with data inside them, but at a first glance you don't know how the algorithm clustered your data. I mean, it may be clustered by locations as you want, but it can be clustered also by another characteristic you don't need so it's all about playing with the parameters of the algorithm and tune your solutions.
I'm interested in the final solution you could find to this problem :) Maybe you can share it in a comment when you end this project!
K means clustering is a technique often used for such problems
The basic idea is this:
Given an initial set of k means m1,…,mk, the
algorithm proceeds by alternating between two steps:
Assignment step: Assign each observation to the cluster with the closest mean
Update step: Calculate the new means to be the centroid of the observations in the cluster.
Here is some sample code for php.
heatmap.js is an HTML5 library for rendering heatmaps, and has a sample for doing it on top of the Google Maps API. It's pretty robust, but only works in browsers that support canvas:
The heatmap.js library is currently supported in Firefox 3.6+, Chrome
10, Safari 5, Opera 11 and IE 9+.
You can try my php class hilbert curve at phpclasses.org. It's a monster curve and reduces 2d complexity to 1d complexity. I use a quadkey to address a coordinate and it has 21 zoom levels like Google maps.
This isn't really a clustering problem. Head maps don't work by creating clusters. Instead they convolute the data with a gaussian kernel. If you're not familiar with image processing, think of it as using a normal or gaussian "stamp" and stamping it over each point. Since the overlays of the stamp will add up on top of each other, areas of high density will have higher values.
One simple alternative for heatmaps is to just round the lat/long to some decimals and group by that.
See this explanation about lat/long decimal accuracy.
1 decimal - 11km
2 decimals - 1.1km
3 decimals - 110m
etc.
For a low zoom level heatmap with lots of data, rounding to 1 or 2 decimals and grouping the results by that should do the trick.

Resources