Distributed hierarchical clustering - algorithm

Are there any algorithms that can help with hierarchical clustering?
Google's map-reduce has only an example of k-clustering. In case of hierarchical clustering, I'm not sure how it's possible to divide the work between nodes.
Other resource that I found is: http://issues.apache.org/jira/browse/MAHOUT-19
But it's not apparent, which algorithms are used.

First, you have to decide if you're going to build your hierarchy bottom-up or top-down.
Bottom-up is called Hierarchical agglomerative clustering. Here's a simple, well-documented algorithm: http://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html.
Distributing a bottom-up algorithm is tricky because each distributed process needs the entire dataset to make choices about appropriate clusters. It also needs a list of clusters at its current level so it doesn't add a data point to more than one cluster at the same level.
Top-down hierarchy construction is called Divisive clustering. K-means is one option to decide how to split your hierarchy's nodes. This paper looks at K-means and Principal Direction Divisive Partitioning (PDDP) for node splitting: http://scgroup.hpclab.ceid.upatras.gr/faculty/stratis/Papers/tm07book.pdf. In the end, you just need to split each parent node into relatively well-balanced child nodes.
A top-down approach is easier to distribute. After your first node split, each node created can be shipped to a distributed process to be split again and so on... Each distributed process needs only to be aware of the subset of the dataset it is splitting. Only the parent process is aware of the full dataset.
In addition, each split could be performed in parallel. Two examples for k-means:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.1882&rep=rep1&type=pdf
http://www.ece.northwestern.edu/~wkliao/Kmeans/index.html.

Clark Olson reviews several distributed algorithms for hierarchical clustering:
C. F. Olson. "Parallel Algorithms for
Hierarchical Clustering." Parallel
Computing, 21:1313-1325, 1995, doi:10.1016/0167-8191(95)00017-I.
Parunak et al. describe an algorithm inspired by how ants sort their nests:
H. Van Dyke Parunak, Richard Rohwer,
Theodore C. Belding,and Sven
Brueckner: "Dynamic Decentralized
Any-Time Hierarchical Clustering." In
Proc. 4th International Workshop on Engineering Self-Organising Systems
(ESOA), 2006, doi:10.1007/978-3-540-69868-5

Check out this very readable if a bit dated review by Olson (1995). Most papers since then require a fee to access. :-)
If you use R, I recommend trying pvclust which achieves parallelism using snow, another R module.

You can see also Finding and evaluating community structure in networks by Newman and Girvan, where they propose an aproach for evaluating communities in networks(and set of algoritms based on this approach) and measure of network division into communities quality (graph modularity).

You could look at some of the work being done with Self-Organizing maps (Kohonen's neural network method)... the guys at Vienna University of Technology have done some work on distributed calculation of their growing hierarchical map algorithm.
This is a little on the edge of your clustering question, so it may not help, but I can't think of anything closer ;)

Related

Community Detection in complete and weighted networks

I do have a complete network graph where every vertex is connected with each other and they only differ in form of their different weights. A example network would be: a trade network, where every country is connected with each other somehow and only differ in form of different trading volumina.
Now the question is how I could perform a community detection in that form of network. The usual suspects (algorithm) are only able to perform in either unweighted or incomplete networks well. The main problem is that the geodesic is everywhere the same.
Two option came into my mind:
Cut the network into smaller pieces by cutting them at a certain "weight-threshold-level"
Or use a hierarchical cluster algorithm to turn the whole network into a blockmodel. But I think the problem "no variance in geodesic terms" will remain.
Several methods were suggested.
One simple yet effective method was suggested in Fast unfolding of communities in large networks (Blondel et al., 2008). It supports weighted networks. Quoting from the abstract:
We propose a simple method to extract the community structure of large
networks. Our method is a heuristic method that is based on modularity
optimization. It is shown to outperform all other known community
detection method in terms of computation time. Moreover, the quality
of the communities detected is very good, as measured by the so-called
modularity.
Quoting from the paper:
We now introduce our algorithm that finds high modularity partitions
of large networks in short time and that unfolds a complete
hierarchical community structure for the network, thereby giving
access to different resolutions of community detection.
So it supposed to work well for complete graph, but you should better check it.
A C++ implementation is available here (now maintained here).
Your other idea - using weight-threshold - may prove as a good pre-processing step, especially for algorithms which won't partition complete graphs. I believe it is best to set it to some percentile (e.g. to the median) of the weights.

How to cluster large datasets

I have a very large dataset (500 Million) of documents and want to cluster all documents according to their content.
What would be the best way to approach this?
I tried using k-means but it does not seem suitable because it needs all documents at once in order to do the calculations.
Are there any cluster algorithms suitable for larger datasets?
For reference: I am using Elasticsearch to store my data.
According to Prof. J. Han, who is currently teaching the Cluster Analysis in Data Mining class at Coursera, the most common methods for clustering text data are:
Combination of k-means and agglomerative clustering (bottom-up)
topic modeling
co-clustering.
But I can't tell how to apply these on your dataset. It's big - good luck.
For k-means clustering, I recommend to read the dissertation of Ingo Feinerer (2008). This guy is the developer of the tm package (used in R) for text mining via Document-Term-matrices.
The thesis contains case-studies (Ch. 8.1.4 and 9) on applying k-Means and then the Support Vector Machine Classifier on some documents (mailing lists and law texts). The case studies are written in tutorial style, but the dataset are not available.
The process contains lots of intermediate steps of manual inspection.
There are k-means variants thst process documents one by one,
MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1.
and k-means variants that repeatedly draw a random sample.
D. Sculley (2010). Web Scale K-Means clustering. Proceedings of the 19th international conference on World Wide Web
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633.
But in the end, it's still useless old k-means. It's a good quantization approach, but not very robust to noise, not capable of handling clusters of different size, non-convex shape, hierarchy (e.g. sports, inside baseball) etc. it's a signal processing technique, not a data organization technique.
So the practical impact of all these is 0. Yes, they can run k-means on insane data - but if you can't make sense of the result, why would you do so?

Can humans cluster data sets manually?clustering algorithms that are closest to the human clustering

Can human cluster data sets manually? For example, consider the Iris data set, depicted below:
http://i.stack.imgur.com/Ae6qa.png
Instead of using clustering algorithms like connectivity-based clustering (hierarchical clustering), centroid-based clustering, distribution-based clustering, density-based clustering. etc.
Can a human manually cluster the Iris dataset? For our convenience, let us consider it as a two dimensional dataset. By which means and how a human would cluster the dataset?
I am concerned that "human clustering" might not be well-defined and could vary according to different people's intuitions and opinions.I would like to know what are the clustering algorithms that are closest to the human clustering or how the data-set clustering is performed by humans? Is there a clustering algorithm that would perform just like the humans do the clustering?
Humans can and do cluster data manually, but as you say there will be a lot of variation and subjective decisions. Assuming that you could get an algorithm that will use the same features as a human, it's in principle possible to have a computer cluster like a human.
At a first approximation, nearest neighbor algorithms are probably close to how humans cluster in they group things look similar under some measure. Keep in mind that without training and significant ongoing effort, humans really don't do well on consistency. We seem to be biased toward looking for novelty, so we tend to break things into two big clusters, the stuff we encounter all of the time, and everything else.

Detail of all MPI Algorithm?

Is there any document about how MPI functions such as MPI_Algather, MPI_AlltoAll, MPI_Allreduce etc.. are implemented ?
I would like to learn about their algorithm and compute the complexity of them in term of uni-directional or bi-directional bandwidth and total data transfer size for a number of nodes and fixed data size.
I think the exact implementation of those algoritms varies, depending on the communication mechanism: in example a network will have tree-based reduction algorithms, while shared memory models will have different ones.
I'm not exactly sure about where to find answers to such questions, but I think that a good search for papers in google scholar or having a look at this paper list at open-mpi.org should be useful.
http://www.amazon.com/Parallel-Programming-MPI-Peter-Pacheco/dp/1558603395/ref=sr_1_10?s=books&ie=UTF8&qid=1314807638&sr=1-10
shown above is great link that explains all the basic MPI algorithms and allows you to implement a simple version yourself. However, when doing comparisons between the algorithms that you have implemented and the MPI algorithms, you will see that they have made many optimizations depending on the size of the message and number of nodes that you are running on. Hopefully this helps

What type of problems can mapreduce solve?

Is there a theoretical analysis available which describes what kind of problems mapreduce can solve?
In Map-Reduce for Machine Learning on Multicore Chu et al describe "algorithms that fit the Statistical Query model can be written in a certain “summation form,” which allows them to be easily parallelized on multicore computers." They specifically implement 10 algorithms including e.g. weighted linear regression, k-Means, Naive Bayes, and SVM, using a map-reduce framework.
The Apache Mahout project has released a recent Hadoop (Java) implementation of some methods based on the ideas from this paper.
For problems requiring processing and generating large data sets. Say running an interest generation query over all accounts a bank hold. Say processing audit data for all transactions that happened in the past year in a bank. The best use case is from Google - generating search index for google search engine.
Many problems that are "Embarrassingly Parallel" (great phrase!) can use MapReduce. http://en.wikipedia.org/wiki/Embarrassingly_parallel
From this article....
http://www.businessweek.com/magazine/content/07_52/b4064048925836.htm
...
Doug Cutting, founder of Hadoop (an open source implementation of MapReduce) says...
“Facebook uses Hadoop to analyze user behavior and the effectiveness of ads on the site"
and... “the tech team at The New York Times rented computing power on Amazon’s cloud and used Hadoop to convert 11 million archived articles, dating back to 1851, to digital and searchable documents. They turned around in a single day a job that otherwise would have taken months.”
Anything that involves doing operations on a large set of data, where the problem can be broken down into smaller independent sub-problems who's results can then be aggregated to produce the answer to the larger problem.
A trivial example would be calculating the sum of a huge set of numbers. You split the set into smaller sets, calculate the sums of those smaller sets in parallel (which can involve splitting those into yet even smaller sets), then sum those results to reach the final answer.
The answer lies is really in the name of the algorithm. MapReduce is not a general purpose parallel programming work or batch execution framework as some of the answers suggest. Map Reduce is really useful when large data sets that need to be processed (Mapping phase) and derive certain attributes from there, and then need to be summarized on on those derived attributes (Reduction Phase).
You can also watch the videos # Google, I'm watching them myself and I find them very educational.
Sort of a hello world introduction to MapReduce
http://blog.diskodev.com/parallel-processing-using-the-map-reduce-prog
This question was asked before its time. Since 2009 there has actually been a theoretical analysis of MapReduce computations. This 2010 paper of Howard Karloff et al. formalizes MapReduce as a complexity class in the same way that theoreticians study P and NP. They prove some relationships between MapReduce and a class called NC (which can be thought of either as shared-memory parallel machines or a certain class of restricted circuits). But the main piece of work are their formal definitions.

Resources