How to cluster large datasets - algorithm

I have a very large dataset (500 Million) of documents and want to cluster all documents according to their content.
What would be the best way to approach this?
I tried using k-means but it does not seem suitable because it needs all documents at once in order to do the calculations.
Are there any cluster algorithms suitable for larger datasets?
For reference: I am using Elasticsearch to store my data.

According to Prof. J. Han, who is currently teaching the Cluster Analysis in Data Mining class at Coursera, the most common methods for clustering text data are:
Combination of k-means and agglomerative clustering (bottom-up)
topic modeling
co-clustering.
But I can't tell how to apply these on your dataset. It's big - good luck.
For k-means clustering, I recommend to read the dissertation of Ingo Feinerer (2008). This guy is the developer of the tm package (used in R) for text mining via Document-Term-matrices.
The thesis contains case-studies (Ch. 8.1.4 and 9) on applying k-Means and then the Support Vector Machine Classifier on some documents (mailing lists and law texts). The case studies are written in tutorial style, but the dataset are not available.
The process contains lots of intermediate steps of manual inspection.

There are k-means variants thst process documents one by one,
MacQueen, J. B. (1967). Some Methods for classification and Analysis of Multivariate Observations. Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1.
and k-means variants that repeatedly draw a random sample.
D. Sculley (2010). Web Scale K-Means clustering. Proceedings of the 19th international conference on World Wide Web
Bahmani, B., Moseley, B., Vattani, A., Kumar, R., & Vassilvitskii, S. (2012). Scalable k-means++. Proceedings of the VLDB Endowment, 5(7), 622-633.
But in the end, it's still useless old k-means. It's a good quantization approach, but not very robust to noise, not capable of handling clusters of different size, non-convex shape, hierarchy (e.g. sports, inside baseball) etc. it's a signal processing technique, not a data organization technique.
So the practical impact of all these is 0. Yes, they can run k-means on insane data - but if you can't make sense of the result, why would you do so?

Related

Looking at the results of clustering algorithms on Protein Interaction Networks

I am working on a project involving the clustering of Protein Interaction Networks, having made several clustering algorithms on the graphs of interacting proteins, I am somewhat confused on how I would now go about seeing whether the clusters created are any good or not.
To put this into context protein interaction networks represent pairwise connections between proteins and isolating groups of interacting proteins that participate in the same biological processes or that perform together specific functions. This is significant as many proteins and interactions are unlabelled so inference to their function can be made if many labelled proteins for a certain are in one cluster.
Unlike typical supervised machine learning tasks where a labelled data set can show numbers of correct groupings or not, there is no precendent for good clusterings of proteins and their interaction, hypothetically a clustering where all proteins are in their one cluster are as good as one where all proteins are in one cluster (though there is no informational significance in this). There are of course no feature vectors for distance calculations either, only binary information whether one protein interacts with another or not, so this is quite difficult.
This problem is completely exploratory, and is hard to see whether a clustering is significant or just bogus.
Most academic papers use cluster analysis techniques to see how good the clusters and the algorithms are. ie. whether they are robust to edge deletion or node deletion, cluster correlation etc. I would like to see if there is any information one can fish out using protein databases, say input a large number of interactions (from one cluster) and see if the labelled ones have a tendency to be involved in the same metabolic process. If there is a significantly high number of proteins involved in one metabolic process one can surmise that the unlabelled proteins may be involved in a similar process or function, or similarly may be part of a protein domain or not.
I have just begun delving into bioinformatics and research in general so there is a very high chance that this has been done before and I haven't looked around extensively enough. If this is the case I would be grateful for links. I would appreciate any help possible, or ideas on how one could think about this problem.
If I understand your question: you would like to know if your clustered protein interaction network identifies biologically relevant protein complexes...
I can think of three ways to do this:
1) Use the primary research literature. Take a cluster and search Pubmed for each member of the cluster and see if there are any reports of interactions with other members of the cluster. This will be time consuming but the most rigorous.
2) Submit each cluster to GO term enrichment analysis (David, funcassociate, etc.) or Pathway analysis (Kegg). If a cluster is "biologically" relevant it should be enriched for specific GO/Kegg terms. this will only work if most of your proteins have annotations.
3) Look at expression data. Biological complexes tend to have correlated gene expression patterns. There for the expression of a cluster should correlate with it members than non-members of the cluster.
I thought of a 4th:
4) Find homologs in an organism with a rich and deep annotation database and look for correlations there (yeast (S. cerevisiae or S. pombe*), fly (D. melanogaster), worm (C elegans), mouse, and human all have large protein interaction databases (i.e. Biogrid).
And a 5th:
5) Use genetic screen data. In this case genetic epistasis data will have distinct relationships within complexes. Proteins that are in the same complex will tend not to have a genetic interaction. While proteins in separate/independently acting complexes could have a genetic interaction component. See the work of Dr. Charles Boone (Univ. of Toronto) on how this can be modeled.
Final thoughts:
A little bit of domain-specific knowledge will go a long way to helping others believe your results. Do well-know/studied complexes form clusters? There has been a lot of work done in this field, Pubmed will be your friend. Start at Biogrid and work out from there.
Good luck

Can humans cluster data sets manually?clustering algorithms that are closest to the human clustering

Can human cluster data sets manually? For example, consider the Iris data set, depicted below:
http://i.stack.imgur.com/Ae6qa.png
Instead of using clustering algorithms like connectivity-based clustering (hierarchical clustering), centroid-based clustering, distribution-based clustering, density-based clustering. etc.
Can a human manually cluster the Iris dataset? For our convenience, let us consider it as a two dimensional dataset. By which means and how a human would cluster the dataset?
I am concerned that "human clustering" might not be well-defined and could vary according to different people's intuitions and opinions.I would like to know what are the clustering algorithms that are closest to the human clustering or how the data-set clustering is performed by humans? Is there a clustering algorithm that would perform just like the humans do the clustering?
Humans can and do cluster data manually, but as you say there will be a lot of variation and subjective decisions. Assuming that you could get an algorithm that will use the same features as a human, it's in principle possible to have a computer cluster like a human.
At a first approximation, nearest neighbor algorithms are probably close to how humans cluster in they group things look similar under some measure. Keep in mind that without training and significant ongoing effort, humans really don't do well on consistency. We seem to be biased toward looking for novelty, so we tend to break things into two big clusters, the stuff we encounter all of the time, and everything else.

Clustering in High Dimensions + some basic stuff [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I've been studying Support Vector Machines(SVM) for a while, and recently started reading articles on Clustering. When using SVM, we did not need to worry about the dimension size of the data, however, I learned that in clustering, due to the "Curse of Dimensionality", the dimension size is of big issue. Furthermore, the sparsity and the data size greatly affects the clustering algorithms you choose as well. So I kind of understand that there is no "best algorithm" for clustering, and it all depends on the nature of the data.
Having said that, I want to ask some really basic questions on Clustering.
When people say "High Dimension", what do they mean specifically?? Is 100d a high dimension?? Or does this depend on the type of data you have?
I've seen answers on this website that said something like, "using k-means on data with 100's dimensions is very usual", and if this is true, does this hold true for other clustering algorithms that uses the same distance metric as k-means??
In pp.649 of the paper, "Survey of Clustering Algorithms"(http://goo.gl/WQyuxo), by Rui Xu et al., the table shows that CURE has "the capability of tackling high dimensional data", and was wondering if anybody has any ideas on how high of dimension they are talking about.
If I wanted to perform clustering on high dimensional datas with adequate size, which was randomly sampled from the initial big data, what kind of algorithms would be appropriate to use?? I understand that density based algorithms such as DBSCAN does not perform well under random sampling.
Can anybody tell me how well/bad CURE performs on high dimensional datas?? Intuitively, I guess CURE does not perform well considering the "Cure of Dimensionality", however, it would be great if there were some detailed results.
Are there any websites/papers/textbooks on explaining the pros and cons of clustering algorithms?? I've seen some papers on the pros/cons of basic algorithms, i.e, k-means, hierarchal clustering, DBSCAN, etc., but wanted to know more on other algorithms such as CURE, CLIQUE, CHAMELEON, etc.
Sorry for asking so much questions all at once!!
It will be awesome if anybody could answer any one of my questions. Also, if I had ill-stated a question or asked a completely pointless question, don't hesitate to tell me.
And if anybody knows a great textbook/survey paper on Clustering that elaborates on these subjects, please tell me!!
Thank you in advance.
You may be interested in this survey:
Kriegel, H. P., Kröger, P., & Zimek, A. (2009). Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(1), 1.
one of the authors wrote DBSCAN, so it will likely help you shed some light in your DBSCAN questions.
100 dimensional data can be high-dimensional data. If it isn't sparse. For the NLP people, 100d is laughably little, but their data is special. It is derived essentially from a binary nature (word present, or not present), so it has actually less than 1 bit of information in each dimension... if you have dense 100 dimensional data, you usually are in trouble.
There are some nice figures in a related / follow up survey by the same authors:
Zimek, A., Schubert, E., & Kriegel, H. P. (2012). A survey on unsupervised outlier detection in high‐dimensional numerical data. Statistical Analysis and Data Mining, 5(5), 363-387.
They have analyzed the behavior of distance functions nicely for such data. The essence is:
high-dimensional data can be hard - or easy; it all depends on the signal to noise ratio. If you only have dimensions carrying signal, additional dimensions can make your problems actually easier. If the additional dimensions are distracting, things can break down.
Which may also explain why the "kernel trick" with SVMs works - it does not really add information content; the increased dimensionality is only virtual, not intrinsic. You have a larger search and solution space; but your data is still on a lower-dimensional manifold within this space.
k-means results in high-dimensional data tend to get meaningless. In many cases, they still work "good enough"; because often quality does not really matter a lot, and any convex partitioning will do (e.g. bag-of-words approaches for image similarity don't seem to improve substantially with "better" k-means clusterings)
CURE, which also seems to use sum-of-squares (like k-means) should suffer from the same problems. For large data, all sum-of-squares values become increasingly similar (i.e. any partitioning is as good as any other).
Yes, there are plenty of textbooks, surveys, and studies that tried to compare clustering algorithms. But in the end, there are too many factors involved: what does your data look like, how did you preprocess it, do you have a well-chosen and appropriate distance measure, how good is your implementation, do you have index acceleration to speed up some algorithms, etc. - there is no rule of thumb; you will have to try out things.

Anomaly Detection Algorithms

I am tasked with detecting anomalies (known or unknown) using machine-learning algorithms from data in various formats - e.g. emails, IMs etc.
What are your favorite and most effective anomaly detection algorithms?
What are their limitations and sweet-spots?
How would you recommend those limitations be addressed?
All suggestions very much appreciated.
Statistical filters like Bayesian filters or some bastardised version employed by some spam filters are easy to implement. Plus there are lots of online documentation about it.
The big downside is that it cannot really detect unknown things. You train it with a large sample of known data so that it can categorize new incoming data. But you can turn the traditional spam filter upside down: train it to recognize legitimate data instead of illegitimate data so that anything it doesn't recognize is an anomaly.
There are various types of anomaly detection algorithms, depending on the type of data and the problem you are trying to solve:
Anomalies in time series signals:
Time series signals is anything you can draw as a line graph over time (e.g., CPU utilization, temperature, rate per minute of number of emails, rate of visitors on a webpage, etc). Example algorithms are Holt-Winters, ARIMA models, Markov Models, and more. I gave a talk on this subject a few months ago - it might give you more ideas about algorithms and their limitations.
The video is at: https://www.youtube.com/watch?v=SrOM2z6h_RQ
Anomalies in Tabular data: These are cases where you have feature vector that describe something (e.g, transforming an email to a feature vector that describes it: number of recipients, number of words, number of capitalized words, counts of keywords, etc....). Given a large set of such feature vectors, you want to detect some that are anomalies compared to the rest (sometimes called "outlier detection"). Almost any clustering algorithm is suitable in these cases, but which one would be most suitable depends on the type of features and their behavior -- real valued features, ordinal, nominal or anything other. The type of features determine if certain distance functions are suitable (the basic requirement for most clustering algorithms), and some algorithms are better with certain types of features than others.
The simplest algo to try is k-means clustering, where an anomaly sample would be either very small clusters or vectors that are far from all cluster centers. One sided SVM can also detect outliers, and has the flexibility of choosing different kernels (and effectively different distance functions). Another popular algo is DBSCAN.
When anomalies are known, the problem becomes a supervised learning problem, so you can use classification algorithms and train them on the known anomalies examples. However, as mentioned - it would only detect those known anomalies and if the number of training samples for anomalies is very small, the trained classifiers may not be accurate. Also, because the number of anomalies is typically very small compared to "no-anomalies", when training the classifiers you might want to use techniques like boosting/bagging, with over sampling of the anomalies class(es), but optimize on very small False Positive rate. There are various techniques to do it in the literature --- one idea that I found to work many times very well is what Viola-Jones used for face detection - a cascade of classifiers. see: http://www.vision.caltech.edu/html-files/EE148-2005-Spring/pprs/viola04ijcv.pdf
(DISCLAIMER: I am the chief data scientist for Anodot, a commercial company doing real time anomaly detection for time series data).

Distributed hierarchical clustering

Are there any algorithms that can help with hierarchical clustering?
Google's map-reduce has only an example of k-clustering. In case of hierarchical clustering, I'm not sure how it's possible to divide the work between nodes.
Other resource that I found is: http://issues.apache.org/jira/browse/MAHOUT-19
But it's not apparent, which algorithms are used.
First, you have to decide if you're going to build your hierarchy bottom-up or top-down.
Bottom-up is called Hierarchical agglomerative clustering. Here's a simple, well-documented algorithm: http://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html.
Distributing a bottom-up algorithm is tricky because each distributed process needs the entire dataset to make choices about appropriate clusters. It also needs a list of clusters at its current level so it doesn't add a data point to more than one cluster at the same level.
Top-down hierarchy construction is called Divisive clustering. K-means is one option to decide how to split your hierarchy's nodes. This paper looks at K-means and Principal Direction Divisive Partitioning (PDDP) for node splitting: http://scgroup.hpclab.ceid.upatras.gr/faculty/stratis/Papers/tm07book.pdf. In the end, you just need to split each parent node into relatively well-balanced child nodes.
A top-down approach is easier to distribute. After your first node split, each node created can be shipped to a distributed process to be split again and so on... Each distributed process needs only to be aware of the subset of the dataset it is splitting. Only the parent process is aware of the full dataset.
In addition, each split could be performed in parallel. Two examples for k-means:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.1882&rep=rep1&type=pdf
http://www.ece.northwestern.edu/~wkliao/Kmeans/index.html.
Clark Olson reviews several distributed algorithms for hierarchical clustering:
C. F. Olson. "Parallel Algorithms for
Hierarchical Clustering." Parallel
Computing, 21:1313-1325, 1995, doi:10.1016/0167-8191(95)00017-I.
Parunak et al. describe an algorithm inspired by how ants sort their nests:
H. Van Dyke Parunak, Richard Rohwer,
Theodore C. Belding,and Sven
Brueckner: "Dynamic Decentralized
Any-Time Hierarchical Clustering." In
Proc. 4th International Workshop on Engineering Self-Organising Systems
(ESOA), 2006, doi:10.1007/978-3-540-69868-5
Check out this very readable if a bit dated review by Olson (1995). Most papers since then require a fee to access. :-)
If you use R, I recommend trying pvclust which achieves parallelism using snow, another R module.
You can see also Finding and evaluating community structure in networks by Newman and Girvan, where they propose an aproach for evaluating communities in networks(and set of algoritms based on this approach) and measure of network division into communities quality (graph modularity).
You could look at some of the work being done with Self-Organizing maps (Kohonen's neural network method)... the guys at Vienna University of Technology have done some work on distributed calculation of their growing hierarchical map algorithm.
This is a little on the edge of your clustering question, so it may not help, but I can't think of anything closer ;)

Resources