How can I calculate pearson cross correlation matrix of large (>10TB) data set, possibly in distributed manner ? Any efficient distributed algorithm suggestion will be appreciated.
update:
I read the implementation of apache spark mlib correlation
Pearson Computaation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/stat/correlation/Correlation.scala
Covariance Computation:
/home/d066537/codespark/spark/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala
but for me it looks like all the computation is happening at one node and it is not distributed in real sense.
Please put some light in here. I also tried executing it on a 3 node spark cluster and below are the screenshot:
As you can see from 2nd image that data is pulled up at one node and then computation is being done.Am i right in here ?
To start with, have a look at this to see if things are going right. You may then refer to any of these implementations: MPI/OpenMP: Agomezl or Meismyles, MapReduce: Vangjee or Seawolf42. It'd also be interesting to read this before you proceed. On a different note, James's thesis provides some pointers if you're interested in computing the correlations that are robust to outliers.
Each local data sets can converted into stdv and covariances.
Also stdv and covariance and sum make correlation.
This is working example
https://github.com/jeesim2/distributed-correlation
Related
Hello I'm new here I hope I have entered everything correctly and this question is in the right forum. Also, I have checked before and no previous question seems to be comparable with this one.
To my question:
I am currently working on the validation of cluster methods using the package clValid. Now my dataset with which I work is very large (1,000 to 25,000) it is gene expressions. Now the question is which methods for the validation of high dimensional data sets come into question at all. Maybe there is another package for validating clustering in high-dimesnion space. Do I have to do a PCA before? how big can my dataset be so that I can use clValdi on it (I don't want to let my computer run for hours or should I just let it run and wait for a result with a small dataset 100x500) I am grateful for every suggestion maybe there are solutions I haven't thought about yet.
clValid
I would rather not rely on any of these indexes.
These measures usually require clusters to be complete and disjoint, and that does not hold for typical Gene biclusters. There are Genes not involved in any of the effects observed in the experiment
The measures we're usually designed with low-dimensional Gaussian data in mind, and once you have such high dimensional data where all distances are large, they measure that there is no contrast between clusters (because their measure does not see contrast between any two data points
I fear that you may need to evaluate by complex, domain-specitic analysis.
I have written a K-Means Clustering code in MapReduce on Hadoop. If I have few number of clusters, consider 2, and if the data is very large, the whole data would be divided into two sets and each Reducer would receive too many values for a particular key, i.e the cluster centroid. How to solve this?
Note: I use the iterative approch to calculate new centers.
Algorithmically, there is not much you can do, as the nature of this algorithm is the one that you describe. The only option, in this respect, is to use more clusters and divide your data to more reducers, but this yields a different result.
So, the only thing that you can do, in my opinion, is compressing. And I do not only mean, using a compression codec of Hadoop.
For instance, you could find a compact representation of your data. E.g., give an integer id to each element and only pass this id to the reducers. This will save network traffic (store elements as VIntWritables, or define your own VIntArrayWritable extending ArrayWritable) and memory of each reducer.
In this case of k-means, I think that a combiner is not applicable, but if it is, it would greatly reduce the network and the reducer's overhead.
EDIT: It seems that you CAN use a combiner, if you follow this iterative implementation. Please, edit your question to describe the algorithm that you have implemented.
If you have too much shuffle then you will run into OOM issues.
Try to split the dataset in smaller chunks and try
yarn.app.mapreduce.client.job.retry-interval
AND
mapreduce.reduce.shuffle.retry-delay.max.ms
where there are more splits but the retries of the job will be long enough so there is no OOM issues.
I have a set of data I have generated that consists of extracted mass (well, m/z but that not so important) values and a time. I extract the data from the file, however, it is possible to get repeat measurements and this results in a large amount of redundancy within the dataset. I am looking for a method to cluster these in order to group those that are related based on either similarity in mass alone, or similarity in mass and time.
An example of data that should be group together is:
m/z time
337.65 1524.6
337.65 1524.6
337.65 1604.3
However, I have no way to determine how many clusters I will have. Does anyone know of an efficient way to accomplish this, possibly using a simple distance metric? I am not familiar with clustering algorithms sadly.
http://en.wikipedia.org/wiki/Cluster_analysis
http://en.wikipedia.org/wiki/DBSCAN
Read the section about hierarchical clustering and also look into DBSCAN if you really don't want to specify how many clusters in advance. You will need to define a distance metric and in that step is where you would determine which of the features or combination of features you will be clustering on.
Why don't you just set a threshold?
If successive values (by time) do not differ by at least +-0.1 (by m/s) they a grouped together. Alternatively, use a relative threshold: differ by less than +- .1%. Set these thresholds according to your domain knowledge.
That sounds like the straightforward way of preprocessing this data to me.
Using a "clustering" algorithm here seems total overkill to me. Clustering algorithms will try to discover much more complex structures than what you are trying to find here. The result will likely be surprising and hard to control. The straightforward change-threshold approach (which I would not call clustering!) is very simple to explain, understand and control.
For the simple one dimension K-means clustering (http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm) is appropriate and can be used directly. The only issue is selecting appropriate K. The best way to select a good K is to either plot K vs residual variance and select the K that "dramatically" reduces variance. Another strategy is to use some information criteria (eg. Bayesian Information Criteria).
You can extend K-Means to multi-dimensional data easily. But you should be beware of scaling the individual dimensions. Eg. Among items (1KG, 1KM) (2KG, 2KM) the nearest point to (1.7KG, 1.4KM) is (2KG, 2KM) with these scales. But once you start expression second item in meters, probably the alternative is true.
I have a list of tweets with their geo locations.
They are going to be displayed in a heatmap image transparently placed over Google Map.
The trick is to find groups of locations residing next to each other and display
them as a single heatmap circle/figure of a certain heat/color, based on cluster size.
Is there some library ready to grouping locations in a map into clusters?
Or I better should decide my clusterization params and build a custom algorithm?
I don't know if there is a 'library ready to grouping locations in a map into clusters', maybe it is, maybe it isn't. Anyways, I don't recommend you to build your custom clustering algorithm since there are a lot of libraries already implemented for this.
#recursive sent you a link with a php code for k-means (one clustering algorithm). There is also a huge Java library with other techniques (Java-ML) including k-means too, hierarchical clustering, k-means++ (to select the centroids), etc.
Finally I'd like to tell you that clustering is a non-supervised algorithm, which means that effectively, it will give you a set of clusters with data inside them, but at a first glance you don't know how the algorithm clustered your data. I mean, it may be clustered by locations as you want, but it can be clustered also by another characteristic you don't need so it's all about playing with the parameters of the algorithm and tune your solutions.
I'm interested in the final solution you could find to this problem :) Maybe you can share it in a comment when you end this project!
K means clustering is a technique often used for such problems
The basic idea is this:
Given an initial set of k means m1,…,mk, the
algorithm proceeds by alternating between two steps:
Assignment step: Assign each observation to the cluster with the closest mean
Update step: Calculate the new means to be the centroid of the observations in the cluster.
Here is some sample code for php.
heatmap.js is an HTML5 library for rendering heatmaps, and has a sample for doing it on top of the Google Maps API. It's pretty robust, but only works in browsers that support canvas:
The heatmap.js library is currently supported in Firefox 3.6+, Chrome
10, Safari 5, Opera 11 and IE 9+.
You can try my php class hilbert curve at phpclasses.org. It's a monster curve and reduces 2d complexity to 1d complexity. I use a quadkey to address a coordinate and it has 21 zoom levels like Google maps.
This isn't really a clustering problem. Head maps don't work by creating clusters. Instead they convolute the data with a gaussian kernel. If you're not familiar with image processing, think of it as using a normal or gaussian "stamp" and stamping it over each point. Since the overlays of the stamp will add up on top of each other, areas of high density will have higher values.
One simple alternative for heatmaps is to just round the lat/long to some decimals and group by that.
See this explanation about lat/long decimal accuracy.
1 decimal - 11km
2 decimals - 1.1km
3 decimals - 110m
etc.
For a low zoom level heatmap with lots of data, rounding to 1 or 2 decimals and grouping the results by that should do the trick.
Here's my scenario. Consider a set of events that happen at various places and times - as an example, consider someone high above recording the lightning strikes in a city during a storm. For my purpose, lightnings are instantaneous and can only hit certain locations (such as high buildings). Also imagine each lightning strike has a unique id so one can reference the strike later. There are about 100,000 such locations in this city (as you guess, this is an analogy as my current employer is sensitive about the actual problem).
For phase 1, my input is the set of (strike id, strike time, strike location) tuples. The desired output is the set of the clusters of more than 1 event that hit the same location within a short time. The number of clusters is not known in advance (so k-means is not that useful here). What is being considered as 'short' could be predefined for a given clustering attempt. That is, I can set it to, say, 3 minutes, than run the algorithm; later try with 4 minutes or 10 minutes. Perhaps a nice touch would be for the algorithm to determine a 'strength' of clustering and recommend that for a given input, the most compact clustering is achieved by using a particular value for 'short', but this is not required initially.
For phase 2, I'd like to take into consideration the amplitude of the strike (i.e., a real number) and look for clusters that are both within a short time and with similar amplitudes.
I googled and checked the answers here about data clustering. The information is a bit bewildering (below is the list of links I found useful). AFAIK, k-means and related algorithms would not be useful because they require the number of clusters to be specified apriori. I'm not asking for someone to solve my problem (I like solving it), but some orientation in the large world of data clustering algorithms would be useful in order to save some time. Specifically, what clustering algorithms are appropriate for when the number of clusters is unknown.
Edit: I realized the location is irrelevant, in the sense that although events happen all the time, I only need to cluster them per location. So each location has its own time-series of events that can thus be analyzed independently.
Some technical details:
- as the dataset is not that large, it can fit all in memory.
- parallel processing is a nice to have, but not essential. I only have a 4-core machine and MapReduce and Hadoop would be too much.
- the language I'm mostly familiar with is Java. I haven't yet used R and the learning curve for it would probably be too much for what time I was given. I'll have a look at it anyway in my spare time.
- for the time being, using tools to run the analysis is ok, I don't have to produce just code. I'm mentioning this because probably Weka will be suggested.
- visualization would be useful. As the dataset is large enough so it doesn't fit in memory, the visualization should at least support zooming and panning. And to clarify: I don't need to build a visualization GUI, it's just a nice capability to use for checking the results produced with a tool.
Thank you. Questions that I found useful are: How to find center of clusters of numbers? statistics problem?, Clustering Algorithm for Paper Boys, Java Clustering Library, How to cluster objects (without coordinates), Algorithm for detecting "clusters" of dots
I would suggest you to look into Mean Shift Clustering. The basic idea behind mean shift clustering is to take the data and perform a kernel density estimation, then find the modes in the density estimate, the regions of convergence of data points towards modes defines the clusters.
The nice thing about mean shift clustering is that the number of clusters do not have to be specified ahead of time.
I have not used Weka, so I am not sure if it has mean shift clustering. However if you are using MATLAB, here is a toolbox (KDE toolbox) to do it. Hope that helps.
Couldn't you just use hierarchical clustering with the difference in times of strikes as part of the distance metric?
It is too late, but still I would add it:
In R, there is a package fpc and it has a method pamk() which provides you the clusters. Using pamk(), you do not need to mention the number of clusters intially. It calculates itself the number of clusters in the input data.