I'm trying to find out which statistical/data mining algorithms in R or R packages at CRAN/github/R-Forge exist that can handle large datasets either in parallel on 1 server or sequentially without running into out-of-memory issues or which work on several machines at once.
This in order to evaluate if I can easily port them to work with ff/ffbase like ffbase::bigglm.ffdf.
I would like to split these up into 3 parts:
Algorithms that update or work on parameter estimates in parallel
Buckshot (https://github.com/lianos/buckshot)
lm.fit # Programming For Big Data (https://github.com/RBigData)
Algorithms that work sequentially (get data in R but only use 1 process and only 1 process updates the parameters)
bigglm (http://cran.r-project.org/web/packages/biglm/index.html)
Compound Poisson linear models (http://cran.r-project.org/web/packages/cplm/index.html)
Kmeans # biganalytics (http://cran.r-project.org/web/packages/biganalytics/index.html)
Work on part of the data
Distributed text processing (http://www.jstatsoft.org/v51/i05/paper)
And I would like to exclude simple parallelisation like optimising over a hyperparameter by e.g. crossvalidating.
Any other pointers to these kind of models/optimisers or algorithms? Maybe Bayesian? Maybe a package called RGraphlab (http://graphlab.org/)?
Have you read through the High Performance Computing Task View on CRAN?
It covers many of the points you mention and gives overviews of packages in those areas.
Random forest are trivial to run in parallel. It's one of the examples in the foreach vignette:
x <- matrix(runif(500), 100)
y <- gl(2, 50)
library(randomForest); library(foreach)
rf <- foreach(ntree=rep(250, 4), .combine=combine,
.packages='randomForest') %dopar% randomForest(x, y, ntree=ntree)
You can use this construct to split your forest over every core in your cluster.
Related
I am runnning an Elastic net model using sklearn. My dataset has 70k observations and 20 features. I want to test different parameters and use the following code:
alpha_plot, l1_ratio_plot = np.linspace(min_xlim, max_xlim, 50), np.linspace(0, 1, 10)
alpha_grid, l1_ratio_grid = np.meshgrid(alpha_plot, l1_ratio_plot)
l1_ratio_alpha_grid = np.array([l1_ratio_grid.ravel(), alpha_grid.ravel()]).T
model_coefficients_analysis = []
for i in l1_ratio_alpha_grid:
model_analysis = ElasticNet(alpha=i[1], l1_ratio=i[0], fit_intercept=True, max_iter=10000).fit(self.features_train_std, self.labels_train)
model_coefficients_analysis.append(model_analysis.coef_)
I am aware that this can be done with GridsearchCV but it doesn't do the job for me as I need to store the coefficients for every combination of parameters tested. The current code snippet is exceptionally slow. It takes roughly 10 minutes for each of the 50*10 iterations. Is there a way to speed up the process? For example in GridsearchCV there is a parameter n_jobs which can be set equal to -1 to speed up the process. But here I do not seem to find it.
It takes roughly 10 minutes for each of the 50*10 iterations
That seems very high, but you also have rather large data; I can't fit a randomized such dataset in memory in Colab (where I usually run examples for answers here). You might not be able to shrink the first fit time very much, but maybe you can reduce the subsequent fit times by using warm-starting.
Setting warm_start=True and using the same model object for each iteration, the coefficients will be saved as a starting point for the solver in the next iteration:
model_analysis = ElasticNet(fit_intercept=True, max_iter=10000)
for i in l1_ratio_alpha_grid:
model_analysis.set_params(alpha=i[1], l1_ratio=i[0])
model_analysis.fit(self.features_train_std, self.labels_train)
model_coefficients_analysis.append(model_analysis.coef_)
You might consider using ElasticNetCV, since it uses warm-starting internally, and it provides some other niceties. You can use a PredefinedSplit if adding k-fold cross-validation is too much of an added expense, but I believe the n_jobs parameter is only useful in splitting up jobs across hyperparameters and folds, so using more cores might mitigate the issues with k-fold (but then you'll also have k times as many coefficients).
Your large max_iter is a bit worrying; do you get nonconvergence? From your independent variable name it seems like you're scaling, but if not that's the place to start: fast (and maybe correct) convergence depends on features with similar scales. You might also consider increasing the convergence criterion tol. I have no experience with the selection parameter, but the docstring suggests changing it to random may speed up convergence?
I am trying to use Apache spark to load up a file, and distribute the file to several nodes in my cluster and then aggregate the results and obtain them. I don't quite understand how to do this.
From my understanding the reduce action enables Spark to combine the results from different nodes and aggregate them together. Am I understanding this correctly?
From a programming perspective, I don't understand how I would code this reduce function.
How exactly do I partition the main dataset into N pieces and ask them to be parallel processed by using a list of transformations?
reduce is supposed to take in two elements and a function for combining them. Are these 2 elements supposed to be RDDs from the context of Spark or can they be any type of element? Also, if you have N different partitions running parallel, how would reduce aggregate all their results into one final result(since the reduce function aggregates only 2 elements)?
Also, I don't understand this example. The example from the spark website uses reduce, but I don't see the data being processed in parallel. So, what is the point of the reduce? If I could get a detailed explanation of the loop in this example, I think that would clear up most of my questions.
class ComputeGradient extends Function<DataPoint, Vector> {
private Vector w;
ComputeGradient(Vector w) { this.w = w; }
public Vector call(DataPoint p) {
return p.x.times(p.y * (1 / (1 + Math.exp(w.dot(p.x))) - 1));
}
}
JavaRDD<DataPoint> points = spark.textFile(...).map(new ParsePoint()).cache();
Vector w = Vector.random(D); // current separating plane
for (int i = 0; i < ITERATIONS; i++) {
Vector gradient = points.map(new ComputeGradient(w)).reduce(new AddVectors());
w = w.subtract(gradient);
}
System.out.println("Final separating plane: " + w);
Also, I have been trying to find the source code for reduce from the Apache Spark Github, but the source is pretty huge and I haven't been able to pinpoint it. Could someone please direct me towards which file I could find it in?
That is a lot of questions. In the future, you should break this up into multiple. I will give a high level that should answer them for you.
First, here is the file with reduce.
Second, most of your problems come from trying to micromanage too much (only necessary if you need to performance tune). You need to first understand the core of what Spark is about and what an RDD is. It is a collection that is parallelized under the hood. From your programming perspective it is just another collection. And reduce is just a function on that collection, a common one in functional programming. All it does is run an operator against all of your collection turning it into one result like below:
((item1 op item2) op item3) op ....
Last, in the example, the code is merely running an iterative algorithm over the data to converge on some point. This is a common task for machine learning algorithms.
Again, I wouldn't focus on the details until you get a better understanding of the high level of distributed programming. Spark is just an abstraction on top to turn this type of programming back into regular code :)
my topic is similarity and clustering of (a bunch of) text(s). In a nutshell: I want to cluster collected texts together and they should appear in meaningful clusters at the end. To do this, my approach up to now is as follows, my problem is in the clustering. The current software is written in php.
1) Similarity:
I treat every document as a "bag-of-words" and convert words into vectors. I use
filtering (only "real" words)
tokenization (split sentences into words)
stemming (reduce words to their base form; Porter's stemmer)
pruning (cut of words with too high & low frequency)
as methods for dimensionality reduction. After that, I'm using cosine similarity (as suggested / described on various sites on the web and here.
The result then is a similarity matrix like this:
A B C D E
A 0 30 51 75 80
B X 0 21 55 70
C X X 0 25 10
D X X X 0 15
E X X X X 0
A…E are my texts and the number is the similarity in percent; the higher, the more similar the texts are. Because sim(A,B) == sim(B,A) only half of the matrix is filled in. So the similarity of Text A to Text D is 71%.
I want to generate a a priori unknown(!) number of clusters out of this matrix now. The clusters should represent the similar items (up to a certain stopp criterion) together.
I tried a basic implementation myself, which was basically like this (60% as a fixed similarity threshold)
foreach article
get similar entries where sim > 60
foreach similar entry
check if one of the entries already has a cluster number
if no: assign new cluster number to all similar entries
if yes: use that number
It worked (somehow), but wasn't good at all and the results were often monster-clusters.
So, I want to redo this and already had a look into all kinds of clustering algorithms, but I'm still not sure which one will work best. I think it should be an agglomerative algoritm, because every pair of texts can be seen as a cluster in the beginning. But still the questions are what the stopp criterion is and if the algorithm should divide and / or merge existing clusters together.
Sorry if some of the stuff seems basic, but I am relatively new in this field. Thanks for the help.
Since you're both new to the field, have an unknown number of clusters and are already using cosine distance I would recommend the FLAME clustering algorithm.
It's intuitive, easy to implement, and has implementations in a large number of languages (not PHP though, largely because very few people use PHP for data science).
Not to mention, it's actually good enough to be used in research by a large number of people. If nothing else you can get an idea of what exactly the shortcomings are in this clustering algorithm that you want to address in moving onto another one.
Just try some. There are so many clustering algorithms out there, nobody will know all of them. Plus, it also depends a lot on your data set and the clustering structure that is there.
In the end, there also may be just this one monster cluster with respect to cosine distance and BofW features.
Maybe you can transform your similarity matrix to a dissimilarity matrix such as transforming x to 1/x, then your problem is to cluster a dissimilarity matrix. I think the hierarchical cluster may work. These may help you:hierarchical clustering and Clustering a dissimilarity matrix
Let's say I have two fairly large data sets - the first is called "Base" and it contains 200 million tab delimited rows and the second is call "MatchSet" which has 10 million tab delimited rows of similar data.
Let's say I then also have an arbitrary function called Match(row1, row2) and Match() essentially contains some heuristics for looking at row1 (from MatchSet) and comparing it to row2 (from Base) and determining if they are similar in some way.
Let's say the rules implemented in Match() are custom and complex rules, aka not a simple string match, involving some proprietary methods. Let's say for now Match(row1,row2) is written in psuedo-code so implementation in another language is not a problem (though it's in C++ today).
In a linear model, aka program running on one giant processor - we would read each line from MatchSet and each line from Base and compare one to the other using Match() and write out our match stats. For example we might capture: X records from MatchSet are strong matches, Y records from MatchSet are weak matches, Z records from MatchSet do not match. We would also write the strong/weak/non values to separate files for inspection. Aka, a nested loop of sorts:
for each row1 in MatchSet
{
for each row2 in Base
{
var type = Match(row1,row2);
switch(type)
{
//do something based on type
}
}
}
I've started considering Hadoop streaming as a method for running these comparisons as a batch job in a short amount of time. However, I'm having a bit of a hardtime getting my head around the map-reduce paradigm for this type of problem.
I understand pretty clearly at this point how to take a single input from hadoop, crunch the data using a mapping function and then emit the results to reduce. However, the "nested-loop" approach of comparing two sets of records is messing with me a bit.
The closest I'm coming to a solution is that I would basically still have to do a 10 million record compare in parallel across the 200 million records so 200 million/n nodes * 10 million iterations per node. Is that that most efficient way to do this?
From your description, it seems to me that your problem can be arbitrarily complex and could be a victim of the curse of dimensionality.
Imagine for example that your rows represent n-dimensional vectors, and that your matching function is "strong", "weak" or "no match" based on the Euclidean distance between a Base vector and a MatchSet vector. There are great techniques to solve these problems with a trade-off between speed, memory and the quality of the approximate answers. Critically, these techniques typically come with known bounds on time and space, and the probability to find a point within some distance around a given MatchSet prototype, all depending on some parameters of the algorithm.
Rather than for me to ramble about it here, please consider reading the following:
Locality Sensitive Hashing
The first few hits on Google Scholar when you search for "locality sensitive hashing map reduce". In particular, I remember reading [Das, Abhinandan S., et al. "Google news personalization: scalable online collaborative filtering." Proceedings of the 16th international conference on World Wide Web. ACM, 2007] with interest.
Now, on the other hand if you can devise a scheme that is directly amenable to some form of hashing, then you can easily produce a key for each record with such a hash (or even a small number of possible hash keys, one of which would match the query "Base" data), and the problem becomes a simple large(-ish) scale join. (I say "largish" because joining 200M rows with 10M rows is quite a small if the problem is indeed a join). As an example, consider the way CDDB computes the 32-bit ID for any music CD CDDB1 calculation. Sometimes, a given title may yield slightly different IDs (i.e. different CDs of the same title, or even the same CD read several times). But by and large there is a small set of distinct IDs for that title. At the cost of a small replication of the MatchSet, in that case you can get very fast search results.
Check the Section 3.5 - Relational Joins in the paper 'Data-Intensive Text Processing
with MapReduce'. I haven't gone in detail, but it might help you.
This is an old question, but your proposed solution is correct assuming that your single stream job does 200M * 10M Match() computations. By doing N batches of (200M / N) * 10M computations, you've achieved a factor of N speedup. By doing the computations in the map phase and then thresholding and steering the results to Strong/Weak/No Match reducers, you can gather the results for output to separate files.
If additional optimizations could be utilized, they'd like apply to both the single stream and parallel versions. Examples include blocking so that you need to do fewer than 200M * 10M computations or precomputing constant portions of the algorithm for the 10M match set.
In order to compute the product between 2 matrices A and B (nxm dimension) in a parallel mode, I have the following restrictions: the server sends to each client a number of rows from matrix A, and a number of rows from matrix B. This cannot be changed. Further the clients may exchange between each other information so that the matrices product to be computed, but they cannot ask the server to send any other data.
This should be done the most efficient possible, meaning by minimizing the number of messages sent between processes - considered as an expensive operation - and by doing the small calculations in parallel, as much as possible.
From what I have researched, practically the highest number of messages exchanged between the clients is n^2, in case each process broadcasts its lines to all the others. Now, the problem is that if I minimize the number of messages sent - this would be around log(n) for distributing the input data - but the computation then would only be done by one process, or more, but anyhow, it is not anymore done in parallel, which was the main idea of the problem.
What could be a more efficient algorithm, that would compute this product?
(I am using MPI, if it makes any difference).
To compute the matrix product C = A x B element-by-element you simply calculate C(i,j) = dot_product(A(i,:),B(:,j)). That is, the (i,j) element of C is the dot product of row i of A and column j of B.
If you insist on sending rows of A and rows of B around then you are going to have a tough time writing a parallel program whose performance exceeds a straightforward serial program. Rather, what you ought to do is send rows of A and columns of B to processors for computation of elements of C. If you are constrained to send rows of A and rows of B, then I suggest that you do that, but compute the product on the server. That is, ignore all the worker processors and just perform the calculation serially.
One alternative would be to compute partial dot-products on worker processors and to accumulate the partial results. This will require some tricky programming; it can be done but I will be very surprised if, at your first attempt, you can write a program which outperforms (in execution speed) a simple serial program.
(Yes, there are other approaches to decomposing matrix-matrix products for parallel execution, but they are more complicated than the foregoing. If you want to investigate these then Matrix Computations is the place to start reading.)
You need also to think hard about your proposed measures of efficiency -- the most efficient message-passing program will be the one which passes no messages. If the cost of message-passing far outweighs the cost of computation then the no-message-passing implementation will be the most efficient by both measures. Generally though, measures of the efficiency of parallel programs are ratios of speedup to number of processors: so 8 times speedup on 8 processors is perfectly efficient (and usually impossible to achieve).
As stated yours is not a sensible problem. Either the problem-setter has mis-specified it, or you have mis-stated (or mis-understood) a correct specification.
Something's not right: if both matrices have n x m dimensions, then they can not be multiplied together (unless n = m). In the case of A*B, A has to have as many columns as B has rows. Are you sure that the server isn't sending rows of B's transposed? That would be equivalent to sending columns from B, in which case the solution is trivial.
Assuming that all those check out, and your clients do indeed get rows from A and B: probably the easiest solution would be for each client to send its rows of matrix B to client #0, who reassambles the original matrix B, then sends out its columns back to the other clients. Basically, client #0 would act as a server that actually knows how to efficiently decompose data. This would be 2*(n-1) messages (not counting the ones used to reunite the product matrix), but considering how you already need n messages to distribute the A and B matrices between the clients, there's no significant performance loss (it's still O(n) messages).
The biggest bottleneck here is obviously the initial gathering and redistribution of the matrix B, which scales terribly, so if you have fairly small matrices and a lot of processes, you might just be better off calculating the product serially on the server.
I don't know if this is homework. But if it is not homework, then you should probably use a library. One idea is scalapack
http://www.netlib.org/scalapack/scalapack_home.html
Scalapack is writtten in fortran, but you can call it from c++.