Using Reduce in Apache Spark - hadoop

I am trying to use Apache spark to load up a file, and distribute the file to several nodes in my cluster and then aggregate the results and obtain them. I don't quite understand how to do this.
From my understanding the reduce action enables Spark to combine the results from different nodes and aggregate them together. Am I understanding this correctly?
From a programming perspective, I don't understand how I would code this reduce function.
How exactly do I partition the main dataset into N pieces and ask them to be parallel processed by using a list of transformations?
reduce is supposed to take in two elements and a function for combining them. Are these 2 elements supposed to be RDDs from the context of Spark or can they be any type of element? Also, if you have N different partitions running parallel, how would reduce aggregate all their results into one final result(since the reduce function aggregates only 2 elements)?
Also, I don't understand this example. The example from the spark website uses reduce, but I don't see the data being processed in parallel. So, what is the point of the reduce? If I could get a detailed explanation of the loop in this example, I think that would clear up most of my questions.
class ComputeGradient extends Function<DataPoint, Vector> {
private Vector w;
ComputeGradient(Vector w) { this.w = w; }
public Vector call(DataPoint p) {
return p.x.times(p.y * (1 / (1 + Math.exp(w.dot(p.x))) - 1));
}
}
JavaRDD<DataPoint> points = spark.textFile(...).map(new ParsePoint()).cache();
Vector w = Vector.random(D); // current separating plane
for (int i = 0; i < ITERATIONS; i++) {
Vector gradient = points.map(new ComputeGradient(w)).reduce(new AddVectors());
w = w.subtract(gradient);
}
System.out.println("Final separating plane: " + w);
Also, I have been trying to find the source code for reduce from the Apache Spark Github, but the source is pretty huge and I haven't been able to pinpoint it. Could someone please direct me towards which file I could find it in?

That is a lot of questions. In the future, you should break this up into multiple. I will give a high level that should answer them for you.
First, here is the file with reduce.
Second, most of your problems come from trying to micromanage too much (only necessary if you need to performance tune). You need to first understand the core of what Spark is about and what an RDD is. It is a collection that is parallelized under the hood. From your programming perspective it is just another collection. And reduce is just a function on that collection, a common one in functional programming. All it does is run an operator against all of your collection turning it into one result like below:
((item1 op item2) op item3) op ....
Last, in the example, the code is merely running an iterative algorithm over the data to converge on some point. This is a common task for machine learning algorithms.
Again, I wouldn't focus on the details until you get a better understanding of the high level of distributed programming. Spark is just an abstraction on top to turn this type of programming back into regular code :)

Related

Normalize SPARK RDD partitions using reduceByKey(numPartitions) or repartition

Using Spark 2.4.0.
My production data is extremely skewed, so one of the tasks was taking 7x longer than everything else.
I tried different strategies to normalize the data so that all executors worked equally -
spark.default.parallelism
reduceByKey(numPartitions)
repartition(numPartitions)
My expectation was that all three of them should evenly partition, however playing with some dummy non-production data on Spark Local/Standalone suggests that options 1,2 normalize better than 3.
Data as below : (and i am trying to do a simple reduce on balance per account+ccy combination
account}date}ccy}amount
A1}2020/01/20}USD}100.12
A2}2010/01/20}SGD}200.24
A2}2010/01/20}USD}300.36
A1}2020/01/20}USD}400.12
Expected result should be [A1-USD,500.24], [A2-SGD,200.24], [A2-USD,300.36] Ideally these should be partitioned in 3 different partitions.
javaRDDWithoutHeader
.mapToPair((PairFunction<Balance, String, Integer>) balance -> new Tuple2<>(balance.getAccount() + balance.getCcy(), 1))
.mapToPair(new MyPairFunction())
.reduceByKey(new ReductionFunction())
Code to check partitions
System.out.println("b4 = " +pairRDD.getNumPartitions());
System.out.println(pairRDD.glom().collect());
JavaPairRDD<DummyString, BigDecimal> newPairRDD = pairRDD.repartition(3);
System.out.println("Number of partitions = " +newPairRDD.getNumPartitions());
System.out.println(newPairRDD.glom().collect());
Option 1: Doing nothing
Option 2: Setting spark.default.parallelism to 3
Option 3: reduceByKey with numPartitions = 3
Option 4: repartition(3)
For Option 1
Number of partitions = 2
[
[(DummyString{account='A2', ccy='SGD'},200.24), (DummyString{
account='A2', ccy='USD'},300.36)],
[(DummyString{account='A1', ccy='USD'},500.24)]
]
For option 2
Number of partitions = 3
[
[(DummyString{account='A1', ccy='USD'},500.24)],
[(DummyString{account='A2', ccy='USD'},300.36)],
[(DummyString{account='A2', ccy='SGD'},200.24)]]
For option 3
Number of partitions = 3
[
[(DummyString{account='A1', ccy='USD'},500.24)],
[(DummyString{account='A2', ccy='USD'},300.36)],
[(DummyString{account='A2', ccy='SGD'},200.24)]
]
For option 4
Number of partitions = 3
[[], [(DummyString{
account='A2', ccy='SGD'},200.24)], [(DummyString{
account='A2', ccy='USD'},300.36), (DummyString{
account='A1', ccy='USD'},500.24)]]
Conclusion : options 2(spark.default.parallelism) and 3(reduceByKey(numPartitions) normalized much better than option 4 (repartition)
Fairly deterministic results, never saw option4 normalize into 3 partitions.
Question :
is reduceByKey(numPartitions) much better than repartition or
is this just because the sample data set is so small ? or
is this behavior going to be different when we submit via a YARN cluster
I think there a few things running through the question and therefore harder to answer.
Firstly there are the partitioning and parallelism related to the data at rest and thus when read in; without re-boiling the ocean, here is an excellent SO answer that addresses this: How spark read a large file (petabyte) when file can not be fit in spark's main memory. In any event, there is no hashing or anything going on, just "as is".
Also, RDDs are not well optimized compared to DFs.
Various operations in Spark cause shuffling after an Action invoked:
reduceByKey will cause less shuffling, using hashing for final aggregations and local partition aggregation which is more efficient
repartition as well, that uses randomness
partitionBy(new HashPartitioner(n)), etc. which you do not allude to
reduceByKey(aggr. function, N partitions) which oddly enough appears to be more efficient than a repartition firstly
Your latter comment alludes to data skewness, typically. Too many entries hash to the same "bucket" / partition for the reduceByKey. Alleviate by:
In general try with a larger number of partitions up front (when reading in) - but I cannot see your transforms, methods here, so we leave this as general advice.
In general try with a larger number of partitions up front (when reading in) using suitable hashing - but I cannot see your transforms, methods here, so we leave this as general advice.
Or in some cases "salt" the key by adding a suffix and then reduceByKey and reduceByKey again to "unsalt" to get the original key. Depends on extra time taken vs. leaving as is or performing the other options.
repartition(n) applies random ordering, so you shuffle and then need to shuffle again. Unnecessarily imo. As another post shows (see comments on your question) it looks like unnecessary work done, but these are old style RDDs.
All easier to do with dataframes BTW.
As we are not privy to your complete coding, hope this helps.

Finding regions in scattered data

I have a number of scattered data sets in Nx3 matrices, a simple example plotted with scatter3 is shown below (pastebin of the raw values):
Each of my data sets have an arbitrary number of regions/blobs; the example above for instance has 4.
Does anyone know of a simple method to programmatically find the number of regions in this form of data?
My initial idea was to use a delaunayTriangulation, convexHull approach, but without any data treatment this will still only find the outer volume of the entire plot rather than each region.
The next idea I have would involve grabbing nearest neighbour statistics of each point, asking if it's within a grid size distance of another point, then lumping those that are in to separate blobs/clusters.
Is there a higher level Matlab function I'm not aware of that could assist me here, or does anyone have a better suggestion of how to pull the region count out of data like this?
It sounds like you need a clustering algorithm. Fortunately for you, MATLAB provides a number of these out of the box. There are plenty of algorithms to choose from, and it sounds like you need something where the number of clusters is unknown beforehand, correct?
If this is the case, and your data is as "nice" as your example I would suggest kmeans combined with a technique to properly choose "k", as suggested here.
There are other options of course, I recommend you learn more about the clustering options in MATLAB, here's a nice reference for more reading.
Determining the number of different clusters in a dataset is a tricky problem and probably hard than what we might thought on first sight. In fact, algorithms like k-means are heavily dependent on this. Wikipedia has a nice article on it, but no clear and easy method.
The Elbow method as mentioned in there seems to be comparatively easy to do, although might be computationally costly. In essence, you could just try to use different number of clusters, and choose the number where the variance explained doesn't grows much and plateaued out.
Also, the notion of a cluster need to be clearly defined - what if zooming into any of the blobs displays a similar structure as the corner structure in the picture?
I would suggest implementing a "light" version of the Gaussian Mixture Model.
Have each point "vote" for a cube. In the above example, all points centered around (-1.5,-1.5,0)
would each add a +1 to the square [-1,-2]x[-1,-2]x[0.2,-0.2]. Finally, you can analyze the peaks in the voting matrix.
In the interest of completeness there's a vastly simpler answer to this problem (which I have built) than Hierarchical clustering; which gives much better results and can differentiate between 1 cluster or 2 (an issue that I couldn't manage to fix with MarkV's suggestions). This assumes your data is on a regular grid of known size, and you have an unknown amount of clusters that are separated by at least 2*(grid size):
% Idea is as follows:
% * We have a known grid size, dx.
% * A random point [may as well be minima(1,:)] will be in a cluster of
% values if any others in the list lie dx away (with one dimention
% varied), sqrt(2)*dx (two dimensions varied) or sqrt(3)*dx (three
% dimensions varied).
% * Chain these objects together until all are found, any with distances
% beyond sqrt(3)*dx of the cluster are ignored for now.
% * Set this cluster aside, repeat until no minima data is left.
function [blobs, clusterIdx] = findClusters(minima,dx)
%problem setup
dx2 = sqrt(2)*dx;
dx3 = sqrt(3)*dx;
eqf = #(list,dx,dx2,dx3)(abs(list-dx) < 0.001 | abs(list-dx2) < 0.001 | abs(list-dx3) < 0.001);
notDoneClust = true;
notDoneMinima = true;
clusterIdx = zeros(size(minima,1),1);
point = minima(1,:);
list = minima(2:end,:);
blobs = 0;
while notDoneMinima
cluster = nan(1,3);
while notDoneClust
[~, dist] = knnsearch(point,list); %All distances to each other point in data
nnidx = eqf(dist,dx,dx2,dx3); %finds indexes of nn values to point
cluster = cat(1,cluster,point,list(nnidx,:)); %add points to current cluster
point = list(nnidx,:); %points to check are now all values that are nn to initial point
list = list(~nnidx,:); %list is now all other values that are not in that list
notDoneClust = ~isempty(point); %if there are no more points to check, all values of the cluster have been found
end
blobs = blobs+1;
clusterIdx(ismemberf(minima,cluster(2:end,:),'rows')) = blobs;
%reset points and list for a new cluster
if ~isempty(list)
if length(list)>1
point = list(1,:);
list = list(2:end,:);
notDoneClust = true;
else
%point is a cluster of its own. Don't reset loops, add point in
%as a cluster and exit (NOTE: I have yet to test this portion).
blobs = blobs+1;
clusterIdx(ismemberf(minima,point,'rows')) = blobs;
notDoneMinima = false;
end
else
notDoneMinima = false;
end
end
end
I fully understand this method is useless for clustering data in the general sense, as any outlying data will be marked as a separate cluster. This (if it happens) is what I need anyway, so this may just be an edge case scenario.

what parallel algorithms exist in R, working on large data

I'm trying to find out which statistical/data mining algorithms in R or R packages at CRAN/github/R-Forge exist that can handle large datasets either in parallel on 1 server or sequentially without running into out-of-memory issues or which work on several machines at once.
This in order to evaluate if I can easily port them to work with ff/ffbase like ffbase::bigglm.ffdf.
I would like to split these up into 3 parts:
Algorithms that update or work on parameter estimates in parallel
Buckshot (https://github.com/lianos/buckshot)
lm.fit # Programming For Big Data (https://github.com/RBigData)
Algorithms that work sequentially (get data in R but only use 1 process and only 1 process updates the parameters)
bigglm (http://cran.r-project.org/web/packages/biglm/index.html)
Compound Poisson linear models (http://cran.r-project.org/web/packages/cplm/index.html)
Kmeans # biganalytics (http://cran.r-project.org/web/packages/biganalytics/index.html)
Work on part of the data
Distributed text processing (http://www.jstatsoft.org/v51/i05/paper)
And I would like to exclude simple parallelisation like optimising over a hyperparameter by e.g. crossvalidating.
Any other pointers to these kind of models/optimisers or algorithms? Maybe Bayesian? Maybe a package called RGraphlab (http://graphlab.org/)?
Have you read through the High Performance Computing Task View on CRAN?
It covers many of the points you mention and gives overviews of packages in those areas.
Random forest are trivial to run in parallel. It's one of the examples in the foreach vignette:
x <- matrix(runif(500), 100)
y <- gl(2, 50)
library(randomForest); library(foreach)
rf <- foreach(ntree=rep(250, 4), .combine=combine,
.packages='randomForest') %dopar% randomForest(x, y, ntree=ntree)
You can use this construct to split your forest over every core in your cluster.

An understandable clusterization

I have a dataset. Each element of this set consists of numerical and categorical variables. Categorical variables are nominal and ordinal.
There is some natural structure in this dataset. Commonly, experts clusterize datasets such as mine using their 'expert knowledge', but I want to automate this process of clusterization.
Most algorithms for clusterization use distance (Euclidean, Mahalanobdis and so on) between objects to group them in clusters. But it is hard to find some reasonable metrics for mixed data types, i.e. we can't find a distance between 'glass' and 'steel'. So I came to the conclusion that I have to use conditional probabilities P(feature = 'something' | Class) and some utility function that depends on them. It is reasonable for categorical variables, and it works fine with numeric variables assuming they are distributed normally.
So it became clear to me that algorithms like K-means will not produce good results.
At this time I try to work with COBWEB algorithm, that fully matches my ideas of using conditional probabilities. But I faced another obsacles: results of clusterization are really hard to interpret, if not impossible. As a result I wanted to get something like a set of rules that describes each cluster (e.g. if feature1 = 'a' and feature2 in [30, 60], it is cluster1), like descision trees for classification.
So, my question is:
Is there any existing clusterization algorithm that works with mixed data type and produces an understandable (and reasonable for humans) description of clusters.
Additional info:
As I understand my task is in the field of conceptual clustering. I can't define a similarity function as it was suggested (it as an ultimate goal of the whoal project), because of the field of study - it is very complicated and mercyless in terms of formalization. As far as I understand the most reasonable approach is the one used in COBWEB, but I'm not sure how to adapt it, so I can get an undestandable description of clusters.
Decision Tree
As it was suggested, I tried to train a decision tree on the clustering output, thus getting a description of clusters as a set of rules. But unfortunately interpretation of this rules is almost as hard as with the raw clustering output. First of only a few first levels of rules from the root node do make any sense: closer to the leaf - less sense we have. Secondly, these rules doesn't match any expert knowledge.
So, I came to the conclusion that clustering is a black-box, and it worth not trying to interpret its results.
Also
I had an interesting idea to modify a 'decision tree for regression' algorithm in a certain way: istead of calculating an intra-group variance calcualte a category utility function and use it as a split criterion. As a result we should have a decision tree with leafs-clusters and clusters description out of the box. But I haven't tried to do so, and I am not sure about accuracy and everything else.
For most algorithms, you will need to define similarity. It doesn't need to be a proper distance function (e.g. satisfy triangle inequality).
K-means is particularly bad, because it also needs to compute means. So it's better to stay away from it if you cannot compute means, or are using a different distance function than Euclidean.
However, consider defining a distance function that captures your domain knowledge of similarity. It can be composed of other distance functions, say you use the harmonic mean of the Euclidean distance (maybe weighted with some scaling factor) and a categorial similarity function.
Once you have a decent similarity function, a whole bunch of algorithms will become available to you. e.g. DBSCAN (Wikipedia) or OPTICS (Wikipedia). ELKI may be of interest to you, they have a Tutorial on writing custom distance functions.
Interpretation is a separate thing. Unfortunately, few clustering algorithms will give you a human-readable interpretation of what they found. They may give you things such as a representative (e.g. the mean of a cluster in k-means), but little more. But of course you could next train a decision tree on the clustering output and try to interpret the decision tree learned from the clustering. Because the one really nice feature about decision trees, is that they are somewhat human understandable. But just like a Support Vector Machine will not give you an explanation, most (if not all) clustering algorithms will not do that either, sorry, unless you do this kind of post-processing. Plus, it will actually work with any clustering algorithm, which is a nice property if you want to compare multiple algorithms.
There was a related publication last year. It is a bit obscure and experimental (on a workshop at ECML-PKDD), and requires the data set to have a quite extensive ground truth in form of rankings. In the example, they used color similarity rankings and some labels. The key idea is to analyze the cluster and find the best explanation using the given ground truth(s). They were trying to use it to e.g. say "this cluster found is largely based on this particular shade of green, so it is not very interesting, but the other cluster cannot be explained very well, you need to investigate it closer - maybe the algorithm discovered something new here". But it was very experimental (Workshops are for work-in-progress type of research). You might be able to use this, by just using your features as ground truth. It should then detect if a cluster can be easily explained by things such as "attribute5 is approx. 0.4 with low variance". But it will not forcibly create such an explanation!
H.-P. Kriegel, E. Schubert, A. Zimek
Evaluation of Multiple Clustering Solutions
In 2nd MultiClust Workshop: Discovering, Summarizing and Using Multiple Clusterings Held in Conjunction with ECML PKDD 2011. http://dme.rwth-aachen.de/en/MultiClust2011
A common approach to solve this type of clustering problem is to define a statistical model that captures relevant characteristics of your data. Cluster assignments can be derived by using a mixture model (as in the Gaussian Mixture Model) then finding the mixture component with the highest probability for a particular data point.
In your case, each example is a vector has both real and categorical components. A simple approach is to model each component of the vector separately.
I generated a small example dataset where each example is a vector of two dimensions. The first dimension is a normally distributed variable and the second is a choice of five categories (see graph):
There are a number of frameworks that are available to run monte carlo inference for statistical models. BUGS is probably the most popular (http://www.mrc-bsu.cam.ac.uk/bugs/). I created this model in Stan (http://mc-stan.org/), which uses a different sampling technique than BUGs and is more efficient for many problems:
data {
int<lower=0> N; //number of data points
int<lower=0> C; //number of categories
real x[N]; // normally distributed component data
int y[N]; // categorical component data
}
parameters {
real<lower=0,upper=1> theta; // mixture probability
real mu[2]; // means for the normal component
simplex[C] phi[2]; // categorical distributions for the categorical component
}
transformed parameters {
real log_theta;
real log_one_minus_theta;
vector[C] log_phi[2];
vector[C] alpha;
log_theta <- log(theta);
log_one_minus_theta <- log(1.0 - theta);
for( c in 1:C)
alpha[c] <- .5;
for( k in 1:2)
for( c in 1:C)
log_phi[k,c] <- log(phi[k,c]);
}
model {
theta ~ uniform(0,1); // equivalently, ~ beta(1,1);
for (k in 1:2){
mu[k] ~ normal(0,10);
phi[k] ~ dirichlet(alpha);
}
for (n in 1:N) {
lp__ <- lp__ + log_sum_exp(log_theta + normal_log(x[n],mu[1],1) + log_phi[1,y[n]],
log_one_minus_theta + normal_log(x[n],mu[2],1) + log_phi[2,y[n]]);
}
}
I compiled and ran the Stan model and used the parameters from the final sample to compute the probability of each datapoint under each mixture component. I then assigned each datapoint to the mixture component (cluster) with higher probability to recover the cluster assignments below:
Basically, the parameters for each mixture component will give you the core characteristics of each cluster if you have created a model appropriate for your dataset.
For heterogenous, non-Euclidean data vectors as you describe, hierarchical clustering algorithms often work best. The conditional probability condition you describe can be incorporated as an ordering of attributes used to perform cluster agglomeration or division. The semantics of the resulting clusters are easy to describe.

CouchDB - passing MapReduce results into second MapReduce function

Rank amateur at both Map/Reduce and CouchDB here. I have a CouchDB populated with ~600,000 rows of data which indicate views of records. My desire is to produce a graph showing hits per record, over the entire data set.
I have implemented Map/Reduce functions to do the grouping, as so:
function(doc) {
emit(doc.id, doc);
}
and:
function(key, values) {
return values.length;
}
Now because there's still a fair amount of reduced values and we only want, say, 100 data points on the graph, this isn't very usable. Plus, it takes forever to run.
I could just retrieve every Xth row, but would would be ideal would be to pass these reduced results back into another reduce function which takes the mean of its values so I eventually get a nice set of, say, 100 results, which are useful for throwing into a high level overview graph to see distribution of hits.
Is this possible? (and if so, what would the keys be?) Or have I just messed something up in my MapReduce code that's making it grossly non-performant, thus allowing me to do this in my application code? There are only 33,500 results returned.
Thanks,
Matt
To answer my own question:
According to this article, CouchDB doesn't support passing Map/Reduce output as input to another Map/Reduce function, although the article notes that other projects such as disco do support this.
Custom server-side processing can be performed by way of CouchDB lists - like, for example, sorting by value.

Resources