Normalize SPARK RDD partitions using reduceByKey(numPartitions) or repartition - performance

Using Spark 2.4.0.
My production data is extremely skewed, so one of the tasks was taking 7x longer than everything else.
I tried different strategies to normalize the data so that all executors worked equally -
spark.default.parallelism
reduceByKey(numPartitions)
repartition(numPartitions)
My expectation was that all three of them should evenly partition, however playing with some dummy non-production data on Spark Local/Standalone suggests that options 1,2 normalize better than 3.
Data as below : (and i am trying to do a simple reduce on balance per account+ccy combination
account}date}ccy}amount
A1}2020/01/20}USD}100.12
A2}2010/01/20}SGD}200.24
A2}2010/01/20}USD}300.36
A1}2020/01/20}USD}400.12
Expected result should be [A1-USD,500.24], [A2-SGD,200.24], [A2-USD,300.36] Ideally these should be partitioned in 3 different partitions.
javaRDDWithoutHeader
.mapToPair((PairFunction<Balance, String, Integer>) balance -> new Tuple2<>(balance.getAccount() + balance.getCcy(), 1))
.mapToPair(new MyPairFunction())
.reduceByKey(new ReductionFunction())
Code to check partitions
System.out.println("b4 = " +pairRDD.getNumPartitions());
System.out.println(pairRDD.glom().collect());
JavaPairRDD<DummyString, BigDecimal> newPairRDD = pairRDD.repartition(3);
System.out.println("Number of partitions = " +newPairRDD.getNumPartitions());
System.out.println(newPairRDD.glom().collect());
Option 1: Doing nothing
Option 2: Setting spark.default.parallelism to 3
Option 3: reduceByKey with numPartitions = 3
Option 4: repartition(3)
For Option 1
Number of partitions = 2
[
[(DummyString{account='A2', ccy='SGD'},200.24), (DummyString{
account='A2', ccy='USD'},300.36)],
[(DummyString{account='A1', ccy='USD'},500.24)]
]
For option 2
Number of partitions = 3
[
[(DummyString{account='A1', ccy='USD'},500.24)],
[(DummyString{account='A2', ccy='USD'},300.36)],
[(DummyString{account='A2', ccy='SGD'},200.24)]]
For option 3
Number of partitions = 3
[
[(DummyString{account='A1', ccy='USD'},500.24)],
[(DummyString{account='A2', ccy='USD'},300.36)],
[(DummyString{account='A2', ccy='SGD'},200.24)]
]
For option 4
Number of partitions = 3
[[], [(DummyString{
account='A2', ccy='SGD'},200.24)], [(DummyString{
account='A2', ccy='USD'},300.36), (DummyString{
account='A1', ccy='USD'},500.24)]]
Conclusion : options 2(spark.default.parallelism) and 3(reduceByKey(numPartitions) normalized much better than option 4 (repartition)
Fairly deterministic results, never saw option4 normalize into 3 partitions.
Question :
is reduceByKey(numPartitions) much better than repartition or
is this just because the sample data set is so small ? or
is this behavior going to be different when we submit via a YARN cluster

I think there a few things running through the question and therefore harder to answer.
Firstly there are the partitioning and parallelism related to the data at rest and thus when read in; without re-boiling the ocean, here is an excellent SO answer that addresses this: How spark read a large file (petabyte) when file can not be fit in spark's main memory. In any event, there is no hashing or anything going on, just "as is".
Also, RDDs are not well optimized compared to DFs.
Various operations in Spark cause shuffling after an Action invoked:
reduceByKey will cause less shuffling, using hashing for final aggregations and local partition aggregation which is more efficient
repartition as well, that uses randomness
partitionBy(new HashPartitioner(n)), etc. which you do not allude to
reduceByKey(aggr. function, N partitions) which oddly enough appears to be more efficient than a repartition firstly
Your latter comment alludes to data skewness, typically. Too many entries hash to the same "bucket" / partition for the reduceByKey. Alleviate by:
In general try with a larger number of partitions up front (when reading in) - but I cannot see your transforms, methods here, so we leave this as general advice.
In general try with a larger number of partitions up front (when reading in) using suitable hashing - but I cannot see your transforms, methods here, so we leave this as general advice.
Or in some cases "salt" the key by adding a suffix and then reduceByKey and reduceByKey again to "unsalt" to get the original key. Depends on extra time taken vs. leaving as is or performing the other options.
repartition(n) applies random ordering, so you shuffle and then need to shuffle again. Unnecessarily imo. As another post shows (see comments on your question) it looks like unnecessary work done, but these are old style RDDs.
All easier to do with dataframes BTW.
As we are not privy to your complete coding, hope this helps.

Related

What is the best way to lag a value in a Dask Dataframe?

I have a Dask Dataframe called data which is extremely large and cannot be fit into main memory, and is importantly not sorted. The dataframe is unique on the following key: [strike, expiration, type, time]. What I need to accomplish in Dask is the equivalent of the following in Pandas:
data1 = data[['strike', 'expiration', 'type', 'time', 'value']].sort_values()
data1['lag_value'] = data1.groupby(['strike', 'expiration', 'type', 'time'])['value'].shift(1)
In other words, I need to lag the variable value within a by group. What is the best way to do this in Dask - I know that sorting is going to be very computationally expensive, but I don't think there is a way around it given what I would like to do?
Thank you in advance!
I'll make a few assumptions, but my guess is that the data is 'somewhat' sorted. So you might have file partitions that are specific to a day or a week or maybe an hour if you are working with high-frequency data. This means that you can do sorting within those partitions, which is often a more manageable task.
If this guess is wrong, then it might be a good idea to incur the fixed cost of sorting (and persisting) the data since it will speed up your downstream analysis.
Since you have only one large file and it's not very big (25GB should be manageable if you have access to a cluster), the best thing might be to load into memory with regular pandas, sort and save the data with partitioning on dates/expirations/tickers (if available) or some other column division that makes sense for your downstream analysis.
It might be possible to reduce memory footprint by using appropriate dtypes, for example strike, type, expiration columns might take less space as categories (vs strings).
If there is no way at all of loading it into memory at once, then it's possible to iterate on chunks of rows with pandas and then saving the relevant bits in smaller chunks, here's rough pseudocode:
df = pd.read_csv('big_file', iterator=True, chunksize=10**4)
for rows in df:
# here we want to split into smaller sets based on some logic
# note the mode is append so some additional check on file
# existence should be added
for group_label, group_df in rows.groupby(['type', 'strike']):
group_df.to_csv(f"{group_label}.csv", mode='a')
Now the above might sound weird, since the question is tagged with dask and I'm focusing on pandas, but the idea is to save time downstream by partitioning the data on the relevant variables. With dask it is probably possible to achieve also, but in my experience in situations like these I would run into memory problems due to data shuffling among workers. Of course, if in your situation there were many files rather than one, then some parallelisation with dask.delayed would be helpful.
Now, after you partition/index your data, then dask will work great when operating on the many smaller chunks. For example, if you partitioned the data based on date and your downstream analysis is primarily using dates, then operations like groupby and shift will be very fast because the workers will not need to check with each other whether they have overlapping dates, so most processing will occur within partitions.

A join operation using Hadoop MapReduce

How to take a join of two record sets using Map Reduce ? Most of the solutions including those posted on SO suggest that I emit the records based on common key and in the reducer add them to say a HashMap and then take a cross product. (eg. Join of two datasets in Mapreduce/Hadoop)
This solution is very good and works for majority of the cases but in my case my issue is rather different. I am dealing with a data which has got billions of records and taking a cross product of two sets is impossible because in many cases the hashmap will end up having few million objects. So I encounter a Heap Space Error.
I need a much more efficient solution. The whole point of MR is to deal with very high amount of data I want to know if there is any solution that can help me avoid this issue.
Don't know if this is still relevant for anyone, but I facing a similar issue these days. My intention is to use a key-value store, most likely Cassandra, and use it for the cross product. This means:
When running on a line of type A, look for the key in Cassandra. If exists - merge A records into the existing value (B elements). If not - create a key, and add A elements as value.
When running on a line of type B, look for the key in Cassandra. If exists - merge B records into the existing value (A elements). If not - create a key, and add B elements as value.
This would require additional server for Cassandra, and probably some disk space, but since I'm running in the cloud (Google's bdutil Hadoop framework), don't think it should be much of a problem.
You should look into how Pig does skew joins. The idea is that if your data contains too many values with the same key (even if there is no data skew) , you can create artificial keys and spread the key distribution. This would make sure that each reducer gets less number of records than otherwise. For e.g. if you were to suffix "1" to 50% of your key "K1" and "2" the other 50% you will end with half the records on the reducer one (1K1) and the other half goes to 2K2.
If the distribution of the keys values are not known before hand you could some kind of sampling algorithm.

Hadoop. Reducing result to the single value

I started learning Hadoop, and am a bit confused by MapReduce. For tasks where result natively is a list of key-value pairs everything seems clear. But I don't understand how should I solve the tasks where result is a single value (say, sum of squared input decimals, or centre of mass for input points).
On the one hand I can put all results of mapper to the same key. But as far as I understood in this case the only reducer will manage the whole set of data (calculate sum, or mean coordinates). It doesn't look like a good solution.
Another one that I can imaging is to group mapper results. Say, mapper that processed examples 0-999 will produce key equals to 0, 1000-1999 will produce key equals to 1, and so on. As far as there still will be multiple results of reducers, it will be necessary to build chain of reducers (reducing will be repeated until only one result remains). It looks much more computational effective, but a bit complicated.
I still hope that Hadoop has the off-the-shelf tool that executes superposition of reducers to maximise the efficiency of reducing the whole data to a single value. Although I failed to find one.
What is the best practise of solving the tasks where result is a single value?
If you are able to reformulate your task in terms of commutative reduce you should look at Combiners. Any way you should take a look on it, it can significantly reduce amount data to shuffle.
From my point of view, you are tackling the problem from the wrong angle.
See that problem where you need to sum the squares of your input, let's assume you have many and large text input files consisting out of a number per line.
Then ideally you want to parallelize your sums in the mapper and then just sum up the sums in the reducer.
e.G:
map: (input "x", temporary sum "s") -> s+=(x*x)
At the end of map, you would emit that temporary sum of every mapper with a global key.
In the reduce stage, you basically get all the sums from your mappers and sum the sums up, note that this is fairly small (n-times a single integer, where n is the number of mappers) in relation to your huge input files and therefore a single reducer is really not a scalability bottleneck.
You want to cut down the communication cost between the mapper and the reducer, not proxy all your data to a single reducer and read through it there, that would not parallelize anything.
I think your analysis of the specific use cases you bring up are spot on. These use cases still fall into a rather inclusive scope of what you can do with hadoop and there are certainly other things that hadoop just wasn't designed to handle. If I had to solve the same problem, I would follow your first approach unless I knew the data was too big, then I'd follow your two-step approach.

Comparing two large datasets using a MapReduce programming model

Let's say I have two fairly large data sets - the first is called "Base" and it contains 200 million tab delimited rows and the second is call "MatchSet" which has 10 million tab delimited rows of similar data.
Let's say I then also have an arbitrary function called Match(row1, row2) and Match() essentially contains some heuristics for looking at row1 (from MatchSet) and comparing it to row2 (from Base) and determining if they are similar in some way.
Let's say the rules implemented in Match() are custom and complex rules, aka not a simple string match, involving some proprietary methods. Let's say for now Match(row1,row2) is written in psuedo-code so implementation in another language is not a problem (though it's in C++ today).
In a linear model, aka program running on one giant processor - we would read each line from MatchSet and each line from Base and compare one to the other using Match() and write out our match stats. For example we might capture: X records from MatchSet are strong matches, Y records from MatchSet are weak matches, Z records from MatchSet do not match. We would also write the strong/weak/non values to separate files for inspection. Aka, a nested loop of sorts:
for each row1 in MatchSet
{
for each row2 in Base
{
var type = Match(row1,row2);
switch(type)
{
//do something based on type
}
}
}
I've started considering Hadoop streaming as a method for running these comparisons as a batch job in a short amount of time. However, I'm having a bit of a hardtime getting my head around the map-reduce paradigm for this type of problem.
I understand pretty clearly at this point how to take a single input from hadoop, crunch the data using a mapping function and then emit the results to reduce. However, the "nested-loop" approach of comparing two sets of records is messing with me a bit.
The closest I'm coming to a solution is that I would basically still have to do a 10 million record compare in parallel across the 200 million records so 200 million/n nodes * 10 million iterations per node. Is that that most efficient way to do this?
From your description, it seems to me that your problem can be arbitrarily complex and could be a victim of the curse of dimensionality.
Imagine for example that your rows represent n-dimensional vectors, and that your matching function is "strong", "weak" or "no match" based on the Euclidean distance between a Base vector and a MatchSet vector. There are great techniques to solve these problems with a trade-off between speed, memory and the quality of the approximate answers. Critically, these techniques typically come with known bounds on time and space, and the probability to find a point within some distance around a given MatchSet prototype, all depending on some parameters of the algorithm.
Rather than for me to ramble about it here, please consider reading the following:
Locality Sensitive Hashing
The first few hits on Google Scholar when you search for "locality sensitive hashing map reduce". In particular, I remember reading [Das, Abhinandan S., et al. "Google news personalization: scalable online collaborative filtering." Proceedings of the 16th international conference on World Wide Web. ACM, 2007] with interest.
Now, on the other hand if you can devise a scheme that is directly amenable to some form of hashing, then you can easily produce a key for each record with such a hash (or even a small number of possible hash keys, one of which would match the query "Base" data), and the problem becomes a simple large(-ish) scale join. (I say "largish" because joining 200M rows with 10M rows is quite a small if the problem is indeed a join). As an example, consider the way CDDB computes the 32-bit ID for any music CD CDDB1 calculation. Sometimes, a given title may yield slightly different IDs (i.e. different CDs of the same title, or even the same CD read several times). But by and large there is a small set of distinct IDs for that title. At the cost of a small replication of the MatchSet, in that case you can get very fast search results.
Check the Section 3.5 - Relational Joins in the paper 'Data-Intensive Text Processing
with MapReduce'. I haven't gone in detail, but it might help you.
This is an old question, but your proposed solution is correct assuming that your single stream job does 200M * 10M Match() computations. By doing N batches of (200M / N) * 10M computations, you've achieved a factor of N speedup. By doing the computations in the map phase and then thresholding and steering the results to Strong/Weak/No Match reducers, you can gather the results for output to separate files.
If additional optimizations could be utilized, they'd like apply to both the single stream and parallel versions. Examples include blocking so that you need to do fewer than 200M * 10M computations or precomputing constant portions of the algorithm for the 10M match set.

How does the MapReduce sort algorithm work?

One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment.
To me sorting simply involves determining the relative position of an element in relationship to all other elements. So sorting involves comparing "everything" with "everything". Your average sorting algorithm (quick, bubble, ...) simply does this in a smart way.
In my mind splitting the dataset into many pieces means you can sort a single piece and then you still have to integrate these pieces into the 'complete' fully sorted dataset. Given the terabyte dataset distributed over thousands of systems I expect this to be a huge task.
So how is this really done? How does this MapReduce sorting algorithm work?
Thanks for helping me understand.
Here are some details on Hadoop's implementation for Terasort:
TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."
So their trick is in the way they determine the keys during the map phase. Essentially they ensure that every value in a single reducer is guaranteed to be 'pre-sorted' against all other reducers.
I found the paper reference through James Hamilton's Blog Post.
Google Reference: MapReduce: Simplified Data Processing on Large Clusters
Appeared in:
OSDI'04: Sixth Symposium on Operating System Design and Implementation,
San Francisco, CA, December, 2004.
That link has a PDF and HTML-Slide reference.
There is also a Wikipedia page with description with implementation references.
Also criticism,
David DeWitt and Michael Stonebraker, pioneering experts in parallel databases and shared nothing architectures, have made some controversial assertions about the breadth of problems that MapReduce can be used for. They called its interface too low-level, and questioned whether it really represents the paradigm shift its proponents have claimed it is. They challenge the MapReduce proponents' claims of novelty, citing Teradata as an example of prior art that has existed for over two decades; they compared MapReduce programmers to Codasyl programmers, noting both are "writing in a low-level language performing low-level record manipulation". MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by common database system features such as B-trees and hash partitioning, though projects such as PigLatin and Sawzall are starting to address these problems.
I had the same question while reading Google's MapReduce paper. #Yuval F 's answer pretty much solved my puzzle.
One thing I noticed while reading the paper is that the magic happens in the partitioning (after map, before reduce).
The paper uses hash(key) mod R as the partitioning example, but this is not the only way to partition intermediate data to different reduce tasks.
Just add boundary conditions to #Yuval F 's answer to make it complete: suppose min(S) and max(S) is the minimum key and maximum key among the sampled keys; all keys < min(S) are partitioned to one reduce task; vice versa, all keys >= max(S) are partitioned to one reduce task.
There is no hard limitation on the sampled keys, like min or max. Just, more evenly these R keys distributed among all the keys, more "parallel" this distributed system is and less likely a reduce operator has memory overflow issue.
Just guessing...
Given a huge set of data, you would partition the data into some chunks to be processed in parallel (perhaps by record number i.e. record 1 - 1000 = partition 1, and so on).
Assign / schedule each partition to a particular node in the cluster.
Each cluster node will further break (map) the partition into its own mini partition, perhaps by the key alphabetical order. So, in partition 1, get me all the things that starts with A and output it into mini partition A of x. Create a new A(x) if currently there is an A(x) already. Replace x with sequential number (perhaps this is the scheduler job to do so). I.e. Give me the next A(x) unique id.
Hand over (schedule) jobs completed by the mapper (previous step) to the "reduce" cluster nodes. Reduce node cluster will then further refine the sort of each A(x) parts which wil lonly happen when al lthe mapper tasks are done (Can't actually start sorting all the words starting w/ A when there are still possibility that there is still going to be another A mini partition in the making). Output the result in the final sorted partion (i.e. Sorted-A, Sorted-B, etc.)
Once done, combine the sorted partition into a single dataset again. At this point it is just a simple concatenation of n files (where n could be 26 if you are only doing A - Z), etc.
There might be intermediate steps in between... I'm not sure :). I.e. further map and reduce after the initial reduce step.

Resources