How does the MapReduce sort algorithm work? - algorithm

One of the main examples that is used in demonstrating the power of MapReduce is the Terasort benchmark. I'm having trouble understanding the basics of the sorting algorithm used in the MapReduce environment.
To me sorting simply involves determining the relative position of an element in relationship to all other elements. So sorting involves comparing "everything" with "everything". Your average sorting algorithm (quick, bubble, ...) simply does this in a smart way.
In my mind splitting the dataset into many pieces means you can sort a single piece and then you still have to integrate these pieces into the 'complete' fully sorted dataset. Given the terabyte dataset distributed over thousands of systems I expect this to be a huge task.
So how is this really done? How does this MapReduce sorting algorithm work?
Thanks for helping me understand.

Here are some details on Hadoop's implementation for Terasort:
TeraSort is a standard map/reduce sort, except for a custom partitioner that uses a sorted list of N − 1 sampled keys that define the key range for each reduce. In particular, all keys such that sample[i − 1] <= key < sample[i] are sent to reduce i. This guarantees that the output of reduce i are all less than the output of reduce i+1."
So their trick is in the way they determine the keys during the map phase. Essentially they ensure that every value in a single reducer is guaranteed to be 'pre-sorted' against all other reducers.
I found the paper reference through James Hamilton's Blog Post.

Google Reference: MapReduce: Simplified Data Processing on Large Clusters
Appeared in:
OSDI'04: Sixth Symposium on Operating System Design and Implementation,
San Francisco, CA, December, 2004.
That link has a PDF and HTML-Slide reference.
There is also a Wikipedia page with description with implementation references.
Also criticism,
David DeWitt and Michael Stonebraker, pioneering experts in parallel databases and shared nothing architectures, have made some controversial assertions about the breadth of problems that MapReduce can be used for. They called its interface too low-level, and questioned whether it really represents the paradigm shift its proponents have claimed it is. They challenge the MapReduce proponents' claims of novelty, citing Teradata as an example of prior art that has existed for over two decades; they compared MapReduce programmers to Codasyl programmers, noting both are "writing in a low-level language performing low-level record manipulation". MapReduce's use of input files and lack of schema support prevents the performance improvements enabled by common database system features such as B-trees and hash partitioning, though projects such as PigLatin and Sawzall are starting to address these problems.

I had the same question while reading Google's MapReduce paper. #Yuval F 's answer pretty much solved my puzzle.
One thing I noticed while reading the paper is that the magic happens in the partitioning (after map, before reduce).
The paper uses hash(key) mod R as the partitioning example, but this is not the only way to partition intermediate data to different reduce tasks.
Just add boundary conditions to #Yuval F 's answer to make it complete: suppose min(S) and max(S) is the minimum key and maximum key among the sampled keys; all keys < min(S) are partitioned to one reduce task; vice versa, all keys >= max(S) are partitioned to one reduce task.
There is no hard limitation on the sampled keys, like min or max. Just, more evenly these R keys distributed among all the keys, more "parallel" this distributed system is and less likely a reduce operator has memory overflow issue.

Just guessing...
Given a huge set of data, you would partition the data into some chunks to be processed in parallel (perhaps by record number i.e. record 1 - 1000 = partition 1, and so on).
Assign / schedule each partition to a particular node in the cluster.
Each cluster node will further break (map) the partition into its own mini partition, perhaps by the key alphabetical order. So, in partition 1, get me all the things that starts with A and output it into mini partition A of x. Create a new A(x) if currently there is an A(x) already. Replace x with sequential number (perhaps this is the scheduler job to do so). I.e. Give me the next A(x) unique id.
Hand over (schedule) jobs completed by the mapper (previous step) to the "reduce" cluster nodes. Reduce node cluster will then further refine the sort of each A(x) parts which wil lonly happen when al lthe mapper tasks are done (Can't actually start sorting all the words starting w/ A when there are still possibility that there is still going to be another A mini partition in the making). Output the result in the final sorted partion (i.e. Sorted-A, Sorted-B, etc.)
Once done, combine the sorted partition into a single dataset again. At this point it is just a simple concatenation of n files (where n could be 26 if you are only doing A - Z), etc.
There might be intermediate steps in between... I'm not sure :). I.e. further map and reduce after the initial reduce step.

Related

Normalize SPARK RDD partitions using reduceByKey(numPartitions) or repartition

Using Spark 2.4.0.
My production data is extremely skewed, so one of the tasks was taking 7x longer than everything else.
I tried different strategies to normalize the data so that all executors worked equally -
spark.default.parallelism
reduceByKey(numPartitions)
repartition(numPartitions)
My expectation was that all three of them should evenly partition, however playing with some dummy non-production data on Spark Local/Standalone suggests that options 1,2 normalize better than 3.
Data as below : (and i am trying to do a simple reduce on balance per account+ccy combination
account}date}ccy}amount
A1}2020/01/20}USD}100.12
A2}2010/01/20}SGD}200.24
A2}2010/01/20}USD}300.36
A1}2020/01/20}USD}400.12
Expected result should be [A1-USD,500.24], [A2-SGD,200.24], [A2-USD,300.36] Ideally these should be partitioned in 3 different partitions.
javaRDDWithoutHeader
.mapToPair((PairFunction<Balance, String, Integer>) balance -> new Tuple2<>(balance.getAccount() + balance.getCcy(), 1))
.mapToPair(new MyPairFunction())
.reduceByKey(new ReductionFunction())
Code to check partitions
System.out.println("b4 = " +pairRDD.getNumPartitions());
System.out.println(pairRDD.glom().collect());
JavaPairRDD<DummyString, BigDecimal> newPairRDD = pairRDD.repartition(3);
System.out.println("Number of partitions = " +newPairRDD.getNumPartitions());
System.out.println(newPairRDD.glom().collect());
Option 1: Doing nothing
Option 2: Setting spark.default.parallelism to 3
Option 3: reduceByKey with numPartitions = 3
Option 4: repartition(3)
For Option 1
Number of partitions = 2
[
[(DummyString{account='A2', ccy='SGD'},200.24), (DummyString{
account='A2', ccy='USD'},300.36)],
[(DummyString{account='A1', ccy='USD'},500.24)]
]
For option 2
Number of partitions = 3
[
[(DummyString{account='A1', ccy='USD'},500.24)],
[(DummyString{account='A2', ccy='USD'},300.36)],
[(DummyString{account='A2', ccy='SGD'},200.24)]]
For option 3
Number of partitions = 3
[
[(DummyString{account='A1', ccy='USD'},500.24)],
[(DummyString{account='A2', ccy='USD'},300.36)],
[(DummyString{account='A2', ccy='SGD'},200.24)]
]
For option 4
Number of partitions = 3
[[], [(DummyString{
account='A2', ccy='SGD'},200.24)], [(DummyString{
account='A2', ccy='USD'},300.36), (DummyString{
account='A1', ccy='USD'},500.24)]]
Conclusion : options 2(spark.default.parallelism) and 3(reduceByKey(numPartitions) normalized much better than option 4 (repartition)
Fairly deterministic results, never saw option4 normalize into 3 partitions.
Question :
is reduceByKey(numPartitions) much better than repartition or
is this just because the sample data set is so small ? or
is this behavior going to be different when we submit via a YARN cluster
I think there a few things running through the question and therefore harder to answer.
Firstly there are the partitioning and parallelism related to the data at rest and thus when read in; without re-boiling the ocean, here is an excellent SO answer that addresses this: How spark read a large file (petabyte) when file can not be fit in spark's main memory. In any event, there is no hashing or anything going on, just "as is".
Also, RDDs are not well optimized compared to DFs.
Various operations in Spark cause shuffling after an Action invoked:
reduceByKey will cause less shuffling, using hashing for final aggregations and local partition aggregation which is more efficient
repartition as well, that uses randomness
partitionBy(new HashPartitioner(n)), etc. which you do not allude to
reduceByKey(aggr. function, N partitions) which oddly enough appears to be more efficient than a repartition firstly
Your latter comment alludes to data skewness, typically. Too many entries hash to the same "bucket" / partition for the reduceByKey. Alleviate by:
In general try with a larger number of partitions up front (when reading in) - but I cannot see your transforms, methods here, so we leave this as general advice.
In general try with a larger number of partitions up front (when reading in) using suitable hashing - but I cannot see your transforms, methods here, so we leave this as general advice.
Or in some cases "salt" the key by adding a suffix and then reduceByKey and reduceByKey again to "unsalt" to get the original key. Depends on extra time taken vs. leaving as is or performing the other options.
repartition(n) applies random ordering, so you shuffle and then need to shuffle again. Unnecessarily imo. As another post shows (see comments on your question) it looks like unnecessary work done, but these are old style RDDs.
All easier to do with dataframes BTW.
As we are not privy to your complete coding, hope this helps.

Hadoop MapReduce - Reducer with small number of keys and many values per key

Hadoop is naturally created to work with Big data. But what happens if you're output from Mappers is also big, too big to fit to Reducers memory?
Let's say we're considering some large amount of data that we want to cluster. We use some partitioning algorithm, that will find specified number of "groups" of elements (clusters), such that elements in one cluster are similar, but elements that belong to different clusters are dissimilar. Number of clusters often needs to be specified.
If I try to implement K-means as best known clustering algorithm, one iteration would look like this:
Map phase - assign objects to closest centroids
Reduce phase - calculate new centroids based on all objects in a cluster
But what happens if we have only two clusters?
In that case, the large dataset will be divided into two parts, and there would be only two keys and for each of the keys values would contain half of the large dataset.
What I don't understand is - what if the Reducer gets many values for one key? How can he fit it in its RAM?? Isn't this one of the things why Hadoop was created?
I gave just an example of an algorithm, but this is a general question.
Precisely the reason why in the Reducer you never get a List of the values for a particular key. You only get an Iterator for the values. If the number of values for a particular key are too many they are not stored in memory but values are read off the local disk.
Links: Reducer
Also please see Secondary Sort which is a very useful design pattern when you have scenario where there are too many values.

HyperLogLog correctness on mapreduce

Something that has been bugging me about the HyperLogLog algorithm is its reliance on the hash of the keys. The issue I have is that the paper seems to assume that we have a totally random distribution of data on each partition, however in the context it is often used (MapReduce style jobs) things are often distributed by their hash values so all duplicated keys will be on the same partition. To me this means that we should actually be adding the cardinalities generated by HyperLogLog rather then using some sort of averaging technique (in the case where we are partitioned by hashing the same thing that HyperLogLog hashes).
So my question is: is this a real issue with HyperLogLog or have I not read the paper in enough detail
This is a real issue if you use non-independent hash functions for both tasks.
Let's say the partition decides the node by the first b bits of the hashed values. If you use the same hash function for both partition and HyperLogLog, the algorithm will still work properly, but the precision will be sacrificed. In practice, it'll be equivalent of using m/2^b buckets (log2m' = log2m-b), because the first b bits will always be the same, so only log2m-b bits will be used to choose the HLL bucket.

How can I uniformly distribute data to reducers using a MapReduce mapper?

I have only a high-level understanding of MapReduce but a specific question about what is allowed in implementations.
I want to know whether it's easy (or possible) for a Mapper to uniformly distribute the given key-value pairs among the reducers. It might be something like
(k,v) -> (proc_id, (k,v))
where proc_id is the unique identifier for a processor (assume that every key k is unique).
The central question is that if the number of reducers is not fixed (is determined dynamically depending on the size of the input; is this even how it's done in practice?), then how can a mapper produce sensible ids? One way could be for the mapper to know the total number of key-value pairs. Does MapReduce allow mappers to have this information? Another way would be to perform some small number of extra rounds of computation.
What is the appropriate way to do this?
The distribution of keys to reducers is done by a Partitioner. If you don't specify otherwise, the default partitioner uses a simple hashCode-based partitioning algorithm, which tends to distribute the keys very uniformly when every key is unique.
I'm assuming that what you actually want is to process random groups of records in parallel, and that the keys k have nothing to do with how the records should be grouped. That suggests that you should focus on doing the work on the map side instead. Hadoop is pretty good at cleanly splitting up the input into parallel chunks for processing by the mappers, so unless you are doing some kind of arbitrary aggregation I see no reason to reduce at all.
Often the procId technique you mention is used to take otherwise heavily-skewed groups and un-skew them (for example, when performing a join operation). In your case the key is all but meaningless.

Hadoop. Reducing result to the single value

I started learning Hadoop, and am a bit confused by MapReduce. For tasks where result natively is a list of key-value pairs everything seems clear. But I don't understand how should I solve the tasks where result is a single value (say, sum of squared input decimals, or centre of mass for input points).
On the one hand I can put all results of mapper to the same key. But as far as I understood in this case the only reducer will manage the whole set of data (calculate sum, or mean coordinates). It doesn't look like a good solution.
Another one that I can imaging is to group mapper results. Say, mapper that processed examples 0-999 will produce key equals to 0, 1000-1999 will produce key equals to 1, and so on. As far as there still will be multiple results of reducers, it will be necessary to build chain of reducers (reducing will be repeated until only one result remains). It looks much more computational effective, but a bit complicated.
I still hope that Hadoop has the off-the-shelf tool that executes superposition of reducers to maximise the efficiency of reducing the whole data to a single value. Although I failed to find one.
What is the best practise of solving the tasks where result is a single value?
If you are able to reformulate your task in terms of commutative reduce you should look at Combiners. Any way you should take a look on it, it can significantly reduce amount data to shuffle.
From my point of view, you are tackling the problem from the wrong angle.
See that problem where you need to sum the squares of your input, let's assume you have many and large text input files consisting out of a number per line.
Then ideally you want to parallelize your sums in the mapper and then just sum up the sums in the reducer.
e.G:
map: (input "x", temporary sum "s") -> s+=(x*x)
At the end of map, you would emit that temporary sum of every mapper with a global key.
In the reduce stage, you basically get all the sums from your mappers and sum the sums up, note that this is fairly small (n-times a single integer, where n is the number of mappers) in relation to your huge input files and therefore a single reducer is really not a scalability bottleneck.
You want to cut down the communication cost between the mapper and the reducer, not proxy all your data to a single reducer and read through it there, that would not parallelize anything.
I think your analysis of the specific use cases you bring up are spot on. These use cases still fall into a rather inclusive scope of what you can do with hadoop and there are certainly other things that hadoop just wasn't designed to handle. If I had to solve the same problem, I would follow your first approach unless I knew the data was too big, then I'd follow your two-step approach.

Resources