Hadoop combiner sort phase - hadoop

When running a MapReduce job with a specified combiner, is the combiner run during the sort phase? I understand that the combiner is run on mapper output for each spill, but it seems like it would also be beneficial to run during intermediate steps when merge sorting. I'm assuming here that in some stages of the sort, mapper output for some equivalent keys is held in memory at some point.
If this doesn't currently happen, is there a particular reason, or just something which hasn't been implemented?
Thanks in advance!

Combiners are there to save network bandwidth.
The mapoutput directly gets sorted:
sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);
This happens right after the real mapping is done. During iteration through the buffer it checks if there has a combiner been set and if yes it combines the records. If not, it directly spills onto disk.
The important parts are in the MapTask, if you'd like to see it for yourself.
sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter);
// some fields
for (int i = 0; i < partitions; ++i) {
// check if configured
if (combinerRunner == null) {
// spill directly
} else {
combinerRunner.combine(kvIter, combineCollector);
}
}
This is the right stage to save the disk space and the network bandwidth, because it is very likely that the output has to be transfered.
During the merge/shuffle/sort phase it is not beneficial because then you have to crunch more amounts of data in comparision with the combiner run at map finish time.
Note the sort-phase which is shown in the web interface is misleading. It is just pure merging.

There are two opportunities for running the Combiner, both on the map side of processing. (A very good online reference is from Tom White's "Hadoop: The Definitive Guide" - https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-6/shuffle-and-sort )
The first opportunity comes on the map side after completing the in-memory sort by key of each partition, and before writing those sorted data to disk. The motivation for running the Combiner at this point is to reduce the amount of data ultimately written to local storage. By running the Combiner here, we also reduce the amount of data that will need to be merged and sorted in the next step. So to the original question posted, yes, the Combiner is already being applied at this early step.
The second opportunity comes right after merging and sorting the spill files. In this case, the motivation for running the Combiner is to reduce the amount of data ultimately sent over the network to the reducers. This stage benefits from the earlier application of the Combiner, which may have already reduced the amount of data to be processed by this step.

The combiner is only going to run how you understand it.
I suspect the reason that the combiner only works in this way is that it reduces the amount of data being sent to the reducers. This is a huge gain in many situations. Meanwhile, in the reducer, the data is already there, and whether you combine them in the sort/merge or in your reduce logic is not really going to matter computationally (it's either done now or later).
So, I guess my point is: you may get gains by combining like you say in the merge, but it's not going to be as much as the map-side combiner.

I haven't gone through the code but in reference to Hadoop : The definitive guide by Tom White 3rd edition, it does mention that if the combiner is specified it will run during the merge phase in the reducer. Following is excerpt from the text:
" The map outputs are copied to the reduce task JVM’s memory if they are small enough
(the buffer’s size is controlled by mapred.job.shuffle.input.buffer.percent, which
specifies the proportion of the heap to use for this purpose); otherwise, they are copied
to disk. When the in-memory buffer reaches a threshold size (controlled by
mapred.job.shuffle.merge.percent), or reaches a threshold number of map outputs
(mapred.inmem.merge.threshold), it is merged and spilled to disk. If a combiner is specified it will be run during the merge to reduce the amount of data written to disk.
"

Related

What is the exact Map Reduce WorkFlow?

Summary from the book "hadoop definitive guide - tom white" is:
All the logic between user's map function and user's reduce function is called shuffle. Shuffle then spans across both map and reduce. After user's map() function, the output is in in-memory circular buffer. When the buffer is 80% full, the background thread starts to run. The background thread will output the buffer's content into a spill file. This spill file is partitioned by key. And within each partition, the key-value pairs are sorted by key.After sorting, if combiner function is enabled, then combiner function is called. All spill files will be merged into one MapOutputFile. And all Map tasks's MapOutputFile will be collected over network to Reduce task. Reduce task will do another sort. And then user's Reduce function will be called.
So the questions are:
1.) According to the above summary, this is the flow:
Mapper--Partioner--Sort--Combiner--Shuffle--Sort--Reducer--Output
1a.) Is this the flow or is it something else?
1b.) Can u explain the above flow with an example say word count example, (the ones I found online weren't that elaborative) ?
2.) So the mappers phase output is one big file (MapOutputFile)? And it is this one big file that is broken into and the key-value pairs are passed onto the respective reducers?
3.) Why does the sorting happens for a second time, when the data is already sorted & combined when passed onto their respective reducers?
4.) Say if mapper1 is run on Datanode1 then is it necessary for reducer1 to run on the datanode1? Or it can run on any Datanode?
Answering this question is like rewriting the whole history . A lot of your doubts have to do with Operating System concepts and not MapReduce.
Mappers data is written on local File System. The data is partitioned based on the number of reducer. And in each partition , there can be multiple files based on the number of time the spills have happened.
Each small file in a given partition is sorted , as before writing the file, in Memory sort is done.
Why the data needs to be sorted on mapper side ?
a.The data is sorted and merged on the mapper side to decrease the number of files.
b.The files are sorted as it would become impossible on the reducer to gather all the values for a given key.
After gathering data on the reducer, first the number of files on the system needs to be decreased (remember uLimit has a fixed amount for every user in this case hdfs)
Reducer just maintains a file pointer on a small set of sorted files and does a merge of them.
To know about more interesting ideas please refer :
http://bytepadding.com/big-data/map-reduce/understanding-map-reduce-the-missing-guide/

Does Apache Spark read and process in the same time, or in first reads entire file in memory and then starts transformations?

I am curious if Spark first reads entire file into memory and only then starts processing it, meaning applying transformations and actions, or it reads first chunk of a file - applies transformation on it, reads second chunk and so on.
Is there any difference between Spark in Hadoop for the same matter? I read that Spark keeps entire file in memory most of the times, while Hadoop not. But what about the initial step when we read it for the first time and map the keys.
Thanks
I think a fair characterisation would be this:
Both Hadoop (or more accurately MapReduce) and Spark use the same underlying filesystem HDFS to begin with.
During the Mapping phase both will read all data and actually write the map result to disk so that it can be sorted and distributed between nodes via the Shuffle logic.
Both of them do in fact try and cache the data just mapped in memory in addition to spilling it to disk for the Shuffle to do its work.
The difference here though is that Spark is a lot more efficient in this process, trying to optimally align the node chosen for a specific computation with the data already cached on a certain node.
Since Spark also does something called lazy-evaluation the memory use of Spark is very different from Hadoop as a result of planning computation and caching simultaneously.
In in the steps of a word-count job Hadoop does this:
Map all the words to 1.
Write all those mapped pairs of (word, 1) to a single file in HDFS (single file could still span multiple nodes on the distributed HDFS) (this is the shuffle phase)
Sort the rows of (word, 1) in that shared file (this is the sorting phase)
Have the reducers read sections (partitions) from that shared file that now contains all the words sorted and sum up all those 1s for every word.
Spark on the other hand will go the other way around:
It figures that like in Hadoop it is probably most efficient to have all those words summed up via separate Reducer runs, so it decides according to some factors that it wants to split the job into x parts and then merge them into the final result.
So it knows that words will have to be sorted which will require at least part of them in memory at a given time.
After that it evaluates that such a sorted list will require all words mapped to (word, 1) pairs to start the calculation.
It works through steps 3 than 2 than 1.
Now the trick relative to Hadoop is that it knows in Step 3, which in-memory cached items it will need in 2. and in 2. it already knows how these parts (mostly K-V pairs) will be needed in the final step 1.
This allows Spark to very efficiently plan the execution of Jobs, but caching data it knows will be needed in later stages of the job. Hadoop working from the beginning (mapping) to the end without explicitly looking ahead into the following stages, simply cannot use memory this efficiently and hence doesn't waste resources keeping the large chunks in memory, that Spark would keep. Unlike Spark it just doesn't know if all the pairs in a Map phase will be needed in the next step.
The fact that it appears that Spark is keeping the whole dataset in memory hence isn't something Spark actively does, but rather a result of the way Spark is able to plan the execution of a job.
On the other hand, Spark may be able to actually keep fewer things memory in a different kind of job. Counting the number of distinct words is a good example here in my opinion.
Here Spark would have planned ahead and immediately drop a repeat-word from the cache/memory when encountering it during the mapping, while in Hadoop it would go ahead and waste memory on shuffling the repeat words too (I acknowledge there is a million ways to also make Hadoop do this but it's not out of the box, also there is ways of writing your Spark job in unfortunate ways to break these optimisations, but it's not so easy to fool Spark here :)).
Hope this helps understand that the memory use is just a natural consequence of the way Spark works, but not something actively aimed at and also not something strictly required by Spark. It is also perfectly capable of repeatedly spilling data back to disk between steps of the execution when memory becomes an issue.
For more insight into this I recommend learning about the DAG scheduler in Spark from here to see how this is actually done in code.
You'll see that it always follows the pattern of working out where what data is and will be cached before figuring out what to calculate where.
Spark uses lazy iterators to process data and can spill data to disk if necessary. It doesn't read all data in memory.
The difference compared to Hadoop is that Spark can chain multiple operations together.

What's the difference between shuffle phase and combiner phase?

i'm pretty confused about the MapReduce Framework. I'm getting confused reading from different sources about that. By the way, this is my idea of a MapReduce Job
1. Map()-->emit <key,value>
2. Partitioner (OPTIONAL) --> divide
intermediate output from mapper and assign them to different
reducers
3. Shuffle phase used to make: <key,listofvalues>
4. Combiner, component used like a minireducer wich perform some
operations on datas and then pass those data to the reducer.
Combiner is on local not HDFS, saving space and time.
5. Reducer, get the data from the combiner, perform further
operation(probably the same as the combiner) then release the
output.
6. We will have n outputs parts, where n is the number
of reducers
It is basically right? I mean, i found some sources stating that combiner is the shuffle phase and it basically groupby each record by key...
Combiner is NOT at all similar to the shuffling phase. What you describe as shuffling is wrong, which is the root of your confusion.
Shuffling is just copying keys from map to reduce, it has nothing to do with key generation. It is the first phase of a Reducer, with the other two being sorting and then reducing.
Combining is like executing a reducer locally, for the output of each mapper. It basically acts like a reducer (it also extends the Reducer class), which means that, like a reducer, it groups the local values that the mapper has emitted for the same key.
Partitioning is, indeed, assigning the map output keys to specific reduce tasks, but it is not optional. Overriding the default HashPartitioner with an implementation of your own is optional.
I tried to keep this answer minimal, but you can find more information on the book Hadoop: The Definitive Guide by Tom White, as Azim suggests, and some related things in this post.
Think of combiner as a mini-reducer phase that only works on the output of map task within each node before it emits it to the actual reducer.
Taking the classical WordCount example, map phase output would be (word,1) for each word the map task processes. Lets assume the input to be processed is
"She lived in a big house with a big garage on the outskirts of a big
city in India"
Without a combiner, map phase would emit (big,1) three times and (a,1) three times and (in,1) two times. But when a combiner is used, the map phase would emit (big,3), (a,3) and (in,2). Note that the individual occurrences of each of these words is aggregated locally within the map phase before it emits its output to reduce phase. In use cases where Combiner is used, it would optimise to ensure network traffic from map to reduce is minimised due to local aggregation.
During the shuffle phase, output from various map phases are redirected to the correct reducer phase. This is handled internally by the framework. If a partitioner is used, it would be helpful to shuffle the input to reduce accordingly.
I don't think that combiner is a part of Shuffle and Sort phase.
Combiner, itself is one of the phases(optional) of the job lifecycle.
The pipelining of these phases could be like:
Map --> Partition --> Combiner(optional) --> Shuffle and Sort --> Reduce
Out of these phases, Map, Partition and Combiner operate on the same node.
Hadoop dynamically selects nodes to run Reduce Phase depend upon the availability and accessibility of the resources in best possible way.
Shuffle and Sort, an important middle level phase works across the Map and Reduce nodes.
When a client submits a job, Map Phase starts working on input file which is stored across nodes in the form of blocks.
Mappers process each line of the file one by one and put the result generated into some memory buffer of 100MB(local memory to each mapper). When this buffer gets filled till a certain threshold, by default 80%, this buffer is sorted and then stored into the disk(as file). Each Mapper can generate multiple such intermediate sorted splits or files. When Mapper is done with all the lines of the block, all such splits are merged together(to form a single file), sorted(on the basis of key) and then Combiner phase starts working on this single file. Note that, if there is no Paritition phase, only one intermediate file will be produced, but in case of Parititioning multiple files get generated depending upon the developers logic. Below image from Oreilly Hadoop: The Definitive guide, may help you in understanding this concept in more details.
Later, Hadoop copies merged file from each of the Mapper nodes to the Reducer nodes depending upon the key value. That is all the records of the same key will be copied to the same Reducer node.
I think, you may know in depth about SS and Reduce Phase work, so not going into more details for these topics.
Also, for more information, I would suggest you to read Oreilly Hadoop: The Definitive guide. Its awesome book for Hadoop.

Shuffle and sort for mapreduce

I read through the definitive guide and some other links on the web including the one here
My question is
where exactly does shuffling and sorting happen?
As per my understanding, they happen on both mappers and reducers. But some links mention that shuffling happens on mappers and sorting on reducers.
Can someone confirm if my understanding is correct; if not can they provide additional documentation I can go through?
Shuffle:
MapReduce makes the guarantee that the input to every reducer is sorted by key. The process by which the system performs the sort and transfers map outputs to the reducers as inputs is known as the shuffle.
Sort:
Sorting happens in various stages of MapReduce program, So can exists in Map and Reduce phases.
Please have a look at this diagram
Adding more description to above image in Map and Reduce phases.
The Map Side:
When the map function starts producing output, it is not simply written to disk. Before Map output writes to disk, the thread first divides the data into partitions corresponding to the reducers that they will ultimately be sent to. Within each partition, the background thread performs an in-memory sort by key.
The Reduce Side:
When all the map outputs have been copied, the reduce task moves into the sort phase (which should properly be called the merge phase, as the sorting was carried out on the map side), which merges the map outputs, maintaining their sort ordering. This will be done in rounds.
Source : Hadoop Definitive Guide.

Same-machine-as-data processing on reduce side of map reduce

One of the big benefits of Hadoop MapReduce is the fact that Map processes take place on the same machine that the data they operate upon resides (to the extent possible). But can this be or is this perhaps already true of the Reduce side? For example, in the extreme case of a Map-only job, all of the output data ends up on the same machine as the corresponding input data (right?). But in an intermediate case in which the output is somewhat correlated with the output, it seems reasonable to partition the output and to the extent possible keep it on same machine at it started on.
Is this possible? Does this already happen?
Inputs to the Reducers can reside on any node(local or remote) and not necessarily on the same machine where they are running. As Mappers complete their output gets written onto the local FS of the machine where they are running. Once this is done the intermediate output is needed by the machines that are about to run the reduce task. One thing to note here is that all the values corresponding to a particular key go the same reducer. So, it's not always possible that the input to Reducers is local, since different sets of key/value pairs are processed by different Mappers running on different machines.
Now, before the Mapper output is sent to Reducers for further processing, the data is partitioned based on keys and each partition goes to a Reducer and all the key/value pairs in that partition get processed by that Reducer. During the process a lot of data shuffling takes place. So it's not possible to maintain the data locality in case of Reducers.
Hope this answers the question.
If you know that the data for a particular reducer is already on the right node after the map phase, and the algorithm allows for it (see this blog post about it) you should insert your reducer as a combiner. Combiners are like miniature reducers that only get to see co-located data. Often you can dramatically improve performance because the combiner output can be orders of magnitude smaller than the map output, so what's left to shuffle is trivial.
Of course, if indeed the map phase leaves your data already correctly partitioned, why use a reducer at all? Why not create a second map job that simulates a reducer?

Resources