Where to add specific function in the Map/Reduce Framework - hadoop

I have a general question to the MAP/Reduce Framework.
I have a task, which can be separated into several partitions. For each partition, I need to run a computation intensive algorithm.
Then, according to the MAP/Reduce Framework, it seems that I have two choices:
Run the algorithm in the Map stage, so that in the reduce stage, there is no work needed to be done, except collect the results of each partition from the Map stage and do summarization
In the Map stage, just divide and send the partitions (with data) to the reduce stage. In the reduce stage, run the algorithm first, and then collect and summarize the results from each partitions.
Correct me if I misunderstand.
I am a beginner. I may not understand the MAP/Reduce very well. I only have basic parallel computing concept.

You're actually really confused. In a broad and general sense, the map portion takes the task and divides it among some n many nodes or so. Those n nodes that receive a fraction of the whole task do something with their piece. When finished computing some steps on their data, the reduce operation reassembles the data.
The REAL power of map-reduce is how scalable it is.
Given a dataset D running on a map-reduce cluster m with n nodes under it, each node is mapped 1/D pieces of the task. Then the cluster m with n nodes reduces those pieces into a single element. Now, take a node q to be a cluster n with p nodes under it. If m assigns q 1/D, q can map 1/D to (1/D)/p with respect to n. Then n's nodes can reduce the data back to q where q can supply its data to its neighbors for m.
Make sense?

In MapReduce, you have a Mapper and a Reducer. You also have a Partitioner and a Combiner.
Hadoop is a distributed file system that partitions(or splits, you might say) the file into blocks of BLOCK SIZE. These partitioned blocks are places on different nodes. So, when a job is submitted to the MapReduce Framework, it divides that job such that there is a Mapper for every input split(for now lets say it is the partitioned block). Since, these blocks are distributed onto different nodes, these Mappers also run on different nodes.
In the Map stage,
The file is divided into records by the RecordReader, the definition of record is controlled by InputFormat that we choose. Every record is a key-value pair.
The map() of our Mapper is run for every such record. The output of this step is again in key-value pairs
The output of our Mapper is partitioned using the Partitioner that we provide, or the default HashPartitioner. Here in this step, by partitioning, I mean deciding which key and its corresponding values go to which Reducer(if there is only one Reducer, its of no use anyway)
Optionally, you can also combine/minimize the output that is being sent to the reducer. You can use a Combiner to do that. Note that, the framework does not guarantee the number of times a Combiner will be called. It is only part of optimization.
This is where your algorithm on the data is usually written. Since these tasks run in parallel, it makes a good candidate for computation intensive tasks.
After all the Mappers complete running on all nodes, the intermediate data i.e the data at end of Map stage is copied to their corresponding reducer.
In the Reduce stage, the reduce() of our Reducer is run on each record of data from the Mappers. Here the record comprises of a key and its corresponding values, not necessarily just one value. This is where you generally run your summarization/aggregation logic.
When you write your MapReduce job you usually think about what can be done on each record of data in both the Mapper and Reducer. A MapReduce program can just contain a Mapper with map() implemented and a Reducer with reduce() implemented. This way you can focus more on what you want to do with the data and not bother about parallelizing. You don't have to worry about how the job is split, the framework does that for you. However, you will have to learn about it sooner or later.
I would suggest you to go through Apache's MapReduce tutorial or Yahoo's Hadoop tutorial for a good overview. I personally like yahoo's explanation of Hadoop but Apache's details are good and their explanation using word count program is very nice and intuitive.
Also, for
I have a task, which can be separated into several partitions. For
each partition, I need to run a computing intensive algorithm.
Hadoop distributed file system has data split onto multiple nodes and map reduce framework assigns a task to every every split. So, in hadoop, the process goes and executes where the data resides. You cannot define the number of map tasks to run, data does. You can however, specify/control the number of reduce tasks.
I hope I have comprehensively answered your question.

Related

What's the difference between shuffle phase and combiner phase?

i'm pretty confused about the MapReduce Framework. I'm getting confused reading from different sources about that. By the way, this is my idea of a MapReduce Job
1. Map()-->emit <key,value>
2. Partitioner (OPTIONAL) --> divide
intermediate output from mapper and assign them to different
reducers
3. Shuffle phase used to make: <key,listofvalues>
4. Combiner, component used like a minireducer wich perform some
operations on datas and then pass those data to the reducer.
Combiner is on local not HDFS, saving space and time.
5. Reducer, get the data from the combiner, perform further
operation(probably the same as the combiner) then release the
output.
6. We will have n outputs parts, where n is the number
of reducers
It is basically right? I mean, i found some sources stating that combiner is the shuffle phase and it basically groupby each record by key...
Combiner is NOT at all similar to the shuffling phase. What you describe as shuffling is wrong, which is the root of your confusion.
Shuffling is just copying keys from map to reduce, it has nothing to do with key generation. It is the first phase of a Reducer, with the other two being sorting and then reducing.
Combining is like executing a reducer locally, for the output of each mapper. It basically acts like a reducer (it also extends the Reducer class), which means that, like a reducer, it groups the local values that the mapper has emitted for the same key.
Partitioning is, indeed, assigning the map output keys to specific reduce tasks, but it is not optional. Overriding the default HashPartitioner with an implementation of your own is optional.
I tried to keep this answer minimal, but you can find more information on the book Hadoop: The Definitive Guide by Tom White, as Azim suggests, and some related things in this post.
Think of combiner as a mini-reducer phase that only works on the output of map task within each node before it emits it to the actual reducer.
Taking the classical WordCount example, map phase output would be (word,1) for each word the map task processes. Lets assume the input to be processed is
"She lived in a big house with a big garage on the outskirts of a big
city in India"
Without a combiner, map phase would emit (big,1) three times and (a,1) three times and (in,1) two times. But when a combiner is used, the map phase would emit (big,3), (a,3) and (in,2). Note that the individual occurrences of each of these words is aggregated locally within the map phase before it emits its output to reduce phase. In use cases where Combiner is used, it would optimise to ensure network traffic from map to reduce is minimised due to local aggregation.
During the shuffle phase, output from various map phases are redirected to the correct reducer phase. This is handled internally by the framework. If a partitioner is used, it would be helpful to shuffle the input to reduce accordingly.
I don't think that combiner is a part of Shuffle and Sort phase.
Combiner, itself is one of the phases(optional) of the job lifecycle.
The pipelining of these phases could be like:
Map --> Partition --> Combiner(optional) --> Shuffle and Sort --> Reduce
Out of these phases, Map, Partition and Combiner operate on the same node.
Hadoop dynamically selects nodes to run Reduce Phase depend upon the availability and accessibility of the resources in best possible way.
Shuffle and Sort, an important middle level phase works across the Map and Reduce nodes.
When a client submits a job, Map Phase starts working on input file which is stored across nodes in the form of blocks.
Mappers process each line of the file one by one and put the result generated into some memory buffer of 100MB(local memory to each mapper). When this buffer gets filled till a certain threshold, by default 80%, this buffer is sorted and then stored into the disk(as file). Each Mapper can generate multiple such intermediate sorted splits or files. When Mapper is done with all the lines of the block, all such splits are merged together(to form a single file), sorted(on the basis of key) and then Combiner phase starts working on this single file. Note that, if there is no Paritition phase, only one intermediate file will be produced, but in case of Parititioning multiple files get generated depending upon the developers logic. Below image from Oreilly Hadoop: The Definitive guide, may help you in understanding this concept in more details.
Later, Hadoop copies merged file from each of the Mapper nodes to the Reducer nodes depending upon the key value. That is all the records of the same key will be copied to the same Reducer node.
I think, you may know in depth about SS and Reduce Phase work, so not going into more details for these topics.
Also, for more information, I would suggest you to read Oreilly Hadoop: The Definitive guide. Its awesome book for Hadoop.

What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?

In Map Reduce programming the reduce phase has shuffling, sorting and reduce as its sub-parts. Sorting is a costly affair.
What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?
First of all shuffling is the process of transfering data from the mappers to the reducers, so I think it is obvious that it is necessary for the reducers, since otherwise, they wouldn't be able to have any input (or input from every mapper). Shuffling can start even before the map phase has finished, to save some time. That's why you can see a reduce status greater than 0% (but less than 33%) when the map status is not yet 100%.
Sorting saves time for the reducer, helping it easily distinguish when a new reduce task should start. It simply starts a new reduce task, when the next key in the sorted input data is different than the previous, to put it simply. Each reduce task takes a list of key-value pairs, but it has to call the reduce() method which takes a key-list(value) input, so it has to group values by key. It's easy to do so, if input data is pre-sorted (locally) in the map phase and simply merge-sorted in the reduce phase (since the reducers get data from many mappers).
Partitioning, that you mentioned in one of the answers, is a different process. It determines in which reducer a (key, value) pair, output of the map phase, will be sent. The default Partitioner uses a hashing on the keys to distribute them to the reduce tasks, but you can override it and use your own custom Partitioner.
A great source of information for these steps is this Yahoo tutorial (archived).
A nice graphical representation of this is the following (shuffle is called "copy" in this figure):
Note that shuffling and sorting are not performed at all if you specify zero reducers (setNumReduceTasks(0)). Then, the MapReduce job stops at the map phase, and the map phase does not include any kind of sorting (so even the map phase is faster).
UPDATE: Since you are looking for something more official, you can also read Tom White's book "Hadoop: The Definitive Guide". Here is the interesting part for your question.
Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation, so I guess it is pretty credible and official...
Let's revisit key phases of Mapreduce program.
The map phase is done by mappers. Mappers run on unsorted input key/values pairs. Each mapper emits zero, one, or multiple output key/value pairs for each input key/value pairs.
The combine phase is done by combiners. The combiner should combine key/value pairs with the same key. Each combiner may run zero, once, or multiple times.
The shuffle and sort phase is done by the framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key. The programmer may supply custom compare functions for sorting and a partitioner for data split.
The partitioner decides which reducer will get a particular key value pair.
The reducer obtains sorted key/[values list] pairs, sorted by the key. The value list contains all values with the same key produced by mappers. Each reducer emits zero, one or multiple output key/value pairs for each input key/value pair.
Have a look at this javacodegeeks article by Maria Jurcovicova and mssqltips article by Datta for a better understanding
Below is the image from safaribooksonline article
I thought of just adding some points missing in above answers. This diagram taken from here clearly states the what's really going on.
If I state again the real purpose of
Split: Improves the parallel processing by distributing the processing load across different nodes (Mappers), which would save the overall processing time.
Combine: Shrinks the output of each Mapper. It would save the time spending for moving the data from one node to another.
Sort (Shuffle & Sort): Makes it easy for the run-time to schedule (spawn/start) new reducers, where while going through the sorted item list, whenever the current key is different from the previous, it can spawn a new reducer.
Some of the data processing requirements doesn't need sort at all. Syncsort had made the sorting in Hadoop pluggable. Here is a nice blog from them on sorting. The process of moving the data from the mappers to the reducers is called shuffling, check this article for more information on the same.
I've always assumed this was necessary as the output from the mapper is the input for the reducer, so it was sorted based on the keyspace and then split into buckets for each reducer input. You want to ensure all the same values of a Key end up in the same bucket going to the reducer so they are reduced together. There is no point sending K1,V2 and K1,V4 to different reducers as they need to be together in order to be reduced.
Tried explaining it as simply as possible
Shuffling is the process by which intermediate data from mappers are transferred to 0,1 or more reducers. Each reducer receives 1 or more keys and its associated values depending on the number of reducers (for a balanced load). Further the values associated with each key are locally sorted.
Because of its size, a distributed dataset is usually stored in partitions, with each partition holding a group of rows. This also improves parallelism for operations like a map or filter. A shuffle is any operation over a dataset that requires redistributing data across its partitions. Examples include sorting and grouping by key.
A common method for shuffling a large dataset is to split the execution into a map and a reduce phase. The data is then shuffled between the map and reduce tasks. For example, suppose we want to sort a dataset with 4 partitions, where each partition is a group of 4 blocks.The goal is to produce another dataset with 4 partitions, but this time sorted by key.
In a sort operation, for example, each square is a sorted subpartition with keys in a distinct range. Each reduce task then merge-sorts subpartitions of the same shade.
The above diagram shows this process. Initially, the unsorted dataset is grouped by color (blue, purple, green, orange). The goal of the shuffle is to regroup the blocks by shade (light to dark). This regrouping requires an all-to-all communication: each map task (a colored circle) produces one intermediate output (a square) for each shade, and these intermediate outputs are shuffled to their respective reduce task (a gray circle).
The text and image was largely taken from here.
There only two things that MapReduce does NATIVELY: Sort and (implemented by sort) scalable GroupBy.
Most of applications and Design Patterns over MapReduce are built over these two operations, which are provided by shuffle and sort.
This is a good reading. Hope it helps. In terms of sorting you are concerning, I think it is for the merge operation in last step of Map. When map operation is done, and need to write the result to local disk, a multi-merge will be operated on the splits generated from buffer. And for a merge operation, sorting each partition in advanced is helpful.
Well,
In Mapreduce there are two important phrases called Mapper and reducer both are too important, but Reducer is mandatory. In some programs reducers are optional. Now come to your question.
Shuffling and sorting are two important operations in Mapreduce. First Hadoop framework takes structured/unstructured data and separate the data into Key, Value.
Now Mapper program separate and arrange the data into keys and values to be processed. Generate Key 2 and value 2 values. This values should process and re arrange in proper order to get desired solution. Now this shuffle and sorting done in your local system (Framework take care it) and process in local system after process framework cleanup the data in local system.
Ok
Here we use combiner and partition also to optimize this shuffle and sort process. After proper arrangement, those key values passes to Reducer to get desired Client's output. Finally Reducer get desired output.
K1, V1 -> K2, V2 (we will write program Mapper), -> K2, V' (here shuffle and soft the data) -> K3, V3 Generate the output. K4,V4.
Please note all these steps are logical operation only, not change the original data.
Your question: What is the purpose of shuffling and sorting phase in the reducer in Map Reduce Programming?
Short answer: To process the data to get desired output. Shuffling is aggregate the data, reduce is get expected output.

Same-machine-as-data processing on reduce side of map reduce

One of the big benefits of Hadoop MapReduce is the fact that Map processes take place on the same machine that the data they operate upon resides (to the extent possible). But can this be or is this perhaps already true of the Reduce side? For example, in the extreme case of a Map-only job, all of the output data ends up on the same machine as the corresponding input data (right?). But in an intermediate case in which the output is somewhat correlated with the output, it seems reasonable to partition the output and to the extent possible keep it on same machine at it started on.
Is this possible? Does this already happen?
Inputs to the Reducers can reside on any node(local or remote) and not necessarily on the same machine where they are running. As Mappers complete their output gets written onto the local FS of the machine where they are running. Once this is done the intermediate output is needed by the machines that are about to run the reduce task. One thing to note here is that all the values corresponding to a particular key go the same reducer. So, it's not always possible that the input to Reducers is local, since different sets of key/value pairs are processed by different Mappers running on different machines.
Now, before the Mapper output is sent to Reducers for further processing, the data is partitioned based on keys and each partition goes to a Reducer and all the key/value pairs in that partition get processed by that Reducer. During the process a lot of data shuffling takes place. So it's not possible to maintain the data locality in case of Reducers.
Hope this answers the question.
If you know that the data for a particular reducer is already on the right node after the map phase, and the algorithm allows for it (see this blog post about it) you should insert your reducer as a combiner. Combiners are like miniature reducers that only get to see co-located data. Often you can dramatically improve performance because the combiner output can be orders of magnitude smaller than the map output, so what's left to shuffle is trivial.
Of course, if indeed the map phase leaves your data already correctly partitioned, why use a reducer at all? Why not create a second map job that simulates a reducer?

How does partitioning in MapReduce exactly work?

I think I have a fair understanding of the MapReduce programming model in general, but even after reading the original paper and some other sources many details are unclear to me, especially regarding the partitioning of the intermediate results.
I will quickly summarize my understanding of MapReduce so far: We have a potentially very large input data set, which is automatically split up into M different pieces by the MR-Framework. For each piece, the framework schedules one map task which is executed by one of the available processors/machines in my cluster. Each of the M map tasks outputs a set of Key-Value-Pairs, which is stored locally on the same machine that executed this map task. Each machine divides its disk into R partitions and distributes its computed intermediate key value pairs based on the intermediate keys among the partitions. Then, the framework starts for each distinct intermediate key one reduce task which is again executed by any of the available machines.
Now my questions are:
In some tutorials it sounds like there could be map and reduce tasks executed in parallel. Is this right? How could that be, assuming that for each distinct intermediate key only one reduce task is started? Do we not have to wait until the last map task is finished before we can start the first reduce task?
As we have one reduce task per distinct intermediate key, is it right that each reduce task requires the executing machine to load the corresponding partition from every other machine? Potentially, every machine can have a key-value-pair with the desired intermediate key, so for each reduce task we potentially have to query all other machines. Is that really efficient?
The original paper says that the number of partitions (R) is specified by the user. But isn’t a partition the input for a reduce task? Or more exactly: Isn’t the union of all partitions with the same number among all machines the input of one reduce task? That would mean, that R depends on the number of distinct intermediate keys which the user usually doesn’t know.
Conceptually it is clear what the input and outputs of the map and reduce functions/tasks are. But I think I haven’t yet understood MapReduce on the technical level. Could somebody please help me understanding?
You can start the reducer tasks while the map tasks are still running (using a feature known as slowstart), but the reducers can only run the copy phase (acquiring the completed results from the completed map tasks. It will need to wait for all the mappers to complete before it can actually perform the final sort and reduce.
A reduce task actually processes zero, one or more keys (rather than a discrete tasks for each key). Each reducer will need to acquire the map output from each map task that relates to its partition before these intermediate outputs are sorted and then reduced one key set at a time.
Back to the note in 2 - a reducer task (one for each partition) runs on zero, one or more keys rather than a single task for each discrete key.
It's also important to understand the spread and variation of your intermediate key as it is hashed and modulo'd (if using the default HashPartitioner) to determine which reduce partition should process that key. Say you had an even number of reducer tasks (10), and output keys that always hashed to an even number - then in this case the modulo of these hashs numbers and 10 will always be an even number, meaning that the odd numbered reducers would never process any data.
Addendum to what Chris said,
Basically, a partitioner class in Hadoop (e.g. Default HashPartitioner)
has to implement this function,
int getPartition(K key, V value, int numReduceTasks)
This function is responsible for returning you the partition number and you get the number of reducers you fixed when starting the job from the numReduceTasks variable, as seen for in the HashPartitioner.
Based on what integer the above function return, Hadoop selects node where the reduce task for a particular key should run.
Hope this helps.

Data sharing in Hadoop Map Reduce chaining

Is it possible to share a value between successive reducer and mapper?
Or is it possible to store the output of first reducer into memory and second mapper can access that from memory ?
Problem is ,
I had written a chain map reducer like Map1 -> Reducer1 --> Map2 --> Reducer2.
Map1 and Map2 is reading the same input file.
Reduce1 is deriving a value suppose 'X' as its output.
I need 'X' and input file for Map2.
How can we do this without reading the output file of Reduce1?
Is it possible store 'X' in memory to access for Mapper 2 ?
Each job is independent of each other, so without storing the output in intermediate location it's not possible to share the data across jobs.
FYI, in MapReduce model the map tasks don't talk to each other. Same is the case for reduce tasks also. Apache Giraph which runs on Hadoop uses communication between the mappers in the same job for iterative algorithms which requires the same job to be run again and again without communication between the mappers.
Not sure about the algorithm being implemented and why MR, but every MR algorithm can be implemented in BSP also. Here is a paper comparing BSP with MR. Some of the algorithms perform well in BSP when compared to MR. Apache Hama is an implementation of the BSP model, the way Apache Hadoop is an implementation of MR.
If number of distinct rows produced by Reducer1 is small (say you have 10000 (id,price) tuples), using two stage processing is preferred. You can load results from first map/reduce into memory in each Map2 mapper and filter input data. So, no uneeded data will be transferred via network and all data will be processed locally. With use of combineres amount of data can be even less.
In case of huge amount of distinct rows looks like you need to read data twice.

Resources