I have gone thru few hadoop info books and papers.
A Slot is a map/reduce computation unit at a node. it may be map or reduce slot.
As far as, i know split is a group of blocks of files in HDFS which have some length and location of nodes where they ares stored.
Mapper is class but when the code is instantiated it is called map task.
Am i right ?
I am not clear of difference and relationship between map tasks, data splits and Mapper.
Regarding scheduling i understand that when a map slot of a node is free a map task is choosen from the non-running map task and launched if the data to be processed by the map task is the node.
Can anyone explain it clearly in terms of above concepts: slots, mapper and map task etc.
Thanks,
Arun
As far as, I know split is a group of blocks of files in HDFS which have the same length and location of nodes where they are stored.
InputSplit is a unit of data which a particular mapper will process. It needs not be just a group of HDFS blocks. It can be a single line, 100 rows from a DB, a 50MB file etc.
I am not clear about difference and relationship between map tasks, data splits and Mapper.
An InputSplit is processed by a map task and an instance of Mapper is a Map task.
As I understand:
first data split in HDFS to the Data nodes
then when there are a new job , the job tracker divide this job into Map and reduce tasks
and then Job tracker assign each map task to the node which already has the split of data related to this map task so the data is local in the node and there will be no cost for moving data so the execution time be less as possible
but sometimes we have to assign task to node which has not the data on it , so the node has to get the data through network and then processed it
input split is not the data it is the reference to particular amount of data that map reduce process. Usually it is same as the block size, because if size of both is not same and some data is on different node then we need to transfer that data.
MAPPER : mapper is a class.
MAPPER PHASE : mapper phase is a input,output code in to convert the values in keys and values pairs(keys,values).
MAPPER SLOT : to execute the mapper and reducer code.
Related
As per my understanding, mapper runs first followed by partitioner(if any) followed by Reducer. But if we use Partitioner class, I am not sure when Sorting and Shuffling phase runs?
A CLOSER LOOK
Below diagram explain the complete details.
From this diagram, you can see where the mapper and reducer components of the Word Count application fit in, and how it achieves its objective. We will now examine this system in a bit closer detail.
mapreduce-flow
Shuffle and Sort phase will always execute(across the mappers and reducers nodes).
The hierarchy of the different phase in MapReduce as below:
Map --> Partition --> Combiner(optional) --> Shuffle and Sort --> Reduce.
The short answer is: Data sorting runs on the reducers, shuffling/sorting runs before the reducer (always) and after the map/combiner(if any)/partitioner(if any).
The long answer is that into a MapReduce job there are 4 main players:
Mapper, Combiner, Partitioner, Reducer. All these are classes you can actually implement by your own.
Let's take the famous word count program, and let's assume the split where we are working contains:
pippo, pluto, pippo, pippo, paperino, pluto, paperone, paperino, paperino
and each word is record.
Mapper
Each mapper runs over a subset of your file, its task is to read each record from the split and assign a key to each record which will output.
Mapper will store intermediate result on disk ( local disk ).
The intermediate output from this stage will be
pippo,1
pluto,1
pippo,1
pippo,1
peperino,1
pluto,1
paperone,1
paperino,1
paperino,1
At this will be stored on the local disk of the node which runs the mapper.
Combiner
It's a mini-reducer and can aggregate data. It can also run joins, so called map-join. This object helps to save bandwidth into the cluster because it aggregates data on the local node.
The output from the combiner, which is still part of the mapper phase, will be:
pippo,3
pluto,2
paperino,3
paperone,1
Of course here are the data from ONE node. Now we have to send the data to the reducers in order to get the global result. Which reducer will process the record depends on the partitioner.
Partitioner
It's task is to spread the data across all the available reducers. This object will read the output from the combiner and will select the reducer which will process the key.
In this example we have two reducers and we use the following rule:
all the pippo goes to reducer 1
all the pluto goes to reducer 2
all the paperino goes to reducer 2
all the paperone goes to reducer 1
so all the nodes will send records which have the key pippo to the same reducer(1), all the nodes will send the records which have the key pluto to the same reducer (2) and so on...
Here is where the data get shuffled/sorted and, since the combiner already reduced the data locally, this node has to send only 4 record instead of 9.
Reducer
This object is able to aggregate the data from each node and it's also able to sort the data.
If there is a job that has only a map and no reduce, and if all data value that are to be processed are mapped to a single key, will the job only be processed on a single node?
No.
Basically, number of nodes will be determined by number of mappers. 1 mapper will run on 1 node, N mappers on N nodes, one node per mapper.
The number of mappers needed for your job will be set by Hadoop, depending on the amount of data, and on the size of blocks your data will be split in. Each block of data will be processed by 1 mapper.
So if for instance you have an amount of data, that is split in N blocks, you will need N mappers to process it.
Directly from Hadoop Definitive Guide, 6 Chapter, Anatomy of Map reduce job run.
"To create the list of tasks to run, the job scheduler first retrieves
the input splits computed by the client from the shared filesystem. It
then creates one map task for each split. The number of reduce tasks
to create is determined by the mapred.reduce.tasks property in the
Job, which is set by the setNumReduceTasks() method, and the scheduler
simply creates this number of reduce tasks to be run. Tasks are given
IDs at this point."
I understand that Resource Manager sends MapReduce Program to each Node Manager so that MapReduce gets executed in each Node.
But After seeing this image , I am getting confused on where actual Map & Reduce jobs executed and how shuffling is happening between Data Nodes ?
Is it not time taking process to sort and suffle/send data accross difference Data Node to perform Reduce Job ? Please explain me.
Also let me know what is Map Node and Reduce Node in this diagram.
Image Src: http://gppd-wiki.inf.ufrgs.br/index.php/MapReduce
The input split is a logical chunk of the file stored on hdfs , by default an input split represents a block of a file where the blocks of the file might be stored on many data nodes in the cluster.
A container is a task execution template allocated by the Resource Manager on any of the data node in order to execute the Map/Reduce tasks.
First the Map tasks gets executed by the containers on data node where the container was allocated by the Resource Manager as near as possible to the Input Split's location by adhering to the Rack Awareness Policy (Local/Rack Local/DC Local).
The Reduce tasks will be executed by any random containers on any data nodes, and the reducers copies its relevant the data from every mappers by the Shuffle/Sort process.
The mappers prepares the results in such a way the results are internally partitioned and within each partition the records are sorted by the key and the partitioner determines which reducer should fetch the partitioned data.
By Shuffle and Sort, the Reducers copies their relevant partitions from every mappers output through http, eventually every reducer Merge&Sort the copied partitions and prepares the final single Sorted file before the reduce() method invoked.
The below image may give more clarifiations.
[Imagesrc:http://www.ibm.com/developerworks/cloud/library/cl-openstack-deployhadoop/]
I have a confusion about the implementation of Hadoop.
I notice that when I run my Hadoop MapReduce job with multiple mappers and reducers, I would get many part-xxxxx files. Meanwhile, it is true that a key only appears in one of them.
Thus, I am wondering how MapReduce works such that a key only goes to one output file?
Thanks in advance.
The shuffle step in the MapReduce process is responsible for ensuring that all records with the same key end up in the same reduce task. See this Yahoo tutorial for a description of the MapReduce data flow. The section called Partition & Shuffle states that
Each map task may emit (key, value) pairs to any partition; all values for the same key are always reduced together regardless of which mapper is its origin.
Shuffle
Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP.
Sort
The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage.
The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.
I got this from here
https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Have a look on it i hope this will helpful
I searched a lot and i knew that in every map task when the content of buffer arrives to a threshold, a thread partitions the data according to number of reduces.what is the role of reduce numbers here? why does partitioning happen in map?how does it help map phase?after sorting , the thread will spills the content to disk.
How does it happen? i can't undersatnd the meaning of spilling here.....
Thanks.
Map need to partition the data as the reducers poll and pull all the data from each mapper that is relevant to the reducer.
If you imagine it the other way around - the reducer pull all the output from each map then you'd be sending all the data output from each mapper to each reducer - hugely inefficient.
So by partitioning in the mapper, the reducer is able to query and pull back the data it needs to reduce from each mapper.