Hadoop map process - hadoop

If there is a job that has only a map and no reduce, and if all data value that are to be processed are mapped to a single key, will the job only be processed on a single node?

No.
Basically, number of nodes will be determined by number of mappers. 1 mapper will run on 1 node, N mappers on N nodes, one node per mapper.
The number of mappers needed for your job will be set by Hadoop, depending on the amount of data, and on the size of blocks your data will be split in. Each block of data will be processed by 1 mapper.
So if for instance you have an amount of data, that is split in N blocks, you will need N mappers to process it.

Directly from Hadoop Definitive Guide, 6 Chapter, Anatomy of Map reduce job run.
"To create the list of tasks to run, the job scheduler first retrieves
the input splits computed by the client from the shared filesystem. It
then creates one map task for each split. The number of reduce tasks
to create is determined by the mapred.reduce.tasks property in the
Job, which is set by the setNumReduceTasks() method, and the scheduler
simply creates this number of reduce tasks to be run. Tasks are given
IDs at this point."

Related

number of mapper and reducer tasks in MapReduce

If I set the number of reduce tasks as something like 100 and when I run the job, suppose the reduce task number exceeds (as per my understanding the number of reduce tasks depends on the key-value we get from the mapper.Suppose I am setting (1,abc) and (2,bcd) as key value in mapper, the number of reduce tasks will be 2) How will MapReduce handle it?.
as per my understanding the number of reduce tasks depends on the key-value we get from the mapper
Your understanding seems to be wrong. The number of reduce tasks does not depend on the key-value we get from the mapper.
In a MapReduce job the number of reducers is configurable on a per job basis and is set in the driver class.
For example if we need 2 reducers for our job then we need to set it in the driver class of our MapReduce job as below:-
job.setNumReduceTasks(2);
In the Hadoop: The Definitive Guide book, Tom White states that -
Setting reducer count is kind of art, instead of science.
So we have to decide how many reducers we need for our job. For your example if you have the intermediate Mapper input as (1,abc) and (2,bcd) and you have not set the number of reducers in the driver class then Mapreduce by default runs only 1 reducer and both of the key value pairs will be processed by a single Reducer and you will get a single output file in the specified output directory.
The default value of number of reducer on MapReduce is 1 irrespective of the number of the (key,value) pairs.
If you set the number of Reducer for a MapReduce job, then number of Reducer will not exceed than the defined value irrespective of the number of different (key,value) pairs.
Once the Mapper task are completed the output is processed by Partitioner by dividing the data into Reducers. The default partitioner for hadoop is HashPartitioner which partition the data based on hash value of keys. It has a method called getPartition. It takes key.hashCode() & Integer.MAX_VALUE and finds the modulus using the number of reduce tasks.
So the number of reducer will never exceed than what you have defined in the Driver class.

Does execution of Map and Reduce phase happen inside each DataNode by Node Manager?

I understand that Resource Manager sends MapReduce Program to each Node Manager so that MapReduce gets executed in each Node.
But After seeing this image , I am getting confused on where actual Map & Reduce jobs executed and how shuffling is happening between Data Nodes ?
Is it not time taking process to sort and suffle/send data accross difference Data Node to perform Reduce Job ? Please explain me.
Also let me know what is Map Node and Reduce Node in this diagram.
Image Src: http://gppd-wiki.inf.ufrgs.br/index.php/MapReduce
The input split is a logical chunk of the file stored on hdfs , by default an input split represents a block of a file where the blocks of the file might be stored on many data nodes in the cluster.
A container is a task execution template allocated by the Resource Manager on any of the data node in order to execute the Map/Reduce tasks.
First the Map tasks gets executed by the containers on data node where the container was allocated by the Resource Manager as near as possible to the Input Split's location by adhering to the Rack Awareness Policy (Local/Rack Local/DC Local).
The Reduce tasks will be executed by any random containers on any data nodes, and the reducers copies its relevant the data from every mappers by the Shuffle/Sort process.
The mappers prepares the results in such a way the results are internally partitioned and within each partition the records are sorted by the key and the partitioner determines which reducer should fetch the partitioned data.
By Shuffle and Sort, the Reducers copies their relevant partitions from every mappers output through http, eventually every reducer Merge&Sort the copied partitions and prepares the final single Sorted file before the reduce() method invoked.
The below image may give more clarifiations.
[Imagesrc:http://www.ibm.com/developerworks/cloud/library/cl-openstack-deployhadoop/]

Hadoop Map/Reduce Job distribution

I have 4 nodes and I am running a mapreduce sample project to see if job is being distrubuted between all 4 nodes. I ran the project mulitple times and have noticed that, the mapper task is being splitted among all 4 nodes but the reducer task is only being done by one node. Is this how it is suppose to be or is reducer task suppose to be split among all 4 nodes as well.
Thank you
Distribution of Mappers depends on which block of data the mapper will operate on. Framework by default tries to assign the task to a node which has the block of data stored. This will prevent network transfer of data.
For reducers again it depends on no. of reducers which your job requires. If your job uses only one reducer it may be assigned to any pf the nodes.
Also impacting this is speculative execution. If on then this results in multiple instances of map task/ reduce task to start on different nodes and the job tracker based on % completion decides which one will go through and other instances will be killed.
Let us say you 224 MB file. When you add that file into HDFS based on the default block size of 64 MB, the files are split into 4 blocks [blk1=64M,blk2=64M,blk3=64M,blk4=32M]. Let us assume blk1 in on node1 represented as blk1::node1, blk2::node2, blk3:node3, blk4:node4. Now when you run the MR, the Map needs to access the input file. So MR FWK creates 4 mappers and are executed on each node. Now comes the Reducer, as Venkat said it depends on no.of reducers configured for your job. The reducers can be configured using the Hadoop org.apache.hadoop.mapreduce.Job setNumReduceTasks(int tasks) API.

Hadoop streaming api - limit number of mappers on a per job basis

I have a job running on a small hadoop cluster that I want to limit the number of mappers it spawns per datanode. When I use the -Dmapred.map.tasks=12, it still spawns 17 mappers for some reason. I've figured out a way to limit it globally, but I want to do it on a per job basis.
In Map Reduce , the total number of mappers will be spawned depends upon the input splits that are being created from your data .
There will be one mapper task spawned per input split. SO , you cannot decrease the count of mapper in Map Reduce.

Difference and relationship between slots, map tasks, data splits, Mapper

I have gone thru few hadoop info books and papers.
A Slot is a map/reduce computation unit at a node. it may be map or reduce slot.
As far as, i know split is a group of blocks of files in HDFS which have some length and location of nodes where they ares stored.
Mapper is class but when the code is instantiated it is called map task.
Am i right ?
I am not clear of difference and relationship between map tasks, data splits and Mapper.
Regarding scheduling i understand that when a map slot of a node is free a map task is choosen from the non-running map task and launched if the data to be processed by the map task is the node.
Can anyone explain it clearly in terms of above concepts: slots, mapper and map task etc.
Thanks,
Arun
As far as, I know split is a group of blocks of files in HDFS which have the same length and location of nodes where they are stored.
InputSplit is a unit of data which a particular mapper will process. It needs not be just a group of HDFS blocks. It can be a single line, 100 rows from a DB, a 50MB file etc.
I am not clear about difference and relationship between map tasks, data splits and Mapper.
An InputSplit is processed by a map task and an instance of Mapper is a Map task.
As I understand:
first data split in HDFS to the Data nodes
then when there are a new job , the job tracker divide this job into Map and reduce tasks
and then Job tracker assign each map task to the node which already has the split of data related to this map task so the data is local in the node and there will be no cost for moving data so the execution time be less as possible
but sometimes we have to assign task to node which has not the data on it , so the node has to get the data through network and then processed it
input split is not the data it is the reference to particular amount of data that map reduce process. Usually it is same as the block size, because if size of both is not same and some data is on different node then we need to transfer that data.
MAPPER : mapper is a class.
MAPPER PHASE : mapper phase is a input,output code in to convert the values in keys and values pairs(keys,values).
MAPPER SLOT : to execute the mapper and reducer code.

Resources