Can anyone explain me how hadoop decides to pass the jobs to map and reduce. Hadoop jobs are passed onto map and reduce but I am not able to figure out the way in which its done.
Thanks in advance.
Please refer Hadoop Definitive guid, Chapter 6, Anatomy of a MapReduce Job Run topic. Happy Learning
From Apache mapreduce tutorial :
Job Configuration:
Job represents a MapReduce job configuration.
Job is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. The framework tries to faithfully execute the job as described by Job
Task Execution & Environmen
The MRAppMaster executes the Mapper/Reducer task as a child process in a separate jvm.
Mapper
Mapper maps input key/value pairs to a set of intermediate key/value pairs.
How Many Maps?
The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
Reducer
Reducer reduces a set of intermediate values which share a key to a smaller set of values.
The number of reduces for the job is set by the user via Job.setNumReduceTasks(int).
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>).
Job Submission and Monitoring
The job submission process involves:
Checking the input and output specifications of the job.
Computing the InputSplit values for the job.
Setting up the requisite accounting information for the DistributedCache of the job, if necessary.
Copying the job’s jar and configuration to the MapReduce system directory on the FileSystem.
Submitting the job to the ResourceManager and optionally monitoring it’s status.
Job Input
InputFormat describes the input-specification for a MapReduce job. InputSplit represents the data to be processed by an individual Mapper.
Job Output
OutputFormat describes the output-specification for a MapReduce job.
Go through that tutorial for further understanding of complete workflow.
From AnatomyMapReduceJob article from http://ercoppa.github.io/ :
The workflow can be pictured as below.
Related
I configured Hadoop in pseudo distributed mode as single node. i want to know how exactly it will process any job and how many mapper & reducers will act for completion of the job ?
Mappers depend on your Inputs splits and reducer depends on what you set in job.setNumReduceTasks() if not a default of 1. Read The definitive guide for more info.
I have 4 nodes and I am running a mapreduce sample project to see if job is being distrubuted between all 4 nodes. I ran the project mulitple times and have noticed that, the mapper task is being splitted among all 4 nodes but the reducer task is only being done by one node. Is this how it is suppose to be or is reducer task suppose to be split among all 4 nodes as well.
Thank you
Distribution of Mappers depends on which block of data the mapper will operate on. Framework by default tries to assign the task to a node which has the block of data stored. This will prevent network transfer of data.
For reducers again it depends on no. of reducers which your job requires. If your job uses only one reducer it may be assigned to any pf the nodes.
Also impacting this is speculative execution. If on then this results in multiple instances of map task/ reduce task to start on different nodes and the job tracker based on % completion decides which one will go through and other instances will be killed.
Let us say you 224 MB file. When you add that file into HDFS based on the default block size of 64 MB, the files are split into 4 blocks [blk1=64M,blk2=64M,blk3=64M,blk4=32M]. Let us assume blk1 in on node1 represented as blk1::node1, blk2::node2, blk3:node3, blk4:node4. Now when you run the MR, the Map needs to access the input file. So MR FWK creates 4 mappers and are executed on each node. Now comes the Reducer, as Venkat said it depends on no.of reducers configured for your job. The reducers can be configured using the Hadoop org.apache.hadoop.mapreduce.Job setNumReduceTasks(int tasks) API.
What is the difference between a mapper and a map task?
Similarly, a reducer and a reduce task?
Also, how are number of mappers,maptasks,reducers,reducetasks determined during the execution of a mapreduce task?
Give interrelationships between them if there is any.
Simply map task is an instance of Mapper. Mapper and reducer are methods in mapreduce jobs.
When we run a mapreduce job, number of map tasks spawned depends on the number blocks(number of blocks depend on input splits) in the input. However the number of reduce tasks can be specified in the mapreduce driver code. Either it can be specified by setting property mapred.reduce.tasks in the job configuration object or org.apache.hadoop.mapreduce.Job#setNumReduceTasks(int reducerCount); method can be used.
In the old JobConf API setNumMapTasks() method was there. But setNumMapTasks() method is removed in the new API org.apache.hadoop.mapreduce.Jobwith the intension of number of mappers should be calculated based on the input splits.
I'm new in hadoop development. I read about hadoop cluster structure and understood that there are one namenode, jobtracker, tasktracker and multiple datanodes.
When we write map-reduce programs we implement mapper and reducer. I also understood logic of these clasess. But I don't understand how are they executed in the hadoop cluster.
Is mapper executed in the namenode only?
Is reducer executed seperatly on the datanodes?
I need to make a lot of parralel computations and don't want to use HDFS, how can I be sure that each output collection (from mapper) executes seperatly in all datanodes?
Please explain me the connection between hadoop cluster and map/reduce logic.
Thanks a lot!
Map Reduce Jobs are executed by Job Tracker and Task Trackers.
Job Tracker initiates the Job the dividing the input file/files into splits. Tasktrackers are given these splits who run map tasks on the splits( One map task per split). After Mappers throws their output.This output will be passed on the reducer depending on the map output keys . Similar keys are sent to one reducer. Reducer can be more than 1 , depending upon your configuration. Reducer process also runs on one the tasktracker nodes only .
You can see stats of the Job on , jobtracker UI which by default runs on 50030 port.
You can also, visit my website for example topics on Bigdata technologies. Also, you can post your questions , I will try to answer.
http://souravgulati.webs.com/apps/forums/show/14108248-bigdata-learnings-hadoop-hbase-hive-and-other-bigdata-technologies-
My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically?
I haven't found the answer neither in Hadoop Streaming documentation, nor in Amazon Elastic MapReduce FAQ.
Hadoop has a notion of "slots". Slot is a place where mapper process will run. You configure number of slots per tasktracker node. It is teoretical maximum of map process which will run parralel per node. It can be less if there is not enough separate poprtions of the input data (called FileSplits).
Elastic MapReduce do have its own estimation how much slots to allocate depending on the instance capabilities.
In the same time I can imagine scenario where your processing will be more efficeint when one datastream is prcessed by many cores. If you have your mapper with built-in multicore usage - you can reduce number of slots. But it is inot usually a case in the typycial Hadoop tasks.
See the EMR doco [1] for the number of map/reduce tasks per instance type.
In addition to David's answers you can also have Hadoop run multiple threads per map slot by setting...
conf.setMapRunnerClass(MultithreadedMapRunner.class);
The default is 10 threads but it's tunable with
-D mapred.map.multithreadedrunner.threads=5
I often find this useful for custom high IO stuff.
[1] http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault_AMI2.html
My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically?
Once a Hadoop cluster has been set, the minimum required to submit a job is
Input format and location
Output format and location
Map and Reduce functions for processing the data
Location of the NameNode and the JobTracker
Hadoop will take care of distributing the job to different nodes, monitoring them, reading the data from the i/p and writing the data to the o/p. If the user has to do all those tasks, then there is no point of using Hadoop.
Suggest, to go through the Hadoop documentation and a couple of tutorials.