Hadoop cluster and MapReduce logic - hadoop

I'm new in hadoop development. I read about hadoop cluster structure and understood that there are one namenode, jobtracker, tasktracker and multiple datanodes.
When we write map-reduce programs we implement mapper and reducer. I also understood logic of these clasess. But I don't understand how are they executed in the hadoop cluster.
Is mapper executed in the namenode only?
Is reducer executed seperatly on the datanodes?
I need to make a lot of parralel computations and don't want to use HDFS, how can I be sure that each output collection (from mapper) executes seperatly in all datanodes?
Please explain me the connection between hadoop cluster and map/reduce logic.
Thanks a lot!

Map Reduce Jobs are executed by Job Tracker and Task Trackers.
Job Tracker initiates the Job the dividing the input file/files into splits. Tasktrackers are given these splits who run map tasks on the splits( One map task per split). After Mappers throws their output.This output will be passed on the reducer depending on the map output keys . Similar keys are sent to one reducer. Reducer can be more than 1 , depending upon your configuration. Reducer process also runs on one the tasktracker nodes only .
You can see stats of the Job on , jobtracker UI which by default runs on 50030 port.
You can also, visit my website for example topics on Bigdata technologies. Also, you can post your questions , I will try to answer.
http://souravgulati.webs.com/apps/forums/show/14108248-bigdata-learnings-hadoop-hbase-hive-and-other-bigdata-technologies-

Related

Physical Process Tree of MapReduce Jobs in Hadoop (over the cluster nodes)

I read a lot of references, book chapters & articles, but I'm still trying to glue everything together :
I fairly understand the MapReduce Logical chain, but I specifically would like to know what specific processes are launched on which physical node over time.
I guess mappers are executed "on site" on datanode machines, but what about the other processes, specifically reducers, who need to access data over multiple datanodes ?
Also, if I get it well, the map and reduce written programs are launched on the master node where the command is executed, and result in launching new threads on new JVMs all over the cluster, is that right ?
I recommend you to visit http://bytepadding.com/map-reduce/
to give you an overview.
MapReduce client can be launched locally or on a dataNode(oozie launcher).
Based on the inputFormat the file location is fetched from namenode by the MapreduceDriver(Application Master)
Based on file split policy, mappers are launched and the process tries to spawn mappers as close as possible to individual file blocks.
Mappers are spawned on dataNodes.
After Mappers have finshed, reducer are launched on DataNodes and data from mappers are copied on these specific machines.

how many mapper and reducers will act for completion of a hadoop job?

I configured Hadoop in pseudo distributed mode as single node. i want to know how exactly it will process any job and how many mapper & reducers will act for completion of the job ?
Mappers depend on your Inputs splits and reducer depends on what you set in job.setNumReduceTasks() if not a default of 1. Read The definitive guide for more info.

How does hadoop distribute jobs to map and reduce

Can anyone explain me how hadoop decides to pass the jobs to map and reduce. Hadoop jobs are passed onto map and reduce but I am not able to figure out the way in which its done.
Thanks in advance.
Please refer Hadoop Definitive guid, Chapter 6, Anatomy of a MapReduce Job Run topic. Happy Learning
From Apache mapreduce tutorial :
Job Configuration:
Job represents a MapReduce job configuration.
Job is the primary interface for a user to describe a MapReduce job to the Hadoop framework for execution. The framework tries to faithfully execute the job as described by Job
Task Execution & Environmen
The MRAppMaster executes the Mapper/Reducer task as a child process in a separate jvm.
Mapper
Mapper maps input key/value pairs to a set of intermediate key/value pairs.
How Many Maps?
The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.
Reducer
Reducer reduces a set of intermediate values which share a key to a smaller set of values.
The number of reduces for the job is set by the user via Job.setNumReduceTasks(int).
How Many Reduces?
The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * <no. of maximum containers per node>).
Job Submission and Monitoring
The job submission process involves:
Checking the input and output specifications of the job.
Computing the InputSplit values for the job.
Setting up the requisite accounting information for the DistributedCache of the job, if necessary.
Copying the job’s jar and configuration to the MapReduce system directory on the FileSystem.
Submitting the job to the ResourceManager and optionally monitoring it’s status.
Job Input
InputFormat describes the input-specification for a MapReduce job. InputSplit represents the data to be processed by an individual Mapper.
Job Output
OutputFormat describes the output-specification for a MapReduce job.
Go through that tutorial for further understanding of complete workflow.
From AnatomyMapReduceJob article from http://ercoppa.github.io/ :
The workflow can be pictured as below.

Is it possible to specify which takstrackers to use in a MapReduce job?

We have two types of jobs in our Hadoop cluster. One job uses MapReduce HBase scanning, the other one is just pure manipulation of raw files in HDFS. Within our HDFS cluster, part of the datanodes are also HBase regionservers, but others aren't. We would like to run the HBase scans only in the regionservers (to take advantage of the data locality), and run the other type of jobs in all the datanodes. Is this idea possible at all? Can we specify which tasktrackers to use in the MapReduce job configuration?
Any help is appreciated.

Understanding the map/reduce process. Have few questions [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
Below are the steps in serial and there are questions in between. Please correct me if I am wrong and elaborate a little bit.
Client/user submits the request to the JobTracker. JobTracker is a software that resides in a name node.
JobTracker divides the job into small sub-problems and gives to the TaskTracker. TaskTracker is the software that resides in data node. TaskTracker may do it again leading to multi-level tree structure.
The mapping step happens only in TaskTracker not in JobTracker?
Shuffle and sort takes place. Does this step takes place in Mapper step or Reducer step?
The output of shuffle and sort is fed into Reducer step?
The reducer step happens only in JobTracker not in TaskTracker?
Reducer step i.e. JobTracker not TaskTracker combines the data and gives output to the client/user.
Only 1 reducer is used for combining the result?
Thanks
Client/user submits the request to the JobTracker. JobTracker is a software that resides in a name node.
JobTracker is a daemon that can reside in a separate machine other than the namenode.
JobTracker divides the job into small sub-problems and gives to the TaskTracker.
The JobTracker farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
TaskTracker is the software that resides in data node. TaskTracker may do it again leading to multi-level tree structure.
Usually yes. TaskTracker can run alone but it definitely needs a datanode to work with, somewhere.
The mapping step happens only in TaskTracker not in JobTracker?
Map Tasks are launched by tasktracker
Shuffle and sort takes place. Does this step takes place in Mapper step or Reducer step?
Shuffle and sort process is actually between the map phase and the reduce phase. But they are relevant only for the reduce phase. Without the reduce phase shuffle and sort will not take place. So, we can say - Reducer has 3 primary phases: shuffle, sort and reduce.
The output of shuffle and sort is fed into Reducer step?
In shuffle and sort, the framework fetches the relevant partition of the output of all the mappers, via HTTP. Input to the Reducer is the sorted output of the mappers.
The reducer step happens only in JobTracker not in TaskTracker?
Reduce tasks are launched by TaskTracker.
Reducer step i.e. JobTracker not TaskTracker combines the data and gives output to the client/user.
Reduce tasks are something that are supposed to runs in parallel in several nodes and emit results to HDFS. You can read the output data from the final data sets from different reducers and combine them in the MapReduce driver if you like.
Only 1 reducer is used for combining the result?
It will depend on the what you want to do. But having a single reduce task will surely bring down performance due to lack of parallelism, if you have large data to process in a single reduce task.
Indeed you need this: Hadoop: The Definitive Guide, 3rd Edition. Most useful guide on this topic.
Some notes:
Hadoop mainly is combination of 2 things: HDFS as "storage" and MapReduce framework as "CPU".
NameNode is related to HDFS, JobTracker is related to MapReduce. MapReduce uses HDFS service but JobTracker and NameNode is completely different services and don't have to be located on the same node.
Again DataNode is HDFS entity but TaskTracker is component of MapReduce and they are independent. In practice they often located on the same node but it is not something that is fixed.
Job steps itself are performed by TaskTracker. JobTracker is like scheduler. This is related both to Map and Reduce steps. Don't forget about Combiner.
No, you can use more than 1 Reducer and you can control this and you can use up to 1 combiner per each mapper as combiner takes place right after mapper.
Shuffle process is related to mapper (or combiner) output so logically its closer to mapper than to reducer but you actually should not rely on this. Your freedom is to take next record and process. In addition if 0 reducers are configured you will not have things like shuffle.
Don't try to replace real knowledge with such Q&A site advices. Doesn't work :-).
Hope this help.

Resources