I have started reading about Big Data and Hadoop, so this question may sound very stupid to you.
This is what I know.
Each mapper processes a small amount of data and produces an intermediate output.
After this, we have the step of shuffle and sort.
Now, Shuffle = Moving intermediate output over to respective Reducers each dealing with a particular key/keys.
So, can one Data Node have the Mapper and Reducer code running in them or we have different DNs for each?
Terminology: Datanodes are for HDFS (storage). Mappers and Reducers (compute) run on nodes that have the TaskTracker daemon on them.
The number of mappers and reducers per tasktracker are controlled by the configs:
mapred.tasktracker.map.tasks.maximum
and
mapred.tasktracker.reduce.tasks.maximum
Subject to other limits in other configs, theoretically, as long as the tasktracker doesn't have the maximum number of map or reduce tasks, it may get assigned more map or reduce tasks by the jobtracker. Typically the jobtracker will try to assign tasks to reduce the amount of data movement.
So, yes, you can have mappers and reducers running on the same node at the same time.
You can have both mappers and reducers running on the same node. As an example, consider a single node hadoop cluster. In a single node hadoop cluster, the entire HDFS storage(Data Nodes, Name Nodes) and both the job tracker and the task trackers everything runs on the same node.
In this case both the mappers and reducers run on the same node.
Related
I am doing two jobs of Word count example in the same cluster (I run hadoop 2.65 locally with my a multi-cluster) where my code run the two jobs one after the other.
Where both of the jobs share the same mapper, reducer and etc. but each one of them has a different Partitioner.
Why there is a different allocation of the reduce task on the nodes for the second job? I am identifying the reduce task node by the node's IP (Java getting my IP address).
I know that the keys would go to a different reduce task but I want that their destination would stay unchanged.
For example, I have five different keys and four reduce task.
The allocation for Job 1 is:
partition_1 ->NODE_1
partition_2 ->NODE_1
partition_3 ->NODE_2
partition_4 ->NODE_3
The allocation for Job 2 is:
partition_1 ->NODE_2
partition_2 ->NODE_3
partition_3 ->NODE_1
partition_4 ->NODE_3
In hadoop we haven’t locality for reducers so yarn select nodes for reducer based on the resources. There is no way to force hadoop to run each reducer on the same node in two job.
As per my understanding, files stored in HDFS are divided into blocks and and each block is replicated to multiple nodes, 3 by default. How does Hadoop framework choose the node to run a map job, out of all the nodes on which a particular block is replicated.
As I know, there will be same amounts of map tasks as amounts of blocks.
See manual here.
Usually, framework choose those nodes close to the input block for reducing network bandwidth for map task.
That's all I know.
In Mapreduce 1 it depends on how many map task are running in that datanode which hosts a replica, because the number of map tasks is fixed in MR1. In MR2 there are no fixed slots, so it depends on number of tasks already running in that node.
I have 4 nodes and I am running a mapreduce sample project to see if job is being distrubuted between all 4 nodes. I ran the project mulitple times and have noticed that, the mapper task is being splitted among all 4 nodes but the reducer task is only being done by one node. Is this how it is suppose to be or is reducer task suppose to be split among all 4 nodes as well.
Thank you
Distribution of Mappers depends on which block of data the mapper will operate on. Framework by default tries to assign the task to a node which has the block of data stored. This will prevent network transfer of data.
For reducers again it depends on no. of reducers which your job requires. If your job uses only one reducer it may be assigned to any pf the nodes.
Also impacting this is speculative execution. If on then this results in multiple instances of map task/ reduce task to start on different nodes and the job tracker based on % completion decides which one will go through and other instances will be killed.
Let us say you 224 MB file. When you add that file into HDFS based on the default block size of 64 MB, the files are split into 4 blocks [blk1=64M,blk2=64M,blk3=64M,blk4=32M]. Let us assume blk1 in on node1 represented as blk1::node1, blk2::node2, blk3:node3, blk4:node4. Now when you run the MR, the Map needs to access the input file. So MR FWK creates 4 mappers and are executed on each node. Now comes the Reducer, as Venkat said it depends on no.of reducers configured for your job. The reducers can be configured using the Hadoop org.apache.hadoop.mapreduce.Job setNumReduceTasks(int tasks) API.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
Below are the steps in serial and there are questions in between. Please correct me if I am wrong and elaborate a little bit.
Client/user submits the request to the JobTracker. JobTracker is a software that resides in a name node.
JobTracker divides the job into small sub-problems and gives to the TaskTracker. TaskTracker is the software that resides in data node. TaskTracker may do it again leading to multi-level tree structure.
The mapping step happens only in TaskTracker not in JobTracker?
Shuffle and sort takes place. Does this step takes place in Mapper step or Reducer step?
The output of shuffle and sort is fed into Reducer step?
The reducer step happens only in JobTracker not in TaskTracker?
Reducer step i.e. JobTracker not TaskTracker combines the data and gives output to the client/user.
Only 1 reducer is used for combining the result?
Thanks
Client/user submits the request to the JobTracker. JobTracker is a software that resides in a name node.
JobTracker is a daemon that can reside in a separate machine other than the namenode.
JobTracker divides the job into small sub-problems and gives to the TaskTracker.
The JobTracker farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
TaskTracker is the software that resides in data node. TaskTracker may do it again leading to multi-level tree structure.
Usually yes. TaskTracker can run alone but it definitely needs a datanode to work with, somewhere.
The mapping step happens only in TaskTracker not in JobTracker?
Map Tasks are launched by tasktracker
Shuffle and sort takes place. Does this step takes place in Mapper step or Reducer step?
Shuffle and sort process is actually between the map phase and the reduce phase. But they are relevant only for the reduce phase. Without the reduce phase shuffle and sort will not take place. So, we can say - Reducer has 3 primary phases: shuffle, sort and reduce.
The output of shuffle and sort is fed into Reducer step?
In shuffle and sort, the framework fetches the relevant partition of the output of all the mappers, via HTTP. Input to the Reducer is the sorted output of the mappers.
The reducer step happens only in JobTracker not in TaskTracker?
Reduce tasks are launched by TaskTracker.
Reducer step i.e. JobTracker not TaskTracker combines the data and gives output to the client/user.
Reduce tasks are something that are supposed to runs in parallel in several nodes and emit results to HDFS. You can read the output data from the final data sets from different reducers and combine them in the MapReduce driver if you like.
Only 1 reducer is used for combining the result?
It will depend on the what you want to do. But having a single reduce task will surely bring down performance due to lack of parallelism, if you have large data to process in a single reduce task.
Indeed you need this: Hadoop: The Definitive Guide, 3rd Edition. Most useful guide on this topic.
Some notes:
Hadoop mainly is combination of 2 things: HDFS as "storage" and MapReduce framework as "CPU".
NameNode is related to HDFS, JobTracker is related to MapReduce. MapReduce uses HDFS service but JobTracker and NameNode is completely different services and don't have to be located on the same node.
Again DataNode is HDFS entity but TaskTracker is component of MapReduce and they are independent. In practice they often located on the same node but it is not something that is fixed.
Job steps itself are performed by TaskTracker. JobTracker is like scheduler. This is related both to Map and Reduce steps. Don't forget about Combiner.
No, you can use more than 1 Reducer and you can control this and you can use up to 1 combiner per each mapper as combiner takes place right after mapper.
Shuffle process is related to mapper (or combiner) output so logically its closer to mapper than to reducer but you actually should not rely on this. Your freedom is to take next record and process. In addition if 0 reducers are configured you will not have things like shuffle.
Don't try to replace real knowledge with such Q&A site advices. Doesn't work :-).
Hope this help.
I'm new in hadoop development. I read about hadoop cluster structure and understood that there are one namenode, jobtracker, tasktracker and multiple datanodes.
When we write map-reduce programs we implement mapper and reducer. I also understood logic of these clasess. But I don't understand how are they executed in the hadoop cluster.
Is mapper executed in the namenode only?
Is reducer executed seperatly on the datanodes?
I need to make a lot of parralel computations and don't want to use HDFS, how can I be sure that each output collection (from mapper) executes seperatly in all datanodes?
Please explain me the connection between hadoop cluster and map/reduce logic.
Thanks a lot!
Map Reduce Jobs are executed by Job Tracker and Task Trackers.
Job Tracker initiates the Job the dividing the input file/files into splits. Tasktrackers are given these splits who run map tasks on the splits( One map task per split). After Mappers throws their output.This output will be passed on the reducer depending on the map output keys . Similar keys are sent to one reducer. Reducer can be more than 1 , depending upon your configuration. Reducer process also runs on one the tasktracker nodes only .
You can see stats of the Job on , jobtracker UI which by default runs on 50030 port.
You can also, visit my website for example topics on Bigdata technologies. Also, you can post your questions , I will try to answer.
http://souravgulati.webs.com/apps/forums/show/14108248-bigdata-learnings-hadoop-hbase-hive-and-other-bigdata-technologies-