Understanding the map/reduce process. Have few questions [closed] - hadoop

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
Below are the steps in serial and there are questions in between. Please correct me if I am wrong and elaborate a little bit.
Client/user submits the request to the JobTracker. JobTracker is a software that resides in a name node.
JobTracker divides the job into small sub-problems and gives to the TaskTracker. TaskTracker is the software that resides in data node. TaskTracker may do it again leading to multi-level tree structure.
The mapping step happens only in TaskTracker not in JobTracker?
Shuffle and sort takes place. Does this step takes place in Mapper step or Reducer step?
The output of shuffle and sort is fed into Reducer step?
The reducer step happens only in JobTracker not in TaskTracker?
Reducer step i.e. JobTracker not TaskTracker combines the data and gives output to the client/user.
Only 1 reducer is used for combining the result?
Thanks

Client/user submits the request to the JobTracker. JobTracker is a software that resides in a name node.
JobTracker is a daemon that can reside in a separate machine other than the namenode.
JobTracker divides the job into small sub-problems and gives to the TaskTracker.
The JobTracker farms out MapReduce tasks to specific nodes in the cluster, ideally the nodes that have the data, or at least are in the same rack.
TaskTracker is the software that resides in data node. TaskTracker may do it again leading to multi-level tree structure.
Usually yes. TaskTracker can run alone but it definitely needs a datanode to work with, somewhere.
The mapping step happens only in TaskTracker not in JobTracker?
Map Tasks are launched by tasktracker
Shuffle and sort takes place. Does this step takes place in Mapper step or Reducer step?
Shuffle and sort process is actually between the map phase and the reduce phase. But they are relevant only for the reduce phase. Without the reduce phase shuffle and sort will not take place. So, we can say - Reducer has 3 primary phases: shuffle, sort and reduce.
The output of shuffle and sort is fed into Reducer step?
In shuffle and sort, the framework fetches the relevant partition of the output of all the mappers, via HTTP. Input to the Reducer is the sorted output of the mappers.
The reducer step happens only in JobTracker not in TaskTracker?
Reduce tasks are launched by TaskTracker.
Reducer step i.e. JobTracker not TaskTracker combines the data and gives output to the client/user.
Reduce tasks are something that are supposed to runs in parallel in several nodes and emit results to HDFS. You can read the output data from the final data sets from different reducers and combine them in the MapReduce driver if you like.
Only 1 reducer is used for combining the result?
It will depend on the what you want to do. But having a single reduce task will surely bring down performance due to lack of parallelism, if you have large data to process in a single reduce task.

Indeed you need this: Hadoop: The Definitive Guide, 3rd Edition. Most useful guide on this topic.
Some notes:
Hadoop mainly is combination of 2 things: HDFS as "storage" and MapReduce framework as "CPU".
NameNode is related to HDFS, JobTracker is related to MapReduce. MapReduce uses HDFS service but JobTracker and NameNode is completely different services and don't have to be located on the same node.
Again DataNode is HDFS entity but TaskTracker is component of MapReduce and they are independent. In practice they often located on the same node but it is not something that is fixed.
Job steps itself are performed by TaskTracker. JobTracker is like scheduler. This is related both to Map and Reduce steps. Don't forget about Combiner.
No, you can use more than 1 Reducer and you can control this and you can use up to 1 combiner per each mapper as combiner takes place right after mapper.
Shuffle process is related to mapper (or combiner) output so logically its closer to mapper than to reducer but you actually should not rely on this. Your freedom is to take next record and process. In addition if 0 reducers are configured you will not have things like shuffle.
Don't try to replace real knowledge with such Q&A site advices. Doesn't work :-).
Hope this help.

Related

How does Hadoop framework decides the node to run Map job

As per my understanding, files stored in HDFS are divided into blocks and and each block is replicated to multiple nodes, 3 by default. How does Hadoop framework choose the node to run a map job, out of all the nodes on which a particular block is replicated.
As I know, there will be same amounts of map tasks as amounts of blocks.
See manual here.
Usually, framework choose those nodes close to the input block for reducing network bandwidth for map task.
That's all I know.
In Mapreduce 1 it depends on how many map task are running in that datanode which hosts a replica, because the number of map tasks is fixed in MR1. In MR2 there are no fixed slots, so it depends on number of tasks already running in that node.

Can reducers and mappers be on the same data node?

I have started reading about Big Data and Hadoop, so this question may sound very stupid to you.
This is what I know.
Each mapper processes a small amount of data and produces an intermediate output.
After this, we have the step of shuffle and sort.
Now, Shuffle = Moving intermediate output over to respective Reducers each dealing with a particular key/keys.
So, can one Data Node have the Mapper and Reducer code running in them or we have different DNs for each?
Terminology: Datanodes are for HDFS (storage). Mappers and Reducers (compute) run on nodes that have the TaskTracker daemon on them.
The number of mappers and reducers per tasktracker are controlled by the configs:
mapred.tasktracker.map.tasks.maximum
and
mapred.tasktracker.reduce.tasks.maximum
Subject to other limits in other configs, theoretically, as long as the tasktracker doesn't have the maximum number of map or reduce tasks, it may get assigned more map or reduce tasks by the jobtracker. Typically the jobtracker will try to assign tasks to reduce the amount of data movement.
So, yes, you can have mappers and reducers running on the same node at the same time.
You can have both mappers and reducers running on the same node. As an example, consider a single node hadoop cluster. In a single node hadoop cluster, the entire HDFS storage(Data Nodes, Name Nodes) and both the job tracker and the task trackers everything runs on the same node.
In this case both the mappers and reducers run on the same node.

Hadoop cluster and MapReduce logic

I'm new in hadoop development. I read about hadoop cluster structure and understood that there are one namenode, jobtracker, tasktracker and multiple datanodes.
When we write map-reduce programs we implement mapper and reducer. I also understood logic of these clasess. But I don't understand how are they executed in the hadoop cluster.
Is mapper executed in the namenode only?
Is reducer executed seperatly on the datanodes?
I need to make a lot of parralel computations and don't want to use HDFS, how can I be sure that each output collection (from mapper) executes seperatly in all datanodes?
Please explain me the connection between hadoop cluster and map/reduce logic.
Thanks a lot!
Map Reduce Jobs are executed by Job Tracker and Task Trackers.
Job Tracker initiates the Job the dividing the input file/files into splits. Tasktrackers are given these splits who run map tasks on the splits( One map task per split). After Mappers throws their output.This output will be passed on the reducer depending on the map output keys . Similar keys are sent to one reducer. Reducer can be more than 1 , depending upon your configuration. Reducer process also runs on one the tasktracker nodes only .
You can see stats of the Job on , jobtracker UI which by default runs on 50030 port.
You can also, visit my website for example topics on Bigdata technologies. Also, you can post your questions , I will try to answer.
http://souravgulati.webs.com/apps/forums/show/14108248-bigdata-learnings-hadoop-hbase-hive-and-other-bigdata-technologies-

Hadoop - "Code moves near data for computation"

I just want to clarify this quote "Code moves near data for computation",
does this mean all java MR written by developer deployed to all servers in cluster ?
If 1 is true, if someone changes a MR program, how its distributed to all the servers ?
Thanks
Hadoop put MR job's jar to the HDFS - its distributed file system. The task trackers which needed it will take it from there. So it distributed to some nodes and then loaded on-demand by nodes which actually needs them. Usually this needs mean that node is going to process local data.
Hadoop cluster is "stateless" in relation to the jobs. Each time job is viewed as something new and "side effects" of the previous job are not used.
Indeed, when some small number of files (or splits to be precise) are to be processed on large cluster, optimization of sending jar to only few hosts where data indeed reside might somewhat reduce the job latency. I do not know if such optimization is planned.
In hadoop cluster you use the same nodes for data and computation. That means your hdfs datanode is setup on the same cluster used by task tracker for computation. So now when you execute MR jobs job tracker looks where your data is stored. Whereas in other computation model data is not stored in the same cluster and you may have to move data while you are doing your computation on some compute node.
After you start a job all the map functions will get splits of your input file. These map functions are executed so that split of input file is closer to them or in other words in the same rack. This is what we mean by computation is done closer to data.
So to clarify your question, every time you run MR job its code is copied to all the nodes. So if we change a code a new code is copied to all the nodes.

Does Amazon Elastic Map Reduce runs one or several mapper processes per instance?

My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically?
I haven't found the answer neither in Hadoop Streaming documentation, nor in Amazon Elastic MapReduce FAQ.
Hadoop has a notion of "slots". Slot is a place where mapper process will run. You configure number of slots per tasktracker node. It is teoretical maximum of map process which will run parralel per node. It can be less if there is not enough separate poprtions of the input data (called FileSplits).
Elastic MapReduce do have its own estimation how much slots to allocate depending on the instance capabilities.
In the same time I can imagine scenario where your processing will be more efficeint when one datastream is prcessed by many cores. If you have your mapper with built-in multicore usage - you can reduce number of slots. But it is inot usually a case in the typycial Hadoop tasks.
See the EMR doco [1] for the number of map/reduce tasks per instance type.
In addition to David's answers you can also have Hadoop run multiple threads per map slot by setting...
conf.setMapRunnerClass(MultithreadedMapRunner.class);
The default is 10 threads but it's tunable with
-D mapred.map.multithreadedrunner.threads=5
I often find this useful for custom high IO stuff.
[1] http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/HadoopMemoryDefault_AMI2.html
My question is: should I care about multiprocessing in my mapper myself (read tasks from stdin then distribute them over worker processes, combine results in a master process and output to stdout) or Hadoop will take care of it automatically?
Once a Hadoop cluster has been set, the minimum required to submit a job is
Input format and location
Output format and location
Map and Reduce functions for processing the data
Location of the NameNode and the JobTracker
Hadoop will take care of distributing the job to different nodes, monitoring them, reading the data from the i/p and writing the data to the o/p. If the user has to do all those tasks, then there is no point of using Hadoop.
Suggest, to go through the Hadoop documentation and a couple of tutorials.

Resources