I am a new about Hadoop, since the data transfer between map node and reduce node may reduce the efficiency of MapReduce, why not map task and reduce task are put together in the same node?
Actually you can run map and reduce in same JVM if the data is too 'small'. It is possible in Hadoop 2.0 (aka YARN) and now called Ubertask.
From the great "Hadoop: The Definitive Guide" book:
If the job is small, the application master may choose to run the tasks in the same JVM as itself. This happens when it judges the overhead of allocating and running tasks in new containers outweighs the gain to be had in running them in parallel, compared to running them sequentially on one node. (This is different from MapReduce 1, where small jobs are never run on a single tasktracker.) Such a job is said to be uberized, or run as an uber task.
The amount of data to be processed is too large that's why we are doing map and reduce in separate nodes. If the amount of data to be processed is small then definitely you ca use Map and Reduce on the same node.
Hadoop is usually used when the amount of data is very large in that case for high availability and concurrency separate nodes are needed for both map and reduce operations.
Hope this will clear your doubt.
An Uber Job occurs when multiple mapper and reducers are combined to get executed inside Application Master.
So assuming, the job that is to be executed has MAX Mappers <= 9 ; MAX Reducers <= 1, then the Resource Manager(RM) creates an Application Master and executes the job well within the Application Master using its very own JVM.
SET mapreduce.job.ubertask.enable=TRUE;
So the advantage using Uberised job is, the roundtrip overhead that the Application master carries out, by asking containers for the job, from Resource Manager (RM) and RM allocating the containers to Application master is eliminated.
Related
This question is not going to discuss the specific case in Hadoop or Spark.
When I was reading MapReduce: Simplified Data Processing on Large Clusters, I was confused about
The master picks idle workers and assigns each one a map task or a
reduce task.
So how does the master decide whether a worker should get a Map task or Reduce task?
If we only assign reduce tasks first, will we never have the job done? (Because no Map task will be completed)
Reduce is required to be ran only after the data needed from the map and shuffle stages are complete.
In the context of Hadoop implementation of MapReduce, map tasks are decided based on data locality, otherwise, any open resources, as decided by YARN, are chosen
As per my understanding, files stored in HDFS are divided into blocks and and each block is replicated to multiple nodes, 3 by default. How does Hadoop framework choose the node to run a map job, out of all the nodes on which a particular block is replicated.
As I know, there will be same amounts of map tasks as amounts of blocks.
See manual here.
Usually, framework choose those nodes close to the input block for reducing network bandwidth for map task.
That's all I know.
In Mapreduce 1 it depends on how many map task are running in that datanode which hosts a replica, because the number of map tasks is fixed in MR1. In MR2 there are no fixed slots, so it depends on number of tasks already running in that node.
I have read from many blogs/web pages that state
the running time of a mapper should be more than X minutes
I understand there are overheads involved in setting up a mapper but how exactly is this calculated? Why is it after X minutes the overhead then is justified? And when we talk about overheads, what are the Hadoop overheads?
Its not a hard code rule but makes sense. At the background so many small process are handled out before a mapper is started. Its initialization,other stuffs apart from the real processing would itself take 10-15 seconds. So to reduce the number of split which in turn would reduce the mapper count, maxsplitsize could be set to some higher value that is what that blog conveys. If we fail to do that. Below are the overheads the MR framework has to handle while creating a mapper.
Calculating splits for that mapper.
The job scheduler in jobtracker has to create a separarte map task this would increase the latency a bit.
When it comes to assignment the jobtracker will have to look for a tasktracker based on its data locality. This will again involving creating local temp directories in the tasktracker which would be used up by the setup and cleanup task for that mapper, for example in the setup the if we are reading from a distributed cache and putting that intop a hashmap or initializing and cleaning up something in the mapper.And if already there are enough map and reduce tasks running in that tasktracker this would put a overhead on the tasktracker.
In worst case the number of fixed map task is full, then the JT will have to look for a different TT which would lead to a remote read.
Also TT would only send the heartbeat to JT once in 3 seconds this would cause a delay in the job initialization because the TT would have to contact the JT to run a job as well as sending the completed status.
Unfortunately if your mapper fails then that task would be run 3 times before it finally fails.
I know that Hadoop divides the work into independent chuncks. But imagine if one mapper finished handling its tasks before other mappers, can the master program give this mapper a work (i.e. some tasks) that was already associated to another mapper? if yes, how?
Read up on speculative execution Yahoo Tutorial-
One problem with the Hadoop system is that by dividing the tasks across many nodes, it is possible for a few slow nodes to rate-limit the rest of the program. For example if one node has a slow disk controller, then it may be reading its input at only 10% the speed of all the other nodes. So when 99 map tasks are already complete, the system is still waiting for the final map task to check in, which takes much longer than all the other nodes.
By forcing tasks to run in isolation from one another, individual tasks do not know where their inputs come from. Tasks trust the Hadoop platform to just deliver the appropriate input. Therefore, the same input can be processed multiple times in parallel, to exploit differences in machine capabilities. As most of the tasks in a job are coming to a close, the Hadoop platform will schedule redundant copies of the remaining tasks across several nodes which do not have other work to perform. This process is known as speculative execution. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy. If other copies were executing speculatively, Hadoop tells the TaskTrackers to abandon the tasks and discard their outputs. The Reducers then receive their inputs from whichever Mapper completed successfully, first.
Speculative execution is enabled by default. You can disable speculative execution for the mappers and reducers by setting the mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution JobConf options to false, respectively.
The Yahoo Tutorial information, which only covers MapReduce v1, is a little out of date, though the concepts are the same. The new options for MR v2 are now:
mapreduce.map.speculative
mapreduce.reduce.speculative
I want to know How many Mapreduce Jobs can be submit/run simultaneously in a single node hadoop envirnment.Is there any limit?
From a configuration standpoint, there's no limit I'm aware of. You can set the number of map and reduce slots to whatever you want. Practically, though, each slot has to spin up a JVM capable of running some hadoop code, which requires some amount of memory, so eventually you would run out of memory on your machine. You might also have to configure job queues cleverly in order to run a ton at the same time.
Now, what is possible is a very different question than what is a good idea...
You can submit as many jobs you want, they will be queued up and scheduler will run them based on FIFO(by default) and available resources.The number of jobs being executed by hadoop will depend as described by John above.
The number of Reducer slots is set when the cluster is configured. This will limit the number of MapReduce jobs based on the number of Reducers each job requests. Mappers are generally more limited by number of DataNodes and # of processors per node.