I want to change the cluster's capacity of reduce slots on a per job basis. That is to say,
originally I have 8 reduce slots configured for a tasktracker, so for a job with 100 reduce tasks, there will be (8 * datanode number) reduce tasks running in the same time. But for a specific job, I want to reduce this number to its half, so I did:
conf.set("mapred.tasktracker.reduce.tasks.maximum", "4");
...
Job job = new Job(conf, ...)
And in the web UI I can see that for this job, the max reduce tasks is exactly at 4, like I set. However hadoop still launches 8 reducer per datanode for this job... It seems that I can't alter the reduce capacity like this.
I asked on the Hadoop mail list, some suggests that I can make it with capacity scheduler, how could I do?
I'm using hadoop 1.0.2.
Thanks.
The Capacity Scheduler allows you to specify resource limits for your MapReduce jobs. Basically you have to define queues, to which your job are being scheduled. Each queue can have different configuration.
As far as your issue is concerned, when using the capacity scheduler one can specify RAM-per-task limits in order to limit how many slots a given task takes. According to the documentation, currently the memory based scheduling is only supported in Linux platform.
For further information about this topic, see: http://wiki.apache.org/hadoop/LimitingTaskSlotUsage and http://hadoop.apache.org/docs/stable/capacity_scheduler.html.
Related
I have heard the term AM limit a couple of times in the context of running jobs in a yarn Big Data cluster.
Its also mentioned here:
https://issues.apache.org/jira/browse/YARN-6428
What does it mean?
It's a setting to guarantee you don't livelock your cluster. A Map-Reduce job has an AM and that spawns mappers and reducers. If your queue only has AM tasks then you cannot run any mappers or reducers which means none of your AMs will complete and you cannot do any meaningful work. You're in a live-lock scenario.
Both Capacity Scheduler and Fair Scheduler have a way to limit the percentage of tasks that can be held by AMs. In Capacity Scheduler look for yarn.scheduler.capacity.maximum-am-resource-percent. In Fair Scheduler look for maxAMShare.
i want to run many job at the same time on a Hadoop cluster but i want to prevent some jobs to starting reduce phase (making reduce slots busy or reserved) before all map tasks of that job being complete.
is there any config for jobs to make theme limit like above?
Thanks.
Reduce slow start
By default, schedulers wait until 5% of the map tasks in a job have completed before
scheduling reduce tasks for the same job. For large jobs this can cause problems with
cluster utilization, since they take up reduce slots while waiting for the map tasks to
complete. Setting mapred.reduce.slowstart.completed.maps to a higher value, such as
0.80 (80%), can help improve throughput.
refrence : Hadoop definitive guide 3rd edition , Chapter 9: Setting Up a Hadoop Cluster page:316
You can get default values here for Apache Hadoop mapred.reduce.slowstart.completed.maps has the value 0.05 which is
Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.
I want to submit a research job into a production cluster. As I don't need this job to finish quickly, and I don't want to delay production jobs, I want to limit the number of tasks that are executing for that job at any given time. Can I do that in Hadoop 2?
For limiting the Hadoop mapreduce resources (map/reduce slots) Fair scheduler can be used. You better create a new fairscheduler pool by setting up desired number of maximum mappers and maximum reducers and job can be submitted to that newly created fairscheduler pool.
you can also do the following
job.getConfiguration().setInt("mapred.map.tasks", 1);
job.setNumReduceTasks(1);
job.setPriority(JobPriority.VERY_LOW);
I am a new about Hadoop, since the data transfer between map node and reduce node may reduce the efficiency of MapReduce, why not map task and reduce task are put together in the same node?
Actually you can run map and reduce in same JVM if the data is too 'small'. It is possible in Hadoop 2.0 (aka YARN) and now called Ubertask.
From the great "Hadoop: The Definitive Guide" book:
If the job is small, the application master may choose to run the tasks in the same JVM as itself. This happens when it judges the overhead of allocating and running tasks in new containers outweighs the gain to be had in running them in parallel, compared to running them sequentially on one node. (This is different from MapReduce 1, where small jobs are never run on a single tasktracker.) Such a job is said to be uberized, or run as an uber task.
The amount of data to be processed is too large that's why we are doing map and reduce in separate nodes. If the amount of data to be processed is small then definitely you ca use Map and Reduce on the same node.
Hadoop is usually used when the amount of data is very large in that case for high availability and concurrency separate nodes are needed for both map and reduce operations.
Hope this will clear your doubt.
An Uber Job occurs when multiple mapper and reducers are combined to get executed inside Application Master.
So assuming, the job that is to be executed has MAX Mappers <= 9 ; MAX Reducers <= 1, then the Resource Manager(RM) creates an Application Master and executes the job well within the Application Master using its very own JVM.
SET mapreduce.job.ubertask.enable=TRUE;
So the advantage using Uberised job is, the roundtrip overhead that the Application master carries out, by asking containers for the job, from Resource Manager (RM) and RM allocating the containers to Application master is eliminated.
I want to know How many Mapreduce Jobs can be submit/run simultaneously in a single node hadoop envirnment.Is there any limit?
From a configuration standpoint, there's no limit I'm aware of. You can set the number of map and reduce slots to whatever you want. Practically, though, each slot has to spin up a JVM capable of running some hadoop code, which requires some amount of memory, so eventually you would run out of memory on your machine. You might also have to configure job queues cleverly in order to run a ton at the same time.
Now, what is possible is a very different question than what is a good idea...
You can submit as many jobs you want, they will be queued up and scheduler will run them based on FIFO(by default) and available resources.The number of jobs being executed by hadoop will depend as described by John above.
The number of Reducer slots is set when the cluster is configured. This will limit the number of MapReduce jobs based on the number of Reducers each job requests. Mappers are generally more limited by number of DataNodes and # of processors per node.