hadoop: limit number of concurrent map / reduce tasks per job - hadoop

I want to submit a research job into a production cluster. As I don't need this job to finish quickly, and I don't want to delay production jobs, I want to limit the number of tasks that are executing for that job at any given time. Can I do that in Hadoop 2?

For limiting the Hadoop mapreduce resources (map/reduce slots) Fair scheduler can be used. You better create a new fairscheduler pool by setting up desired number of maximum mappers and maximum reducers and job can be submitted to that newly created fairscheduler pool.

you can also do the following
job.getConfiguration().setInt("mapred.map.tasks", 1);
job.setNumReduceTasks(1);
job.setPriority(JobPriority.VERY_LOW);

Related

What is AM limit in yarn?

I have heard the term AM limit a couple of times in the context of running jobs in a yarn Big Data cluster.
Its also mentioned here:
https://issues.apache.org/jira/browse/YARN-6428
What does it mean?
It's a setting to guarantee you don't livelock your cluster. A Map-Reduce job has an AM and that spawns mappers and reducers. If your queue only has AM tasks then you cannot run any mappers or reducers which means none of your AMs will complete and you cannot do any meaningful work. You're in a live-lock scenario.
Both Capacity Scheduler and Fair Scheduler have a way to limit the percentage of tasks that can be held by AMs. In Capacity Scheduler look for yarn.scheduler.capacity.maximum-am-resource-percent. In Fair Scheduler look for maxAMShare.

How many Mapreduce Jobs can be run simultaneously

I want to know How many Mapreduce Jobs can be submit/run simultaneously in a single node hadoop envirnment.Is there any limit?
From a configuration standpoint, there's no limit I'm aware of. You can set the number of map and reduce slots to whatever you want. Practically, though, each slot has to spin up a JVM capable of running some hadoop code, which requires some amount of memory, so eventually you would run out of memory on your machine. You might also have to configure job queues cleverly in order to run a ton at the same time.
Now, what is possible is a very different question than what is a good idea...
You can submit as many jobs you want, they will be queued up and scheduler will run them based on FIFO(by default) and available resources.The number of jobs being executed by hadoop will depend as described by John above.
The number of Reducer slots is set when the cluster is configured. This will limit the number of MapReduce jobs based on the number of Reducers each job requests. Mappers are generally more limited by number of DataNodes and # of processors per node.

Set reducer capacity for a specific M/R job

I want to change the cluster's capacity of reduce slots on a per job basis. That is to say,
originally I have 8 reduce slots configured for a tasktracker, so for a job with 100 reduce tasks, there will be (8 * datanode number) reduce tasks running in the same time. But for a specific job, I want to reduce this number to its half, so I did:
conf.set("mapred.tasktracker.reduce.tasks.maximum", "4");
...
Job job = new Job(conf, ...)
And in the web UI I can see that for this job, the max reduce tasks is exactly at 4, like I set. However hadoop still launches 8 reducer per datanode for this job... It seems that I can't alter the reduce capacity like this.
I asked on the Hadoop mail list, some suggests that I can make it with capacity scheduler, how could I do?
I'm using hadoop 1.0.2.
Thanks.
The Capacity Scheduler allows you to specify resource limits for your MapReduce jobs. Basically you have to define queues, to which your job are being scheduled. Each queue can have different configuration.
As far as your issue is concerned, when using the capacity scheduler one can specify RAM-per-task limits in order to limit how many slots a given task takes. According to the documentation, currently the memory based scheduling is only supported in Linux platform.
For further information about this topic, see: http://wiki.apache.org/hadoop/LimitingTaskSlotUsage and http://hadoop.apache.org/docs/stable/capacity_scheduler.html.

Hadoop streaming api - limit number of mappers on a per job basis

I have a job running on a small hadoop cluster that I want to limit the number of mappers it spawns per datanode. When I use the -Dmapred.map.tasks=12, it still spawns 17 mappers for some reason. I've figured out a way to limit it globally, but I want to do it on a per job basis.
In Map Reduce , the total number of mappers will be spawned depends upon the input splits that are being created from your data .
There will be one mapper task spawned per input split. SO , you cannot decrease the count of mapper in Map Reduce.

Running jobs parallely in hadoop

I am new to hadoop.
I have set up a 2 node cluster.
How to run 2 jobs parallely in hadoop.
When i submit jobs, they are running one by one in FIFO order. I have to run the jobs parallely. How to acheive that.
Thanks
MRK
Hadoop can be configured with a number of schedulers and the default is the FIFO scheduler.
FIFO Schedule behaves like this.
Scenario 1: If the cluster has 10 Map Task capacity and job1 needs 15 Map Task, then running job1 takes the complete cluster. As job1 makes progress and there are free slots available which are not used by job1 then job2 runs on the cluster.
Scenario 2: If the cluster has 10 Map Task capacity and job1 needs 6 Map Task, then job1 takes 6 slots and job2 takes 4 slots. job1 and job2 run in parallel.
To run jobs in parallel from the start, you can either configure a Fair Scheduler or a Capacity Scheduler based on your requirements. The mapreduce.jobtracker.taskscheduler and the specific scheduler parameters have to be set for this to take effect in the mapred-site.xml.
Edit: Updated the answer based on the comment from MRK.
You have "Map Task Capacity" and "Reduce Task Capacity". Whenever those are free they would pick the job in FIFO order. Your submitted jobs contains mapper and optionally reducer. If your jobs mapper (and/or reducer) count is smaller then the cluster's capacity it would take the next jobs mapper (and/or reducer).
If you don't like FIFO, you can always give priority to your submitted jobs.
Edit:
Sorry about slight missinformation, Praveen's answer is the right one.
in adition to his answer you can check HOD scheduler aswell.
With the default scheduler only one job per user at a time. You can launch different jobs from different user ids. They will run in parallel, of course, as mentioned by others you need to have enough slot capacity.

Resources