Running jobs parallely in hadoop - hadoop

I am new to hadoop.
I have set up a 2 node cluster.
How to run 2 jobs parallely in hadoop.
When i submit jobs, they are running one by one in FIFO order. I have to run the jobs parallely. How to acheive that.
Thanks
MRK

Hadoop can be configured with a number of schedulers and the default is the FIFO scheduler.
FIFO Schedule behaves like this.
Scenario 1: If the cluster has 10 Map Task capacity and job1 needs 15 Map Task, then running job1 takes the complete cluster. As job1 makes progress and there are free slots available which are not used by job1 then job2 runs on the cluster.
Scenario 2: If the cluster has 10 Map Task capacity and job1 needs 6 Map Task, then job1 takes 6 slots and job2 takes 4 slots. job1 and job2 run in parallel.
To run jobs in parallel from the start, you can either configure a Fair Scheduler or a Capacity Scheduler based on your requirements. The mapreduce.jobtracker.taskscheduler and the specific scheduler parameters have to be set for this to take effect in the mapred-site.xml.
Edit: Updated the answer based on the comment from MRK.

You have "Map Task Capacity" and "Reduce Task Capacity". Whenever those are free they would pick the job in FIFO order. Your submitted jobs contains mapper and optionally reducer. If your jobs mapper (and/or reducer) count is smaller then the cluster's capacity it would take the next jobs mapper (and/or reducer).
If you don't like FIFO, you can always give priority to your submitted jobs.
Edit:
Sorry about slight missinformation, Praveen's answer is the right one.
in adition to his answer you can check HOD scheduler aswell.

With the default scheduler only one job per user at a time. You can launch different jobs from different user ids. They will run in parallel, of course, as mentioned by others you need to have enough slot capacity.

Related

What is AM limit in yarn?

I have heard the term AM limit a couple of times in the context of running jobs in a yarn Big Data cluster.
Its also mentioned here:
https://issues.apache.org/jira/browse/YARN-6428
What does it mean?
It's a setting to guarantee you don't livelock your cluster. A Map-Reduce job has an AM and that spawns mappers and reducers. If your queue only has AM tasks then you cannot run any mappers or reducers which means none of your AMs will complete and you cannot do any meaningful work. You're in a live-lock scenario.
Both Capacity Scheduler and Fair Scheduler have a way to limit the percentage of tasks that can be held by AMs. In Capacity Scheduler look for yarn.scheduler.capacity.maximum-am-resource-percent. In Fair Scheduler look for maxAMShare.

Hadoop Map/Reduce Job distribution

I have 4 nodes and I am running a mapreduce sample project to see if job is being distrubuted between all 4 nodes. I ran the project mulitple times and have noticed that, the mapper task is being splitted among all 4 nodes but the reducer task is only being done by one node. Is this how it is suppose to be or is reducer task suppose to be split among all 4 nodes as well.
Thank you
Distribution of Mappers depends on which block of data the mapper will operate on. Framework by default tries to assign the task to a node which has the block of data stored. This will prevent network transfer of data.
For reducers again it depends on no. of reducers which your job requires. If your job uses only one reducer it may be assigned to any pf the nodes.
Also impacting this is speculative execution. If on then this results in multiple instances of map task/ reduce task to start on different nodes and the job tracker based on % completion decides which one will go through and other instances will be killed.
Let us say you 224 MB file. When you add that file into HDFS based on the default block size of 64 MB, the files are split into 4 blocks [blk1=64M,blk2=64M,blk3=64M,blk4=32M]. Let us assume blk1 in on node1 represented as blk1::node1, blk2::node2, blk3:node3, blk4:node4. Now when you run the MR, the Map needs to access the input file. So MR FWK creates 4 mappers and are executed on each node. Now comes the Reducer, as Venkat said it depends on no.of reducers configured for your job. The reducers can be configured using the Hadoop org.apache.hadoop.mapreduce.Job setNumReduceTasks(int tasks) API.

hadoop: limit number of concurrent map / reduce tasks per job

I want to submit a research job into a production cluster. As I don't need this job to finish quickly, and I don't want to delay production jobs, I want to limit the number of tasks that are executing for that job at any given time. Can I do that in Hadoop 2?
For limiting the Hadoop mapreduce resources (map/reduce slots) Fair scheduler can be used. You better create a new fairscheduler pool by setting up desired number of maximum mappers and maximum reducers and job can be submitted to that newly created fairscheduler pool.
you can also do the following
job.getConfiguration().setInt("mapred.map.tasks", 1);
job.setNumReduceTasks(1);
job.setPriority(JobPriority.VERY_LOW);

How many Mapreduce Jobs can be run simultaneously

I want to know How many Mapreduce Jobs can be submit/run simultaneously in a single node hadoop envirnment.Is there any limit?
From a configuration standpoint, there's no limit I'm aware of. You can set the number of map and reduce slots to whatever you want. Practically, though, each slot has to spin up a JVM capable of running some hadoop code, which requires some amount of memory, so eventually you would run out of memory on your machine. You might also have to configure job queues cleverly in order to run a ton at the same time.
Now, what is possible is a very different question than what is a good idea...
You can submit as many jobs you want, they will be queued up and scheduler will run them based on FIFO(by default) and available resources.The number of jobs being executed by hadoop will depend as described by John above.
The number of Reducer slots is set when the cluster is configured. This will limit the number of MapReduce jobs based on the number of Reducers each job requests. Mappers are generally more limited by number of DataNodes and # of processors per node.

Set reducer capacity for a specific M/R job

I want to change the cluster's capacity of reduce slots on a per job basis. That is to say,
originally I have 8 reduce slots configured for a tasktracker, so for a job with 100 reduce tasks, there will be (8 * datanode number) reduce tasks running in the same time. But for a specific job, I want to reduce this number to its half, so I did:
conf.set("mapred.tasktracker.reduce.tasks.maximum", "4");
...
Job job = new Job(conf, ...)
And in the web UI I can see that for this job, the max reduce tasks is exactly at 4, like I set. However hadoop still launches 8 reducer per datanode for this job... It seems that I can't alter the reduce capacity like this.
I asked on the Hadoop mail list, some suggests that I can make it with capacity scheduler, how could I do?
I'm using hadoop 1.0.2.
Thanks.
The Capacity Scheduler allows you to specify resource limits for your MapReduce jobs. Basically you have to define queues, to which your job are being scheduled. Each queue can have different configuration.
As far as your issue is concerned, when using the capacity scheduler one can specify RAM-per-task limits in order to limit how many slots a given task takes. According to the documentation, currently the memory based scheduling is only supported in Linux platform.
For further information about this topic, see: http://wiki.apache.org/hadoop/LimitingTaskSlotUsage and http://hadoop.apache.org/docs/stable/capacity_scheduler.html.

Resources