What is AM limit in yarn? - hadoop

I have heard the term AM limit a couple of times in the context of running jobs in a yarn Big Data cluster.
Its also mentioned here:
https://issues.apache.org/jira/browse/YARN-6428
What does it mean?

It's a setting to guarantee you don't livelock your cluster. A Map-Reduce job has an AM and that spawns mappers and reducers. If your queue only has AM tasks then you cannot run any mappers or reducers which means none of your AMs will complete and you cannot do any meaningful work. You're in a live-lock scenario.
Both Capacity Scheduler and Fair Scheduler have a way to limit the percentage of tasks that can be held by AMs. In Capacity Scheduler look for yarn.scheduler.capacity.maximum-am-resource-percent. In Fair Scheduler look for maxAMShare.

Related

How many reducers can simultaneously run?

Learning Big Data at Uni and I'm kind of confused on the topic of MapReduce. I was wondering how many reducers can run simultaneously. For example lets say if we had 864 reducers, how many could run simultaneously?
All of them can run simultaneously depending upon what is the state(health, i.e. no rouge/bad node) of cluster is, what is the capacity of the cluster is and also how free the cluster is. If there are other MR jobs running on the same cluster then out of your 864 reducers only few will go in running state, and once the capacity is free then another set of reducer will start running.
Also there is one case which happens sometimes is when your reducer/mapper keep on preempting each other and takes up the whole memory. Job fails in majority of this case. To avoid this we generally set less number of reducer.
One line answer is - all of them can run simultaneously; as each of the reducer performs an independent unit of task in map reduce framework.
Now, how many would actually run in parallel, or more precisely when each of them would be scheduled to run depends on many factors including but not limited to resource availability, scheduling mechanism, cluster configuration etc.

hadoop: limit number of concurrent map / reduce tasks per job

I want to submit a research job into a production cluster. As I don't need this job to finish quickly, and I don't want to delay production jobs, I want to limit the number of tasks that are executing for that job at any given time. Can I do that in Hadoop 2?
For limiting the Hadoop mapreduce resources (map/reduce slots) Fair scheduler can be used. You better create a new fairscheduler pool by setting up desired number of maximum mappers and maximum reducers and job can be submitted to that newly created fairscheduler pool.
you can also do the following
job.getConfiguration().setInt("mapred.map.tasks", 1);
job.setNumReduceTasks(1);
job.setPriority(JobPriority.VERY_LOW);

How many Mapreduce Jobs can be run simultaneously

I want to know How many Mapreduce Jobs can be submit/run simultaneously in a single node hadoop envirnment.Is there any limit?
From a configuration standpoint, there's no limit I'm aware of. You can set the number of map and reduce slots to whatever you want. Practically, though, each slot has to spin up a JVM capable of running some hadoop code, which requires some amount of memory, so eventually you would run out of memory on your machine. You might also have to configure job queues cleverly in order to run a ton at the same time.
Now, what is possible is a very different question than what is a good idea...
You can submit as many jobs you want, they will be queued up and scheduler will run them based on FIFO(by default) and available resources.The number of jobs being executed by hadoop will depend as described by John above.
The number of Reducer slots is set when the cluster is configured. This will limit the number of MapReduce jobs based on the number of Reducers each job requests. Mappers are generally more limited by number of DataNodes and # of processors per node.

Set reducer capacity for a specific M/R job

I want to change the cluster's capacity of reduce slots on a per job basis. That is to say,
originally I have 8 reduce slots configured for a tasktracker, so for a job with 100 reduce tasks, there will be (8 * datanode number) reduce tasks running in the same time. But for a specific job, I want to reduce this number to its half, so I did:
conf.set("mapred.tasktracker.reduce.tasks.maximum", "4");
...
Job job = new Job(conf, ...)
And in the web UI I can see that for this job, the max reduce tasks is exactly at 4, like I set. However hadoop still launches 8 reducer per datanode for this job... It seems that I can't alter the reduce capacity like this.
I asked on the Hadoop mail list, some suggests that I can make it with capacity scheduler, how could I do?
I'm using hadoop 1.0.2.
Thanks.
The Capacity Scheduler allows you to specify resource limits for your MapReduce jobs. Basically you have to define queues, to which your job are being scheduled. Each queue can have different configuration.
As far as your issue is concerned, when using the capacity scheduler one can specify RAM-per-task limits in order to limit how many slots a given task takes. According to the documentation, currently the memory based scheduling is only supported in Linux platform.
For further information about this topic, see: http://wiki.apache.org/hadoop/LimitingTaskSlotUsage and http://hadoop.apache.org/docs/stable/capacity_scheduler.html.

Can we use both Fair scheduler and Capacity Scheduler in the same hadoop cluster

Can we use both Fair scheduler and Capacity Scheduler in the same hadoop cluster. Which scheduler is good and effective. Can anyone help me ?
I do not think both can be used at the same time. It doesn't make sense too. Why would you want to use both type of scheduling in the same cluster? Both scheduling algos have come up due to specific use-cases.
Fair scheduling is a method of assigning resources to jobs such that
all jobs get, on average, an equal share of resources over time. When
there is a single job running, that job uses the entire cluster. When
other jobs are submitted, tasks slots that free up are assigned to the
new jobs, so that each job gets roughly the same amount of CPU time.
Unlike the default Hadoop scheduler, which forms a queue of jobs, this
lets short jobs finish in reasonable time while not starving long
jobs. It is also a reasonable way to share a cluster between a number
of users. Finally, fair sharing can also work with job priorities -
the priorities are used as weights to determine the fraction of total
compute time that each job should get.
The Fair Scheduler arose out of Facebook’s need to share its data warehouse between multiple users. Facebook started using Hadoop to manage the large amounts of content and log data it accumulated every day. Initially, there were only a few jobs that needed to run on the data each day to build reports. However, as other groups within Facebook started to use Hadoop, the number of production jobs increased. In addition, analysts started using the data warehouse for ad-hoc queries through Hive (Facebook’s SQL-like query language for Hadoop), and more large batch jobs were submitted as developers experimented with the data set. Facebook’s data team considered building a separate cluster for the production jobs, but saw that this would be extremely expensive, as data would have to be replicated and the utilization on both clusters would be low. Instead, Facebook built the Fair Scheduler, which allocates resources evenly between multiple jobs and also supports capacity guarantees for production jobs. The Fair Scheduler is based on three concepts:
Jobs are placed into named “pools” based on a configurable attribute
such as user name, Unix group, or specifically tagging a job as being
in a particular pool through its jobconf.
Each pool can have a “guaranteed capacity” that is specified through
a config file, which gives a minimum number of map slots and reduce
slots to allocate to the pool. When there are pending jobs in the
pool, it gets at least this many slots, but if it has no jobs, the
slots can be used by other pools.
Excess capacity that is not going toward a pool’s minimum is
allocated between jobs using fair sharing. Fair sharing ensures that
over time, each job receives roughly the same amount of resources.
This means that shorter jobs will finish quickly, while longer jobs
are guaranteed not to get starved.
The scheduler also includes a number of features for ease of administration, including the ability to reload the config file at runtime to change pool settings without restarting the cluster, limits on running jobs per user and per pool, and use of priorities to weigh the shares of different jobs.
The CapacityScheduler is designed to allow sharing a large cluster
while giving each organization a minimum capacity guarantee. The
central idea is that the available resources in the Hadoop Map-Reduce
cluster are partitioned among multiple organizations who collectively
fund the cluster based on computing needs. There is an added benefit
that an organization can access any excess capacity no being used by
others. This provides elasticity for the organizations in a
cost-effective manner.
The Capacity Scheduler from Yahoo offers similar functionality to the Fair Scheduler but takes a somewhat different philosophy. In the Capacity Scheduler, you define a number of named queues. Each queue has a configurable number of map and reduce slots. The scheduler gives each queue its capacity when it contains jobs, and shares any unused capacity between the queues. However, within each queue, FIFO scheduling with priorities is used, except for one aspect – you can place a limit on percent of running tasks per user, so that users share a cluster equally. In other words, the capacity scheduler tries to simulate a separate FIFO/priority cluster for each user and each organization, rather than performing fair sharing between all jobs. The Capacity Scheduler also supports configuring a wait time on each queue after which it is allowed to preempt other queues’ tasks if it is below its fair share.
Hence it would boil down to what is your need and setup in order to decide on which scheduler you should go with.
Apache hadoop has now support for both these types of scheduling. More detailed info can be found at the following links:
Capacity Scheduler
Fair Scheduler

Resources