Capacity Scheduler - hadoop

The Capacity Scheduler allows sharing of Hadoop cluster along organizational lines, whereby each organization is allocated a certain capacity of the overall cluster.
I want to know that if large data come, then the capacity allocated to that to certain queue will be change automatically?

in capacity scheduler config we define yarn.scheduler.capacity.root.<queue name>.capacity and yarn.scheduler.capacity.root.<queue name>.maximum-capacity
yarn.scheduler.capacity.root.<queue name>.capacity is the capacity of queue while yarn.scheduler.capacity.root.<queue name>.maximum-capacity is maximum resources all jobs/users in that queue can take
if large data come, then the capacity allocated to that to certain queue will be change automatically.
No, queue size is fixed and doesn't change automatically according to input data volume.
you can manually change it in capacity-scheduler.xml and then refresh queues by yarn rmadmin -refreshQueues
you can write a script which will update (and refresh) the queues capacity according to input data volume but I don't think it is recommended.

Related

In YARN, how is the container size determined?

In YARN application, how does ApplicationMaster decide on the size of the container? I understand there are parameters controlling on the minimum memory allocation, vcores ratio etc. But how does application master understand that it needs so much amount of memory and so many CPUs for a particular job - either MapReduce / Spark?
First let me explain in one or two lines how YARN works then we go through the questions.
So let's assume we have 100GB of total YARN cluster memory and 1GB minimum-allocation-mb, then we have 100 max containers. If we set the minimum allocation to 4GB, then we have 25 max containers.
Each application will get the memory it asks for rounded up to the next container size. So if the minimum is 4GB and you ask for 4.5GB you will get 8GB.
If the job/task Memory requirement is bigger than the allocated container size, in which case it will shoot down this container.
Now coming back to your original question, how YARN application master decide how much amount of Memory and CPU is required for a particular job.
YARN Resource Manager (RM) allocates resources to the application through logical queues which include memory, CPU, and disks resources.
By default, the RM will allow up to 8192MB ("yarn.scheduler.maximum-allocation-mb") to an Application Master (AM) container allocation request.
The default minimum allocation is 1024MB ("yarn.scheduler.minimum-allocation-mb").
The AM can only request resources from the RM that are in increments of ("yarn.scheduler.minimum-allocation-mb") and do not exceed ("yarn.scheduler.maximum-allocation-mb").
The AM is responsible for rounding off ("mapreduce.map.memory.mb") and ("mapreduce.reduce.memory.mb") to a value divisible by the ("yarn.scheduler.minimum-allocation-mb").
RM will deny an allocation greater than 8192MB and a value not divisible by 1024MB.
Following YARN and Map-Reduce parameters need to set to change the default Memory requirement:-
For YARN
yarn.scheduler.minimum-allocation-mb
yarn.scheduler.maximum-allocation-mb
yarn.nodemanager.vmem-pmem-ratio
yarn.nodemanager.resource.memory.mb
For MapReduce
mapreduce.map.java.opts
mapreduce.map.memory.mb
mapreduce.reduce.java.opts
mapreduce.reduce.memory.mb
So conclusion is that, application master doesn't use any logic to calculate resources (memory/CPU) requirement for a particular job. It simply use above mentioned parameters value for it.
If any jobs doesn't complete in given container size (including virtual Memory), then node manager simply kill the container.

What is Memory reserved on Yarn

I managed to launch a spark application on Yarn. However memory usage is kind of weird as you can see below :
http://imgur.com/1k6VvSI
What does memory reserved mean ? How can i manage to efficiently use all the memory available ?
Thanks in advance.
Check out this blog from Cloudera that explains the new memory management in YARN.
Here's the pertinent bits:
... An implementation detail of this change that prevents applications from starving under this new flexibility is the notion of reserved containers. Imagine two jobs are running that each have enough tasks to saturate more than the entire cluster. One job wants each of its mappers to get 1GB, and another job wants its mappers to get 2GB. Suppose the first job starts and fills up the entire cluster. Whenever one of its task finishes, it will leave open a 1GB slot. Even though the second job deserves the space, a naive policy will give it to the first one because it’s the only job with tasks that fit. This could cause the second job to be starved indefinitely.
To prevent this unfortunate situation, when space on a node is offered to an application, if the application cannot immediately use it, it reserves it, and no other application can be allocated a container on that node until the reservation is fulfilled. Each node may have only one reserved container. The total reserved memory amount is reported in the ResourceManager UI. A high number means that it may take longer for new jobs to get space. ,,,
A container will become reserved state when the container is assigned to some nodemanager node which do not have enough resource(cpu or memory) for it.

What is the difference between the fair and capacity schedulers?

I am new to the world of Hadoop and want to know the difference between fair and capacity schedulers. Also when are we supposed to use each one? Please answer in a simple way because I read many things on the Internet but I don't get much from them.
Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also a reasonable way to share a cluster between a number of users. Finally, fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job should get.
The CapacityScheduler is designed to allow sharing a large cluster while giving each organization a minimum capacity guarantee. The central idea is that the available resources in the Hadoop Map-Reduce cluster are partitioned among multiple organizations who collectively fund the cluster based on computing needs. There is an added benefit that an organization can access any excess capacity no being used by others. This provides elasticity for the organizations in a cost-effective manner.
Below is the feature-wise comparison of the two schedulers.
Fair Scheduler, Allocates resources pools ( by weights), with fair sharing within each pool
Capacity Scheduler, Allocates resources to pools, with FIFO scheduling within each pool
The Capacity Scheduler is designed to allow sharing a large cluster while giving each organization capacity guarantees. with the possibility to excess capacity not being used by others.

Flexible heap space allocation to Hadoop MapReduce Mapper tasks

I'm having trouble figuring out the best way to configure my Hadoop cluster (CDH4), running MapReduce1. I'm in a situation where I need to run both mappers that require such a large amount of Java heap space that I couldn't possible run more than 1 mapper per node - but at the same time I want to be able to run jobs that can benefit from many mappers per node.
I'm configuring the cluster through the Cloudera management UI, and the Max Map Tasks and mapred.map.child.java.opts appear to be quite static settings.
What I would like to have is something like a heap space pool with X GB available, that would accommodate both kinds of jobs without having to reconfigure the MapReduce service each time. If I run 1 mapper, it should assign X GB heap - if I run 8 mappers, it should assign X/8 GB heap.
I have considered both the Maximum Virtual Memory and the Cgroup Memory Soft/Hard limits, but neither will get me exactly what I want. Maximum Virtual Memory is not effective, since it still is a per task setting. The Cgroup setting is problematic because it does not seem to actually restrict the individual tasks to a lower amount of heap if there is more of them, but rather will allow the task to use too much memory and then kill the process when it does.
Can the behavior I want to achieve be configured?
(PS you should use the newer name of this property with Hadoop 2 / CDH4: mapreduce.map.java.opts. But both should still be recognized.)
The value you configure in your cluster is merely a default. It can be overridden on a per-job basis. You should leave the default value from CDH, or configure it to something reasonable for normal mappers.
For your high-memory job only, in your client code, set mapreduce.map.java.opts in your Configuration object for the Job before you submit it.
The answer gets more complex if you are running MR2/YARN since it no longer schedules by 'slots' but by container memory. So memory enters the picture in a new, different way with new, different properties. (It confuses me, and I'm even at Cloudera.)
In a way it would be better, because you express your resource requirement in terms of memory, which is good here. You would set mapreduce.map.memory.mb as well to a size about 30% larger than your JVM heap size since this is the memory allowed to the whole process. It would be set higher by you for high-memory jobs in the same way. Then Hadoop can decide how many mappers to run, and decide where to put the workers for you, and use as much of the cluster as possible per your configuration. No fussing with your own imaginary resource pool.
In MR1, this is harder to get right. Conceptually you want to set the maximum number of mappers per worker to 1 via mapreduce.tasktracker.map.tasks.maximum, along with your heap setting, but just for the high-memory job. I don't know if the client can request or set this though on a per-job basis. I doubt it as it wouldn't quite make sense. You can't really approach this by controlling the number of mappers just because you have to hack around to even find out, let alone control, the number of mappers it will run.
I don't think OS-level settings will help. In a way these resemble more how MR2 / YARN thinks about resource scheduling. Your best bet may be to (move to MR2 and) use MR2's resource controls and let it figure the rest out.

Can we use both Fair scheduler and Capacity Scheduler in the same hadoop cluster

Can we use both Fair scheduler and Capacity Scheduler in the same hadoop cluster. Which scheduler is good and effective. Can anyone help me ?
I do not think both can be used at the same time. It doesn't make sense too. Why would you want to use both type of scheduling in the same cluster? Both scheduling algos have come up due to specific use-cases.
Fair scheduling is a method of assigning resources to jobs such that
all jobs get, on average, an equal share of resources over time. When
there is a single job running, that job uses the entire cluster. When
other jobs are submitted, tasks slots that free up are assigned to the
new jobs, so that each job gets roughly the same amount of CPU time.
Unlike the default Hadoop scheduler, which forms a queue of jobs, this
lets short jobs finish in reasonable time while not starving long
jobs. It is also a reasonable way to share a cluster between a number
of users. Finally, fair sharing can also work with job priorities -
the priorities are used as weights to determine the fraction of total
compute time that each job should get.
The Fair Scheduler arose out of Facebook’s need to share its data warehouse between multiple users. Facebook started using Hadoop to manage the large amounts of content and log data it accumulated every day. Initially, there were only a few jobs that needed to run on the data each day to build reports. However, as other groups within Facebook started to use Hadoop, the number of production jobs increased. In addition, analysts started using the data warehouse for ad-hoc queries through Hive (Facebook’s SQL-like query language for Hadoop), and more large batch jobs were submitted as developers experimented with the data set. Facebook’s data team considered building a separate cluster for the production jobs, but saw that this would be extremely expensive, as data would have to be replicated and the utilization on both clusters would be low. Instead, Facebook built the Fair Scheduler, which allocates resources evenly between multiple jobs and also supports capacity guarantees for production jobs. The Fair Scheduler is based on three concepts:
Jobs are placed into named “pools” based on a configurable attribute
such as user name, Unix group, or specifically tagging a job as being
in a particular pool through its jobconf.
Each pool can have a “guaranteed capacity” that is specified through
a config file, which gives a minimum number of map slots and reduce
slots to allocate to the pool. When there are pending jobs in the
pool, it gets at least this many slots, but if it has no jobs, the
slots can be used by other pools.
Excess capacity that is not going toward a pool’s minimum is
allocated between jobs using fair sharing. Fair sharing ensures that
over time, each job receives roughly the same amount of resources.
This means that shorter jobs will finish quickly, while longer jobs
are guaranteed not to get starved.
The scheduler also includes a number of features for ease of administration, including the ability to reload the config file at runtime to change pool settings without restarting the cluster, limits on running jobs per user and per pool, and use of priorities to weigh the shares of different jobs.
The CapacityScheduler is designed to allow sharing a large cluster
while giving each organization a minimum capacity guarantee. The
central idea is that the available resources in the Hadoop Map-Reduce
cluster are partitioned among multiple organizations who collectively
fund the cluster based on computing needs. There is an added benefit
that an organization can access any excess capacity no being used by
others. This provides elasticity for the organizations in a
cost-effective manner.
The Capacity Scheduler from Yahoo offers similar functionality to the Fair Scheduler but takes a somewhat different philosophy. In the Capacity Scheduler, you define a number of named queues. Each queue has a configurable number of map and reduce slots. The scheduler gives each queue its capacity when it contains jobs, and shares any unused capacity between the queues. However, within each queue, FIFO scheduling with priorities is used, except for one aspect – you can place a limit on percent of running tasks per user, so that users share a cluster equally. In other words, the capacity scheduler tries to simulate a separate FIFO/priority cluster for each user and each organization, rather than performing fair sharing between all jobs. The Capacity Scheduler also supports configuring a wait time on each queue after which it is allowed to preempt other queues’ tasks if it is below its fair share.
Hence it would boil down to what is your need and setup in order to decide on which scheduler you should go with.
Apache hadoop has now support for both these types of scheduling. More detailed info can be found at the following links:
Capacity Scheduler
Fair Scheduler

Resources