I am new to the world of Hadoop and want to know the difference between fair and capacity schedulers. Also when are we supposed to use each one? Please answer in a simple way because I read many things on the Internet but I don't get much from them.
Fair scheduling is a method of assigning resources to jobs such that all jobs get, on average, an equal share of resources over time. When there is a single job running, that job uses the entire cluster. When other jobs are submitted, tasks slots that free up are assigned to the new jobs, so that each job gets roughly the same amount of CPU time. Unlike the default Hadoop scheduler, which forms a queue of jobs, this lets short jobs finish in reasonable time while not starving long jobs. It is also a reasonable way to share a cluster between a number of users. Finally, fair sharing can also work with job priorities - the priorities are used as weights to determine the fraction of total compute time that each job should get.
The CapacityScheduler is designed to allow sharing a large cluster while giving each organization a minimum capacity guarantee. The central idea is that the available resources in the Hadoop Map-Reduce cluster are partitioned among multiple organizations who collectively fund the cluster based on computing needs. There is an added benefit that an organization can access any excess capacity no being used by others. This provides elasticity for the organizations in a cost-effective manner.
Below is the feature-wise comparison of the two schedulers.
Fair Scheduler, Allocates resources pools ( by weights), with fair sharing within each pool
Capacity Scheduler, Allocates resources to pools, with FIFO scheduling within each pool
The Capacity Scheduler is designed to allow sharing a large cluster while giving each organization capacity guarantees. with the possibility to excess capacity not being used by others.
Related
TL;DR: Is there any way to get SGE to round-robin between servers when scheduling jobs, instead of allocating all jobs to the same server whenever it can?
Details:
I have a large compute process that consists of many smaller jobs. I'm using SGE to distribute the work across multiple servers in a cluster.
The process requires a varying number of tasks at different points in time (technically, it is a DAG of jobs). Sometimes the number of parallel jobs is very large (~1 per CPU in the cluster), sometimes it is much smaller (~1 per server). The DAG is dynamic and not uniform so it isn't easy to tell how many parallel jobs there are/will at any given point.
The jobs use a lot of CPU but also do some non trivial amount of IO (especially at job startup and shutdown). They access a shared NFS server connected to all the compute servers. Each compute server has a narrower connection (10Gb/s) but the NFS server has several wide connections (40Gbs) into the communication switch. Not sure what the bandwidth of the switch backbone is, but it is a monster so it should be high.
For optimal performance, jobs should be scheduled across different servers when possible. That is, if I have 20 servers, each with 20 processors, submitting 20 jobs should run one job on each. Submitting 40 jobs should run 2 on each, etc. Submitting 400 jobs would saturate the whole cluster.
However, SGE is perversely intent on minimizing my I/O performance. Submitting 20 jobs would schedule all of them on a single server. So they all fight for a single measly 10Gb network connection when 19 other machines with a bandwidth of 190Gb sit idle.
I can force SGE to execute each job on a different server in several ways (using resources, using special queues, using my parallel environment and specifying '-t 1-', etc.). However, this means I will only be able to run one job per server, period. When the DAG opens up and spawns many jobs, the jobs will stall waiting for a completely free server while 19 out of the 20 processors of each machine will stay idle.
What I need is a way to tell SGE to to assign each job to the next server that has an available slot in a round-robin order. A better way would be to assign the job to the least loaded server (maximal number of unused slots, or maximal fraction of unused slots, or minimal number of used slots, etc.). But a dead simple round-robin would do the trick.
This seems like a much more sensible strategy in general, compared to SGE's policy of running each job on the same server as the previous job, which is just about the worst possible strategy for my case.
I looked over SGE's configuration options but I couldn't find any way to modify the scheduling strategy. That said, SGE's documentation isn't exactly easy to navigate, so I could have easily missed something.
Does anyone know of any way to get SGE to change its scheduling strategy to round-robin or least-loaded or anything along these lines?
Thanks!
Simply change allocation_rule to $round_robin for the SGE parallel environment (sge_pe file):
allocation_rule
The allocation rule is interpreted by the scheduler thread
and helps the scheduler to decide how to distribute parallel
processes among the available machines. If, for instance, a
parallel environment is built for shared memory applications
only, all parallel processes have to be assigned to a single
machine, no matter how much suitable machines are available.
If, however, the parallel environment follows the distri-
buted memory paradigm, an even distribution of processes
among machines may be favorable.
The current version of the scheduler only understands the
following allocation rules:
<int>: An integer number fixing the number of processes
per host. If the number is 1, all processes have
to reside on different hosts. If the special
denominator $pe_slots is used, the full range of
processes as specified with the qsub(1) -pe switch
has to be allocated on a single host (no matter
which value belonging to the range is finally
chosen for the job to be allocated).
$fill_up: Starting from the best suitable host/queue, all
available slots are allocated. Further hosts and
queues are "filled up" as long as a job still
requires slots for parallel tasks.
$round_robin:
From all suitable hosts a single slot is allocated
until all tasks requested by the parallel job are
dispatched. If more tasks are requested than suit-
able hosts are found, allocation starts again from
the first host. The allocation scheme walks
through suitable hosts in a best-suitable-first
order.
Source: http://gridscheduler.sourceforge.net/htmlman/htmlman5/sge_pe.html
While reading about the benefits of YARN from this video, They said that there is Improved utilization of cluster as Scheduler optimizes cluster utilization. Scheduler bases the optimization on certain criteria i) Capacity guarantees ii)fairness iii)SLA’s So I was confuse, What is SLA's and how it works optimization for scheduling
YARN's capacity scheduler is designed to allow sharing of large cluster across many organizations. The cluster utilization takes into account the capacity guarantees, fairness and SLA's of the organizations for optimization. It provides a stringent set of limits to ensure that single application or user cannot consume disproportionate amount of resources in the cluster.
SLA's basically the timeline before which the job of a particular organization should be completed.
I ran a hadoop job and when I look in some map tasks I see they are not running where the file's blocks are. E.g., the map task runs on slave1, but the file blocks (all of them) are in slave2. The files are all gzip.
Why is that happening and how to resolve?
UPDATE: note there are many pending tasks, so this is not a case of a node being idle and therefore hosting tasks that read from other nodes.
Hadoop's default (FIFO) scheduler works like this: When a node has spare capacity, it contacts the master and asks for more work. The master tries to assign a data-local task, or a rack-local task, but if it can't, it will assign any task in the queue (of waiting tasks) to that node. However, while this node was being assigned this non-local task (we'll call it task X), it is possible that another node also had spare capacity and contacted the master asking for work. Even if this node actually had a local copy of the data required by X, it will not be assigned that task because the other node was able to acquire the lock to the master slightly faster than the latter node. This results in poor data locality, but FAST task assignment.
In contrast, the Fair Scheduler uses a technique called delayed scheduling that achieves higher locality by delaying non-local task assignment for a "little bit" (configurable). It achieves higher locality but at a small cost of delaying some tasks.
Other people are working on better schedulers, and this may likely be improved in the future. For now, you can choose to use the Fair Scheduler if you wish to achieve higher data locality.
I disagree with #donald-miner's conclusion that "With a default replication factor of 3, you don't see very many tasks that are not data local." He is correct in noting that more replicas will give improve your locality %, but the percentage of data-local tasks may still be very low. I've also ran experiments myself and saw very low data locality with the FIFO scheduler. You could achieve high locality if your job is large (has many tasks), but for the more common, smaller jobs, they suffer from a problem called "head-of-line scheduling". Quoting from this paper:
The first locality problem occurs in small jobs (jobs that
have small input files and hence have a small number of data
blocks to read). The problem is that whenever a job reaches
the head of the sorted list [...] (i.e. has the fewest
running tasks), one of its tasks is launched on the next slot
that becomes free, no matter which node this slot is on. If
the head-of-line job is small, it is unlikely to have data on
the node that is given to it. For example, a job with data on
10% of nodes will only achieve 10% locality.
That paper goes on to cite numbers from a production cluster at Facebook, and they reported observing just 5% of data locality in a large, production environment.
Final note: Should you care if you have low data locality? Not too much. The running time of your jobs may be dominated by the stragglers (tasks that take longer to complete) and shuffle phase, so improving data locality would only have a very modest improve in running time (if any at all).
Unfortunately, the default scheduler isn't that smart. I'm not sure exactly what's going on, but I think it's using some sort of greedy-style scheduling where it tries to schedule what it can now for the next task, and then moves on. There could definitely be improvements made to the hadoop scheduler and there have been a few academic attempts and making hadoop scheduling more optimal.
This research paper shows that the default hadoop scheduler is not optimal. In the results, they show that increasing the replication factor to three improves data locality significantly, with diminishing returns after that.
So, why hasn't the default scheduler been improved? Here is my opinion/theory: With a default replication factor of 3, you don't see very many tasks that are not data local. By having more replicas, you give the schedule more flexibility to fit tasks in the right spots. Basically, it's a coincidence that you have 3 replicas, and the default scheduler takes advantage of that by being implemented in a lazy manner. Since you typically have 3 replicas for redundancy sake already... there isn't much motivation to help scheduler performance for people with a replication of 1.
If you have the space, I suggest just upping the replication factor to two or three. There really isn't much downside.
Can we use both Fair scheduler and Capacity Scheduler in the same hadoop cluster. Which scheduler is good and effective. Can anyone help me ?
I do not think both can be used at the same time. It doesn't make sense too. Why would you want to use both type of scheduling in the same cluster? Both scheduling algos have come up due to specific use-cases.
Fair scheduling is a method of assigning resources to jobs such that
all jobs get, on average, an equal share of resources over time. When
there is a single job running, that job uses the entire cluster. When
other jobs are submitted, tasks slots that free up are assigned to the
new jobs, so that each job gets roughly the same amount of CPU time.
Unlike the default Hadoop scheduler, which forms a queue of jobs, this
lets short jobs finish in reasonable time while not starving long
jobs. It is also a reasonable way to share a cluster between a number
of users. Finally, fair sharing can also work with job priorities -
the priorities are used as weights to determine the fraction of total
compute time that each job should get.
The Fair Scheduler arose out of Facebook’s need to share its data warehouse between multiple users. Facebook started using Hadoop to manage the large amounts of content and log data it accumulated every day. Initially, there were only a few jobs that needed to run on the data each day to build reports. However, as other groups within Facebook started to use Hadoop, the number of production jobs increased. In addition, analysts started using the data warehouse for ad-hoc queries through Hive (Facebook’s SQL-like query language for Hadoop), and more large batch jobs were submitted as developers experimented with the data set. Facebook’s data team considered building a separate cluster for the production jobs, but saw that this would be extremely expensive, as data would have to be replicated and the utilization on both clusters would be low. Instead, Facebook built the Fair Scheduler, which allocates resources evenly between multiple jobs and also supports capacity guarantees for production jobs. The Fair Scheduler is based on three concepts:
Jobs are placed into named “pools” based on a configurable attribute
such as user name, Unix group, or specifically tagging a job as being
in a particular pool through its jobconf.
Each pool can have a “guaranteed capacity” that is specified through
a config file, which gives a minimum number of map slots and reduce
slots to allocate to the pool. When there are pending jobs in the
pool, it gets at least this many slots, but if it has no jobs, the
slots can be used by other pools.
Excess capacity that is not going toward a pool’s minimum is
allocated between jobs using fair sharing. Fair sharing ensures that
over time, each job receives roughly the same amount of resources.
This means that shorter jobs will finish quickly, while longer jobs
are guaranteed not to get starved.
The scheduler also includes a number of features for ease of administration, including the ability to reload the config file at runtime to change pool settings without restarting the cluster, limits on running jobs per user and per pool, and use of priorities to weigh the shares of different jobs.
The CapacityScheduler is designed to allow sharing a large cluster
while giving each organization a minimum capacity guarantee. The
central idea is that the available resources in the Hadoop Map-Reduce
cluster are partitioned among multiple organizations who collectively
fund the cluster based on computing needs. There is an added benefit
that an organization can access any excess capacity no being used by
others. This provides elasticity for the organizations in a
cost-effective manner.
The Capacity Scheduler from Yahoo offers similar functionality to the Fair Scheduler but takes a somewhat different philosophy. In the Capacity Scheduler, you define a number of named queues. Each queue has a configurable number of map and reduce slots. The scheduler gives each queue its capacity when it contains jobs, and shares any unused capacity between the queues. However, within each queue, FIFO scheduling with priorities is used, except for one aspect – you can place a limit on percent of running tasks per user, so that users share a cluster equally. In other words, the capacity scheduler tries to simulate a separate FIFO/priority cluster for each user and each organization, rather than performing fair sharing between all jobs. The Capacity Scheduler also supports configuring a wait time on each queue after which it is allowed to preempt other queues’ tasks if it is below its fair share.
Hence it would boil down to what is your need and setup in order to decide on which scheduler you should go with.
Apache hadoop has now support for both these types of scheduling. More detailed info can be found at the following links:
Capacity Scheduler
Fair Scheduler
I have an intuition that increasing/decreasing
number of nodes interactively on running job can speed up map-heavy
jobs, but won't help wth reduce heavy jobs, where most of work is done
by reduce.
There's an faq about this but it doesn't really explain very well
http://aws.amazon.com/elasticmapreduce/faqs/#cluster-18
This question was answered by Christopher Smith, who gave me permission to post here.
As always... "it depends". One thing you can pretty much always count
on: adding nodes later on is not going to help you as much as having
the nodes from the get go.
When you create a Hadoop job, it gets split up in to tasks. These
tasks are effectively "atoms of work". Hadoop lets you tweak the # of
mapper and # of reducer tasks during job creation, but once the job is
created, it is static. Tasks are assigned to "slots". Traditionally,
each node is configured to have a certain number of slots for map
tasks, and a certain number of slots for reduce tasks, but you can
tweak that. Some newer versions of Hadoop don't require you to
designate the slots as being for map or reduce tasks. Anyway, the
JobTracker periodically assigns tasks to slots. Because this is done
dynamically, new nodes coming online can speed up the processing of a
job by providing more slots to execute the tasks.
This sets the stage for understanding the reality of adding new nodes.
There's obviously an Amdahl's law issue where having more slots than
pending tasks accomplishes little (if you have speculative execution
enabled, it does help somewhat, as Hadoop will schedule the same task
to run on many different nodes, so that a slow node's tasks can be
completed by faster nodes if there are spare resources). So, if you
didn't define your job with many map or reduce tasks, adding more
nodes isn't going to help much. Of course, each task imposes some
overhead, so you don't want to go crazy high either. That's why I
suggest a guideline for task size should be "something which takes
~2-5 minutes to execute".
Of course, when you add nodes dynamically, they have one other
disadvantage: they don't have any data local. Obviously, if you are at
the start of a EMR pipeline, none of the nodes have data in them, so
doesn't matter, but if you have an EMR pipeline made of many jobs,
with earlier jobs persisting their results to HDFS, you get a huge
performance boost because the JobTracker will favour shaping and
assigning tasks so nodes have that lovely locality of data (this is a
core trick of the whole MapReduce design to maximize performance). On
the reducer side, data is coming from other map tasks, so dynamically
added nodes are really at no disadvantage as compared to other nodes.
So, in principle, dynamically adding new nodes is actually less likely
to help with IO bound map tasks that are reading from HDFS.
Except...
Hadoop has a variety of cheats under the covers to optimize
performance. Once is that it starts transmitting map output data to
the reducers before the map task completes/the reducer starts. This
obviously is a critical optimization for jobs where the mappers
generate a lot of data. You can tweak when Hadoop starts to kick off
the transfers. Anyway, this means that a newly spun up node might be
at a disadvantage, because the existing nodes might already have such
a huge data advantage. Obviously, the more output that the mappers
have transmitted, the larger the disadvantage.
That's how it all really works. In practice though, a lot of Hadoop
jobs have mappers processing tons of data in a CPU intensive fashion,
but outputting comparatively little data to the reducers (or they
might send a lot of data to the reducers, but the reducers are still
very simple, so not CPU bound at all). Often jobs will have few
(sometimes even 0) reducer tasks, so even extra nodes could help, if
you already have a reduce slot available for every outstanding reduce
task, new nodes can't help. New nodes also disproportionately help out
with CPU bound work, for obvious reasons, so because that tends to
be map tasks more than reduce tasks, that's where people typically see
the win. If your mappers are I/O bound and pulling data from the
network, adding new nodes obviously increases the aggregate bandwidth
of the cluster, so it helps there, but if your map tasks are I/O bound
reading HDFS, the best thing is to have more initial nodes, with data
already spread over HDFS. It's not unusual to see reducers get I/O
bound because of poorly structured jobs, in which case adding more
nodes can help a lot, because it splits up the bandwidth again.
There's a caveat there too of course: with a really small cluster,
reducers get to read a lot of their data from the mappers running on
the local node, and adding more nodes shifts more of the data to being
pulled over the much slower network. You can also have cases where
reducers spend most of their time just multiplexing data processing
from all the mappers sending them data (although that is tunable as
well).
If you are asking questions like this, I'd highly recommend profiling
your job using something like Amazon's offering of KarmaSphere. It
will give you a better picture of where your bottlenecks are and what
are your best strategies for improving performance.