Consider there are 3 top level queues, q1,q2,q3.Capacity Scheduler
Users of q1 and q2 submit their jobs to their respective queues, they are guaranteed to get their allocated resources. Now the resources which are not utilized by q3 has to be utilized by q1 and q2. What factors does yarn consider while dividing the extra resources? Who (q1,q2) gets preference?
Every queue in the Capacity Scheduler has 2 important properties (which are defined in terms of percentage of total resources available), which determine the scheduling:
Guaranteed capacity of the queue (determined by configuration "yarn.scheduler.capacity.{queue-path}.capacity")
Maximum capacity to which the queue can grow (determined by the configuration "yarn.scheduler.capacity.{queue-path}.maximum-capacity"). This puts an upper limit on the resource utilization by a queue. The queue cannot grow beyond this limit.
The Capacity Scheduler, organizes the queues in a hierarchical fashion.
Queues are of 2 types "parent" and "leaf" queues. The jobs can only be submitted to the leaf queues.
"ROOT" queue is the parent of all the other queues.
Each parent queue, sorts the child queues based on the demand (what's the current used capacity of the queue? Whether it is under-served or over-served?).
For each queue, the ratio (Used Capacity / Total Cluster Capacity) gives an indication about the utilization of the queue. The parent queue always gives priority to the most under-served child queue..
When the free resources are given to a parent queue, the resources are recursively distributed to the child queues, depending on the current used capacity of the queue.
Within a leaf queue, the distribution of capacity can happen based on certain user limits (for e.g. configuration parameter: yarn.scheduler.capacity.{queue-path}.minimum-user-limit-percent, determines the minimum queue capacity that each user is guaranteed to have).
In your example, for the sake of simplicity, let's assume that the queues q1, q2 and q3 are directly present under "ROOT". As mentioned earlier, the parent queue keeps the queues sorted based on their utilization.
Since q3 is not utilized at all, the parent can distribute the un-utilized resources of q3, between q1 and q2.
The available resources are distributed based on following factors:
If both q1 and q2 have enough resources to continue scheduling their jobs, then there is no need to distribute the available resources from q3
If both q1 and q2 have hit maximum capacity ("yarn.scheduler.capacity.{queue-path}.maximum-capacity", this configuration limits the elasticity of the queues. Queues cannot demand more than the percentage configured by this parameter), then the free resources are not allotted
If any one of the queues q1 or q2 is under-served, then the free resources are allotted to the under-served queue
If both q1 and q2 are under-served, then the most under-served queue is given the top priority.
Related
We have 3 node of kafka cluster with around 32 topic and 400+ partition
spread across these servers. We have the load evenly distributed amongst
this partition however we are observing that 2 broker server are running
around >60% CPU where as the third one is running just abour 10%. How do we
ensure that all server are running smoothly? Do i need to reassing the
partition (kafka-reassign-parition cmd).
PS: The partition are evenly distributed across all the broker servers.
In some cases, this is a result of the way that individual consumer groups determine which partition to use within the __consumer_offsets topic.
On a high level, each consumer group updates only one partition within this topic. This often results in a __consumer_offsets topic with a highly uneven distribution of message rates.
It may be the case that:
You have a couple very heavy consumer groups, meaning they need to update the __consumer_offsets topic frequently. One of these groups uses a partition that has the 2nd broker as its leader. The other uses a partition that has the 3rd broker as its leader.
This would result in a significant amount of the CPU being utilized for updating this topic, and would only occur on the 2nd and 3rd brokers (as seen in your screenshot).
A detailed blog post is found here
I am reading up on Apache Storm to evaluate if it is suited for our real time processing needs.
One thing that I couldn't figure out until now is — Where does it store the tuples during the time when next node is not available for processing it. For e.g. Let's say spout A is producing at the speed of 1000 tuples per second, but the next level of bolts(that process spout A output) can only collectively consume at a rate of 500 tuples per second. What happens to the other tuples ? Does it have a disk-based buffer(or something else) to account for this ?
Storm used internal in-memory message queues. Thus, if a bolt cannot keep up processing, the messages are buffered there.
Before Storm 1.0.0 those queues may grow out-of-bound (ie, you get an out-of-memory exception and your worker dies). To protect from data loss, you need to make sure that the spout can re-read the data (see https://storm.apache.org/releases/1.0.0/Guaranteeing-message-processing.html)
You could use "max.spout.pending" parameter, to limit the tuples in-flight to tackle this problem though.
As of Storm 1.0.0, backpressure is supported (see https://storm.apache.org/2016/04/12/storm100-released.html). This allows bolt to notify its upstream producers to "slow down" if a queues grows too large (and speed up again in a queues get empty). In your spout-bolt-example, the spout would slow down to emit messaged in this case.
Typically, Storm spouts read off of some persistent store and track that completion of tuples to determine when it's safe to remove or ack a message in that store. Vanilla Storm itself does not persist tuples. Tuples are replayed from the source in the event of a failure.
I have to agree with others that you should check out Heron. Stream processing frameworks have advanced significantly since the inception of Storm.
How does Apache Storm Divide the tasks amongst it's workers, I read that storm does it by itself, and it's a function of parallelism, but what I don't know is how do I figure out which node does what and how many nodes would do which task, basically so that I can calculate the optimal number of nodes required?
Assuming that the hardware configuration of all nodes is not the same.
By default, Storm used "round robin" scheduling, ie, it loops over all supervisors with available slots and assigns the parallel instances of spouts/bolts. If no more free slots are available, single workers are assigned multiple spout/bolt instances.
You need to have a look at storm UI. The metrics: complete latency, capacity, execute latency, process latency and failed tuples will give you "hints" on how many executors and tasks you should allocate for each bolt.
The Capacity Scheduler allows sharing of Hadoop cluster along organizational lines, whereby each organization is allocated a certain capacity of the overall cluster.
I want to know that if large data come, then the capacity allocated to that to certain queue will be change automatically?
in capacity scheduler config we define yarn.scheduler.capacity.root.<queue name>.capacity and yarn.scheduler.capacity.root.<queue name>.maximum-capacity
yarn.scheduler.capacity.root.<queue name>.capacity is the capacity of queue while yarn.scheduler.capacity.root.<queue name>.maximum-capacity is maximum resources all jobs/users in that queue can take
if large data come, then the capacity allocated to that to certain queue will be change automatically.
No, queue size is fixed and doesn't change automatically according to input data volume.
you can manually change it in capacity-scheduler.xml and then refresh queues by yarn rmadmin -refreshQueues
you can write a script which will update (and refresh) the queues capacity according to input data volume but I don't think it is recommended.
Can we use both Fair scheduler and Capacity Scheduler in the same hadoop cluster. Which scheduler is good and effective. Can anyone help me ?
I do not think both can be used at the same time. It doesn't make sense too. Why would you want to use both type of scheduling in the same cluster? Both scheduling algos have come up due to specific use-cases.
Fair scheduling is a method of assigning resources to jobs such that
all jobs get, on average, an equal share of resources over time. When
there is a single job running, that job uses the entire cluster. When
other jobs are submitted, tasks slots that free up are assigned to the
new jobs, so that each job gets roughly the same amount of CPU time.
Unlike the default Hadoop scheduler, which forms a queue of jobs, this
lets short jobs finish in reasonable time while not starving long
jobs. It is also a reasonable way to share a cluster between a number
of users. Finally, fair sharing can also work with job priorities -
the priorities are used as weights to determine the fraction of total
compute time that each job should get.
The Fair Scheduler arose out of Facebook’s need to share its data warehouse between multiple users. Facebook started using Hadoop to manage the large amounts of content and log data it accumulated every day. Initially, there were only a few jobs that needed to run on the data each day to build reports. However, as other groups within Facebook started to use Hadoop, the number of production jobs increased. In addition, analysts started using the data warehouse for ad-hoc queries through Hive (Facebook’s SQL-like query language for Hadoop), and more large batch jobs were submitted as developers experimented with the data set. Facebook’s data team considered building a separate cluster for the production jobs, but saw that this would be extremely expensive, as data would have to be replicated and the utilization on both clusters would be low. Instead, Facebook built the Fair Scheduler, which allocates resources evenly between multiple jobs and also supports capacity guarantees for production jobs. The Fair Scheduler is based on three concepts:
Jobs are placed into named “pools” based on a configurable attribute
such as user name, Unix group, or specifically tagging a job as being
in a particular pool through its jobconf.
Each pool can have a “guaranteed capacity” that is specified through
a config file, which gives a minimum number of map slots and reduce
slots to allocate to the pool. When there are pending jobs in the
pool, it gets at least this many slots, but if it has no jobs, the
slots can be used by other pools.
Excess capacity that is not going toward a pool’s minimum is
allocated between jobs using fair sharing. Fair sharing ensures that
over time, each job receives roughly the same amount of resources.
This means that shorter jobs will finish quickly, while longer jobs
are guaranteed not to get starved.
The scheduler also includes a number of features for ease of administration, including the ability to reload the config file at runtime to change pool settings without restarting the cluster, limits on running jobs per user and per pool, and use of priorities to weigh the shares of different jobs.
The CapacityScheduler is designed to allow sharing a large cluster
while giving each organization a minimum capacity guarantee. The
central idea is that the available resources in the Hadoop Map-Reduce
cluster are partitioned among multiple organizations who collectively
fund the cluster based on computing needs. There is an added benefit
that an organization can access any excess capacity no being used by
others. This provides elasticity for the organizations in a
cost-effective manner.
The Capacity Scheduler from Yahoo offers similar functionality to the Fair Scheduler but takes a somewhat different philosophy. In the Capacity Scheduler, you define a number of named queues. Each queue has a configurable number of map and reduce slots. The scheduler gives each queue its capacity when it contains jobs, and shares any unused capacity between the queues. However, within each queue, FIFO scheduling with priorities is used, except for one aspect – you can place a limit on percent of running tasks per user, so that users share a cluster equally. In other words, the capacity scheduler tries to simulate a separate FIFO/priority cluster for each user and each organization, rather than performing fair sharing between all jobs. The Capacity Scheduler also supports configuring a wait time on each queue after which it is allowed to preempt other queues’ tasks if it is below its fair share.
Hence it would boil down to what is your need and setup in order to decide on which scheduler you should go with.
Apache hadoop has now support for both these types of scheduling. More detailed info can be found at the following links:
Capacity Scheduler
Fair Scheduler