Uncategorized

Choosing a YARN Scheduler

The Apache documentation on the YARN schedulers is good, but it covers how to configure them, not how to choose one or the other. Here’s the background on why the schedulers are designed the way they are and how to choose the right one.

Way Back When: FIFO

In the early days, Hadoop was primarily used for batch MapReduce (MR) and for frameworks such as Pig and Hive that used MR as the compute engine. It used a simple FIFO based strategy for allocating resources: incoming jobs were taken from a FIFO queue and given as much of the available resources as they asked for until all cluster capacity was spoken for.  Subsequent jobs had to wait in line until running tasks completed, freeing up more resources.

One of the several drawbacks to this strategy is that under FIFO scheduling, the average elapsed time for a job can be highly variable even if you have maximum utilization.

Assume, for example, that you have ten small jobs each require 10% of the cluster and could all run simultaneously in ten minutes. You also have one big job that runs in an hour using the entire cluster.  If the ten small jobs start first, they’ll all be done in 10 minutes, and the big job will complete 60 minutes after that, for a total of 70 minutes of cluster time. If they start in the other order, the big job will complete in 60 minutes, and the small jobs will complete after 10 more minutes, again for a total of 70 minutes. But despite 100% utilization for both runs, and identical throughput rates, the average job finished in 16 minutes in the first run, and in 69 minutes in the second run.

The Unfair Scheduler

Modern YARN (Hadoop 2.x) offers two schedulers called the FairScheduler and the CapacityScheduler. IMO, the names are a problem.

The trouble is, the name FairScheduler sounds so dang good. People hear that name and all they can think is, “Heck, yeah! No way am I going to use some crappy unfair scheduler.” FairScheduler sounds like it’s on your side, while CapacityScheduler sounds like it works for the man. Hortonworks users, for instance, are often at pains to replace the standard, supported CapacityScheduler with FairScheduler, despite having zero evidence that one is better than the other, or even understanding how they differ. It’s all about that lovely name.

You really should get past that. The FairScheduler is neither more nor less ethical nor nicer than the CapacityScheduler, and “fair” probably doesn’t even mean what you think.

Let’s take a look at how scheduling works in Hadoop to see what the real differences between the schedulers as they are used on real clusters.

FairScheduler

The FairScheduler was devised with the goal of allocating resources more sensibly.  Like the FIFO scheduler, FairScheduler reckoned resources solely in terms of their memory requirements, which for several reasons is the easiest and most fool proof way to determine of how many processing containers can be effectively run on a compute node. (More about this later.)

As with FIFO scheduling, if there is a single job on the FairScheduler’s queue, it gets all the resources it asks for.  If there are more jobs ready to run, just as with the FIFO scheduler, each will get whatever it wants, so long as free resources remain. However, if the available resources are less than the aggregate requirements of the running jobs and the jobs that are ready to run, the FairScheduler will go ahead and start the new jobs anyway. Then, as time goes by, it will assign freed-up resources to them according to which job is the least well-served at the moment.

The scheduler accomplishes this by always allocating the available capacity to the job that is at the head of a list of jobs that is sorted in ascending order of how well served they are.  This strategy dynamically balances the resources among the current set of jobs.  A few big jobs might utilize the entire cluster at some given moment, but as additional jobs come along the new jobs will preferentially get any freed-up resources until all the jobs are roughly on par.

That describes allocation at the job level, but FairScheduler also supports the abstraction of hierarchical queues. Allocation among queues is similar to what we just described for jobs. If the root queue has N child queues, over time, each will get an approximately 1/N of the resources, passing it’s share down in equal amounts to any child queues, or to individual jobs, if the queue is a leaf.

For example, let’s say we have two queues A and B, but only one active job, X, which happens to be big enough that it could use the entire cluster.  When X starts on A, it will get the entire capacity of the cluster, not just the entirety of queue A’s 50% entitlement, because no other jobs are active. Now suppose job Y comes along on queue B. Initially, there are no available resources, but as some of X’s containers complete, they are assigned to queue B because it is the least served, with 0%.  Queue B’s only job, Y, gets all the freed up resources until it has 100% of B’s allocation, i.e., 50% of the cluster. At that point, A and B each have half the resources of the cluster, and X and Y each have 100% of their respective queue’s allocation.  Now, a third job, Z, comes along on queue A.  Z will accumulate a slowly increasing proportion of A’s resources until X and Z have equal shares, i.e., 50% of queue A’s resources, which is 25% each of the total cluster resources. Queue B’s 50% allocation remains unimpaired.

To complicate matters only a little, FairScheduler has the notion of “weights.”  Queues can be assigned different weights, allowing some queues to be treated more fairly than others. For instance, if queue A has twice the weight of queue B, the scheduler will consider them equally served when queue A has twice the resources of B.

Fair scheduling has the right behavior if you want to hold down the longest time that jobs have to wait to get served. Everyone runs, and over time gets a reasonable proportion of the available resources—nobody starves, and jobs tend to run in time proportional to their size. This strategy isn’t optimized for maximum throughput, but it does a reasonable job in that respect because it tends to keep lots of jobs running, which increases the average amount of container re-use significantly compare to the FIFO strategy.

CapacityScheduler 

The CapacityScheduler was specifically designed for clusters processing a diverse load, with a variety of user groups submitting jobs that may vary by:

  • Processing style, e.g., batch, interactive, and long-running jobs.
  • Size.
  • Categories of organizational ownership, urgency, or other special entitlement.

The primary motivation for CapacityScheduler was to provide a way to assign varying proportions of computing capacity to these diverse classes of jobs while allowing a controllable degree of flexibility. The goal was to provide a path between excessively rigid reserving of resources and unpredictable resource availability.

Note that it’s a basic consequence of the CapacityScheduler’s design that in its default configuration (a single queue) it reduces to the equivalent of the original FIFO scheduler. Therefore, it is important not to use the defaults during a product evaluation, as it will affect both performance and user experience.

Like the FairScheduler, Capacity scheduler was initially designed to reckon resource requirements in terms of memory alone.

Also like the FairScheduler, the CapacityScheduler supports hierarchical queues, and each queue is assigned a pre-determined proportion of the resources assigned to its parent queue. With CapacityScheduler, the weights of sub-queues of any given queue must add up to 100%, whereas with FairScheduler the weights can be arbitrary values but otherwise the queue systems are quite similar. The queues are kept in sorted ascending order of the proportion of their allotted capacity that they are actually using, and allocations are made to the head of the queue.

A queue may or may not confer on a job the right to use resources beyond its allocations that would otherwise go idle, and there are numerous configuration parameters that control how, when, and how much such borrowing will take place, how the resources will be recovered, etc.  Queues can also be configured so that their capacity is reserved purely for their own jobs. In such a case, the queue’s assigned proportion of the clusters resources will simply remain idle if there are not jobs assigned to the queue.

You may wonder why one would you ever want to deliberately let resources sit idle when there are available jobs that could use the capacity? The answer is, for responsiveness.  The original FairScheduling algorithm described above grinds inexorably towards fair sharing of the resources, but fairness in that sense isn’t always what you want.  The cost of maximizing utilization is that resources are rarely ready to go instantly on demand. Nimble responsiveness requires idle resources ready to run.

This is one of a number of features of the CapacityScheduler that were designed with diverse cluster utilization in mind.  If the FairScheduler algorithm strives to avoid starvation, the CapacityScheduler might be thought of as allowing fast-food as a goal.

Another feature that is aimed particularly at situations in which a cluster is shared among organizations is that the hierarchical queues have resource “affinity.” This means that idle resources will be shared preferentially among subtrees of the same organization.

The CapacityScheduler also has the ability to reserve memory for containers that require more RAM that is immediately available. Rather than simply moving down the queue until finding a container request that is small enough for the memory in hand, a reservation is made on that memory for the job on the head of the queue. No freed-up memory will be applied to any other needs until the job for which the memory is reserved is satisfied.

The full details of how the CapacityScheduler handles contention for resources are numerous and only hinted at here.

Dominant Resource Fairness (DRF)

Hadoop has for long used memory as the resource by which it apportions capacity, but clearly computing jobs can be dominated by other resources than memory, including CPU, disk usage and network I/O, etc. Clearly, any allocation algorithm that considers only a single resource will be less effective if the workload includes jobs that are dominated by various other resources.

The Dominant Resource Fairness (DRF) algorithm, rectifies this shortcoming, by allowing allocation based on multiple resource types. Memory and CPU are the currently supported resource types, but more may be added in the future. Note that DRF is not a third Scheduler class, but rather a variant algorithm used within the schedulers to determine which queue is most deserving of an available resource. Either scheduler can use it DRF but they support it slightly differently.

CPU is a natural choice for a second resource to consider, not only because it is inherently important, but because Linux offers supporting optimizations around CPU allocation (known as CGroups.)

DRF is only cursorily documented in the Apache documentation, which refers the reader to the dense academic literature on multiple resource scheduling. The literature is worth looking at because the goals of DRF are subtle and extend beyond simply taking into account multiple resources. Among other properties, DRF:

  • Divides capacity among multiple queues that represent classes of jobs
  • Maximizes the minimum amount of resources assigned to the queues.
  • Equalizes the total resources that each queue gets over time.
  • Considers multiple resources.
  • Prevents users from gaming the system.

The first three of these sound a lot like the FairScheduler in the way they implement fairness, and like both of the schedulers in considering fairness from the obvious point of view, that of the scheduler.

The fifth bullet point is about a different kind of fairness—preventing users from being unfair to each other by gaming the system. This aspect of fairness is embodied in four properties of the DRF algorithm.

  • The first three, sharing incentive, strategy-proofness and envy-freeness respectively ensure that users fare no worse than they would under an equal division of capacity, protect against users gaming the system by misrepresenting their intentions, and never giving a user reason to believe that others are getting better allocations.  These properties encourage and enforce cooperative behavior among user groups.
  • The fourth property, Pareto efficiency, says that under no alternate allocation could a user do any better unless that allocation subtracts capacity from that of another user with an equal or lesser share. This means, among other things, that the scheduler is not letting resources go to waste. The intuition behind this is that even-handedness alone is not sufficient for scheduling, because, for example, evenhandedness by itself is consistent with giving out no resources at all.  Pareto efficiency forbids this.

In its approach to achieving fairness, DRF is similar to fair-scheduling, but it is important to keep in mind that it applies primarily to the allocation of resources among queues, an activity which is already dominated by queue weights. Thus, the most important thing about DRF is that considers multiple resources, rather than that it attempts to provide equal resource allocation.

Resource allocation within the queues is controlled separately. Within a queue:

  • FairScheduler can apply any of FifoPolicy, FairPolicy or DominantResouceFairnessPolicy.
  • CapacityScheduler can apply either FifoPolicy and FairPolicy.

The original paper describing Dominant Resource Fairness is here and the Apache embodiment of it in the FairScheduler is documented here.  A quick rundown of FairShare scheduling can be found here.

Node Labeling

The CapacityScheduler also supports another dimension in resource allocation, which is the ability to segregate subsets of the cluster into what amount to sub-clusters. To accomplish this, nodes to be grouped together are tagged with a label. For instance, if a subset of the a servers are equipped with special CPU’s or GPU’s, you might label them with CALC.

The general process is that labels are defined, and then particular nodes are associated with each label, partitioning the cluster. (Nodes with no label are in their own partition.) The node are then associated with scheduler queues, which allows jobs to be sent to particular labeled subclusters of the cluster. A node can only have at most one label (i.e., the cluster is partitioned by labels) but a queue can be associated with multiple labels.

The labels may be exclusive or shareable. If they are exclusive, jobs submitted to a queue with that label will have exclusive access to those nodes. If the label is shareable, access to those nodes will be available for queues with other labels or no labels, but the use of those resources is subject to preemption on demand.

Note that YARN and the CapacityScheduler don’t “know” what the labels mean—labeling is a way for administers to segregate workloads, and it’s up to the administrators to set up the labels and queues correctly. The labeling critereon can be anything—a machine type, the name of an organizational group, etc.

There is a good description of the overall process and how to configure it for a Horton cluster here.

Convergence

The difference in the capabilities of the FIFO scheduler and CapacityScheduler or the FIFO scheduler and FairScheduler were large from the start. The differences between the capabilities of the CapacityScheduler and the FairScheduler were never so large and have shrunk over time, in that there is little that you can achieve with one that you cannot achieve with the other.  One of the last significant differences was that prior to YARN 2.7.x (which is in HDP 2.3) the policy for resource allocation within a leaf queue was limited to FIFO for the CapacityScheduler. Fair allocation can now be specified for queues in the CapacityScheduler if this behavior is deemed preferable.

One caution about using the fair option for scheduling within a queue—it may not give you the results you expect.  If you are in a multi-tenant environment, your queue may have only a small slice of a cluster. With a large number of jobs and small resources, fair scheduling can divide up the capacity available to you to an excessive degree, not only negatively affecting throughput but changing the user experience in unexpected ways.

The problem with all solutions to scheduling problems is that they exist primarily to make humans happy. You can prove all sorts of things about their behavior mathematically, but  the effect on the user experience is often difficult to predict. Therefore, while what can be achieved with the schedulers is pretty similar, the differences in the conceptual models are significant.  As a matter of personal preference, I find the CapacityScheduler’s model more natural and intuitive for data-lake or multi-tenant situations, and the FairScheduler model more natural for a single-use cluster, particularly for one that emphasizes batch or longer running analytics.

Like any distribution of Apache Hadoop, Hortonworks HDP distribution can run either scheduler, but HDP actively supports only the CapacityScheduler. With the convergence of capabilities, there is little reason for a distribution to support both from the feature-set point of view (although it is possible that there are some unusual use-case where one or the other has some slight advantage.)  

The choice of schedulers in 2015 should usually be whichever your distribution favors because that will be the one that has the easiest interface for configuration and the one that will get the most attention from the experts on your particular distribution should you have a problem.

Some References for Further Reading

Standard

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s