The Apache documentation on the YARN schedulers is good, but it covers how to configure them, not how to choose one or the other. Here’s the background on why the schedulers are designed the way they are and how to choose the right one.
HDFS storage is cheap—about 1% of the cost of storage on a data appliance such as Teradata. It takes some doing to use up all the disk space on even a small cluster of say, 30 nodes. Such a cluster may have anywhere from 12 to 24 TB per node, so a cluster of that size has from 720 to 1440 TB of storage space. If there’s no shortage of space, why bother wasting cycles on compression and decompression?
With Hadoop, that’s the wrong way to look at it because saving space is not the main reason to use compression in Hadoop clusters—minimizing disk and network I/O is usually more important. In a fully-used cluster, MapReduce and Tez, which do most of the work, tend to saturate disk-I/O capacity, while jobs that transform bulk data, such as ETL or sorting, can easily consume all available network I/O.
YARN—the data operating system for Hadoop. Bored yet? They should call it YAWN, right?
Not really—YARN is turning out be the biggest thing to hit big-data since Hadoop itself, despite the fact that it runs down in the plumbing of somewhere, and even some Hadoop users aren’t 100% clear on exactly what it does. In some ways, the technical improvements it enables aren’t even the most important part. YARN is changing the very economics of Hadoop.