Hadoop

What’s So Important About Compression?

HDFS storage is cheap—about 1% of the cost of storage on a data appliance such as Teradata.  It takes some doing to use up all the disk space on even a small cluster of say, 30 nodes. Such a cluster may have anywhere from 12 to 24 TB per node, so a cluster of that size has from 720 to 1440 TB of storage space.  If there’s no shortage of space, why bother wasting cycles on compression and decompression?

clampWith Hadoop, that’s the wrong way to look at it because saving space is not the main reason to use compression in Hadoop clusters—minimizing disk and network I/O is usually more important. In a fully-used cluster, MapReduce and Tez, which do most of the work, tend to saturate disk-I/O capacity, while jobs that transform bulk data, such as ETL or sorting, can easily  consume all available network I/O.

Continue reading

Standard
Hadoop, YARN

The YARN Revolution

YARN—the data operating system for Hadoop.  Bored yet? They should call it YAWN, right?

152bb5c293e1e1b091141c2c1ad9ebda2

Not really—YARN is turning out be the biggest thing to hit big-data since Hadoop itself, despite the fact that it runs down in the plumbing of somewhere, and even some Hadoop users aren’t 100% clear on exactly what it does. In some ways, the technical improvements it enables aren’t even the most important part. YARN is changing the very economics of Hadoop.

Continue reading

Standard