What’s So Important About Compression?

HDFS storage is cheap—about 1% of the cost of storage on a data appliance such as Teradata.  It takes some doing to use up all the disk space on even a small cluster of say, 30 nodes. Such a cluster may have anywhere from 12 to 24 TB per node, so a cluster of that size has from 720 to 1440 TB of storage space.  If there’s no shortage of space, why bother wasting cycles on compression and decompression?

clampWith Hadoop, that’s the wrong way to look at it because saving space is not the main reason to use compression in Hadoop clusters—minimizing disk and network I/O is usually more important. In a fully-used cluster, MapReduce and Tez, which do most of the work, tend to saturate disk-I/O capacity, while jobs that transform bulk data, such as ETL or sorting, can easily  consume all available network I/O.

Continue reading