Uncategorized

Erasure Code in Hadoop

What is Erasure Code?

Hadoop 2.7 isn’t out yet, but it’s scheduled to include something called “erasure code.”  What the heck is that, you ask? Here’s a quick preview.

300px-Office-pink-erasers

The short answer is that erasure code is another name for Reed-Solomon error-correcting codes, which will be used in Hadoop 3.0 as an alternative to brute-force triple replication. This new feature is intended to provide high data availability while using much less disk space.

The longer answer follows.

Continue reading

Standard
Hadoop, Hadoop Hive

Lipwig for Hive Is The Greatest!

Making_Money_LipwigOk, this is the coolest thing this Hive user has seen all day.

As you probably know, if you prepend the word EXPLAIN to your SQL query and then run it, Hive prints out a text description of the query plan. This lets you explore the effects such variations as code changes, the use of analyze, turning on/off the cost-based optimizer (CBO), and so on. It’s an essential tool for optimizing Hive.

The output of EXPLAIN is far from pretty, but fortunately, a simple pipeline of Linux commands can give you a slick graphical rendition like the one below.

Continue reading

Standard
Hadoop

What’s So Important About Compression?

HDFS storage is cheap—about 1% of the cost of storage on a data appliance such as Teradata.  It takes some doing to use up all the disk space on even a small cluster of say, 30 nodes. Such a cluster may have anywhere from 12 to 24 TB per node, so a cluster of that size has from 720 to 1440 TB of storage space.  If there’s no shortage of space, why bother wasting cycles on compression and decompression?

clampWith Hadoop, that’s the wrong way to look at it because saving space is not the main reason to use compression in Hadoop clusters—minimizing disk and network I/O is usually more important. In a fully-used cluster, MapReduce and Tez, which do most of the work, tend to saturate disk-I/O capacity, while jobs that transform bulk data, such as ETL or sorting, can easily  consume all available network I/O.

Continue reading

Standard
Hadoop, YARN

The YARN Revolution

YARN—the data operating system for Hadoop.  Bored yet? They should call it YAWN, right?

152bb5c293e1e1b091141c2c1ad9ebda2

Not really—YARN is turning out be the biggest thing to hit big-data since Hadoop itself, despite the fact that it runs down in the plumbing of somewhere, and even some Hadoop users aren’t 100% clear on exactly what it does. In some ways, the technical improvements it enables aren’t even the most important part. YARN is changing the very economics of Hadoop.

Continue reading

Standard