What is Erasure Code?
Hadoop 2.7 isn’t out yet, but it’s scheduled to include something called “erasure code.” What the heck is that, you ask? Here’s a quick preview.
The short answer is that erasure code is another name for Reed-Solomon error-correcting codes, which will be used in Hadoop 3.0 as an alternative to brute-force triple replication. This new feature is intended to provide high data availability while using much less disk space.
The longer answer follows.
Hadoop and HDFS originally had a simple philosophy about data availability and reliability: replicate everything, at least three times. This keeps data highly available while increasing Hadoop’s options for locating computation with the data.
The disadvantage, of course, is that triple replication takes three times as much disk space. In the case of Hadoop, this disadvantage is not completely unalloyed, because more disks and nodes mean more CPU’s available to process the data. Nevertheless, a factor of three is a big number when you’re storing petabytes.
Hadoop has introduced a number of options since the early days to make data storage cheaper, but until now they were either alternatives to HDFS (S3, Isilon) or they saved money but didn’t actually reduce the number of bytes stored (tiered storage using large, cheap disks.)
Claude Shannon proved in 1948 that data encodings exist that can make the probability of data loss during transmission as small as you desire, so long as you are willing to add redundancy, i.e., extra bits, to the data.
Shannon’s theory put bounds on how much redundancy is required for a given level of guarantee against loss. While the theory didn’t provide the actual encoding and decoding algorithms, it proved that such encodings must exist, which inspired a great deal of work by numerous researchers on the problems of how to detect and correct data corruption or loss with minimum redundancy.
The most successful approach to the problem is the family of encodings known as forward error correction (FEC) codes. With FEC codes, redundancy can be added to a stream of symbols—usually bits—in such a way that the original set of symbols can be recovered even if some of the data stream is missing or corrupted. The encodings are called “forward” because they do not work by detecting corruption and asking for a re-transmission but actually correct the error using the damaged augmented data.
A bewildering variety of such codes have been invented, but the most widely used family of FEC algorithms today are known as Reed-Solomon Error Correction Codes (RS). These codes were designed with data transmission in mind, but there is no logical difference between stored data and transmitted data—you can think of storage media being simply a transmission channel that holds the data for an indeterminate amount of time. Thus, the same technique can be used for both transmission applications, such data radioed from interplanetary space missions or satellites, and to increase the reliability of storage media such as CD’s, RAID-6 disk drives, Isilon NAS storage, and soon, HDFS.
In the context of HDFS, the term “erasure code” is often used instead of Reed-Solomon. This is because Reed-Solomon handles both erroneous symbols and missing symbols, and in HDFS, RS is used to insure against the latter. (Bit errors are handled by a different mechanism.) The name erasure comes from the name of the mathematical model of digital communication known as a binary erasure channel (which you can read about here) that is used to reason about situations in which some of the bits are missing, i.e., have been erased.
The details of how RS works are beyond the scope of these notes, but the gist of what the RS coding scheme does is to add a certain number of parity bits to each unit of input, then convert the padded unit into the coefficients of a polynomial using finite field arithmetic.
Only certain outputs are possible for a given encoding, and collectively these possible outputs have the property that each of them is as different from every other as it is jointly possible for them all to be. In particular, if you use m parity bits, each output symbol is guaranteed to be different enough from all others that you can lose of m of the bits before what remains becomes ambiguous (or it can detect and correct changes to m/2 of the bits.)
To make this concrete, let’s think in terms of disk drives. The level of protection provided by RS is described by the equation n=k+m, where k is the original amount of data, m is the amount of redundancy added and n is the sum of the two. You may see this written as RS(k,m). The values k=10 and m=4 would be suitable for distributed file system, and that would make n=14. This means you’d have information about 10 input bits spread across 14 output bits in the transformed encoding. If each of the 14 values is written to a different disk drive, then so long as you can read k, i.e., 10, of the drives, you can reconstruct the original data.
The thing to note here is that RS achieves this resilience with much less redundancy than would result from multiple copies. HDFS can lose data if only three drives fail, but if data is encoded with RS(10,4) as in the above example, you can lose as many as four disk drives while increasing the data size by only 1.4X. There is no need to limit m to four—you can make the probability of data loss as low as you want by increasing the redundancy, so long as you are willing to pay for the storage, computational, and network cost for encoding, storing, and decoding the data.
Reed-Solomon coding has been bread-and-butter engineering since long before the earliest days of Hadoop, so the Hadoop engineers certainly knew all about it, but back in 2006, fancy information-theoretic approaches fit poorly with the design philosophy of Hadoop. Most obviously, Hadoop was all about getting huge amounts of data from disks to CPU’s as quickly as possible, and RS puts non-trivial computation in that path for both reading and writing, but particularly for reading. (This is less of a problem for RAID-6 and other special purpose devices because they have hardware support for doing the computations.) Secondly, saving disk space was a lesser concern, because more disks mean more nodes to apply to the task. Perhaps most importantly, the efficiency and scalability of Hadoop depended upon data locality, i.e., the convention of moving the mapper code to the machine hosting the data. This was critical, not only because moving data over a network is slower than moving it over a bus, but because the network is a globally shared limited resource—even a single node with 22 data disks reading and/or writing at 50MB/sec could in principle generate enough data I/O to swamp a 10Gb network.
The hardware/software/data balance has changed in many ways since Hadoop was young. Both CPU and network capability have increased faster than disk controller capability, even as data growth has outpaced all three. Moreover, the ability to save data cheaply begets the desire to save even more data, resulting in a continually increasing pressure on storage.
Two Ways to Store Data
There are two basic strategies to store a distributed data set, regardless of whether or not EC is used:
- Contiguously: When stored in this way, logical blocks of data are large and data is written one byte after the other in a single corresponding physical block. HDFS stores data this way in blocks of from 64 to 256MB, and MapReduce is designed to stream in entire blocks directly from disk.
- Striped: When stored in stripes, a logical block is broken into smaller cells, which are stored round-robin among a group of physical blocks. To read a logical block from striped storage, the locations of the set of disks where the stripes reside must be obtained, the cells read (presumably in parallel,) and the contents reassembled into the logical block.
Striped storage, with or without EC has two problems:
- It is inherently non-local because all or most of a given unit of data must be moved from remote machines.
- All data access involves the network, which is a shared limited resource.
It is often claimed that the use of multiple spindles in striped storage can speed up access. This is certainly true with RAID, but it’s a dubious claim with HDFS, because, in a well-balanced cluster, by definition each drive will have sufficient bandwidth to supply a mapper instance. (That’s part of what defines the term well-balanced.)
EC can be used for storage with either striped of contiguous storage (there is no logical difference to the algorithm.)
The problem with contiguous storage in HDFS is that EC storage is only efficient if a file is big enough to fill all k storage blocks. Assume, for instance, EC(10,4) encoding as in the example above. EC must build the 4 parity blocks, but the others need not be filled. If you encode a file that is only big enough to fill one of the 10 available blocks, the redundancy factor will be 4:1, i.e., worse than plain vanilla HDFS replication. Simply using smaller contiguous blocks doesn’t help because the relative cost of container overhead in processing rises in inverse proportion to block size. Note also that the inefficiencies of undersized files apply not only to storage space but in the amount of data that needs to be read and moved per byte decoded.
The advantage of striped storage is that when a logical block is broken into n small cells, 1/nth of the cells can be partially filled, little space will be wasted.
The problem with striped storage is that the component chunks of a logical block are scattered on many different machines, which means that there is almost no data locality even if a copy of the block is decoded in advance.
The advantage of contiguous storage is that you can have data locality if you decode a single replica. That replica will exist on a single physical drive, and the tasks can be assigned accordingly.
The first release of Hadoop EC will use striped storage for EC-encoded files. Subsequent releases will offer contiguous storage as well.
- High levels of availability.
- Reduced storage volume.
- Encoding and decoding are computationally expensive.
- Encoded data cannot be read without moving all or most of it over the network.
- Local reads of encoded data are impossible
- Involving the network in every block can cause a bottleneck even in fairly small computations.
- Recovery from failure is more complex, takes longer and uses a lot of network bandwidth.
- EC requires significant support within the NameNode and elsewhere.
How EC Will be Used
EC’s guaranteed of high availability is unquestionably a huge advantage. Data can be stored with extremely high guarantees against loss at relatively low cost in storage space.
The question of how much space is saved is trickier. High availability depends upon the data being distributed on many machines, which would quickly become a bottleneck for large scale processing.
For this reason, the EC is primarily intended for archival storage, i.e., data that will relatively seldom be accessed. It’s worth noting that in most environments, the great majority of stored data is archival, or, at least cool, after a year or two. In fact, in many environments, a large proportion of data is literally never accessed for much of its lifetime because it is stored primarily because of regulatory requirements.
The design assumes that hot data will be kept with conventional HDFS triple replication and would migrate to EC encoding as it cools. For warm data, it may be economical to keep a single un-encoded copy plus an EC-encoded copy for availability. The space saving for warm data would be significantly lower than for cold data, but it would have a higher guarantee against loss. Moreover, if contiguous storage is used, the unencoded replica could be processed locally.
The design also provides that if such a replica is missing for any reason, it will be automatically reconstituted from the EC version. The data consumer will not have to wait for the entire logical block to be decoded; data will be available processing immediately as the first bytes are decoded.
Encoding of HDFS data will be done both online and offline. It will be possible to add data subject to immediate encoding, or to add to be stored under conventional replication, and encoded later when/if designated conditions are met. Encoding of blocks is primarily controlled by policy, much as tiered storage is currently handled.
Bottom line: It is important to realize that while EC requires only about 1.5X the raw data size to achieve high availability (which is great) the greatest benefits are for data that is cold-archived, with no decoded replica available. For data that is not ice-cold, you will usually want a decoded replica as well, which reduces the space savings for warm data to about 2.5 times the raw size, albeit with a higher degree of availability.
One feature which is not in the design, but which can be expected, is that if temporary unencoded versions of contiguous blocks are prepared, they would be cached in unused disk space for a limited amount of time to take advantage of temporal locality of reference.
The implementation takes advantage of Intel’s Intelligent Storage Acceleration Library (ISA-L) which is a set of routines that optimize many low-level operations relating to compression, cryptography, hashing, data integrity and data protection, and can greatly accelerate EC. The manufacturers claim a 10X speedup—in practice, a factor for four or five seems to be more like it.