Developers might not want to read all the background on Unicode included in this earlier blog entry. Here is a quick distillation of how Unicode and the UTF encodings are relevant to a Hadoop user—just the facts and the warnings.
Key points about Unicode v UTF’s:
- Unicode proper is an abstraction
- It maps each character in every language and symbol-set to a unique integer “code point.”
- The code points are the same everywhere, and for all time.
- To store or use Unicode, the characters must be encoded in some concrete format. Standard encodings include:
- UTF-8 (variable length, can represent any code point)
- UTF-16 (variable length, can represent any code point)
- UTF-32 (fixed length, can represent any code point)
- UTF-16LE (same as UTF-16, but specifically little-endian)
- UTF-16BE (same as UTF-16, but specifically big-endian)
- UTF-32LE (same as UTF-32, but specifically little-endian)
- UTF-32BE (same as UTF-32, but specifically big-endian)
- UTF-2 (obsolete, fixed length, for pre-1996 16-bit Unicode)
- Both Unicode itself and the UTF’s are referred to as “encodings,” but when programmers say “encoding” they usually mean the UTF.
If you’ve forgotten what endian-ness is, look here under “The Endian Problem.”
90% of misunderstandings about Unicode trace back to one of these:
- UTF-8 is not an 8-bit encoding—it can encode all of 21-bit Unicode.
- UTF-16 is not a 16 bit encoding—it can encode all of 21-bit Unicode.
- Unicode itself is not limited to 16 bits.
- In Granddad’s day, Unicode was 16 bits and could represent only about 60K distinct characters
- It was changed to 21 bits in 1996 and now can handle up to 1,112,064 distinct characters.
- The numbers 8 and 16 in UTF-8 and UTF-16:
- Do not refer to the number of bits in the code-points that the encoding can express.
- Do refer to the number of bits/bytes that are logically processed together.
- UTF-8 takes bytes one at a time.
- UTF-16 takes bytes two at a time.
- It would have made more sense to call them UTF-1 and UTF-2, but when UTF-16 was named, the name UTF-2 was already taken.
The key points to remember about encodings are:
- The most widely used Unicode encoding today, by far, is UTF-8, but UTF-16 is not dead yet.
- Sometimes you are forced by circumstance to ingest UTF-16, but the only reason to write any format other than UTF-8 is to accommodate legacy processing.
- Occasionally, other formats, e.g., UTF-32, are used for special purposes internally to some program. If you need to know about this, then you are beyond needing this primer.
- UTF-8 and UTF-16 are both “variable length” encodings, i.e., not every character is expressed with the same number of bytes.
- ASCII by definition is 7-bit.
- Range is 0 through FF hex, which includes FF+1 values or, in decimal, 128.
- If the high order bit is set, it’s not ASCII.
- UTF-16 and UTF-8 represent the ASCII characters with the same numeric values used in ASCII, but they encode them differently.
- UTF-16 always uses either two bytes or four bytes.
- ASCII characters will have an extra all-zero byte in addition to the byte with the ASCII value.
- Whether the all-zero byte is leading or trailing depends on the endian-ness of the representation.
- The BMP characters all fit into 16 bits.
- UTF-8 uses:
- 1 byte for code points < 8 bits (ASCII characters, i.e. the Latin alphabet)
- 2 bytes for all code points that require from 8 to 11 bits
- 3 bytes for all code points that require from 12 to 16 bits
- 4 bytes for all code points that require from 17 to 21 bits
- Note that this implies that much of the BMP requires three bytes, not two.
- UTF-16 always uses either two bytes or four bytes.
- Need for code-points outside the BMP, i.e., the low-order 16 bits is fairly unusual unless you’re using Chinese, and usually not even then.
- ASCII (plain text) is UTF-8
- A file of pure ASCII is a valid UTF-8 file.
- The reverse is not necessarily true.
- Any file containing a byte with the high-bit set is not ASCII.
- UTF-16, because it deals with bytes two at a time, is actually two physical encodings—little endian and big endian.
- For UTF-16, the optional BOM character can be, but need not be, used as the first character in a file to distinguish little-endian and big-endinan encodings.
- The BOM is guaranteed not to ever be valid for anything else.
- If the first character of a UTF-16 file is read as U_FEFF, the file and the program reading the file will be in agreement.
- If the first character of a UTF-16 file is read as U_FFFE, then the program must reverse the endian-ness.
- This doesn’t actually tell a program which encoding it is using, only that the encoding is either the other one or the same one as the program is using.
Advantages of UTF-8
- ASCII is the most common format for data and ASCII is UTF-8
- For ASCII, UTF-8 takes only half as much space as UTF-16.
- No conversion needed for ASCII
- If you jump to a random point in a UTF-8 file, you can synchronize to the next complete character in at most three bytes—one byte if it’s ASCII.
- One disadvantage of UTF-8 is that it takes about 50% more space than UTF-16 when encoding East Asian and South Asian languages (3 bytes v. 2 bytes.)
- UTF-8 is not subject to endian problems, while all multi-byte encodings, including UTF-16, are.
Java and Unicode
Java (unlike C and C++) was originally designed (before 1995) to use 16 bit Unicode, and later moved to 21-bit Unicode when the standard changed. The encoding used internally is UTF-16, but the Java specification requires it to to handle a variety of encodings. Two critical points:
- Unlike C/C++, Java defines Strings in terms of characters, not bytes. This blog on Java and Unicode details it pretty well.
- Java is not limited to 16-bit code points.
Hadoop and UTF Formats
In theory, Hadoop and Hive should work with either UTF-16 or UTF-8, but there are a couple of known Hive bugs that limit the use of UTF-16 to the characters in the BMP, and may cause problems even then. See this Apache bug report for details.
Even if it Hadoop did work correctly with UTF-16, there would still be significant drawbacks:
- UTF-16 doubles the space required for Latin-alphabet text (English and European languages) in an environment that already triples storage size.
- Applications running over Map-Reduce and Tez (e.g. Hive) usually do a lot of sorting (in the shuffle-sort)
- Lexical sorts of UTF-8 are significantly more efficient than UTF-16 sorts.
- The reasons that are beyond the scope of these notes, but see: these notes for more details.
- BOM markers are lost when files are split. See this page for details.
- ORC requires UTF-8. If your project uses tabular data, you should almost always be using ORC.
The fact that Hadoop does not work well with UTF-16 is less of a problem than you’d think for two reasons:
- The majority of data ingested by Hadoop is ASCII, and ASCII is automatically UTF-8.
- Most data that is not specifically ASCII is UTF-8, because UTF-8 dominates the Web.
What to do if you are stuck with UTF-16 data?
- Don’t monkey around with trying to get UTF-16 to work in Hadoop—convert it directly to UTF-8 or specifically to ASCII if none of the code points are greater than 127.
- If it’s a reasonable amount of data, e.g., periodic ingestion of a few Gigs, you may be able to do it on the way in, e.g., with bash, or as part of the Oozie process.
- The Linux iconv utility Linux iconv utility can be invoked from within a bash script.
- iconv has been known to fail for very large files (15GB) but these can be chopped into smaller pieces with the Linux split utility: split utility.
- You can do larger amounts of data with a simple MapReduce job.
- Conversion is straightforward in Java
- MR is fast for this because it’s map-side only.
- You can find a clue to the Java code here.
- ICU provides Java libraries for doing conversions and many other operations on Unicode: http://userguide.icu-project.org/conversion/converters
If you want a little more depth on Unicode, endian-ness, representations, etc., be sure to check out Not Even Hadoop: All about Unicode.