No Fluff Unicode Sumary for Hadoop

clovers6

Developers might not want to read all the background on Unicode included in this earlier blog entry. Here is a quick distillation of how Unicode and the UTF encodings are relevant to a Hadoop user—just the facts and the warnings.

Key points about Unicode v UTF’s:

Unicode proper is an abstraction
- It maps each character in every language and symbol-set to a unique integer “code point.”
- The code points are the same everywhere, and for all time.
To store or use Unicode, the characters must be encoded in some concrete format. Standard encodings include:
- UTF-8 (variable length, can represent any code point)
- UTF-16 (variable length, can represent any code point)
- UTF-32 (fixed length, can represent any code point)
- UTF-16LE (same as UTF-16, but specifically little-endian)
- UTF-16BE (same as UTF-16, but specifically big-endian)
- UTF-32LE (same as UTF-32, but specifically little-endian)
- UTF-32BE (same as UTF-32, but specifically big-endian)
- UTF-2 (obsolete, fixed length, for pre-1996 16-bit Unicode)
Both Unicode itself and the UTF’s are referred to as “encodings,” but when programmers say “encoding” they usually mean the UTF.

If you’ve forgotten what endian-ness is, look here under “The Endian Problem.”

90% of misunderstandings about Unicode trace back to one of these:

UTF-8 is not an 8-bit encoding—it can encode all of 21-bit Unicode.
UTF-16 is not a 16 bit encoding—it can encode all of 21-bit Unicode.
Unicode itself is not limited to 16 bits.
- In Granddad’s day, Unicode was 16 bits and could represent only about 60K distinct characters
- It was changed to 21 bits in 1996 and now can handle up to 1,112,064 distinct characters.
The numbers 8 and 16 in UTF-8 and UTF-16:
- Do not refer to the number of bits in the code-points that the encoding can express.
- Do refer to the number of bits/bytes that are logically processed together.
  - UTF-8 takes bytes one at a time.
  - UTF-16 takes bytes two at a time.
- It would have made more sense to call them UTF-1 and UTF-2, but when UTF-16 was named, the name UTF-2 was already taken.

The key points to remember about encodings are:

The most widely used Unicode encoding today, by far, is UTF-8, but UTF-16 is not dead yet.
Sometimes you are forced by circumstance to ingest UTF-16, but the only reason to write any format other than UTF-8 is to accommodate legacy processing.
Occasionally, other formats, e.g., UTF-32, are used for special purposes internally to some program. If you need to know about this, then you are beyond needing this primer.
UTF-8 and UTF-16 are both “variable length” encodings, i.e., not every character is expressed with the same number of bytes.
ASCII by definition is 7-bit.
- Range is 0 through FF hex, which includes FF+1 values or, in decimal, 128.
- If the high order bit is set, it’s not ASCII.
UTF-16 and UTF-8 represent the ASCII characters with the same numeric values used in ASCII, but they encode them differently.
- UTF-16 always uses either two bytes or four bytes.
  - ASCII characters will have an extra all-zero byte in addition to the byte with the ASCII value.
  - Whether the all-zero byte is leading or trailing depends on the endian-ness of the representation.
  - The BMP characters all fit into 16 bits.
- UTF-8 uses:
  - 1 byte for code points < 8 bits (ASCII characters, i.e. the Latin alphabet)
  - 2 bytes for all code points that require from 8 to 11 bits
  - 3 bytes for all code points that require from 12 to 16 bits
  - 4 bytes for all code points that require from 17 to 21 bits
  - Note that this implies that much of the BMP requires three bytes, not two.

Need for code-points outside the BMP, i.e., the low-order 16 bits is fairly unusual unless you’re using Chinese, and usually not even then.
ASCII (plain text) is UTF-8
- A file of pure ASCII is a valid UTF-8 file.
- The reverse is not necessarily true.
- Any file containing a byte with the high-bit set is not ASCII.
UTF-16, because it deals with bytes two at a time, is actually two physical encodings—little endian and big endian.
- For UTF-16, the optional BOM character can be, but need not be, used as the first character in a file to distinguish little-endian and big-endinan encodings.
- The BOM is guaranteed not to ever be valid for anything else.
- If the first character of a UTF-16 file is read as U_FEFF, the file and the program reading the file will be in agreement.
- If the first character of a UTF-16 file is read as U_FFFE, then the program must reverse the endian-ness.
- This doesn’t actually tell a program which encoding it is using, only that the encoding is either the other one or the same one as the program is using.

Advantages of UTF-8

ASCII is the most common format for data and ASCII is UTF-8
- For ASCII, UTF-8 takes only half as much space as UTF-16.
- No conversion needed for ASCII
If you jump to a random point in a UTF-8 file, you can synchronize to the next complete character in at most three bytes—one byte if it’s ASCII.
One disadvantage of UTF-8 is that it takes about 50% more space than UTF-16 when encoding East Asian and South Asian languages (3 bytes v. 2 bytes.)
UTF-8 is not subject to endian problems, while all multi-byte encodings, including UTF-16, are.

Java and Unicode

Java (unlike C and C++) was originally designed (before 1995) to use 16 bit Unicode, and later moved to 21-bit Unicode when the standard changed. The encoding used internally is UTF-16, but the Java specification requires it to to handle a variety of encodings. Two critical points:

Unlike C/C++, Java defines Strings in terms of characters, not bytes. This blog on Java and Unicode details it pretty well.
Java is not limited to 16-bit code points.

Hadoop and UTF Formats

In theory, Hadoop and Hive should work with either UTF-16 or UTF-8, but there are a couple of known Hive bugs that limit the use of UTF-16 to the characters in the BMP, and may cause problems even then. See this Apache bug report for details.

Even if it Hadoop did work correctly with UTF-16, there would still be significant drawbacks:

UTF-16 doubles the space required for Latin-alphabet text (English and European languages) in an environment that already triples storage size.
Applications running over Map-Reduce and Tez (e.g. Hive) usually do a lot of sorting (in the shuffle-sort)
- Lexical sorts of UTF-8 are significantly more efficient than UTF-16 sorts.
- The reasons that are beyond the scope of these notes, but see: these notes for more details.
BOM markers are lost when files are split. See this page for details.
ORC requires UTF-8. If your project uses tabular data, you should almost always be using ORC.

The fact that Hadoop does not work well with UTF-16 is less of a problem than you’d think for two reasons:

The majority of data ingested by Hadoop is ASCII, and ASCII is automatically UTF-8.
Most data that is not specifically ASCII is UTF-8, because UTF-8 dominates the Web.

What to do if you are stuck with UTF-16 data?

Don’t monkey around with trying to get UTF-16 to work in Hadoop—convert it directly to UTF-8 or specifically to ASCII if none of the code points are greater than 127.
If it’s a reasonable amount of data, e.g., periodic ingestion of a few Gigs, you may be able to do it on the way in, e.g., with bash, or as part of the Oozie process.
The Linux iconv utility Linux iconv utility can be invoked from within a bash script.
iconv has been known to fail for very large files (15GB) but these can be chopped into smaller pieces with the Linux split utility: split utility.
You can do larger amounts of data with a simple MapReduce job.
- Conversion is straightforward in Java
- MR is fast for this because it’s map-side only.
- You can find a clue to the Java code here.
ICU provides Java libraries for doing conversions and many other operations on Unicode: http://userguide.icu-project.org/conversion/converters

If you want a little more depth on Unicode, endian-ness, representations, etc., be sure to check out Not Even Hadoop: All about Unicode.

	Water on A Pilgrim’s Progress #1:…
	Stewyn on Shifting to Hive Part II: Best…
	Glen on Go Go Go
	hadoop 3 Erasure cod… on Erasure Code in Hadoop
	Rajesh KSV on Shifting to Hive Part II: Best…

	Water on A Pilgrim’s Progress #1:…
	Stewyn on Shifting to Hive Part II: Best…
	Glen on Go Go Go
	hadoop 3 Erasure cod… on Erasure Code in Hadoop
	Rajesh KSV on Shifting to Hive Part II: Best…

hadoopoopadoop

Big Data with Hortonworks Hadoop

No Fluff Unicode Sumary for Hadoop

Leave a comment Cancel reply

Share this:

Leave a comment Cancel reply