Hadoop, Hadoop Hive, Ingestion, not-hadoop, Unicode

No Fluff Unicode Sumary for Hadoop

clovers6

Developers might not want to read all the background on Unicode included in this earlier blog entry. Here is a quick distillation of how Unicode and the UTF encodings are relevant to a Hadoop user—just the facts and the warnings.

Key points about Unicode v UTF’s:

  • Unicode proper is an abstraction
    • It maps each character in every language and symbol-set to a unique integer “code point.”
    • The code points are the same everywhere, and for all time.
  • To store or use Unicode, the characters must be encoded in some concrete format. Standard encodings include:
    •  UTF-8 (variable length, can represent any code point)
    •  UTF-16 (variable length, can represent any code point)
    • UTF-32 (fixed length, can represent any code point)
    • UTF-16LE (same as UTF-16, but specifically little-endian)
    • UTF-16BE (same as UTF-16, but specifically big-endian)
    • UTF-32LE (same as UTF-32, but specifically little-endian)
    • UTF-32BE (same as UTF-32, but specifically big-endian)
    • UTF-2 (obsolete, fixed length, for pre-1996 16-bit Unicode)
  • Both Unicode itself and the UTF’s are referred to as “encodings,” but when programmers say “encoding” they usually mean the UTF.

If you’ve forgotten what endian-ness is, look here under “The Endian Problem.”

90% of misunderstandings about Unicode trace back to one of these:

  • UTF-8 is not an 8-bit encoding—it can encode all of 21-bit Unicode.
  • UTF-16 is not a 16 bit encoding—it can encode all of 21-bit Unicode.
  • Unicode itself is not limited to 16 bits.
    • In Granddad’s day, Unicode was 16 bits and could represent only about 60K distinct characters
    • It was changed to 21 bits in 1996 and now can handle up to 1,112,064 distinct characters.
  • The numbers 8 and 16 in UTF-8 and UTF-16:
    • Do not refer to the number of bits in the code-points that the encoding can express.
    • Do refer to the number of bits/bytes that are logically processed together.
      • UTF-8 takes bytes one at a time.
      • UTF-16 takes bytes two at a time.
    • It would have made more sense to call them UTF-1 and UTF-2, but when UTF-16 was named, the name UTF-2 was already taken.

The key points to remember about encodings are:

  • The most widely used Unicode encoding today, by far, is UTF-8, but UTF-16 is not dead yet.
  • Sometimes you are forced by circumstance to ingest UTF-16, but the only reason to write any format other than UTF-8 is to accommodate legacy processing.
  • Occasionally, other formats, e.g., UTF-32, are used for special purposes internally to some program. If you need to know about this, then you are beyond needing this primer.
  • UTF-8 and UTF-16 are both “variable length” encodings, i.e., not every character is expressed with the same number of bytes.
  • ASCII by definition is 7-bit.
    • Range is 0 through FF hex, which includes FF+1 values or, in decimal, 128.
    • If the high order bit is set, it’s not ASCII.
  • UTF-16 and UTF-8 represent the ASCII characters with the same numeric values used in ASCII, but they encode them differently.
    • UTF-16 always uses either two bytes or four bytes.
      • ASCII characters will have an extra all-zero byte in addition to the byte with the ASCII value.
      • Whether the all-zero byte is leading or trailing depends on the endian-ness of the representation.
      • The BMP characters all fit into 16 bits.
    • UTF-8 uses:
      • 1 byte for code points < 8 bits (ASCII characters, i.e. the Latin alphabet)
      • 2 bytes for all code points that require from 8 to 11 bits
      • 3 bytes for all code points that require from 12 to 16 bits
      • 4 bytes for all code points that require from 17 to 21 bits
      • Note that this implies that much of the BMP requires three bytes, not two.
  • Need for code-points outside the BMP, i.e., the low-order 16 bits is fairly unusual unless you’re using Chinese, and usually not even then.
  • ASCII (plain text) is UTF-8
    • A file of pure ASCII is a valid UTF-8 file.
    • The reverse is not necessarily true.
    • Any file containing a byte with the high-bit set is not ASCII.
  • UTF-16, because it deals with bytes two at a time, is actually two physical encodings—little endian and big endian.
    • For UTF-16, the optional BOM character can be, but need not be, used as the first character in a file to distinguish little-endian and big-endinan encodings.
    • The BOM is guaranteed not to ever be valid for anything else.
    • If the first character of a UTF-16 file is read as U_FEFF, the file and the program reading the file will be in agreement.
    • If the first character of a UTF-16 file is read as U_FFFE, then the program must reverse the endian-ness.
    • This doesn’t actually tell a program which encoding it is using, only that the encoding is either the other one or the same one as the program is using.

Advantages of UTF-8

  • ASCII is the most common format for data and ASCII is UTF-8
    • For ASCII, UTF-8 takes only half as much space as UTF-16.
    • No conversion needed for ASCII
  • If you jump to a random point in a UTF-8 file, you can synchronize to the next complete character in at most three bytes—one byte if it’s ASCII.
  • One disadvantage of UTF-8 is that it takes about 50% more space than UTF-16 when encoding East Asian and South Asian languages (3 bytes v. 2 bytes.)
  • UTF-8 is not subject to endian problems, while all multi-byte encodings, including UTF-16, are.

Java and Unicode

Java (unlike C and C++) was originally designed (before 1995) to use 16 bit Unicode, and later moved to 21-bit Unicode when the standard changed. The encoding used internally is UTF-16, but the Java specification requires it to to handle a variety of encodings. Two critical points:

  • Unlike C/C++, Java defines Strings in terms of characters, not bytes. This blog on Java and Unicode details it pretty well.
  • Java is not limited to 16-bit code points.

Hadoop and UTF Formats

In theory, Hadoop and Hive should work with either UTF-16 or UTF-8, but there are a couple of known Hive bugs that limit the use of UTF-16 to the characters in the BMP, and may cause problems even then. See this Apache bug report for details.

Even if it Hadoop did work correctly with UTF-16, there would still be significant drawbacks:

  • UTF-16 doubles the space required for Latin-alphabet text (English and European languages) in an environment that already triples storage size.
  • Applications running over Map-Reduce and Tez (e.g. Hive) usually do a lot of sorting (in the shuffle-sort)
    • Lexical sorts of UTF-8 are significantly more efficient than UTF-16 sorts.
    • The reasons that are beyond the scope of these notes, but see: these notes for more details.
  • BOM markers are lost when files are split. See this page for details.
  • ORC requires UTF-8. If your project uses tabular data, you should almost always be using ORC.

The fact that Hadoop does not work well with UTF-16 is less of a problem than you’d think for two reasons:

  • The majority of data ingested by Hadoop is ASCII, and ASCII is automatically UTF-8.
  • Most data that is not specifically ASCII is UTF-8, because UTF-8 dominates the Web.

What to do if you are stuck with UTF-16 data?

  • Don’t monkey around with trying to get UTF-16 to work in Hadoop—convert it directly to UTF-8 or specifically to ASCII if none of the code points are greater than 127.
  • If it’s a reasonable amount of data, e.g., periodic ingestion of a few Gigs, you may be able to do it on the way in, e.g., with bash, or as part of the Oozie process.
  • The Linux iconv utility Linux iconv utility can be invoked from within a bash script.
  • iconv has been known to fail for very large files (15GB) but these can be chopped into smaller pieces with the Linux split utility: split utility.
  • You can do larger amounts of data with a simple MapReduce job.
  • ICU provides Java libraries for doing conversions and many other operations on Unicode: http://userguide.icu-project.org/conversion/converters

If you want a little more depth on Unicode, endian-ness, representations, etc., be sure to check out Not Even Hadoop: All about Unicode.

Standard

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s