Unicode is a subject that trips up even experienced programmers. It’s one of those places where computer science and engineering bump hard into human diversity.
Of the planet’s seven thousand living languages, more than two thousand are written, and together use approximately 30 major alphabets, including some used only for religious purposes. That’s 30 alphabets for bona fide living languages—there are also alphabets for ancient languages that haven’t been spoken for centuries, as well as alphabets for made up languages, and other exotica such as Braille. Altogether, Unicode currently includes about 130 alphabets, plus many other symbol sets for other purposes.
Ok, thats a lot of alphabets, but, what’s the big deal? You just assign every letter a number, and you’re done. What’s so hard about that? More than you think, actually. First, a quick overview of some key ideas.
What Is An Alphabet?
Linguists overload the term alphabet, using it in a broad sense to talk about any symbol set used for writing, or in a narrow sense to talk about symbol sets used to represent the individual sounds (phonemes) that make up words. (Computer scientists really abuse the term, using it for any abstract symbol set processed by a machine.)
English and hundreds of other languages use the Latin alphabet, which is an alphabet in the narrow sense, having symbols for both consonants and vowels from which words can be constructed by lining up the symbols for sounds. Some alphabets only have symbols for consonants, leaving the vowels as an exercise for the reader. Others lack specific symbols for vowels, but indicate vowel sounds by decorating the consonants with a variety of symbols known collectively as diacritics.
Notice that hand written alphabetic text isn’t strictly in a one dimensional line of characters. Diacritics go above and below alphabetic characters. When represented as a file of symbols in a computer, they are in a one dimensional row, and there must be provision for grouping the diacritics around the appropriate alphabetic character.
There are other languages, such as Japanese, that use slightly higher level symbols that represent syllables instead of individual phonemes. Sets of symbols for such languages are called syllabaries, or in hybrid systems, alphasyllabies.
Chinese is something else again. Spoken Chinese isn’t one language, but many, which are approximately as similar and mutually intelligible in their spoken forms as the Romance languages are. Yet, though mutually unintelligible, those many languages are often described as dialects. Why is that?—Spanish and Italian aren’t considered “dialects” of Latin.
The explanation isn’t some vestigial Western cultural chauvinism left over from the 19th Century. The reason is that the written forms of these many languages are all based on Chinese writing, which varies much less than spoken Chinese. This is possible because Chinese is written with symbols that represent units of meaning rather than units of sound. These symbols are often referred to as ideograms, but like “alphabet,” the term “ideogram” is overloaded. Strictly speaking, ideograms are symbols that represent meaning at the level of ideas. A more precise term for the symbols of Chinese is logograms, a grouping that includes both ideograms and morphemes, which are the lowest level unit of meaning is a language.
Words, in Chinese, which are by definition free standing, consist of one or more morphemes. Because written Chinese represents language as a sequence of logograms rather than a sequence of phonemes, it is not tightly coupled to the spoken language, either geographically or across time. This provides a powerful unifying principle to cultures which might otherwise be much further separated culturally by their spoken languages. The down-side of this is that it requires a vastly larger set of characters than the few dozen that are usually necessary to write phonetically. Han Chinese was represented in the original Unicode by about 21,000 characters, which as of 2015 has so far grown to more than 70,000, with more on the way.
There’s No Such Thing As Neutral
This diversity of character sets is just the beginning of the complexities that the designers of Unicode had to contend with. For instance, languages can be written left to right (English) or right to left (Arabic and Hebrew) or even in both directions within the same slab of text, e.g. a line from the Talmud (Hebrew) quoted in an English language sentence. Unicode allows this to be expressed.
Another design complexity is in defining what “the same” means when talking about alphabetic characters. For example, alphabets sometimes have multiple versions of the same character, and the choice of which to use is situational. The English long s, which extends both above and below the line, is one example. Though rarely used today, it was common until the 19th Century (e.g., in the US Constitution) and both its use and its appearance have evolved over time. In some documents it looks like a lower case ‘f’, but with only one half of the crossbar. Other times it has a curl at the bottom as well as at the top. Should an encoding treat these as separate characters, or are the differences simply a matter of font, or the choice of boldface? Characters sets also include ligatures, which are pairs of characters that are written differently when they appear side by side, although not necessarily in every context. Where do you draw the line between true characters and a mere change in the typeface?
Chinese and other East Asian language multiply this kind of problem. These languages are ancient, and it can be extremely difficult to decide whether two versions of the same character (perhaps from documents written thousands of years apart) are genuinely different, or whether they are merely different glyphs representing the same Platonic ideal of a character. Because ideograms encode ideas, and not merely sounds, the merging characters is viewed by many as modifying the underlying ideas as well—two nearly identical characters from different eras may represent significantly different ideas.
The Unicode specification addresses all these, and a myriad of other issues.
A Quick Refresher on Bits and Bytes
If you have not forgotten all about bits, bytes, machine words, big endian vs little endian, hexadecimal notation, etc., by all means, feel free to skip this part, but in the Unicode world, the low level details matter, because many of the quirks of Unicode trace back to the lowest levels of how computers represent data.
In the digital world, a ‘bit’ is the smallest unit of memory. Think of a bit as 0/1, Y/N, true/false, on/off, or any other discrete pair of two values. There are no gray areas, and no concept of ’neither’—a bit is always one value or the other. You can represent them with any device or marking that can have exactly two states or values: electrical toggle switches made of transistors (CPU’s and main memory), capacitors that either hold or do not hold an electric charge (DRAM memory), magnetized or non-magnetized spot on a metal disk (disk drives), properties of an electrical pulse on a wire (computer busses and networks), or even little tiny dots of ink (paper for smart-pens.) In the 1970’s and before, bits were often stored as holes in paper tape or punch cards.
General purpose computers, however, do not directly address individual bits. For many reasons, it is more convenient to group bits together at the hardware level into octets, or bunches of eight, known as “bytes.” (Eight is by far the most common number, but larger and smaller numbers of bits have been used.) At the conceptual level of computer programs, a byte is the smallest unit of memory that has its own address. Down at the hardware level this may or may not be true. Most machines move around small groups of bytes called words, and the size of the word (usually four or eight bytes) as well the location of word-boundaries, are deeply woven into the hardware design. Depending on the particular hardware and exactly what you mean by “address,” the smallest unit with a physically distinct machine address can sometimes be the word rather than the byte, but that detail is largely hidden by higher level languages. Within the context of a programming languages that allow pointers, we can usually think of the main memory as a single array of bytes numbered from zero to the size of the memory, minus one.)
An aside on hex: By convention, when talking about bits and bytes, hexadecimal notation is usually used instead of the more familiar decimal. Hex uses 16 digits—the same 0-9 used in decimal, plus A-F for the digits above nine that we don’t have decimal numerals for. Why confuse everyone with base sixteen? Because sixteen is two to the fourth power, which means that each hex digit corresponds to four bits. If you want to picture the bits, you simply expand each hex digit (0-F) to the corresponding binary number between 0000 and 1111, which is a small enough set to either remember of figure out on the fly.
Spotting what is special about
is painful in binary format, but it’s a snap in hex. Grouping the bits by fours gives 00010000 in hex. You see right off that it’s eight digits, so it’s four bytes. And you see that the first two bytes are zero, and that the only set bit is the low order bit of the third byte. Therefore, this is a very significant number—one more than the biggest value you can express in two bytes. That would be much harder to infer from the decimal equivalent, 65537, unless you just happened to remember that 65536 is two to the sixteenth power. A useful special case to remember is that with all the bits set, a byte is FF, and there are FF+1 distinct values, just as the biggest two digit decimal number is 99, but two digits can express 100 values, including 0.
A byte can represent many things: a number, part of a number, more than one number, a single text character, part of a text character, a machine-language instruction, or many other things. What a byte “means” is mostly a matter of circumstance. For instance, a byte set to 01000001, in one context is the two-byte decimal number 65, while in Intel 8080 machine language it is an instruction to the CPU to increment the contents of a memory location, and in the familiar ASCII text encoding, it is the capital letter ‘A’.
Computers are famous for their number crunching ability, but in terms of sheer number of bytes consumed, strings, i.e., text, outweigh everything else. One of the main things that high-level computer languages do is abstract this issue away so that programmers don’t have to think much about how the data their program is represented at the level of bits and bytes. Still, as we will see below, there remain quite a few places where the gory details of how data is encoded and stored are unavoidable, and a lot of the most common issues center around text.
Back In The Day
The 256 distinct ways you can set the eight bits in a byte are enough to let us assign a different value to every letter in the alphabet (both upper and lower case), plus punctuation marks, numbers, and assorted whitespace and control characters. In fact, even seven bits (128 distinct values) are plenty if you are talking about the English language alphabet, or that of most modern European languages.
The ASCII (American Standard Code for Information Interchange) system was established in 1960 as a standard for which of the 128 values of the lower 7 bits would map to which control characters, numerals, and characters in the alphabet used for English text. For a number of reasons, engineers back then didn’t worry too much about dealing with other languages. For one thing, virtually all computers were manufactured in the United States, which was also overwhelmingly the biggest market. Even if non-English speaking markets had been bigger, computers were still too expensive, and storage too limited for most text applications. Although disk and tape drives had existed for several years, they were expensive, and of very limited capacity. Most data and programs were still stored and transported on paper—either punch-cards or paper tape.
ASCII uses first 32 values that can be represented in a byte (00000000 to 00100000) for non-printing control characters that can be embedded in text to do things like indicate the end of a file, or signal a new-line or carriage return. After that comes a block of punctuation and other non-alphanumeric characters, followed by the digits 0-9 (in order), more non-alphanumerics, capitals A-Z (in order), more non-alphanumerics, lowercase a-z (in order), a few more non-alphanumerics, and finally, the delete key (the only non-printing character that is by itself.) All of ASCII fits into the values 00000000 to 011111111, so with the highest order bit unused, the 128 values above 10000000 could be used for special purposes.
Actually, ASCII was only the most successful of many encodings. Other countries came up with their own, and even as ASCII was being developed, IBM came out with a scheme called EBCDIC for their then-dominant mainframe products (actually a family of eight-bit encodings.)
There is simply not enough space in eight bits to encode all the characters and diacritics used in the other European languages, let alone the non-European languages, but the 128 unused values give sufficient space for the characters used in almost any one language (so long as it isn’t Chinese.) Therefore, the standard approach for computing systems for non-English language environments used what came to be called “codepages.” A codepage started with the 128 character ASCII set, and used the other 128 values to encode the characters unique to the language in question. Because the control-characters, numbers, and most punctuation were already covered, there was plenty of room for most languages. A very few non-Chinese alphabets such as Vietnamese, couldn’t quite fit into the extra 128 values, and various hacks were required to wedge them in, such as taking over rarely-used values that were already committed to the original ASCII part of the table.
Eight bit encodings were never an ideal system, but even with the complexity of scores of codepages, the system was workable when networked computers were unusual, and computers were mostly for business and scientific use. But with the rise of networks that connected computers all over the world, and the incredible diversification of applications for computers, the system became untenable. By the late 1970’s a standard way to represent all of the world’s alphabets was clearly becoming necessary.
The driving idea behind Unicode, which was conceived in the late 1980’s, is that every character in every language in the world would be uniquely identified by an unchanging value called a code point, which is written for humans as U+X, where X is a hexadecimal number in the allowed range. For instance, ‘A’ is U+0041, which is 65 in decimal—the same as in ASCII. The proliferation of codepages for the world’s many languages would be replaced by a single vast codepoint-to-character mapping that would be the same everywhere and for all time.
With codepages, the same number can mean any number of different characters, or the same character represented by different numbers, depending upon what language your computer assumed you were using. Only the character set used in English was constant (provided you were using one of the ANSI codepages and not EBCDIC or some other special purpose encoding.)
A single all-encompassing alphabet for all the worlds languages and symbols is a great idea, but giving every character its own unique value implies that you need bigger numbers, which require multiple bytes, which brings up a thicket of problems. First of all, how many bytes do you need? The Unicode designers originally decided on two bytes for every character, which allows for a maximum of 65,536 distinct values. It seemed sufficient at the time, but it certainly wasn’t overly generous, given the size of Chinese-derived alphabets. So why didn’t they choose, say, four bytes, which would have room for billions—more characters than any planet could ever need? The main reason is that most alphabets fit into a single byte, which implies that for users of those alphabets, a two byte encoding already wastes half the space used for text. A four byte format would increase the proportion of meaningless zeros to about three out of four.
Unicode didn’t actually use every possible value to encode characters. A small proportion were reserved for other purposes, but the more than 60k available values seemed sufficient without imposing the bloat of a larger range that would rarely be used.
The scheme chosen to write the original Unicode was a now mostly obsolete encoding called UTF-2, which stands for Unicode Transformation Format, two (bytes). True to its name, it always uses exactly two bytes to write each of the code points, and this brings up a problem that does not arise with single-byte encodings.
The Endian Problem
In ASCII, the value 65 (hex 41) means ‘A’, period, on every machine. Even when using codepages, within the context of a codepage, the values above 127 are likewise unambiguous as long as the computers exchanging data agree on what codepage they are using.
With multi-byte encodings this isn’t true. Remember how we said that most machines move bytes around in groups called words? Well, within a word, machines can order the bytes within a word in two different ways. Think of how you would type out a number in hexadecimal, with the most significant number first, and the least at the end. It’s so automatic that you don’t even think of it. But raw machine words are usually stored and read in the opposite direction—from smallest address to largest. A human writing down a list of hexadecimal values as stored in computer memory would write the words in ascending order, and the bytes within each number in descending order and think nothing of it.
As it turns out, some machines arrange the bytes in the word the way humans would, with the most significant byte at the lowest address, and others reverse the order, putting the least significant byte at the lowest address. These are called respectively the big-endian and little-endian conventions, and most machines are one or the other, although ARM processors, for instance, can switch hit. (The majority of processors in used today are little-endian.) Individual bits within a byte don’t have machine addresses, so endian-ness doesn’t apply, but we customarily picture them as big-endian, because that’s the way we write numbers.
Note that the endian-ness of some kinds of data files, e.g., JPG and GIF, are standardized regardless of hardware. Either way, within a given architecture there is no confusion—data is always read and written using the same convention, and the endian-ness of the computer is not an issue. However, when data is moved among machines, the possibility of byte-order reversal has to be explicitly allowed for. (This isn’t just a problem unique to Unicode or even text in general—all multi-byte data is subject to this problem.)
Unicode with UTF-2 wasted a significant amount of space, because every character required two bytes, but it had some virtues too. For one thing, as with ASCII, you could do arithmetic on character positions because characters were all the same width. However, even in the early 1990’s it was already becoming evident that a two-byte encoding was not going to be adequate, so in 1995 the specification was upgraded from 16 bits to 21 bits. 21-16 = 5, so 21 bits gives 2^5, or 32, times as many values to for engineers to play with. For reasons that we will see below, the range of code points is smaller than the 2,097,152 distinct values in the 21 bit range. Engineering decisions made to facilitate efficient encoding while maintaining backwards compatibility result in a total of 1,112,064 usable code points being available. Of those, some are specifically defined as non-characters, and other are defined as being permanently available for private use. Of the remainder, around a million, only a small fraction are currently used, with far more than enough unused values for the predictable future.
The original 16 bit space had already encompassed most of the worlds living languages, as well as a lot of other odds and ends, and those mappings remained unchanged, but the enhanced scheme gave plenty of room for almost any number of ancient languages and special-purpose character sets. The original 16 bit range would thereafter be called the Basic Multilingual Plane, aka the BMP, aka Plane 0, and covers values 0000-FFFF, which is all code-points below 65,535. Each of the supplementary planes would cover a similar range, i.e., 10000-1FFFF, 20000-2FFFF, … , F0000-10FFFF (note that’s 10 hex, or 16 in decimal) and those are sometimes jokingly referred to as astral planes.
As the original Unicode had been backwards-compatible with ASCII, the 21 bit Unicode is backward compatible with the original code points. UTF-2 cannot express code points greater than FFFF, but a new encoding called UTF-16 was devised to express the full 21-bit Unicode. Unlike UTF-2, UTF-16 is a variable length encoding that uses 16 bits for the values in the BMP range, but uses 32 bits to handle the relatively unusual case of values outside of the BMP range.
This is possible because the values DF800 to DFFFF, which are all within the BMP range, are were set aside, and do not represent valid BMP characters. A UTF-16 encoder can therefore write all of the legal BMP codepoints directly as 16 bit numbers, and use the 2048 values that are between hex DF800 and DFFFF to encode the values from the higher planes. This is done by dividing the 2048 values into equal sized sets of pseudo-charaters called the high and low surrogate characters. These 16 bit values must always appear in pairs, one from the high set, and one from the low set, for a total of 32 bits, or four bytes. As each of the two sets of surrogates has 1024 distinct values, there are a total of 1,048,576 combinations that can be used to encode the code points from the higher planes. This, by the way, explains why maximum number of code points in Unicode is the peculiar, non-power-of-two value mentioned above: 1,112,064. Of the 65,536 distinct 16 bit values, 2048 are reserved for the surrogates, leaving 63,488. This number, added to the number of code points that can be represented by the surrogates, 1,048,576, gives 1,112,064.
A simpler but bulkier encoding, UTF-32 was also defined. UTF-32 is fixed length encoding, always using 32 bits, much as UTF-2 had always used 16 bits. This preserves the virtue of constant width characters at the expense of wasting a great deal of space for the typical use.
Both encodings faced the same byte-order problem that UTF-2 had. To solve this problem, Unicode provided a symbol, U+FEFF, called the Byte Order Mark (BOM) that could be used at the beginning of each Unicode string to tell the consuming program how to interpret each pair of bytes in the subsequent stream. If the endian-ness of the file matches that of the decoder, the first two bytes of the file will be be read as FEFF regardless of whether it is big-endian or little-endian. But if the two bytes are read as FFFE, the decoder knows it must reverse the endianness as it reads, because there is no such character in Unicode as U+FFFE. Note that this works even if your code doesn’t know the byte order of the machine it’s running on. Two other closely related encodings, UTF-16LE and UTF-16BE, solved the byte-order identification problem without the BOM, by explicitly giving the byte-order in the name of the encoding.
The extra five bits had greatly expanded the number of available codepoints, and the new UTF-16 encodings did not make the wasted space problem much worse, but they didn’t make it any better, either.
ASCII, UTF-2, and UTF-32 write every character in the character set that they represent as its code point value. UTF-16 goes almost that far, and writes all of the BMP characters as their literal code point values and ASCII, reserving special encodings for the exotica in the higher planes.
The next encoding would be very different in spirit. UTF-8, which is now rapidly squeezing out all others, goes much further in recognizing the distinction between the alphabet and the encoding.
Like UTF-16, UTF-8 is a variable length encoding. It writes the ASCII characters, which are the most common, as their literal one-byte values. The next most common group of characters are written with two bytes, and so on, up to a total of four bytes. The scheme is very efficient of space because, while the majority of code points require would three bytes each to express as a raw number, under UTF-8, the great majority of code point usages are encoded in no more bytes that would be required simply to write the code points as a numbers. Moreover, because UTF-8 deals with bytes one at a time, the big-endian/little-endian problem does not arise. Here is how it works:
The first byte of every character tells the decoder how many bytes to expect and which bits among those bytes will contain the code point value.
- Codepoints that require no more than seven significant bits to encode, i.e., those for the first 128 characters, are exactly equal to ASCII. Therefore, if all characters in a UTF-8 file are in the ASCII range of 0-127, the UTF-8 and ASCII encodings are identical. These, and only these, code points start with a zero bit.
- Codepoints that require from 8 to 11 bits are encoded with two bytes. The first byte starts with 110, and the second byte starts with 01, which leaves a total of 11 usable bits.
- Codepoints that require from 12 to 16 bits are encoded with three bytes. The first byte starts with 1110, and the second and third each start with 01, leaving a total of 16 usable bits.
- Codepoints that require from 17 to 21 bits are encoded in four bytes. The first stars with 11110, and and each of the others starts with 01, leaving 21 usable bits.
Because a starting byte cannot be confused with a following byte, a Unicode stream is self-synchroniizing, i.e., a decoder starting from an arbitrary point in the stream, need never move more than the width of the largest character to recognize that it is at the beginning of a character.
The standard also requires that no more bytes than are necessary should be used to encode a code point. This means that every code point can be encoded in only one way, even though in principle, there are usually several.
The economy of this encoding is rapidly making the other Unicode encodings obsolete. Since 2008, UTF-8 has been the most common encoding specified on the internet, and is usually used internally on new systems built for Unicode.
So how do you know what encoding is being used in a file?
That’s a good question. Unlike variables in a program, files don’t really have types, except by convention—the meaning of data in a file is whatever programs assign to it. Some standard data file formats such as JPEG, GIF, and PDF’s are specified to start with a characteristic string of bytes without which the programs that use them won’t recognize them, but that says something about the expectations of the programs that consume the file, not about the properties of the file itself. For instance, all JPEG files must start with hexadecimal FFD8, but a non-JPEG data file can easily just happen to start with the byte FF (decimal 255) followed by the byte D8 (decimal 216). Other formats such as UTF-16 define sequence of bytes the use of which is optional. A UTF-16 file can start with either hexadecimal FEFF or FFFE for big-endian and little-endian respectively, or it can just start in with the text. (Unicode specifies that the BOM characters have no effect on the text and will never be used for anything else.)
Historically, probably the most common way to deal with the encoding question has been to punt: usually, a programmer could assume that any text data is was ASCII, and if it wasn’t, it was some special case that you knew about, such as data that always arrives in EBCDIC and has to be converted. That was a pretty effective approach for many years, but internationalization of programs, and the routine movement of data around the world are rapidly making hiding your head in the sand less of a solution.
The real answer to the question is that there are basically four ways to find out what the encoding is:
- Someone tells you, as in: “We will be landing UTF-8 CSV data in such and such LZ directory.”
- The filename is descriptive. A very simple example is the convention of labeling ASCII text with “.txt”.
- The file is self-describing. The BOM marker used in UTF-16 is an example of the information encoding being named inside the file. HTML pages are a still more complex example. A properly written HTML page, has a meta tag that gives the content type (e.g., “text/HTML charset=UTF-8”). This is and interesting because the markup characters in XML are all ASCII, so even if the file is, say, UTF-8, the meta tag and its contents will be readable regardless.
- You make a reasonable guess and see if the results make sense. This is actually surprisingly common. Take the example above. Not every HTML page actually has a meta tag, yet most browsers will display the pages correctly anyway, because they can usually make a fairly reliable guess based on heuristics.
There is no 100% reliable way to tell what encoding is being used, or even whether the data in a file represents text at all, particularly if the file is short. However, word-frequency analysis can often identify particular code page encodings, and there are a number of useful tricks that can answer particular questions accurately. For instance:
- If a file is presumed to be Unicode, the presence of a BOM in the first position tells you that it is UTF-16 and how to deal with the endian convention. Unfortunately, the absence of a BOM is also consistent with UTF-16.
- If a scan of the file indicates that no byte has the high-order bit set, then it’s probably ASCII. Conversely, if any bytes do have the high-order bit set, then it’s definitely not valid ASCII.
- Most UTF-8 that is not also ASCII can be recognized as follows. The first byte of every character tells you how many non-first bytes must follow: 0, 1, 2, or 3. Therefore, the sequence must conform to the repeating pattern, valid first byte, correct number of following bytes Very few non—UTF-8, non-ASCII strings of significant length will just happen to match this pattern, but it is not impossible, particularly for short files.
- Browsers often use word usage statistics to determine the codepage used to encode a document.
The relationship of Unicode to fonts is complicated. Fonts are collections of the actual printable representations of characters, symbols, glyphs, etc., on the page or on the screen. A given alphabet (Unicode or not) may be represented by any number of fonts, and some, like Latin-1 are represented by hundreds, but the fonts are not part of Unicode. Unicode specifies the mapping of code points to each of the symbols. Each font for a given alphabet may come in many sizes, plus bold, italic, etc., but the code point underlying a given letter is the same whether it’s 12 pt. Helvetica or 8 pt Times New Roman.
Because it specifies the mapping of code-points to symbols, it is important to be clear exactly what symbol it is that a code-point maps to. Therefore, for each symbol set, Unicode defines a “Code Chart” which, for each of the symbols in the set gives the symbol’s name(s), if it is printable, shows a neutral version of what it looks like, describes its meanings (there may be many), and gives the encodings of related symbols. For instance, the “!” character in the ASCII symbol set is called the exclamation mark, the factorial sign, and the bang character, and is related to four other symbols, including the iterrobang and the inverted exclamation mark. The code charts for all of Unicode’s many languages and symbol set may be browsed through at http://unicode.org/charts/