About This Blog

Hi, I'm Ben Pryor. This blog contains my thoughts about general software engineering topics, and occasionally specifics that I find interesting. If you see something here that sparks your interest, please feel free to comment on a post or send me an email at ben at benpryor.com.

5 August 2007 - 18:32A Layered Conceptual Model for Character Encoding

In today’s global software industry, character encoding issues are frequently encountered on almost all software projects. One factor that causes trouble when discussing and solving character encoding issues is terminology. In order for developers to share problems and successes with each other, a generally agreed-upon set of terms is needed. Another problem is that different character encoding standards work differently. It can often be useful to abstract away the details and talk in general about how character encoding standards work. By doing this, it can be easier to understand the mechanics of a specific character encoding by fitting it into a more general framework.

You might be familiar with the OSI seven layer model for describing network protocols. Many computer science and software engineering education programs teach this model (e.g., in an introductory networking class). This model is useful because it abstracts away the specific details of particular network protocols and provides a generalized stack of layers that all protocols can be thought of as having.

A similar abstract layered model can be used to define and discuss character encoding standards. In this article, I’ll define and explain a layered model for character encoding. By learning this model, you’ll be able to more easily diagnose and solve character encoding issues, as well as gain the ability to easily understand new encodings by fitting them into an existing mental model. The model I describe here is defined by the Unicode standard in the Unicode Technical Report #17, where it is known simply as the “Character Encoding Model”. Even though this model is defined by the Unicode standard, it is a general purpose conceptual model and can be used for character encoding standards other than Unicode.

Before discussing the individual layers of the model, it’s useful to remember what character encoding is, in a very basic way. At the risk of overstating the obvious, the point of character encoding is to go from a sequence of characters to a sequence of bits. The bits could be used in memory, persisted on a disk, or transferred across a network. At some point in the future the process is reversed, and the bits are decoded back into characters. By breaking this mechanism down into abstract layers, it is easier to understand all of the different transforms involved.

The first and most basic layer is an Abstract Character Repertoire. This defines a collection of characters that are described and given names. For instance, phrases like “the XYZ alphabet” and “the script used by XYZ” could be said to define character repertoires. An important aspect of character repertoires is that in no way do they define representations of characters. A repertoire simply defines member characters by naming them.

One point to note - a character repertoire assumes a definition of what a “character” is. For the purposes of this article, I’m going to hand-wave over the entire concept of character identity and what a character truly is. That is a topic that could easily make up an article all by itself. For now, whatever definition of “character” you have in your head is good enough to understand this layered model.

The second layer is a Coded Character Set. This is the first level at which actual encoding takes place: each character in the character repertoire is given a unique integer number to represent it, called a code point. Therefore, the coded character set level defines the first representation of each character, which is simply an integer. In other words, a coded character set defines a mapping from characters in a character repertoire to code points.

A coded character set, by its nature, defines a code space. The code space is the domain of the code points, and defines the minimum and maximum code point values. For large repertoires, it can be helpful to break the code space up into smaller sub-sections and give those sub-sections names.

Representing a sequence of characters as a sequence of numbers gets us a little closer to our ultimate goal of a sequence of bits, but it’s not a huge difference from characters. When compared to a sequence of bits, a number sequence is still a pretty abstract concept. The code points in a coded character set may have very different magnitudes (e.g., 10 vs. 10000), and a coded character set says nothing about how to represent these abstract integers as bits.

The third layer is a Character Encoding Form. A character encoding form transforms a sequence of code points in a sequence of equal-sized integers called code units. This is the first level at which bits are introduced into the encoding - code units are called equal size because each the code unit size is expressed as a number of bits. The size of code units may vary from character encoding to character encoding, but for a particular character encoding form the size is fixed. So when you hear the term n-bit character encoding, it refers to a character encoding form in which the code units are n bits long.

It’s easy to confuse the concepts of code points and code units for a few reasons. For one, they are both integers. For another reason, many character encodings use an identity mapping as a character encoding form, in which each code point value is equal to the code unit value. In such character encodings, the character encoding form is said to be one-to-one (i.e., one code point maps to one code unit). To remember the difference between the two concepts, keep a few things in mind. A code point is an abstract integer (e.g., 17), or just a point on some number line. A code unit is a fixed-size integer (e.g., 17 expressed as an 8-bit value or 0×11). Even though many encodings map one code point to one code unit, such a one-to-one mapping is not the case for all encoding standards.

The fourth layer is a Character Encoding Scheme, which maps individual code unit values to specific sequences of bits. For encoding standards in which the code units are of length 8 bits or less, the character encoding scheme layer typically does nothing. For encoding standards in which the code units are longer than 8 bits, the encoding scheme must map the code unit values into a sequence of bytes. This is where endianness issues arise - in this case a character encoding scheme specifies the ordering of the sequence of bytes for a code unit value.

The UTR #17 model also defines an optional fifth layer called a Transfer Encoding Syntax. This layer is different than the previous four layers. A transfer encoding syntax is almost always separate and orthogonal to the other four layers, and is often not specified as part of a character encoding standard but is used in addition to a defined standard. The most common use of a transfer encoding syntax is to apply some sort of post-processing to the sequence of bytes produced by the other four layers. For example, the sequence of bytes may be compressed to save space (e.g., according to an algorithm such as LZW). Or, the sequence of bytes may be further encoded by an algorithm so that it can be more easily transmitted over certain media (e.g., an algorithm like Base64).

It’s most useful to think of a transfer encoding syntax as a completely optional and separate fifth layer that can be added on to a stack of the other four layers.

To summarize the layers, an abstract character repertoire defines a set of named characters. A coded character set encodes a sequence of those characters as a sequence of abstract integer code points. A character encoding form represents the character sequence as a sequence of fixed-length integer code units. Finally, a character encoding scheme then represents the character sequence as a sequence of bytes.

Now that the levels have been defined, it is possible to give a slightly more precise definition of a character encoding standard. A character encoding standard specifies a stack of these four layers that when combined ultimately maps from a sequence of abstract characters in a repertoire to a sequence of bytes.

To further explain these layers, here’s a few examples using several character encoding standards that many software professionals will be familiar with.

First, consider a character encoding known as windows-1252. This encoding defines a character repertoire of 256 characters that are in the Latin alphabet and used in languages such as English (primarily), French, German, etc. This encoding defines a coded character set that maps each of the 256 characters in the repertoire to an integer value between 0 and 255. Further, since each code point value is between 0 and 255, a very straightforward character encoding form is used in which each code point value maps to an 8 bit code unit having the same value, which is obtained by 0-padding out each integer code point value to 8 bits. The character encoding scheme layer does nothing since the code units are only 8 bits in length.

As you can see, in a very simple character encoding standard such as windows-1252, some of the layers blur together or appear to be unused. This is a reflection of the fact that more complicated character encodings exist in which those layers are more distinct.

As a second example, consider a standard in which all of the layers are easily seen - the UTF-16 standard as defined by Unicode. The character repertoire defined by Unicode is huge. Unlike all other character encoding standards, Unicode is an attempt to include all useful characters in its repertoire - spanning languages, cultures, and even history. Unicode defines a coded character set in which each Unicode character is given a code point in the range from 0 to 0×10FFFF. Unicode code point values are often written in the form “U+hexadecimal code point value” (e.g., U+0041). UTF-16 defines a character encoding form that uses 16 bit code units. Each code point maps to either one or two code units. Finally, the encoding scheme specifies how the sequences of code units should be serialized as sequences of bytes. UTF-16 is actually a family of encoding standards in which the individual standards in the family differ only in the encoding scheme (in other words, they differ in byte ordering). For example, UTF-16BE uses an encoding scheme in which code units are serialized in big-endian form.

It is also useful to observe the transformation of a character as the representation moves through the layers. Consider the character “A” as encoded by windows-1252. “A” is included in windows-1252’s repertoire, and is given the code point 65 (coded character set layer). The character encoding form layer maps code point 65 to the code unit 0×41. The character encoding scheme layer does nothing since the code unit 0×41 serializes a single byte (1000001 in binary).

Now consider the character U+10140 (greek acrophonic attic one quarter) as encoded by UTF-16BE. This character is an ancient Greek number character that has only historical significance. I picked it at random since I wanted a character that would map to more than one UTF-16 code unit. This character is included in Unicode’s character repertoire and given the code point U+10140 (coded character set layer). UTF-16 maps the code point U+10140 to the two code units 0xD800 0xDD40 (character encoding form layer). The UTF-16BE encoding maps the two code units to the byte sequence 0xD8 0×00 0xDD 0×40 (character encoding scheme layer).

By learning this layered model for character encodings, you will gain both an understanding of how character encodings work and a mental model you can apply when things go wrong. It will also be easier to discuss character encoding issues with other software professionals since you can share a common set of terms. Finally, when learning new character encodings you can easily fit them into an existing framework, comparing and contrasting them with encodings you are familiar with.

No Comments | Tags: Uncategorized

Comments are closed.