About This Blog

Hi, I'm Ben Pryor. This blog contains my thoughts about general software engineering topics, and occasionally specifics that I find interesting. If you see something here that sparks your interest, please feel free to comment on a post or send me an email at ben at benpryor.com.

15 August 2007 - 20:49Simple Concurrency Guidelines for Designing APIs

When designing an API, one of the considerations you usually have to address is concurrency. In other words, for every class in the API, what is the class’s threading policy? At the very minimum, an API should document how it behaves with regard to concurrent access, and even better is an API designed with concurrent access in mind.

Writing a good API that is concurrency-friendly is hard. More than anything, it requires lots of reasoning about how all of the moving pieces will work together under concurrent access. In this article I’m going to discuss a few simple guidelines you can follow when designing APIs and reasoning about concurrency. The specifics of what I’ll discuss apply to Java, although the general concepts probably apply equally well to .NET and other similar environments.

The goal is, in general, to choose the design that is the easist to reason about and still has acceptable concurrent performance and API usability. Here are four guidelines that can be used as a starting point when thinking about what the threading policy of a class should be:

1) For non-collection-oriented classes, default to immutable
2) For collection-oriented classes, default to mutable and thread safe
3) Prefer collections of immutable elements
4) For collections of mutable elements, make copies and treat the contained elements as immutable internally

These guidelines will help you to design classes that make it easier for the consumer of your API to reason about concurrency policy. This reasoning is made possible by clear documentation of each class’s design for concurrent access. Failure to consider the threading policy for a class leads to under-documented APIs, unexpected behavior at runtime, and obscure bugs that are hard to reproduce. Documenting the concurrency of an API is essential in order to guide the API’s consumers to correctly using the API.

Note that documenting an API as “not thread safe” is a valid design choice. It may not be the best choice (depending on the API). However, it is better to document that a class isn’t thread safe than to make the API consumer guess or rely on undocumented behavior that could change.

In the discussion below, I make a distinction between collection-oriented classes and non-collection-oriented classes. Collection-oriented classes contain multiple similar child objects (elements). They are usually easy to recognize since they often contain methods to add elements, remove elements, find elements, and perform similar operations across the contained elements. Collection-oriented classes shouldn’t be confused with composite classes that are made by composing together dissimilar classes.

Now I’ll go over each guideline. Remember that ultimately, we are looking at a class and trying to determine what a reasonable threading policy for that class might be.

For non-collection-oriented classes, default to immutable

Immutable classes are one of the best design choices you can make when designing an API. For classes that aren’t collection-oriented, default to using an immutable design.

Only design mutable classes if it seems that API consumers would be greatly inconvenienced by the immutable classes. Usually though, immutable non-collection-oriented classes aren’t a big hassle for users.

Immutable classes have the inherent property of being safe for simultaneous access from multiple threads. You don’t have to do any internal or external locking. This is the first big advantage of immutable classes - thread safety for free and no performance hit under concurrent access.

Another big advantage of immutable classes is that when you use them, you can more easily build a mental model of the system you’re designing. Since immutable classes are very easy to reason about, it takes less mental RAM to think about how instances will behave at runtime - they can only ever be in a single state (post construction).

Immutable classes are also a great way to set API consumer expectations. One of the most frustrating things about an API is when some class, say a service of some kind, takes a mutable object as initialization data. What happens after you hand the mutable configuration data off to the service? Are you still allowed to modify the configuration data? Does the service care? Will it break? Using an immutable class design for the configuration data instead solves these problems.

Designing immutable classes is extremely easy in Java. Briefly, here are the requirements you should satisfy when designing a class in order to call it immutable:

1) All fields in the class should be declared final
2) All fields should be either immutable objects or mutable objects that are not mutated outside of a constructor
3) The this reference does not escape a constructor
4) No mutable objects passed to a constructor are retained by the instance or any of its components
5) No mutable objects escape the instance

Satisfying these requirements is not the only way to design a thread safe immutable object, but other techniques are much harder to explain and require deep knowledge of Java’s memory model.

For collection-oriented classes, default to mutable and thread safe

When designing a collection-oriented class, the default choice should be to write a mutable container that can be safely accessed by multiple threads concurrently. Most collection-oriented classes should be mutable since that’s what the API consumer will expect. When using a collection-oriented class, the most common reason is because you want to modify the collection (add or remove elements) or modify the contained elements themselves. An immutable collection-oriented object makes consumers jump through hoops to do this.

Often, collection-oriented classes in an API will be used only from a single thread for many scenarios. It can be tempting not perform the necessary locking to make these classes thread safe, since that locking will be unnecessary for the majority of usage scenarios. In this case, one valid design decision would be to skip the locking and document the collection as not being thread safe. However, doing this penalizes the users of the class who are in concurrent-access scenarios, since the burden of implementing thread safety is now on them.

I recommend going ahead and doing the internal locking to make these types of classes thread safe. On modern Java runtimes, the cost of uncontended synchronization is extremely low (JCIP talks about this in detail). Even when the common case is single-thread use, it’s better to design a class to be thread safe if there are potential concurrent scenarios. If a profile reveals that the locking is a hotspot when using the class, then you have a good reason to avoid it. Otherwise, assume that locks are essentially free when uncontended.

Although it’s not the first choice I’d make, for some APIs immutable collection-oriented classes may not be a bad decision. You get all of the advantages mentioned above for immutable classes. If API consumers will not often need to mutate the collection or the collection’s elements, doing this may make sense. For example, some collection-oriented objects are often just passed around to other parts of the API, and rarely changed. If you decide to design an immutable collection-oriented class, be sure to document this very explicitly. Also, avoid method names that imply that the receiver is being changed as they can be confusing to a casual user of the API. An immutable collection-oriented class should not have a “public void add(Foo f)” method since the signature of that method implies that it alters the receiver.

Prefer collections of immutable elements

Whenever possible, collection-oriented classes should contain immutable elements. This helps reduce confusion about the API, since it is clear that the state of the elements can’t be changed while the collection contains them. I’ve found that designs that use small, immutable objects as building blocks and contain them in mutable containers tend to be very robust and easy to understand.

The biggest reason this is important is that most non-trivial collection-oriented classes have one or more invariants that they must enforce. For instance, a collection-oriented class may store elements that each have an identifier of some sort, and the collection may guarantee that contained elements will have unique identifiers. This is just an example - often the constraints can be much more complicated. If the contained elements are externally mutable, it will be very hard or even impossible for the collection to enforce those invariants.

Mutable collections of immutable objects are also very fast to make copies of. Since the contained elements are immutable, copying the collection only involves making a shallow copy. Supporting copies of collections is important for many APIs, so anything that makes this easier and faster is a win.

For collections of mutable elements, make copies and treat the contained elements as immutable internally

Unfortunately, it’s not always possible to design using only collections of immutable elements. Often, for one reason or another, the collection must contain mutable elements. One case of this is a collection-oriented class where the elements themselves are collection-oriented. No matter the specifics, there is a design pattern you can follow in this case.

The collection should make copies of the mutable elements as they are added or retrieved, ensuring that no external clients have a reference to the actual contained instances. In other words, every time a mutable element is added to the collection, the collection makes a copy and adds the copy instead. Every time a mutable element would be obtained, a copy is obtained instead. Internally, the collection should treat the contained elements as though they were immutable and should not call any method on the elements that could change them. By doing this, the collection will contain “effectively immutable” elements. The collection is then free to enforce constraints on the contained elements and know that no external client can break the constraints.

I can attest that this kind of design works well, but it does require some careful API documentation. Without proper documentation, clients may expect that they can obtain a contained element, mutate it, and have those changes automatically show up in the collection. Instead, this kind of design facilitates a more transactional usage. Clients obtain an instance, perform some changes to it, and then must add that instance back in to the collection. This usually isn’t a huge burden on the clients as long as expectations are set correctly.

That covers the 4 guidelines. Note that these guidelines are really meant to only cover simple cases - however, the simple cases make up the bulk of most APIs. There are certainly complex classes that don’t fall easily into one of the categories above, and they will need to be designed with more thought.

No Comments | Tags: Uncategorized

5 August 2007 - 18:32A Layered Conceptual Model for Character Encoding

In today’s global software industry, character encoding issues are frequently encountered on almost all software projects. One factor that causes trouble when discussing and solving character encoding issues is terminology. In order for developers to share problems and successes with each other, a generally agreed-upon set of terms is needed. Another problem is that different character encoding standards work differently. It can often be useful to abstract away the details and talk in general about how character encoding standards work. By doing this, it can be easier to understand the mechanics of a specific character encoding by fitting it into a more general framework.

You might be familiar with the OSI seven layer model for describing network protocols. Many computer science and software engineering education programs teach this model (e.g., in an introductory networking class). This model is useful because it abstracts away the specific details of particular network protocols and provides a generalized stack of layers that all protocols can be thought of as having.

A similar abstract layered model can be used to define and discuss character encoding standards. In this article, I’ll define and explain a layered model for character encoding. By learning this model, you’ll be able to more easily diagnose and solve character encoding issues, as well as gain the ability to easily understand new encodings by fitting them into an existing mental model. The model I describe here is defined by the Unicode standard in the Unicode Technical Report #17, where it is known simply as the “Character Encoding Model”. Even though this model is defined by the Unicode standard, it is a general purpose conceptual model and can be used for character encoding standards other than Unicode.

Before discussing the individual layers of the model, it’s useful to remember what character encoding is, in a very basic way. At the risk of overstating the obvious, the point of character encoding is to go from a sequence of characters to a sequence of bits. The bits could be used in memory, persisted on a disk, or transferred across a network. At some point in the future the process is reversed, and the bits are decoded back into characters. By breaking this mechanism down into abstract layers, it is easier to understand all of the different transforms involved.

The first and most basic layer is an Abstract Character Repertoire. This defines a collection of characters that are described and given names. For instance, phrases like “the XYZ alphabet” and “the script used by XYZ” could be said to define character repertoires. An important aspect of character repertoires is that in no way do they define representations of characters. A repertoire simply defines member characters by naming them.

One point to note - a character repertoire assumes a definition of what a “character” is. For the purposes of this article, I’m going to hand-wave over the entire concept of character identity and what a character truly is. That is a topic that could easily make up an article all by itself. For now, whatever definition of “character” you have in your head is good enough to understand this layered model.

The second layer is a Coded Character Set. This is the first level at which actual encoding takes place: each character in the character repertoire is given a unique integer number to represent it, called a code point. Therefore, the coded character set level defines the first representation of each character, which is simply an integer. In other words, a coded character set defines a mapping from characters in a character repertoire to code points.

A coded character set, by its nature, defines a code space. The code space is the domain of the code points, and defines the minimum and maximum code point values. For large repertoires, it can be helpful to break the code space up into smaller sub-sections and give those sub-sections names.

Representing a sequence of characters as a sequence of numbers gets us a little closer to our ultimate goal of a sequence of bits, but it’s not a huge difference from characters. When compared to a sequence of bits, a number sequence is still a pretty abstract concept. The code points in a coded character set may have very different magnitudes (e.g., 10 vs. 10000), and a coded character set says nothing about how to represent these abstract integers as bits.

The third layer is a Character Encoding Form. A character encoding form transforms a sequence of code points in a sequence of equal-sized integers called code units. This is the first level at which bits are introduced into the encoding - code units are called equal size because each the code unit size is expressed as a number of bits. The size of code units may vary from character encoding to character encoding, but for a particular character encoding form the size is fixed. So when you hear the term n-bit character encoding, it refers to a character encoding form in which the code units are n bits long.

It’s easy to confuse the concepts of code points and code units for a few reasons. For one, they are both integers. For another reason, many character encodings use an identity mapping as a character encoding form, in which each code point value is equal to the code unit value. In such character encodings, the character encoding form is said to be one-to-one (i.e., one code point maps to one code unit). To remember the difference between the two concepts, keep a few things in mind. A code point is an abstract integer (e.g., 17), or just a point on some number line. A code unit is a fixed-size integer (e.g., 17 expressed as an 8-bit value or 0×11). Even though many encodings map one code point to one code unit, such a one-to-one mapping is not the case for all encoding standards.

The fourth layer is a Character Encoding Scheme, which maps individual code unit values to specific sequences of bits. For encoding standards in which the code units are of length 8 bits or less, the character encoding scheme layer typically does nothing. For encoding standards in which the code units are longer than 8 bits, the encoding scheme must map the code unit values into a sequence of bytes. This is where endianness issues arise - in this case a character encoding scheme specifies the ordering of the sequence of bytes for a code unit value.

The UTR #17 model also defines an optional fifth layer called a Transfer Encoding Syntax. This layer is different than the previous four layers. A transfer encoding syntax is almost always separate and orthogonal to the other four layers, and is often not specified as part of a character encoding standard but is used in addition to a defined standard. The most common use of a transfer encoding syntax is to apply some sort of post-processing to the sequence of bytes produced by the other four layers. For example, the sequence of bytes may be compressed to save space (e.g., according to an algorithm such as LZW). Or, the sequence of bytes may be further encoded by an algorithm so that it can be more easily transmitted over certain media (e.g., an algorithm like Base64).

It’s most useful to think of a transfer encoding syntax as a completely optional and separate fifth layer that can be added on to a stack of the other four layers.

To summarize the layers, an abstract character repertoire defines a set of named characters. A coded character set encodes a sequence of those characters as a sequence of abstract integer code points. A character encoding form represents the character sequence as a sequence of fixed-length integer code units. Finally, a character encoding scheme then represents the character sequence as a sequence of bytes.

Now that the levels have been defined, it is possible to give a slightly more precise definition of a character encoding standard. A character encoding standard specifies a stack of these four layers that when combined ultimately maps from a sequence of abstract characters in a repertoire to a sequence of bytes.

To further explain these layers, here’s a few examples using several character encoding standards that many software professionals will be familiar with.

First, consider a character encoding known as windows-1252. This encoding defines a character repertoire of 256 characters that are in the Latin alphabet and used in languages such as English (primarily), French, German, etc. This encoding defines a coded character set that maps each of the 256 characters in the repertoire to an integer value between 0 and 255. Further, since each code point value is between 0 and 255, a very straightforward character encoding form is used in which each code point value maps to an 8 bit code unit having the same value, which is obtained by 0-padding out each integer code point value to 8 bits. The character encoding scheme layer does nothing since the code units are only 8 bits in length.

As you can see, in a very simple character encoding standard such as windows-1252, some of the layers blur together or appear to be unused. This is a reflection of the fact that more complicated character encodings exist in which those layers are more distinct.

As a second example, consider a standard in which all of the layers are easily seen - the UTF-16 standard as defined by Unicode. The character repertoire defined by Unicode is huge. Unlike all other character encoding standards, Unicode is an attempt to include all useful characters in its repertoire - spanning languages, cultures, and even history. Unicode defines a coded character set in which each Unicode character is given a code point in the range from 0 to 0×10FFFF. Unicode code point values are often written in the form “U+hexadecimal code point value” (e.g., U+0041). UTF-16 defines a character encoding form that uses 16 bit code units. Each code point maps to either one or two code units. Finally, the encoding scheme specifies how the sequences of code units should be serialized as sequences of bytes. UTF-16 is actually a family of encoding standards in which the individual standards in the family differ only in the encoding scheme (in other words, they differ in byte ordering). For example, UTF-16BE uses an encoding scheme in which code units are serialized in big-endian form.

It is also useful to observe the transformation of a character as the representation moves through the layers. Consider the character “A” as encoded by windows-1252. “A” is included in windows-1252’s repertoire, and is given the code point 65 (coded character set layer). The character encoding form layer maps code point 65 to the code unit 0×41. The character encoding scheme layer does nothing since the code unit 0×41 serializes a single byte (1000001 in binary).

Now consider the character U+10140 (greek acrophonic attic one quarter) as encoded by UTF-16BE. This character is an ancient Greek number character that has only historical significance. I picked it at random since I wanted a character that would map to more than one UTF-16 code unit. This character is included in Unicode’s character repertoire and given the code point U+10140 (coded character set layer). UTF-16 maps the code point U+10140 to the two code units 0xD800 0xDD40 (character encoding form layer). The UTF-16BE encoding maps the two code units to the byte sequence 0xD8 0×00 0xDD 0×40 (character encoding scheme layer).

By learning this layered model for character encodings, you will gain both an understanding of how character encodings work and a mental model you can apply when things go wrong. It will also be easier to discuss character encoding issues with other software professionals since you can share a common set of terms. Finally, when learning new character encodings you can easily fit them into an existing framework, comparing and contrasting them with encodings you are familiar with.

No Comments | Tags: Uncategorized