Monday, October 17, 2005

UTF-8 Encoding

UTF-8 encoding is a pretty nifty way to encode text as bytes, keeping the compact representation of the 7-bit ASCII characters as 1 byte-per-character, yet allowing for the full range of Unicode characters (used in other languages).

Here's how the encoding works:

Unicode        UTF-8 Binary
----------- -----------------------------
0000h-007Fh 0xxx xxxx
0080h-07FFh 110x xxxx 10xx xxxx
0800h-FFFFh 1110 xxxx 10xx xxxx 10xx xxxx

So, as the decoder is processing a stream of bytes, it first looks to the Most Significant Bit of the next byte. If it is 0, then that byte contains the next character (ASCII range 0-127). If it is 1, then the next character will come from multiple bytes, decoded from the bitmask shown above.

That's all well enough, but have you ever opened a web page that contains weird characters, like this:

"The temperature is 68°F"

This is indicative of an encoding problem. The content of the web page was most definitely encoded using UTF-8, however, the web server did not correctly inform your web browser of this fact. The result: The multi-byte unicode characters encoded as UTF-8 were represented as the literal ANSI (Windows-1252) characters for those bytes.

In the case of the degree symbol, in this example, the Unicode character is 00B0h. This is encoded in UTF-8 as C2B0h (1100 0010 1011 0000). But, when interpreted as ANSI characters, the result is two characters: C2h (194 = Â) and B0h (176 = °)

Question: I have a file [or a stream] that is encoded as UTF-8. How does {insert application name here} automatically know this when the document is opened?

Have you ever opened a file with a hex editor, and found some strange bytes at the beginning that you know were not part of the content? These are the "Byte Order Mark" bytes, and are primarily used to handle Endianess issues between different CPUs. However, they will also give away the encoding that is used.

For UTF-8, you will find the following Byte Order Mark:


UPDATE (7/31/2006): Scott Hanselman did a Hanselminutes podcast this week on Globalization, and discussed UTF-8: