Thursday, June 14, 2012

A little something on the basics of UTF-8

Some years ago, in 2009, I wrote a couple of blog posts on the subject of Character sets and UNICODE and how that works. In a not unusual move, I included a bit on the history of character sets and stuff like that. And another recurring theme was that I promised a third part on collations, something that somehow didn't really happen. You can read these posts here and here.

Ken Thompson
Now, 3 years later, we expect things to have moved on. But not really. Except that UTF-8 is much more persistent these days, you really cannot escape it, don't even try! For example, not so long ago I wrote about JSON and a JSON string is UTF-8, there is no other choice. And I will follow up on my previous posts with something on COLLATION, but not right now, instead, I'll do a slightly deeper dive into the UTF-8 waters and explain how it works in some more detail, and some of the neat side effects of it. It is pretty smart, but as it was invented by Ken Thompson (together with Rob Pike when at a restaurant), this is expected.

When you look at UTF-8 encoding of UNICODE, there are some useful things you should know about it. This encoding is pretty smart in that characters 0 - 127 are the same as 7-bit ASCII, which makes UTF-8 easy for us oldtimers. But there is more to it than that. The first bits in every byte tells you the length of the character. If the first bit is 0, you know that this is a "single-byte" character that is the same as in 7-bit ASCII then. But then comes the smart part.

UTF-8 representation explained in short

For any character which requires more than 7 bits to represent in UNICODE, the first bits in the first byte of a UTF-8 encoded string tells you the number of bytes required in the UTF-8 encoding (i.e. not the number of UNICODE bytes, but how many bytes the character makes up when represented as UTF-8). If the two bit are 1, you know you need as many bytes as there are leading 1's. The net effect of this is that the first byte either has the first bit set to 0, in which case this is 7-bit ASCII, or that the two first bits are 11. Or to put it differently:
  • The first byte has the pattern 0xxxxxxx - single byte UTF-8, i.e. 7-bit ASCII.
  • The first byte has the pattern 110xxxxx - Two byte UTF-8.
  • The first byte has the pattern 1110xxxx - Three byte UTF-8.
  • etc.
Now, in the case where there are 2 or more bytes to the UTF-8 representation, any bytes except the first has the pattern 10xxxxxxxx. If you look at this way of representing UNICODE, you probably think that this is hardly the most compact way or representing UNICODE, and it's not. But instead it is incredibly useful, in particular when you look at the side effects (surely most of them intentional) of this representation. Let's have a look at some of them.

7-bit ASCII

As we have already seen, UTF-8 is fully compatible with 7-bit ASCII, which means that any character that can be represented as 7-bit ASCII has exactly the same representation in UTF-8. I have said this many times now, so maybe you are getting bored, but it is actually more to it than meets the eye.
If you look at the scheme above, you see that the first bit of ANY byte, not only the first, is NEVER 0, unless when this is a single byte 7-bit ASCII character. Which means what? It means that if you pick any byte anywhere in a UTF-8 encoded string, and that byte has the first bit 0 (or in other words, a byte with a value in the range 0 - 127) you know this is a 7-bit ASCII character! It cannot be the first byte of a multi-byte UTF-8 character and NOR can it be the second or later byte in a multi-byte character (as these always has the first bit set hence has a value of 128 or higher).
The control characters we IT folks play around with, like carriage return, line-feed or the classic C-style end of string 0, are all in the 7-bit ASCII character range. Which in turns means that you you want to, say, exchange all cr/lf (or \n) for a null (or \0) in a string, the way you would do that is NO DIFFERENT with UTF-8 than with 7-bit ASCII. Sure, any bytes excluding these characters have different representation, but we can ignore that, it will work anyway!
This is the reason that, say, strlen(), strcat() etc still work with UTF-8. As long as you understand that strlen() will return the length of the string in bytes, not characters, it works as usual. And strcat() works exactly as before!

Navigating a UTF-8 string

Another thing you might want to do is navigate among the bytes in a UTF-8 string. In many schemes of variable length item compaction, you would need to start from the beginning of the string to know where you are, i.e. get the first byte, figure out the length of the item, get to the next item etc. Not so in UTF-8! The first byte in UTF-8 character EITHER has the first bit set to 0 OR the first TWO bits set to 1! Or to put it differently, any byte that is NOT the first byte of a character has the two first bits set to 11!
So for any byte inside a string, you can always find what character it is or is part of by looking at the first bits:
  • If the first bit is 0, you are done: This is a simple 7-bit ASCII character.
  • If the first two bits are 10, this byte is part of a multi-byte UTF-8 character, so to figure out what character this REALLY represents, move "back" in the string until you find a byte which is the first byte of this character, i.e. it has the two highest bits in the byte set to 11.
  • If the first two bits are 11, then this is the first byte in a multi-byte UTF-8 character.

In conclusion

Complex? Well, yes, but much less so that if we had to convert all ASCII-7 to full UNICODE! None of the old functions would work. Our old code would break completely! The issue with UTF-8 is that it is a bit too smart for it's own good, us IT folks gets a bit lazy and care less about the cases where we really DO need to do special processing to support UTF-8, as it is so compatible and mostly works, like 7-bit ASCII, at least for the intents and purposes for your average programming project. But many you know a bit more now.

Cheers
/Karlsson

No comments: