Ken Thompson |
When you look at UTF-8 encoding of UNICODE, there are some useful things you should know about it. This encoding is pretty smart in that characters 0 - 127 are the same as 7-bit ASCII, which makes UTF-8 easy for us oldtimers. But there is more to it than that. The first bits in every byte tells you the length of the character. If the first bit is 0, you know that this is a "single-byte" character that is the same as in 7-bit ASCII then. But then comes the smart part.
UTF-8 representation explained in short
For any character which requires more than 7 bits to represent in UNICODE, the first bits in the first byte of a UTF-8 encoded string tells you the number of bytes required in the UTF-8 encoding (i.e. not the number of UNICODE bytes, but how many bytes the character makes up when represented as UTF-8). If the two bit are 1, you know you need as many bytes as there are leading 1's. The net effect of this is that the first byte either has the first bit set to 0, in which case this is 7-bit ASCII, or that the two first bits are 11. Or to put it differently:- The first byte has the pattern 0xxxxxxx - single byte UTF-8, i.e. 7-bit ASCII.
- The first byte has the pattern 110xxxxx - Two byte UTF-8.
- The first byte has the pattern 1110xxxx - Three byte UTF-8.
- etc.
7-bit ASCII
As we have already seen, UTF-8 is fully compatible with 7-bit ASCII, which means that any character that can be represented as 7-bit ASCII has exactly the same representation in UTF-8. I have said this many times now, so maybe you are getting bored, but it is actually more to it than meets the eye.If you look at the scheme above, you see that the first bit of ANY byte, not only the first, is NEVER 0, unless when this is a single byte 7-bit ASCII character. Which means what? It means that if you pick any byte anywhere in a UTF-8 encoded string, and that byte has the first bit 0 (or in other words, a byte with a value in the range 0 - 127) you know this is a 7-bit ASCII character! It cannot be the first byte of a multi-byte UTF-8 character and NOR can it be the second or later byte in a multi-byte character (as these always has the first bit set hence has a value of 128 or higher).
The control characters we IT folks play around with, like carriage return, line-feed or the classic C-style end of string 0, are all in the 7-bit ASCII character range. Which in turns means that you you want to, say, exchange all cr/lf (or \n) for a null (or \0) in a string, the way you would do that is NO DIFFERENT with UTF-8 than with 7-bit ASCII. Sure, any bytes excluding these characters have different representation, but we can ignore that, it will work anyway!
This is the reason that, say, strlen(), strcat() etc still work with UTF-8. As long as you understand that strlen() will return the length of the string in bytes, not characters, it works as usual. And strcat() works exactly as before!
Navigating a UTF-8 string
Another thing you might want to do is navigate among the bytes in a UTF-8 string. In many schemes of variable length item compaction, you would need to start from the beginning of the string to know where you are, i.e. get the first byte, figure out the length of the item, get to the next item etc. Not so in UTF-8! The first byte in UTF-8 character EITHER has the first bit set to 0 OR the first TWO bits set to 1! Or to put it differently, any byte that is NOT the first byte of a character has the two first bits set to 11!So for any byte inside a string, you can always find what character it is or is part of by looking at the first bits:
- If the first bit is 0, you are done: This is a simple 7-bit ASCII character.
- If the first two bits are 10, this byte is part of a multi-byte UTF-8 character, so to figure out what character this REALLY represents, move "back" in the string until you find a byte which is the first byte of this character, i.e. it has the two highest bits in the byte set to 11.
- If the first two bits are 11, then this is the first byte in a multi-byte UTF-8 character.
In conclusion
Complex? Well, yes, but much less so that if we had to convert all ASCII-7 to full UNICODE! None of the old functions would work. Our old code would break completely! The issue with UTF-8 is that it is a bit too smart for it's own good, us IT folks gets a bit lazy and care less about the cases where we really DO need to do special processing to support UTF-8, as it is so compatible and mostly works, like 7-bit ASCII, at least for the intents and purposes for your average programming project. But many you know a bit more now.Cheers
/Karlsson
No comments:
Post a Comment