Friday, September 27, 2013

How MariaDB and MySQL makes life with UTF-8 a bit too easy. And how to fix it...

UNICODE is getting more and more traction and most new applications, at least web applications, support UNICODE. I have written about UNICODE and related stuff before in Character sets, Collations, UTF-8 and all that but before I go into some more specific and some issues, and fixes, let me tell you about UNICODE, UTF-8 and how MySQL interprets it. See the blogpost linked to above for more information on the subject, surprisingly even more boring, on Collations.

So, let's begin with UNICODE. UNICODE is a character set that is very complete, you should be able to make yourself understood in any language using the characters from this vast character set. This is not to say that all characters from all languages are in UNICODE, some are missing here and there and sometimes new characters make their way into languages, like different currency signs and stuff like that. In general though, you can use all the more than 100.000 characters in UNICODE to express text in any language.

So, then we know UNICODE; what is UTF-8, UTF-16 and all that? Those are ways to represent the bytes that make up UNICODE. The smartest, and most common, of these, by far, is UTF-8. To represent the highest byte-value of a UNICODE character, you would need 3 bytes for every character, but this is wasting space. Also, this would mean that all old program code that assumes that a character is 1 byte only (with 7 bits representing the good old US ASCII characters, but there are local variations of this, or interpreting up to 8 bits as latin1). Ken Thompson fixed this for us though with the UTF-8 representation.

UTF-8 is variable length, and the way this works is that if the first (highest) bit is 0, then this is a character in the range 0-127. If you stick with US ASCII, you can say this is actually UNICODE represented as UTF-8, as the 128 "first" caharacters in UNICODE are the same as 7-bit US ASCII. No issues there. When there are more than 1 byte to the character, then at least the first bit is 1 and the number of leading 1's, up to the first 0, tells how many following bytes there are. So the first byte either starts with 0, in which case there is 1 byte only, or with 110 (10 would mend 1 byte, but that starts with a 0). This in turn means that the first byte never starts with the bit pattern 10, so 10 is used as the leading bit pattern for any of following bytes. Using this scheme you can't represent as many bytes as you want, as you run out of 1's in the first byte, but enough bytes for UNICODE at least! And it's US ASCII compatible.

Potentially you should be able to have 8 bytes in total (all bits set to one in the first byte and then 7 following bytes). As the following bytes all start with 10, only 6 bits are significant, and as we have used up all the significant bits in the first byte, this means that we can now represent 7 * 6 = 42 bits of data or 5 bytes + 2 "extra" bits. But the scheme was never intended to go this far, originally it was envisioned that UNICODE might use all up to 6 bytes in the UTF-8 encoding (this doesn't mean that we would have that many characters, not all code-points are used as characters are organized in separate codepages). Eventually it was agreed that UNICODE would only use up to 21 bits or slightly less than 3 bytes, which encoded as UTF-8 requires up to 4 bytes.

So, now back to good old MySQL. MySQL does support UNICODE and it supports UTF-8, right, so we are all set then? Nope, I'm afraid not. You might have seen that MySQL makes space 3 * in a CHAR or VARCHAR column, to make sure that a 3-byte UTF-8 character can fit. In a VARCHAR that might not be significant, as it is variable length anyway? Right? Nope, depending on the MySQL Storage Engine, VARCHAR is actually allocated to the full potential length. OK, but you use InnoDB anyway, which doesn't behave like this, so you are in the clear? Again, I have to disappoint you, unless you never sort on that column, you need consider this as sorting involves sorting the potential length of data! Also, if you use the NDB Stoarge Engine and use Disk Data, you have the same issue (see: http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-disk-data-storage-requirements.htm).

But there is another thing that might have caught your attention in the above section. I write that MySQL allocated 3 bytes per UTF-8 encoded UNICODE character, but I also say that UNICODE, when encoded as UTF-8, might occupy up to 4 bytes? What is going on is that the MySQL UTF-8 encoding is limited to 3 bytes, which means that MySQL can, when using UTF-8, not represent the whole UNICODE character set, specifically, it is limited to 4 + 6 + 6 = 16 bits or 2 bytes. This might not be a big deal, but it depends on your situation. I was reading up on this on Wikipedia for example, and it says "Three bytes are needed for characters in the rest of the Basic Multilingual Plane (which contains virtually all characters in common use[11]). Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters and various historic scripts and mathematical symbols." but this isn't altogether true, but someone who implemented the original UTF-8 probably read something like this.

What one has to understand is that UNICODE isn't only about languages, it's also about symbols, such as currency signs etc. But again, most of this also lives in UTF-8 3-byte space (even new ones do, as there is unused space in UNICODE). But then Facebook and Twitter comes around and screws it all up and nothing works again :-( suddenly the use of Smileys exploded. And this is where trouble starts. The simple smileys, like :-) all fit within 2 bytes of UNICODE, but the more advanced ones do not! Oh my!

There is help on the way though, as MySQL also supports a full UTF-8 encoding called (in MySQL lingo then) utf8mb4. Now you know what that is used for!

But I'm going to close here, and follow up real soon with a blog specifically on using utf8mb4. Fun, isn't it!
/Karlsson

No comments: