Friday, September 27, 2013

Using 4-byte UTF-8 (aka 3-byte UNICODE) in MariaDB and MySQL

As I wrote in a previous post, MariaDB / MySQL has some issues with the standard UTF-8 encoding there. This UTF-8 encoding limits us to 3 UTF-8 bytes or 2 UNICODE bytes if you want to look at it that way. This is slightly limiting, but for languages it is usually pretty much OK, although there are some little used languages in the 3 byte UNICODE range. But in addition to languages, you will be missing symbols, such as smileys!

Help is on the way though, in the utf8mb4 character set that is part of both MariaDB and MySQL. This is a character set that is just like the one just called utf8, except this one accepts all the UNICODE characters with up to 3 UNICODE bytes, or 4 bytes using the UTF-8 encoding.

This means that there are more limits to how long a column might be when using utf8mb4 compared to latin1 and even when compared to utf8. Let's try it:

First, I don't want MySQL to substitute any data types for me (why I would EVER want that is beyond me):
mysql> set sql_mode='strict_all_tables';
Query OK, 0 rows affected (0.00 sec)

Then we can create a table to hold a message using the latin1 character set:
mysql> CREATE TABLE msg(msgid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, msgtext VARCHAR(65528) CHARACTER SET latin1);
Now, we know this isn't terribly international, so lets go with UTF-8 instead:
mysql> DROP TABLE msg;
Query OK, 0 rows affected (0.12 sec)

mysql> CREATE TABLE msg(msgid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, msgtext VARCHAR(65528) CHARACTER SET utf8);
ERROR 1074 (42000): Column length too big for column 'msgtext' (max = 21845); use BLOB or TEXT instead

OK, that didn't work, so we'll reduce the size of the msg column and try again:
mysql> DROP TABLE msg;
Query OK, 0 rows affected (0.05 sec)

mysql> CREATE TABLE msg(msgid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, msgtext VARCHAR(21842) CHARACTER SET utf8);
Query OK, 0 rows affected (0.29 sec)

OK, so now we have a table with a UTF-8 column in it. Let's insert some data into it:
mysql> INSERT INTO msg VALUES(NULL, 'Some test message');
Query OK, 1 row affected (0.04 sec)

mysql> INSERT INTO msg VALUES(NULL, 'Some test message with utf8 åäö');
Query OK, 1 row affected (0.01 sec)

Right. That was OK. But what happens if we insert a 3-byte UTF-8 character into this table, say a "smiling face with horns" which is UNICODE 01F608 or as UTF-8 F09F9888. We do this using the MySQL UNHEX function:
mysql> INSERT INTO msg VALUES(NULL, CONCAT('Some test message with 3-buyte UNICODE ', UNHEX('F09F9888')));
ERROR 1366 (22007): Incorrect string value: '\xF0\x9F\x98\x88' for column 'msgtext' at row 1


Hmm, too bad, back to the drawing board and let's use the utf8mb4 character set:
mysql> DROP TABLE msg;
Query OK, 0 rows affected (0.08 sec)

mysql> CREATE TABLE msg(msgid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, msgtext VARCHAR(21842) CHARACTER SET utf8mb4);
ERROR 1074 (42000): Column length too big for column 'msgtext' (max = 16383); use BLOB or TEXT instead

Close, but no cigar I guess. We need to further reduce the size of the msg column, as each utf8mb4 character potentially takes up 1 more byte than when using utf8. So try this instead:
mysql> CREATE TABLE msg(msgid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, msgtext VARCHAR(16382) CHARACTER SET utf8mb4);
Query OK, 0 rows affected (0.29 sec)

OK, this worked better, now let's try that insert again:
mysql> INSERT INTO msg VALUES(NULL, CONCAT('Some test message with 3-byte UNICODE ', UNHEX('F09F9888')));
Query OK, 1 row affected (0.00 sec)

Cool, now we have some data in there, let's then just check if this really is a 4-byte UTF-8 encoded character in there.
mysql> SELECT LENGTH(msgtext), CHAR_LENGTH(msgtext) FROM msg;
+-----------------+----------------------+
| LENGTH(msgtext) | CHAR_LENGTH(msgtext) |
+-----------------+----------------------+
|              43 |                   40 |
+-----------------+----------------------+
1 row in set (0.00 sec)

As can be seen, length(which returns the length of the column in bytes) returns a value 3 longer than the number of characters, which is just right!

So in conclusion, as utf8mb4 really is a superset of utf8, should I always use utf8mb4? No, you shouldn't, as in most cases you don't need those 3-byte UNICODE characters in strings for names, places, companies etc. But in this case, when it comes to some free format text that the user enters, then it might be useful. Also, as I have pointed out before, even though this is VARCHAR, which is variable length, there are situations where the full possible length is used, such as when the table data is sorted. And finally, as for collations, these are slightly different than those for utf8, but they are all there, like utf8mb4_bin, utf8mb4_general_ci, utf8mb4_unicode_ci etc.

I'm really bored with character sets and UNICODE and stuff like that now, so my next blog will NOT be about the fascinating subject of Byte Order Marks (BOM).

/Karlsson

No comments: