Friday, September 27, 2013

Using 4-byte UTF-8 (aka 3-byte UNICODE) in MariaDB and MySQL

As I wrote in a previous post, MariaDB / MySQL has some issues with the standard UTF-8 encoding there. This UTF-8 encoding limits us to 3 UTF-8 bytes or 2 UNICODE bytes if you want to look at it that way. This is slightly limiting, but for languages it is usually pretty much OK, although there are some little used languages in the 3 byte UNICODE range. But in addition to languages, you will be missing symbols, such as smileys!

Help is on the way though, in the utf8mb4 character set that is part of both MariaDB and MySQL. This is a character set that is just like the one just called utf8, except this one accepts all the UNICODE characters with up to 3 UNICODE bytes, or 4 bytes using the UTF-8 encoding.

This means that there are more limits to how long a column might be when using utf8mb4 compared to latin1 and even when compared to utf8. Let's try it:

First, I don't want MySQL to substitute any data types for me (why I would EVER want that is beyond me):
mysql> set sql_mode='strict_all_tables';
Query OK, 0 rows affected (0.00 sec)

Then we can create a table to hold a message using the latin1 character set:
mysql> CREATE TABLE msg(msgid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, msgtext VARCHAR(65528) CHARACTER SET latin1);
Now, we know this isn't terribly international, so lets go with UTF-8 instead:
mysql> DROP TABLE msg;
Query OK, 0 rows affected (0.12 sec)

mysql> CREATE TABLE msg(msgid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, msgtext VARCHAR(65528) CHARACTER SET utf8);
ERROR 1074 (42000): Column length too big for column 'msgtext' (max = 21845); use BLOB or TEXT instead

OK, that didn't work, so we'll reduce the size of the msg column and try again:
mysql> DROP TABLE msg;
Query OK, 0 rows affected (0.05 sec)

mysql> CREATE TABLE msg(msgid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, msgtext VARCHAR(21842) CHARACTER SET utf8);
Query OK, 0 rows affected (0.29 sec)

OK, so now we have a table with a UTF-8 column in it. Let's insert some data into it:
mysql> INSERT INTO msg VALUES(NULL, 'Some test message');
Query OK, 1 row affected (0.04 sec)

mysql> INSERT INTO msg VALUES(NULL, 'Some test message with utf8 åäö');
Query OK, 1 row affected (0.01 sec)

Right. That was OK. But what happens if we insert a 3-byte UTF-8 character into this table, say a "smiling face with horns" which is UNICODE 01F608 or as UTF-8 F09F9888. We do this using the MySQL UNHEX function:
mysql> INSERT INTO msg VALUES(NULL, CONCAT('Some test message with 3-buyte UNICODE ', UNHEX('F09F9888')));
ERROR 1366 (22007): Incorrect string value: '\xF0\x9F\x98\x88' for column 'msgtext' at row 1


Hmm, too bad, back to the drawing board and let's use the utf8mb4 character set:
mysql> DROP TABLE msg;
Query OK, 0 rows affected (0.08 sec)

mysql> CREATE TABLE msg(msgid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, msgtext VARCHAR(21842) CHARACTER SET utf8mb4);
ERROR 1074 (42000): Column length too big for column 'msgtext' (max = 16383); use BLOB or TEXT instead

Close, but no cigar I guess. We need to further reduce the size of the msg column, as each utf8mb4 character potentially takes up 1 more byte than when using utf8. So try this instead:
mysql> CREATE TABLE msg(msgid INT NOT NULL PRIMARY KEY AUTO_INCREMENT, msgtext VARCHAR(16382) CHARACTER SET utf8mb4);
Query OK, 0 rows affected (0.29 sec)

OK, this worked better, now let's try that insert again:
mysql> INSERT INTO msg VALUES(NULL, CONCAT('Some test message with 3-byte UNICODE ', UNHEX('F09F9888')));
Query OK, 1 row affected (0.00 sec)

Cool, now we have some data in there, let's then just check if this really is a 4-byte UTF-8 encoded character in there.
mysql> SELECT LENGTH(msgtext), CHAR_LENGTH(msgtext) FROM msg;
+-----------------+----------------------+
| LENGTH(msgtext) | CHAR_LENGTH(msgtext) |
+-----------------+----------------------+
|              43 |                   40 |
+-----------------+----------------------+
1 row in set (0.00 sec)

As can be seen, length(which returns the length of the column in bytes) returns a value 3 longer than the number of characters, which is just right!

So in conclusion, as utf8mb4 really is a superset of utf8, should I always use utf8mb4? No, you shouldn't, as in most cases you don't need those 3-byte UNICODE characters in strings for names, places, companies etc. But in this case, when it comes to some free format text that the user enters, then it might be useful. Also, as I have pointed out before, even though this is VARCHAR, which is variable length, there are situations where the full possible length is used, such as when the table data is sorted. And finally, as for collations, these are slightly different than those for utf8, but they are all there, like utf8mb4_bin, utf8mb4_general_ci, utf8mb4_unicode_ci etc.

I'm really bored with character sets and UNICODE and stuff like that now, so my next blog will NOT be about the fascinating subject of Byte Order Marks (BOM).

/Karlsson

4 comments:

Peter Laursen said...

I have an old bug report here:
http://bugs.mysql.com/bug.php?id=62458

The database backend of bugs.mysql.com obviously stores strings in utf8 and not utf8mb4 columns.

So you cannot reporrt issues with utf8mb4 in MySQL to bugs.mysql.com (unless attaching text files)!

This deserves a smiley :0}

jhon said...

What is a Payday/Cash Beforehand Loan? It is aloof what it sounds like. A actuality can go to a accommodation provider and booty a accommodation out today adjoin approaching earnings/wages. Sounds appealing acceptable doesn't it? Don't be so quick to agree. It is not as simple as it sounds. Area does this money appear from? My parents acclimated to consistently say "Money doesn't abound on trees.", aback I asked for commodity not in our budget.
Check Cashing Corona

jhon said...

The administration of the accommodation is about abbreviate and the amounts provided will additionally not be actual high. But, usually the amount of absorption is about college in allegory to abounding added loans. For this reason, the serviceman or woman should anxiously counterbalance all the options that are accessible to them and to analysis on the pros and cons of anniversary one of them afore finalizing on the ideal accommodation to accommodated their requirements yet not be a accountability on their pockets.
Payday Loans

Shan Jonson said...

If your gas or electric bill is due and you don't accept a paycheck advancing until abutting week, you may demand to accede a payday accommodation to pay your account bill and accomplish abiding that you don't lose your account account because of non-payment. This will additionally accumulate your acclaim address chargeless of backward acquittal marks.
check cashing places