Thursday, October 8, 2009

Of UNICODE, UTF-8, Character sets part 1

Why would you care about UNICODE? Come on now, most people can read english and english can be written using only 7-bit ASCII, so who needs more? Well, I think it's safe to say that Internet (remember that? Netscape, WWW, .com booms, pet food on the net etc) changed all that. Now applications can be found and run everywhere by anyone, more or less, so even if the application speaks english, and even if the user does, you may end up with users inputing data using some other character sets.

For someone like myself, having grown up in a "beyond A-Z" part of the world (Sweden, which is one of the easy cases), I can tell you how annoying it is when I input my address on some webpage (this happens even on swedish website)s using some swedish characters (I got 2 of the 3 beyond A-Z characters in the name of the street where I live), and it comes out looking like someone just smashed a fly prominently placed in the name of my street.

For a developer, this is difficult. Having someone test it is bad enough. And then we have things like localized keyboards (I got one of them), printers, OCR software etc. With that in mind, I plan to write a few blog posts on character sets, unicode and stuff like that.

Before I start this first post though, let me tell you that although I am pretty familiar with the subject, I'm far from an expert, so you may well catch an error or two. Please help me correct them if you find any!

So, that said, where shall we start? Well, lets begin with some basics, way back in the Reagan administration. The 7-bit ASCII character set was what was used all over the place when I started in this industry. The only competition was EBCDIC, but that was IBM mainframe only. This was in the early 1980's, but even then, we needed to use Swedish characters sometimes (I am a "swedish character" myself, I guess), and in the 7-bit ASCII world, this was handled by changing some lesser used punctuation marks to the 6 swedish characters (å, ä, ö and the upper case versions Å, Ä and Ö). This was an issue as a C developer, as I was back then, as the puctuation marks changed was the pipe, backslash, and the curly and square brackets! Yikes! Usually, you could put you VT220 terminal in US ASCII mode, but when you printed, the printer was often shared with office workers, meaning that the printouts often looked like:
main(int argc, char *argvÄÅ)
printf("Hello WorldÖn");
Well, you see what I mean, quite unreadable, looks even worse than a Python script. I was about to write that I might have gotten the above slightly wrong, as it was a long time ago since I used this, and then I decided to look it up, and when I did, I actually had it all right, which goes to show that this was something you really had to learn if you were writing code in C here in Sweden back then in the stoneages.

Now, time went on (well, actually, it didn't. 7-bit swedish ASCII is still in use out there, quite a bit in homebrew ERP systems and stuff like that), and the next step was support for all (or most) of the western world characters in one character set. And the 8-bit ASCII set was born. This was pretty good, actually, and was pioneered most in the DEC VT220 terminal and then spread. There were still some variations of the 8-bit character set, but they were much fewer. The most common, by far, is the ISO 8859-1 character set, which contains most characters used in major western world common languages.

Why do I use such weird language here, you ask "major western world common languages", why do I just not say "western world langauges". Because that would be incorrect, that's why. Take my native Sweden for example. I think most swedes will agree that 8859-1 contains all character used in the official swedish language, and that there is just one such language. And this just isn't true, I'm afraid. Neither 8859-1 or any of the other 8859 variations cover any of the special characters in the 4 (I think there are 4, where 3 are sort-of common and used) sami languages / dialects.

8859-1 has a few variations (I know, I know, this is getting boring. ALL these character sets have variations). One such is the 8859-15, which, among other things, contains the Euro symbol. 8859 also has another name, which should be well known to you MySQLers: latin-1! And what about Windows? Windows uses codepages (cp) and cp1252 is the one used by non-UNICODE Windows variations in most of the western world. And cp1252 is the same as 8859-1, right? Nope, it's not, but for our practical people, it can be trested as being so.

So what is the difference between cp1252 and ISO-8859-1 you ask? The difference lies in something that hardly anyone uses anymore, which is in the control characters. CP1252 contains only the non-printable characters as used in 7-bit ASCII in range 0-31, whereas 8859-1 and -15 also has some control characters in the range 128-159. In the latter range, CP1252 has some real characters.

This difference is due to ISO 8859-1 being so much older, from days when we actually used control characters (do you youngsters reading this even know when these are? If not, ask your grandaddy). But besides this, they are the same. This means that web-pages, which typically use 7-bit (very old pages do), 8859-1 or UTF-8 (other variations DO exist, but these are the most common ones), using 7-bit ASCII or 8859-1 can be displayed on Windows using CP1252, as 1252 just adds characters in a control characters range, and control characters aren't used on a web-page (except the basic LF, CR/LF, LF/CR and ... NO, dont get be started on THAT for gods sake!).

So along comes 8859-15, which builds on 8859-1, but adds the Euro sign, among a few other things. And as CP1252 was already in wide use, and as 8859-1 was largely compatible with CP1252 for all practical uses, and because noone in their right mind use much of control characters anymore, the committe defining 8859-15 was smart enough to put the additional characters in the same place as the existing ones in CP1252 (the Euro sign is a good example, CP1252 contains the Euro sign in the upper control characters range). HA HA HA Got you there. This is ISO, a bunch of smart people, of course they would not put the Euro sign in 8859-15 in the same place as it was in CP1252! The effect was that, I think most people who think they use 8859 actually use CP1252 (as the Euro sign is used more and more, and the 1252 encoding of it is probably more well known).

OK, so this is a mess. You understand that by now I think, it's not just me who is a mess, the whole character set thing is. Luckily UNICODE will fix that, so more on that in the next post of this subject (and if you beleive that UNICODE will fix this and stop the controversy, let me tell you about a New York Bridge that I can get you a real good deal on). And also something on collations. What are those? Any why? And what happened to the squirrel? We'll be right back, so don't touch that dial!

AKA The Swedish character


rpbouman said...

Hi Anders,

great post. And, death to trolls like the dude before me....

anyway, just a remark:

"Neither 8859-1 or any of the other 8859 variations cover any of the special characters in the 4 (I think there are 4, where 3 are sort-of common and used) sami languages / dialects."

be careful with the references. You probably meant to write: *ISO-*8859-etc.

I mean, really, if there would be 8859 variations (as in, the amount, not the number as a name), it would not have become so popular :)

Karlsson said...

You are right, of course. I think the th reason I wrote as I did was to try to "lighten up" a boring subject by trying to use some synonyms, which is different anytime you write about standards. I'll just try not to spice up the language by using synonyms for standards, and that is probably good advice.