Karlsson on databases and stuff

Monday, January 21, 2013

Talking at the SkySQL Roadshow in Stockholm

SkySQL Roadshow is coming to Stockholm on Feb 7, come by and meet us. I'll be ending the day with a talk on Big Data, which will be a more generic Big Data talk with some MySQL relevance, but with the focus on Big Data in general.

I haven't blogging much recently, but that has some reasons. I am since Dec 1 the proud father of twins, a little boy and a little girl. I have yet to teahc them to write proper SQL, the have particular issues with subqueries, but we'll get there. In order to create the usual mess of things and to make sure things are at the brink of running out of control, we decided to renovate our flat in the middle of all this. But I'll get there, and once we have a new kitchen installed, I'll do some more blogging, I have some things piled up to write about.

/Karlsson

Friday, January 4, 2013

MySQL JSON import / export tools updated

A user of mysqlimport. Josh Baird, reminded me of a feature which I should have added from teh start, but which was forgotten about. The deal is that when you put a bunch of JSON objects in a file, you have a couple of options on how to do this.

The most obvious is maybe to export as a JSON array of objects, like this:
[
{"id":1, "name": "Geraint Watkins"},
{"id":2, "name": "Kim Wilson"}
]
But this is not what mysqljsoninport supported and this is not how, say, MongoDB exports JSON by default. The reason is that for large amount if data this is cumbersome, as what is in the file is actually one big JSON object containing all the data. This is difficult to parse, requires that a lot of data is read and that the object in whole is kept in memory, unless some clever processing is done. And if we are clever, this is still not effective. Rather, what was supported by mysqljsonimport and how MongoDB exports to JSON is as multiple objects without separators, i.e. you read an object, processit, and then you read some optional blankspace until you reach another object, like this:

{"id":1, "name": "Geraint Watkins"}
{"id":2, "name": "Kim Wilson"}

The latter is more effective, but often the former is used also. So mysqljsonimport now supports both formats, and mysqlexport can optionally export as a single JSON array of objects in a file.

Download the most recent version from Sourcefore: mysqlimport 1.5 and mysqlexport 1.2

Cheers
/Karlsson

Wednesday, January 2, 2013

Amazon AWS for MySQL folks - Speaking at Percona Live 2013

I'll be speaking at Percona Live Conference and Expo in Santa Clara (April 22-25 2013) and this time I'll do a different talk from what I usually do. The plan here is to be low-level dirty practical, showing stuff using the Amazon AWS API, writing scripts using them and showing how to use them together with MySQL. I have said it before and I say it again, to get the most from your cloud, you have to understand and use the unique features of the cloud environment you use.

Can you create an elastic MySQL setup on Amazon? What about HA? How can you add slaves seamlessly? And automatically? I'll try to cover and show as much of this is possible, but the presentation is far from ready so I am happy to accept suggestions on specifics to cover here. See some more details on my talk here.

Hope to see you in Santa Clara in April!

/Karlsson

Friday, December 21, 2012

Galera features beyond just HA

Galera from codership has been getting a lot of attention recently. Galera provides a nice High Availability solution for MySQL where Galera provides synchronous replication with conflict detection using the classic InnoDB Storage Engine. No more playing about with special storage engines of DRBD failover, just continue to use InnoDB and add Galera as the secret sauce for High Availability.

Some of the neat features of Galera are, but are not limited to, multi-master replication, a lightweight implementation of replication and zero failover times due to the multi.master ability. This is not a complete HA solution though, just a component of it, we still need to add some monitoring and failover mechanisms, but as Galera is multi-master this is greatly simplified and can in many cases be handled by the driver or the application with little overhead.

Now, the replication in Galera is synchronous, so that should slow things down a bit, right? Well, yes, but on the other hand Galera can use multiple threads to apply data on the slave, so that should compensate for that somewhat. And how does it compare to MySQL Semi-synchronous replication, which on paper should be that much different?

So I was curious about the multi-threaded apply on the slave that Galera supports? Could this be the multi-thread apply that MySQL has been waiting for all this time? (No, the schema parallel implementation in MySQL 5.6 doesn't count in my mind). So I set out to try this, and this is my thinking:

The parallel nature of this should be best exposed when you have many small transaction, so each INSERT is a single row, autocommit transaction.
For the sake of the test, remove as much InnoDB overhead as possible and run on Ramdisk (tmpfs)
The schema should be simple
Simple INSERTs are to be tested, nothing else
Multiple INSERT threads.
Multi-master operation, but no conflicts.

This is admittedly a simple testcase, but it should tell us something. The schema looks like this:
CREATE TABLE `tab1` (
`c1` int(11) NOT NULL,
`c2` char(100) DEFAULT NULL,
PRIMARY KEY (`c1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
And the data to be inserted is

column c1 - Unique sequential integer.
column c2 - A random string of characters of 5 to 100 characters length.
1.000.000 rows are inserted using 400 threads (200 on each MySQL server).

The hardware I am using for this test is my usual homebrew 8 core AMD box with 16G RAM, nothing exciting but useful.
InnoDB was standard configured here, nothing special, and Galera was using 16 apply threads on the slave, which is probably excessive for this use case. Both MySQL and Galera was using two MySQL servers on the same box.

MySQL with semi-synchronous replication ached some 4.830 INSERTs per second.
Galera achieved some 12.987 INSERTs per second, nearly 3 times the performance!

This test wasn't something scientific, but then most applications aren't terribly scientific either. To me, it seems like Galera is the Replication system MySQL should have had from the beginning! This rocks! And this also proves to me what I was thinking from the start, that Galera has more to give than a plain HA solution!

/Karlsson

Monday, November 26, 2012

In my mind: Why the ORDBMS idea failed

Some 15 years ago, the idea of an ORDBMS (Object-Relational Database Management System) was red hot, and I was very close to the flaming hot center of that. I worked for Informix at the time, and Informix bought Illustra which was the hottest and coolest of the databases if it's time, hey it was an ORDBMS.

This was not a bad idea per se, and I got entangled with it and was really enthusiastic about the idea and I spent a lot of time evangelizing this technology. For Informix, this was as much market positioning and a technical change, Informix went from being the cheap redneck cousin to become the Gordon Gekko of databases. Before this, Oracle was the Big Market Leader, Sybase was the technology leader and Informix was the price leader (no, I'm not talking technical realities, there was a lot of good technical stuff to all of these, this is about how the world at large perceived these guys). But Illustra and another Informix project, XPS (aimed at the data warehouse market) was going to take Informix to places it had never been before. Oh, the Billboard wars, the day when Informix went past Sybase, those were fun days.

From a financial POV, Informix lost it, we already know that (read "The Real Story of Informix Software and Phil White" by "Steve W. Martin" ISBN: 978-0-09721822-2-5), but that's not, in my mind, the whole story, and I think that even though I think there are many good prspects for an ORDBMS system, it's not really as generic as I figured it back then (OK, I was wrong, I admit it, it does happen).

From a technical standpoint what went wrong was (this is my take on it, by the way) that the cool ORDBMS features shoehorned into an aging Informix RDBMS design ended up being largely the worst of both worlds. That has been fixed, to an extent, in more recent Informix releases, but not it's too late :-(

And from a conceptual view, this is also what I think is wrong with the whole ORDBMS thinking. I know and love the traditional RDBMS model, with a fixed number of columns and a variable number of rows, even if this is a simple model, it works real well for data. It makes plain data easy to visualize and understand, and this is also a well researched and understood model for data. As for OO, then this has been thoroughly researched, but the implementations and functionalities differ a lot. Also, OO has a developer focues way of looking at data, for an application, and Objects is a natural way of looking at things and makes things easy, from an application POV. But representing data as an Object is a different thing. Not a bad or good way, but different. The Relational model also lends itself to building control structures for data as it, assuming that the RDBMS is used in some kind of normalized form, is representing data at a very low level, lower than what most applications or end-users view data. And Objects are a way of combining all this data into something that is more application centric.

So the ORDBMS systems turned non-OO enough to not attract the OO people, and at the same time the OO features were non-Relational enough to make the SQL-experts ignore them. (Like: "Why would I want a result set with a variable number of columns?"),

And before I close this: Yes, I know there are many ORDBMS applications out there, that works well and where the application utilize all the cool ORDBMS features. Also, in Oracle and in Particular Postgres and others, there are ORDBMS features that are developed. And inside Postgres, the ORDBMS features is a building block for more than one generic RDBMS feature. But for database people in general, ORDBMS is something we don't see much of.

/Karlsson

Thursday, November 22, 2012

Character sets, Collations, UTF-8 and all that

Yesterday at the first Swedish MySQL User Group real meeting here in Stockholm, I presented a talk on character sets, collations and stuff like that. If you read this blog, you know that I have written about this before, but the presentation I did yesterday was a fair bit more detailed. You can view the full presentation on slideshare :

One thing I talked a lot on was collations and how they affect matters, and this has more of an impact than you think, in particular when using UTF-8. You would think that using UTF-8 most character set problems are solved (at least when using 4-byte UTF-8), but no. Collations are still added to this, and there are many of them and the effect of choosing the wrong one can be real bad.

Let me take an example. You would think think that using a UNIQUE or PRIMARY KEY on a text-based column (using something like a VARCHAR or CHAR type) in a table would ensure that any two strings are unique, but that two strings values that are different may coexist in two different rows. Think again.

A collation defines how characters in a character set are sorted and compared. And most localized collations have some weird attributes to them. There are things that linguistics think are reasonable for a particular language, and that are hence present in the UNICODE standard, but it might not be widely accepted by the community at large. So back to my original example. Let's say we are in Sweden, then 4 (yes, four) different collations may be applicable:

utf8 binary - This is a plain binary collation, comparisons are done on the binary value of the characters.
utf8_unicode - This is a pretty reasonable collations based on some generic compromise in UNICODE on how things are sorted, and are not sorted across the globe. Sort of.
utf8_general - This is a simplified, faster general variation compared to utf8_Unicode
utf8_swedish - This is a collations that is specific to Sweden with some interesting Swedish specifics.

So lets's see how this work in practice. Lets try a table that looks like this:
CREATE TABLE `utf8_table` (
`swedishname` char(10) CHARACTER SET utf8
COLLATE utf8_general_ci NOT NULL,
PRIMARY KEY (`swedishname`)
) ENGINE=InnoDB;
What happens with this data:
INSERT INTO utf8_table VALUES('A');
INSERT INTO utf8_table VALUES('Ä');
In Sweden, these two are unique (the second A has an umlaut). In the rest of the world, these two are the same, so the above will not work, a PRIMARY KEY error will happen on the second row, despite the characters being different! So we try this instead:
CREATE TABLE `utf8_table` (
`swedishname` char(10) CHARACTER SET utf8
COLLATE utf8_swedish_ci NOT NULL,
PRIMARY KEY (`swedishname`)
) ENGINE=InnoDB;
And with the same data:
INSERT INTO utf8_table VALUES('A');
INSERT INTO utf8_table VALUES('Ä');
And this works as it should work, both rows are inserted!

I will write another blog post on this soon, with some more examples eventually, but for now:
Cheers
/Karlsson
PS. I apologize if you have problems reading the above, it probably comes from the fact that there are embedded Swedish characters in the text :-(

Tuesday, November 13, 2012

This sucks! Well, maybe it does and maybe it doesn't...

Imagine that Microsoft and Apple got into a big fight for the market some 15 years ago and that Apple lost. Big time. Apple went down completely and there was nothing left. And as an IT expert, you were called in to look at what remained, what could be salvaged and what was just a waste of everyones time and money.

If you had seen the iPhone back then what would you have said? (I'm not so sure myself, chances are I would have been terribly negative). Note that there would have been no AppStore, no HTML5 sites, none of that neat stuff.

Or to make a different analogy: Was VHS better than BetaMax? Well, that depends on who you ask: The end consumer wanting to rent a movie or the techie looking at the specifications of the technology in question.

Just after the second world war, in a Germany in shambles, the allies went in and had that look at Germany, and with them they brought some smart dudes, to look at what was useful, what was not and what was rubbish. Reginald Rootes, who together with his brother Billy ran the Rootes Group, one of the big 5 producers of cars i Britain at the time, came along to, among other places, Wolfsburg to have a look at the VW plant. Despite being adviced that the VW was a viable product and seeing it himself, Reggie wasn't interested. Now, some 60+ years later, all the remaining Rootes brands and factories are long gone (the last one, producing Peugeots, closed in 2007). And VW is fighting with Toyota for the title of the world largest car maker.

All in all, stuff that might have serious issues, might be just because of development issues, and you need to look further down the road to see the potential. And don't make the mistake of thinking that the good or bad implementation of an idea says much about the real potential of that idea.

Take Virtualization. Running a database in a virtualized environment was a big no-no just a few years ago. Now things have developed, performance is much better and many of us can use a virtualized environment for many, if not most, of our database needs. Be it Oracle, MySQL or Postgres or whatever.

Go back 20 years and ask yourself how you would have reacted if someone told you that in 20 years, many large enterprises would have large parts of their infrastructure run on a operating system developed by a Finnish student in his spare time in an outsourced environment run by an Internet bookstore company? Nah, don't think so.

So what is the next big thing then? I try to spend some time on it, and when I get to test or try something, I really try to separate the implementation of the idea, be it a new operating system, a new type of access method or whatever, from the actual implementation. The latter says less about the former than you think.

Also, technology isn't everything. Far from it. The best technology doesn't always win. And as for the new technologies you look at, the usefulness and applicability of those isn't always what you think. Did the web turn out to do what we were expecting? What did you expect to be able to do with a cellphone some 15 years ago, besides making phone calls and sending text messages? I believe there is a synergy between the potential of a technology and the applications for it that is the drive forward. And don't be so fast to click that "This sucks" button.

/Karlsson
Sorry for this post not being that MySQL focused, but I think it is still applicable. Even if the implementation sucks.