Karlsson on databases and stuff: 2011

Monday, November 28, 2011

Former Swedish Pirate Party leader and MySQL US Speaker on the list of the worlds top 100 thinkers

For those of you who was at the MySQL User Conference in 2008, you might remember that the then leader of the swedish Pirate Party, Rick Falkvinge, was one of the keynote speakers, giving a talk on Copyright Regime vs. Civil Liberties. The MySQL User Conference has a history of inviting really interesting speakers, often slightly off what is expected at such a tech-focus conference. Falkvinge did a great talk there, and at least impressed me and got me thinking even more on the issues pf free and open speach, copyright, software licensing etc.

Now Falkvinge has stepped down as leader of the Pirate Party, but he is still very much out there and promoting openness. He has managed to achive such a reputation that he is now number 98 of the list of Top Global Thinkers published by the Foreign Policy magazine. I'm not sure how much influence the MySQL UC keynote meant in terms of getting on that list, but it is impressive and the MySQL UC organizers deserve a pat on the back!

/Karlsson

Friday, November 25, 2011

Cloud Tech Day in Stockholm Tue Nov 29

I'll be doing the keynote at Cloud Tech Day here in Stockholm on tuesday. I'll be speaking a bit about what Recorded Future is up to, about Clouds at Amazon and what it is like, about databases, like MySQL and MongoDB, in the Clouds and about Big Data in the Cloud! Really big data in Mongo, in MySQL and the lot.

As usual, I will express my opinions in no uncertain terms. What works? What doesn't work? What really should work, but which doesn't! What is considered new and waay cool but what is really some old technology that didn't use to work and has little chance of working now. And stuff like that, you know what it's like and maybe you even know what I am like :-)

Hope to see you on tuesday
/Karlsson

Wednesday, November 23, 2011

Nov 23: At Cloud Camp Stockholm

I am Cloud Camp in Stockholm today. Some interesting ideas are bounced around, pretty cool stuff.

One thing hit me today though: the lack in innovation, in IT as a whole and in databases in particular is stunning. I have thus decided to write a few blogpost on this I think should, and probably eventually has to change, but noone wants to change it, and few even see it as a problem.

That said, I still got a few interesting ideas today, and I will test some products I saw here, and I will write a few blogpost on some of them.

I think the good usecases for clouds is also getting clearer, and that is a good thing. In difference to the current IT trends, IT press and many high-profile bloggers as well as IT influencers, I do not think that cloud computing will help resolve the conflict in the middle east. Also, I do not believe that the introduction of cloud computing, in difference to what many IT security folks seem to think, will cause all the credit card info, all the personal data and everything else suddenly to be available to everyone on the net. Taking my own stand as usual, and in this case this is a real different view,I beleive that Cloud computing is great for some, but not for all. And I also do not think (you are sitting down now, right? This is revolutionary, ground-breaking thinking) there is no such thing as a silver bullet. Tough!

/Karlsson

Tuesday, November 15, 2011

MyQuery 3.4.3 GA Released

I have had MyQueru 3.4.2 as beta for quite a while now. During this time, a few minor bugs has appeared, and they have now been fixed. This means I can proudly announce MyQuery 3.4.3 as GA! Download it from Sourceforge!

Except a few bugfixes, the only major change is that the NDB Monitor plugin is no longer part of the prebuilt binaries. There are three reasons for this:

The plugin relies on NDB API, which still is still not available from Oracle in binary form, and if they insist on not shipping these binaries, although the source exists, I have to assume they do not want me to use them.
I just can't be bothered to build MySQL luster form source, just to get these binaries.
The way Windows binaries are built with NDB is not properly documented, so I had to guess and use some trial-and-errors. This was not something I wanted to put in a GA release.

If you don't know what MyQuery is, this is a Windows-only MySQL Query tools, that concentrates on flexibility, extensibility and SQL-scripting. It has a colour coded text editor, using Scintilla and includes user-defined, if you want to, keyword lists. There are several means of extending the tool, from just running simple SQL statements to using a C API. The tool comes with complete docutemtation and API samples.

/Karlsson

Monday, November 14, 2011

My take on the "warning" against using MongoDB...

We have seen the "warning" against using MongoDB a few times now, and I have to say that this reminds me of other such warnings:

640 K is enough
Relational databases are too slow (John Cullinane) (see for example: http://www.cbronline.com/news/ellison_makes_pre_emptive_strike_to_save_oracle_from_fate_of_cullinet)
Linux is for nerds only.
Virtual computing is waaay too slow.
There is no reason anyone would want a computer in their home. (Ken Olsen, DEC)
Who the h*ll wants to hear actors talk?. Warner

In a sense, most of them were right. If you had, in the 1920's, asked the movie going public if they wanted "talkies", chances are most of them would have said no. If you had told my mom and dad in the late 1970's that within 20 - 30 years, everyone would have a computer at home, with some resemblence to what their nerdy near-20 year old boy was tinkering with in the basement of their house, they probably would have laughed, at best.

But that's not the thing here. True innovation moves things forward. It introduces new things and new ways of doing things in a way that we have not heard of before, and the rest of the world has not a got a good view on it. Look at Virtual computing. This was considered so slow that it was close to useless some 10 years ago or so, but today it just cannot be ignored and is put to good use all over the place (I am now disregarding the fact that this technology is way older than this, I am talking Virtual computing in the field where I spend most of my time).

If something that provides new and unique features, and new ways of doing things, are still slow, when compred to traditional means of acheving similar results, isn't strange:

New means not fully developed. What you want to demonstrate with something completely new isn't that it performs as well as existing technologies, then why would anyone change? No, you want to show the new features and demonstrate hwo unique this new thing is.
The way we measure performance or whatever we use to measure existing technologies, is usually tied to measuring just that: the performance or whatever of existing technologies, not that of a new technology and a new way of doing things.

In the early 20th centrury, steam cars were mach faster and more reliable than internal combustion powered cars, the Stanley Steamer was more performant and reliable than most competitors.

Getting back to MongoDB then: My main gripe with it is not that's it's not in all aspect mature (it's not. face it, if you use MongoDB you used leading egde stuff. It will break, live with it!). Neither do I have any issues with many of the other attributes of MongoDB and neither that it really isn't even innovative (it's not, live with it). No, my main gripe is this: MongoDB and NoSQL isn't really that new, and this means it is probably a stop-gap solution. In the 1980's running a SQL database on a PC was possible, but slow (I was working for Oracle at the time, so I know), DBase was in the case of a single PC easier to use, faster and more developer friendly. And the way you used Ashton-Tate DBase, by the way, wasn't that much different from how MongoDB is used today.

But the SQL Based relational databases, like Oracle, Informix etc. had more features and was more flexible and standardized, and once the PCs got more powerful, DBase was history.

What really wins then, in my mind, is features, flexibility, scalability and broad spectrum of usacases. SQL Based relational databases, has this, to an extent, but what most of them lacks is scalability across servers in a cloud. In this aspect, they have some of this scalability, but they don't scale nearly as much or as easily as, say, MongoDB or the other NoSQL databases (yes, I hate that term. Find a better one fpor me that is broadly accepted and I start using it).

So what am I saying here? Let me summarize it:

No. MongoDB doesn't suck, no way, but in terms of maturity it has some way to go.
Yes, a relational RDBMS usualy has more flexibility and broader set of usecases than a NoSQL solution.
And Yes: There are places where the RDBMS software industry got caught with their pants down: Cloud environment scalability for example. Also, licensing, if Oracle or MySQL or whoever could figure out a proper means of pricing cloud services, I'd be happy (insteaad of using the pricing models for software that was introduced with Auto-Flow in the early 1960. This is insane. The world has changed since then, guy!)
Would I rather use MySQL than MongoDB here at Recorded Future? Yes, probably, but I don't insist, and it just wouldn't work, as MySQL will not scale in the way you can scale a MongoDB solution, far from it.
Will MongoDB or the other NoSQL solutions mean that the era of SQL based databases is reaching an end? Nope, no way, José. Eventually some RDBMS vendor will get it and understand the issues and build a viable solution (like ScaleDB or NuoDB or something like that).
What is my main issue with MongoDB? That it sacrifices features and functionality for performance. And it cannot add the features and flexibility of an RDBMS without sacrificing performance (look at this: access method: B+-Tree. Come on, how innovative is THAT?) . To be honest, running MongoDB without sharding seems like a useless excercise to me. If you don't need the scalability that this setup can provide you with, you have better options. (but this is just me talking here.)
With MongoDB hang around for long? Yes, probably, but it will not be that hot for as long at SQL based databases. The reason is that compared to a traditional RDBMS it provides just one big advantage (a big advantage, yes, but just one): Performance. Which is why we use it. Remember the Stanley Steamer? Old hat today, but it was the hottest thing you could drive some 100 years ago or so. And this is what cloud computing is all about, a constant change of technology to get the best value for money right now, and to be on constant lookout for new technologies that drives features and performance.

/Karlsson

Wednesday, November 2, 2011

Clouds in Stockholm

I'll be at Cloud Camp here in Stockholm on November 23. Some familiar faces will be there, beyond yours truly then. I will discuss and present some real-live Database Cloud experiences, but as this is an unconference, don't expect slides, rather I will talk from my heart and give you some annoying and upsetting views on how things really are. Really!

I hope to see you there, pop by and say hello!
/Karlsson

MongoDB for MySQL folks part 3 - More on queries and indexes

Last time I wrote about MongoDB for MySQL DBAs I described some of the basics of MongoDB querying, and this time I'll follow that up with some more on querying.

As we saw last time, the basic format of a MongoDB query is:
db.find(<query>,<attributes>)
Note that you do NOT replace db with the name of the database you want to query here, you just make the database you want to use the current one and issue the query, such as:
> use test
> db.mycoll.find()
The example above will find all objects in the mycoll collection, and will include all the object attributes and also the key (_id), like this:
{ "_id" : ObjectId("4eb0634807b16556bf46b214"), "c1" : 1 }
{ "_id" : ObjectId("4eb0634a07b16556bf46b215"), "c2" : 1 }
{ "_id" : ObjectId("4eb0635607b16556bf46b216"), "c1" : 2, "c2" : 2 }
{ "_id" : ObjectId("4eb0635e07b16556bf46b217"), "c3" : 3 }
The Object id is generated by MongoDB itself here, although you can set it yorself if you want to, as long as it's unique. The insert method is used to insert data:
> db.mycoll.insert({c3: 4, c4: 'some string'})
> db.mycoll.find()
results in;
{ "_id" : ObjectId("4eb0634807b16556bf46b214"), "c1" : 1 }
{ "_id" : ObjectId("4eb0634a07b16556bf46b215"), "c2" : 1 }
{ "_id" : ObjectId("4eb0635607b16556bf46b216"), "c1" : 2, "c2" : 2 }
{ "_id" : ObjectId("4eb0635e07b16556bf46b217"), "c3" : 3 }
{ "_id" : ObjectId("4eb063d307b16556bf46b218"), "c3" : 4, "c4" : "some string" }
And as you can see, typing is automatic, or you can look at it as being type agnostic. Now, this wasn't much more than we saw last time, what we want is to select some specific objects and possibly get some specific columns from it, this is done by specifying one or two arguments to the find() method. For example, if I only want to get the object back that I inserted last above, I'd do this:
> db.mycoll.find({c3: 4})
{ "_id" : ObjectId("4eb063d307b16556bf46b218"), "c3" : 4, "c4" : "some string" }
And this wasn't really complicated, right? The condition is passed as a Java Script object notation, and that is fairly uncomplicated. But what happens for something slightly more than this really simple example, like a rangesearch? To get all objects where the c3 member is 4 or higher (which results in the same object as above by the way), you would write something like this, and :
> db.mycoll.find({c3: {$gt: 3}})
{ "_id" : ObjectId("4eb063d307b16556bf46b218"), "c3" : 4, "c4" : "some string" }
I will show some more $-operations beyond $gt in a later post, for now just accept that they exist and are documented here: Advanced Queries

The _id column is just annoying here, right now, but it is always shown by default, as are all the other object. To get rid of it for now, this will do the trick:
> db.mycoll.find({c3: {$gt: 3}}, {_id: 0})
{ "c3" : 4, "c4" : "some string" }
Not too bad, right, and kinda easy to understand. The falgs you pass for each field in the second argument may have one of three values:

1 - Include this field. This is the default.
0 - Do not include this field.
-1 - Include no fields except this one and the ObjectId. You may have more of these, in which case all the -1 flagged fields will be included.

Let's try a more advanced version. I want to the the c1 and c2 attributes, and nothing else, then I do this:
> db.mycoll.find({},{c1: -1, c2: -1, _id: 0})
{ "c1" : 1 }
{ "c2" : 1 }
{ "c1" : 2, "c2" : 2 }
{ }
{ }
As you can see, I have to explicitly exclude the _id field.

Online help
The mongo commandline tool for once has decent online help. Typing just help will show the options. For help on database specific operations, type db.help() and for collection specific operations, type db..help(), such as db.mycoll.help(). In JavaScript, a function is just another script, and adding arguments to the function will execute the function, but maybe you want to see how the function is implemented? The just type the name of the function, like this:
> db.mycoll.find
function (query, fields, limit, skip) {
return new DBQuery(this._mongo, this._db, this, this._fullName, this._massageObject(query), fields, limit, skip);
}

DBA Work - Indexing data and explain
What would a mongo DBA want to do? Let's try creating an index. Let's say we want an index on the c1 attribute in the mycoll collection as above, then we must use the ensureIndex() method on the collection in question, telling what columns I want to index, like this:
> db.mycoll.ensureIndex({c1: 1})
And that's it. Let's try to query that collection again, this time using the c1 column as an argument, and hopefully the index will be used:
{ "_id" : ObjectId("4eb0634807b16556bf46b214"), "c1" : 1 }
Right. But is the index used? I want to know that it is for a fact, or if it isn't, so I have something to complain to my developers about. In MySQL, you want use the EXPLAIN command and figure out what index are being used, but with mongo? Easy. Use the explain method, like this:
> db.mycoll.find({c1: 1}).explain()
{
"cursor" : "BtreeCursor c1_1",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"c1" : [
[
1,
1
]
]
}
}
Hey, that's prett cool, right! The index is a standard B-tree index (the only index type available in MongoDB). An index can also be unique, like this:
> db.mycoll.ensureIndex({c2: 1}, {unique: true})
Which will create a unique index on the c2 attribute, but in our case it will not work:
E11000 duplicate key error index: test.mycoll.$c2_1 dup key: { : null }
What's going on here? Well, the c2 attribute isn't included in all objects, and but the index will include all objects, and MongoDB considers NULL a duplicate here (unlike an SQL NULL in which case this is not the case). So the real question here is, what do you want? As MongoDB is schema-free, and you can have any kind of attributes, and also looking at the data above, what I would probably want is an index on the c2 attrbute that makes sure that c2 is unique WHEN INCLUDED, if the c2 attribute isn't part of the object, then please mr. Indexer, ignore it. This is called a sparse index in MongoDB, and what it means is an index that just indexes the objects where the attribute is included.

Note that this may not always be what you want with non-unique indexes, but it often it is, and it makes seaching and inserting faster (as the index is smaller). In the case you have an attribute that is only rarely part of the object, and you want to find the objects where it IS included, this is just what you want.

In our case, the index is created like this:
> db.mycoll.ensureIndex({c2: 1}, {unique: true, sparse: true})
And this time we had no errors. Let's see how it works, first get some data:
> db.mycoll.find({}, {c2: 1, _id:0})
{ }
{ "c2" : 1 }
{ "c2" : 2 }
{ }
{ }
Now, let's see if the unique index on c2 will guarantee uniqueness by inserting a new row with an existing value for c2:
> db.mycoll.insert({c2: 1})
E11000 duplicate key error index: test.mycoll.$c2_1 dup key: { : 1.0 }
Yo! That worked as expected! As does this (which gives no errors):
> db.mycoll.insert({c2: 3})

That's it for now, I'll be back soon with some more MongoDB DBA stuff: Sharding!
/Karlsson

Wednesday, October 26, 2011

Databases in the Clouds - My Slides from Percona Live!

Due to some bad planning from my side, I failed to realize that I only had 30 minutes for my presentation at Percona Live in London on October 25, instead of the orginally planned 45 minutes. My bad! So I had to rush through the presentation, and I still had a bunch of slides I still wanted to show. To compensate for this, and for your reference, here are all my slides on Cloud Computing from the conference:

/Karlsson

Tuesday, October 25, 2011

Percona Live is now over!

And this was a great event! I'm at Heathrow catching a severely delayed flight back home to Stockholm now. Before I board the flight, I just letting you know that my slides will be on-line in a very short time.

It was great to meet you all there, and Percona does a great job of organizing these conferences and keeping the eco-system alive and kicking!

Thanx Peter Z and the whole Percona crew!
/Karlsson

Monday, October 24, 2011

At Percona Live in London now!

I'm now in London for the Percona Live conference! Come see me at 1:30 tomorrow tuesday October 24 and hear about running databases, BIG databases, in a cloud environment. MongoDB, Sphinx, MySQL what have you, all running in a Amazon EC2 environment. Lots of data,, lots of resources, lots of disks! And lots of fun!

/Karlsson

MongoDB for MySQL folks - Part 2

In this second part of a series of blogs on using a NoSQL database, in this case MongoDB, aimed at MySQL users, I will describe querying a bit. And it's probably not what you expected if you haven't tried a NoSQL database or MongoDB more specifically, before. The first blogpost on this subject is here.

At the heart of accessing MongoDB is JavaScript, more specifically it uses the Mozilla SpiderMonkey JavaScript engine. As I wrote in the previous blog on this subject, JavScript is all over the place in MongoB, largely you can look at MonngoDB as JavaScript with HUGE space for variables, and although many would look at this as gross oversimplification, it works as simple descripton of what we have here. JavaScript for querying, inserting, administration, scripting and "stored procedures" (they aren''t called this in MongoDB, but you know what I mean). And JSON (JavaScript Object Notation) for data, when inserting an object, you construct a JSON object, when you query, what you get back is a JSON object. etc etc. Internally, Mongo uses a binary representation of JSON called BSON.

So we have A LOT of JavaScript and JSON, and if you are no fan of JavaScript, then probably MongoDB isn't for you! In the MongoDB database there are collections then, which are similar to a table, but in a collection, there isn't, like in an RDBMS, a bunch of rows all with the same columns (attributes), instead there are multiple object, all using the same "primary key" (called _id) and the object itself is a JSON Object with any attributes. This is why MongoDB is "schemaless", there is no set schema for the objects (i.e. "rows") except the _id then.

The schemaless design sometimes gives the expressionh that beyond the unique access key, there is no other way to access the data and that you cannot have secondary indexes, but you sure can! But now I am getting ahead of myself.

Let's give MongoDB a shot. After installing MongoDB (which is dead simple) and starting the server, access the command-line using the "mongo" command line program. In the examples I show here, I use a sharded setup, so the prompt is "mongos", and it might be different in your case, depending on your configuration and version of MongoDB. First, lets create a database to play with;
mongos> use mytest
Hey, you ask, what is going on here? You never created a database, you just accessed it? Yes, that is the deal with MongoDB, you just access a database or a collection, and if it doesn't already exist, it is created for you. Which is the reason many MongoDB databases, even in production, has a bunch of wrongly spelled database and collection names.
Now, let's see of the database we access really exists.
mongos> show dbs
admin (empty)
config 0.1875GB
test (empty)
and as you can see, I lied to you, the mytest database isn't there at all! Well, the reason is that there is no collection there yet, so let's create one:
mongos> db.foo.insert({attr1: 1})
Again I am accessing something that doesn't exist to create it, in this case a collection called foo, where the first object has one attribute, attr1, with the value 1. Now, the mytest database should exist, as there is data in it, let's have a look:
mongos> show dbs
admin (empty)
config 0.1875GB
mytest 0.203125GB
And there you go, the mystest database is created and so is the collection foo:
mongos> show collections
foo
system.indexes
OK, there you go, a database and a collection for you. This is getting hotter! Let's insert some more data:
mongos> db.foo.insert({attr1: 2})
mongos> db.foo.insert({attr2: 3})
mongos> db.foo.insert({attr2: 4, attr3: 5})
All this is valid data, every object may have different and different number of attributes, fact is, every object can have any attribute you would want. Which doesn't necessarily make it a good idea, but it's good to have that feeling of complete freedom, right?
Some more inserts:
mongos> db.foo.insert({attr2: 4, attr4: {attr5: 6, attr6: 7}})
mongos> db.foo.insert({attr2: 4, attr7: [10, 11, 12]})
Here I first insert an object where one attribute in itself is an object, and then an object where one attribute is an array. Hey, cool!
Now, let's do some youerying. First just get att the data:
mongos> db.foo.find()
{ "_id" : ObjectId("4ea528fb35a81331e048baa9"), "attr1" : 1 }
{ "_id" : ObjectId("4ea52a4835a81331e048baaa"), "attr1" : 2 }
{ "_id" : ObjectId("4ea52a5035a81331e048baab"), "attr2" : 3 }
{ "_id" : ObjectId("4ea52a5f35a81331e048baac"), "attr2" : 4, "attr3" : 5 }
{ "_id" : ObjectId("4ea52ae635a81331e048baad"), "attr2" : 4, "attr4" : { "attr5" : 6, "attr6" : 7 } }
{ "_id" : ObjectId("4ea52afc35a81331e048baae"), "attr2" : 4, "attr7" : [ 10, 11, 12 ] }
The find() function is a built-in MongoDB function that retries data for you, and it can take 2 optional arguments, both of which is an object in itself (surprise!):

A query-specification.
The columes to return.

For example:
mongos> db.foo.find({ attr2: 3})
{ "_id" : ObjectId("4ea52a5035a81331e048baab"), "attr2" : 3 }
Here I am querying for all objects where c1 is 3. The attr2 attribute doesn't have an index on it, but I can query on it. To the use enforced _id key, just do this (in your case, you will se a different object id of course):
mongos> db.foo.find({ _id: ObjectId("4ea52a5035a81331e048baab")})
{ "_id" : ObjectId("4ea52a5035a81331e048baab"), "attr2" : 3 }
Before I close for now, let me show you how you can tell that the latter of the two last queries above was more effective by using explain() (you didn't expect that, did you):
mongos> db.foo.find({ attr2: 3}).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 6,
"nscannedObjects" : 6,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {

}
}
The result from an explain is, as you would expect by now, a JavaScript object. Look at the nscanned attribute here. There is no index on the attr2 attribute, so all objects are scanned, compared to this:
mongos> db.foo.find({ _id: ObjectId("4ea52a5035a81331e048baab")}).explain()
{
"cursor" : "BtreeCursor _id_",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 22,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"_id" : [
[
ObjectId("4ea52a5035a81331e048baab"),
ObjectId("4ea52a5035a81331e048baab")
]
]
}
}
Here, just one object is looked at, to get the data, and the id column is used to find it (look in the indexBounds attribute). So far, querying doesn't all that difficult to what you see in an RDBMS, except all those objects and JavaScript.

That's it for now, I'll be back with some more MongoDB lessons real soon! The next post will be on more advanced querying!

/Karlsson

Tuesday, October 4, 2011

NoSQL for us RDBMS folks - MongoDB 101

As you probably know, I have been doing RDBMS work for many years, some 25+ years by now. At Recorded Future I am the database architect, and although an RDBMS is used extensively, MySQL in this case, we are looking at options, and are currently doing more and more work using a NoSQL Solution, probably te most popular one by now, namely MongoDB.

And before you complan: NoSQL is not a good term, but someone with a longer NoSQL background should then find something better, not yours truly. And for all intents and purposes, you know what I am talking about, right?

I plan to post a few MongoDB posts here, how it looks like from an RDBMS dudes (like myself) POV. This is the first installment, but there will be more. I should also point out that I am no expert in NoSQL technologies in general, nor specifically in MongoDB, but I am trying, and our MongoDB setup is large and complex enough to serve as a decent example: We have a couple of Tb of data in Mongo, we use Replication and we use Sharding / Clustering whatever you want to call it. In addition, we run this puppy on Amazon EC2.

So, why did we move to MongoDB, and what are the results? I think the reason is two-fold:

Better performance. Not better performance per se, but by using sharding and large in-memory stores, we can achive this.
Better and easier to set up scalability. Again, this doesn't come by itself, nor is it without problems. But the built-in sharding in MongoDB actually does work reasonably well, and is largely transparent to the application.

This sounds great, doesn't it? Everyone wants better perforamnce and better scalability, right? And why can't we have MySQL to scale in the same way as MongoDB? Well, there are drawbacks to all this, but in our case, I think we can live with it.

An important thing to note here, I think, is this: Scalability is transparent and easy to use and implement because the basic idea of a NoSQL system (a key-value store in this case) is so simple: One unique key points to one value. That's it. This is easy to shard. Dead easy. A multi-table relational schema is much less so. But you could simplify the schema in MySQL to make it just as easy to shrd, but then what is the point of an RDBMS at all? And MongoDB Scalability is an incredibly important part of MongoDB performance (in addition to it's in-memory nature).

In our case, using Amazon EC2, scaleability is really important. A single Amazon instance is limited in terms of memory and CPU power. The limits are high (some 70G RAM for example), but there is still a limit. If you have, say, 500 Gb or a Tb or so of data, 70Gb might not be that much, so you have to shard or cluster or something. And this is where MongoDB rocks, it is easy to shard, the simple data structures in a NoSQL database mena this this can be done largely transparent, and all means that MongoDB Rocks for us.

Do we have other options? You bet! MySQL Cluster, NimbusDB (aka NuoDB), Oracle, Cassandra, etc etc, but we ended up with MongoDB for now.

So from a DBA POV, what is mongo then? To begin with, there is no SQL interface and no "query language" per se, rather to the user, MongoDB looks a lot like a Java Script environment with a lot of space for variables, sort of. MongoDB has databases, and in databases there are collections, which are sort of like tables in an RDBMS. A collection is a key-value store, you can define a key for the collection, or MongoDB will generate one for you. MongoDB is schemaless, which means that a collection doesn't not have any predefined columns or anything. Instead, if an object that you store has an attribute foo then you assign a value to it in your object and insert the object. That's it. And you can have an index on foo if you want to. And index in this case is a traditional B-Tree index, nothing more exciting than that.

An Object as above in Mongo is a JSON structure, which internally in Mongos is represented as BSON. To access the collections you use Java Script as I said before, like this to insert an object into the collection bar with the attribute foo:
db.bar.insert({foo: "foobar"});
In SQL this means: INSERT INTO bar(foo) VALUES('foobar')
Used this way, Mongo will assign the object a unique id, called _id, which is similar to the PRIMARY KEY in an SQL RDBMS. To retrieve the row we just inserted, use the command:
db.bar.find()
Which in SQL means: SELECT * FROM bar
And you get something like this back:
{ "_id" : ObjectId("4e8c9b2e2fde67676c2381ae"), "foo" : "anka" }

MongoDB has a big bunch of built-in Java Script methods, what I showed above was just a very basic sample. To use Mongo, knowing Java Script and JSON in general is more or less a requirement, and if you don't know these, and start using MongoDB, you will know then soon.

I will write some more blog-posts on Mongo eventually, but before I close this post, let me tell you one more thing. I never created the collections (bar) I used above. The reason is that yiu do not have to, it is created for you as you access it. The same goes for databases. This is the reason that you often end up with a bunch of misspelled databases and collections in mongo, which you can of cource drop after you have made a mistake, but as we are all lazy, you mostly have a bunch of non-used databases and collections.

/Karlsson

Thursday, September 22, 2011

MySQL Installer, once again

First an apology. I have not been blogging much lately, but I do have a reason, and I think a valid one for once (i.e. not along the lines of "I was out drinking last night" or "I was busy boiling potatoes" or something), I have been moving house. And my new office is now as tidy as ever (and if you believe that, let me tell you about a bridge in New York that I have for sale at a really good price). No, the place is a mess, but my beautiful wife keeps up with me, and I am slowly getting things together.

That said, let's get to the point, MySQL Installer! If you read this blog, you know I have copmplained about this puppy several times before (First attempts with MySQL Installer and MySQL Installer part 2), and I have been in touch with the good developers at MySQL / Sun / Oracle / Google (no wait, that last one wasn't right. At least not yet) and for all intents and purposes, they are doing their best (I already knew that) and are listening to us.

So now MySQL Installer is GA. Is it any better than when I complained last time? It sure is!

Now it does what it was set out to do, it does install most of what you need as a MySQL Developer on Windows, it installs the MySQL Server, the Workbench and the drivers. This time around, I removed all Windows products first and then did a clean install, but some old settings still seemed to be around. This could be annoying potentiually, but the machine I was testing really has had a lot of testing on it, so it might be that this is in a weird state. Also, this might be what you want to do. Before doing a clean install, I tried an Upgrade. I had MySQL 5.5.14 on my box, and the installer came with MySQL 5.5.15, so I figured it would offer mne to upgrade if I choose Add/Modify Products and Features. On the other hand, after a clean install, if I choose Add/Modify Products and Features again, the installer insist that there are "changes pending" to the MySQL .NET Connector. If I execute that it executes OK, but the Installer still insist that there are "changes pending" to Connector/.NET.

This may sound like a lot of complaints, but it really is not that bad. If I was a Windows developer coming new to MySQL, this is not a bad starting point at all. But if I was Oracle (which I am not), I'd add a few things:

Links - I'd offer in the installer too add a few links to useful places on the net for MySQL stuff. This includes Planet MySQL, MySQL Forge, dev.mysql.com, www.mysql.com, the Bug database (how active that is these days I don't know either).
Third party software - There is some third party software that could be interesting to include, such as drivers (PHP etc), tools, plugins etc. There is a whole bunch of these available, and if I was new to MySQL, I would have a hard time selecting which to include and what to do with them. I understand the licensing issues here, but as this is the Community edition of MySQL, that shouldn't be too difficult.
Tips and tricks - How do you do this or that? What questions might a Windows developer have when approaching MySQL for the first time.
Integration - I would really like to see more integration here, with Visual Studio and with many other Windows developer tools. I admit that this might be there, but that the integration is hidden from me as I don't have those tools.

These are just some ideas moving forward with MySQL Installer. The current product is a bit limited I think, but still useful and is now pretty stable. Why the MySQL Installer is not available on a swedish mirrow is beyond me, but that is a minor issue I think, unless you live in Sweden (like I do).

For reference, I used 1.0.17 of MySQL Installer, something which is not that easy to find out, as the download page states that this is version 5.5.15 (which is the MySQL Server version). The installer itself doesn't say what version it is, at least not as far as I can find out. It does say in the VERSION resource of the MySQLInstaller.exe file though, so looking at properties with Windows Explorer reveals the version.

/Karlsson

Saturday, August 27, 2011

Percona Live getting closer and some odd notes

Percona Live MySQL Conference, London, Oct 24th and 25th, 2011

Percona Live in London is getting closer, and Oct 24 will soon be with us. Before that I hope to finish my delayed honeymoon in S.t Petersburg, move house and Travel to New York (I am glad I am NOT doing that today, Hurricane Irene doesn't seem like a nice lady).

I have done some more work on my presentation, and I hope this will end up being pretty interesting. At the same time, Recorded Future is in the middle of some major rewamping on the backend, so things are a bit hectic, to say the least.

The other week we had an issue with one of our slaves falling behind and having problems to catch up. I did not have my accelerator working on that machine, so I started it, and immediately the slave was catching up way faster than before, so again I have proven to myself that this puppy really helps. You can read more on the accelerator here: Slavereadahead 1.3 released.

This fall I am testing a bunch of alternative database technologies, some based on MySQL, some complementary to MySQL and some not MySQL-related at all, and I'll be talking a bit on some of them in London.

Special offer: Use the discount code RecFPLUK when registering to get a special discount.

See you in London
/Karlsson

Wednesday, August 17, 2011

Speaking at Percona Live in London October 24-25

I'll be speaking about MySQL in the Cloud at Percona Live in London October 24-25 and this will be talk centered on our experiences of running a database setup, based on MySQL but with serveral complementary technologies, in Mamzon EC2.

I will talk about our experiences so far, we have a lot of data at Recorded Future and we are also write intensive, which is a slight difference from what you may be used to from other Web-based systems. The amount of data we have right now is in the multi Tb range, and we are growing fast. Is a could environment really useful for something like this? What does work and what doesn't?

Also, I will talk a bit on what I think the technological issues are here when using traditional servers such as MySQL and MongoDB, and how I would like it to work instead. Finally, I'll take a shot at looking at some alternatives that I have been testing recently? I like MySQL, and I have used it for many years and I support it now, but I am also a pragmatist+ Are there options and why are they interesting?

See you in London!
/Karlsson

Friday, August 12, 2011

MySQL Global status difference using MySQL procedures / functions

As a MySQL DBA, I guess you use the SHOW GLOBAL STATUS command or the corresponding INFORMATION_SCHEMA.GLOBAL STATUS table to show current mySQL status quite often. And many of us use different tools to collect that data and view it.

But sometimes we use same command from the SQL prompt, because we have to, because it's the only option or just because that is the most convenient way. And often you kick of two such command in a row, a few seconds apart, just to see how things are moving:
SHOW GLOBAL STATUS LIKE 'innodb%rows%';
... DBA picks his or her nose for a few seconds ...
SHOW GLOBAL STATUS LIKE 'innodb%rows%';

And then you can see how things are advancing. But how much? To figure of how much, you have to calculate the differnce between the values returned by those two statements. And then there is another issue. How much is the difference per second? To know what the difference is per second, we have to includ the UPTIME_SINCE_FLUSH_STATUS status value in those statements. So we do this then:

SHOW GLOBAL STATUS WHERE variable_name like '%rows%' OR variable_name = 'Uptime_since_flush_status';
... Yet another chance for the DBA to pick the nose ...
SHOW GLOBAL STATUS WHERE variable_name like '%rows%' OR variable_name = 'Uptime_since_flush_status';

Now we have the data we need to work with, but we still need to do some math. And math is fine of course, in and of itself, but wouldn't it be nice to have some kind of contraption to do that for us? Like a machine? Let's call it a computer! Whao!

No, joking aside here, this is perfect job for a stored procedure or something. What we need is a stored procedure to to the jobs for us, and somewhere to store the previous value inbetween status runs, and here I will use a MySQL variable for that. Before I show you the code, let me show you what running it looks like:


MySQL> call getstat('%rows%', FALSE);

+--------------------------+-------------+

| variable_name            | current     |

+--------------------------+-------------+

| INNODB_ROWS_DELETED      | 171085420   |

| INNODB_ROWS_INSERTED     | 4940247881  |

| INNODB_ROWS_READ         | 16041373517 |

| INNODB_ROWS_UPDATED      | 90534033    |

| NOT_FLUSHED_DELAYED_ROWS | 0           |

| SORT_ROWS                | 266732753   |

+--------------------------+-------------+

6 rows in set (0.00 sec)



Query OK, 0 rows affected (0.00 sec)



MySQL> call getstat('%rows%', FALSE);

+--------------------------+-------------+-------------+------------+--------------------+

| variable_name            | previous    | current     | difference | difference_per_sec |

+--------------------------+-------------+-------------+------------+--------------------+

| INNODB_ROWS_DELETED      | 171085420   | 171085420   | 0          | 0.00 (2 s)         |

| INNODB_ROWS_INSERTED     | 4940247881  | 4940249044  | 1163       | 581.50 (2 s)       |

| INNODB_ROWS_READ         | 16041373517 | 16041373597 | 80         | 40.00 (2 s)        |

| INNODB_ROWS_UPDATED      | 90534033    | 90534044    | 11         | 5.50 (2 s)         |

| NOT_FLUSHED_DELAYED_ROWS | 0           | 0           | 0          | 0.00 (2 s)         |

| SORT_ROWS                | 266732753   | 266732753   | 0          | 0.00 (2 s)         |

+--------------------------+-------------+-------------+------------+--------------------+

6 rows in set (0.00 sec)



Query OK, 0 rows affected (0.00 sec)

The first parameter to the getstat procedure is the parameter LIKE-string you want to use, the second is a boolean if you want to rerun the base for the saved parameters. In other words, the first time the procedure is run, it collects relevant data, but it only shows current values, as that is what it has. The second and subsequent runs, it shows the difference from the first run, unless you set the other parameter to TRUE, which causes a new set of values to compare with to be set. If you subsequently again run it with the second parameter set to TRUE, the procedure will only show the difference between the two runs.

The nice thing with this procedure is that is uses no temp tables, which means no disk IO or anything, just access to the INFORMATION_SCHEMA GLOBAL_STATUS table, that's it.

So, now you can wait to get the procedure and use it yourself? OK, here we go, enjoy! (And there is actually the procedure and a simple supporting function).

/Karlsson


DROP PROCEDURE IF EXISTS getstat;

delimiter //

CREATE PROCEDURE getstat(p_like VARCHAR(64), p_gennew BOOL)

BEGIN

   DECLARE v_stat TEXT;

   DECLARE v_name VARCHAR(64);

   DECLARE v_value VARCHAR(1024);

   DECLARE v_nodata INTEGER;

   DECLARE v_uptime_current INTEGER;

   DECLARE v_uptime_last INTEGER;

   DECLARE c1 CURSOR FOR SELECT variable_name, variable_value

     FROM information_schema.global_status;

   DECLARE CONTINUE HANDLER FOR NOT FOUND SET v_nodata = 1;



   SET v_nodata = 0;

-- Make sure we always generate new stats the first time we run.

   IF @savedstat IS NULL THEN

      SET p_gennew = TRUE;

   END IF;



   IF p_gennew THEN

      SET v_stat = '';

-- Get the current status now, and loop through it.

      OPEN c1;

      l1: LOOP

         FETCH c1 INTO v_name, v_value;

         IF v_nodata THEN

            LEAVE l1;

         END IF;



-- Add name and value to saved statuses.

         SET v_stat = CONCAT(IF(v_stat = '', '', CONCAT(v_stat, ',')), v_name, ':', v_value);



-- Get current uptime.

         IF v_name = 'UPTIME_SINCE_FLUSH_STATUS' THEN

            SET v_uptime_current = v_value;

         END IF;

      END LOOP;

      CLOSE c1;

   ELSE

      SELECT variable_value INTO v_uptime_current

        FROM information_schema.global_status

        WHERE variable_name = 'UPTIME_SINCE_FLUSH_STATUS';

   END IF;



   SET v_uptime_last = getstatvalue('UPTIME_SINCE_FLUSH_STATUS');

-- If there was a previous value to show, then show difference.

   IF  v_uptime_last != '' THEN

      SELECT variable_name, getstatvalue(variable_name) AS previous,

        variable_value AS current,

        IF(getstatvalue(variable_name) = '', '',

          variable_value - getstatvalue(variable_name)) AS difference,

        IF(getstatvalue(variable_name) = '', '',

          CONCAT(TRUNCATE((variable_value - getstatvalue(variable_name))

          / (v_uptime_current - v_uptime_last), 2), ' (',

          v_uptime_current - v_uptime_last, ' s)')) AS difference_per_sec

      FROM information_schema.global_status

      WHERE variable_name LIKE p_like;

   ELSE

      SELECT variable_name, variable_value AS current

      FROM information_schema.global_status

      WHERE variable_name LIKE p_like;

   END IF;



   IF p_gennew THEN

      SET @savedstat = v_stat;

   END IF;

END//

delimiter ;



DROP FUNCTION IF EXISTS getstatvalue;

delimiter //

CREATE FUNCTION getstatvalue(p_name VARCHAR(64))

RETURNS VARCHAR(1025)

NO SQL

BEGIN

   DECLARE v_pos INTEGER;

   DECLARE v_end INTEGER;



-- If no saved status exists, then return nothing.

   IF @savedstat = '' THEN

      RETURN '';

   END IF;



-- Find the value.

   SET v_pos = INSTR(@savedstat, CONCAT(',', p_name, ':'));



-- Try to find the value name at the beginning of the string.

   IF v_pos = 0 THEN

      IF LEFT(@savedstat, LENGTH(p_name) + 1) = CONCAT(p_name, ':') THEN

         SET v_pos = 0;

      ELSE

         RETURN '';

      END IF;

   END IF;



   SET v_pos = v_pos + LENGTH(p_name) + 2;

-- Now find the value in the string, following the name and a colon.

   SET v_end = LOCATE(',', @savedstat, v_pos);

   IF v_end = 0 THEN

      SET v_end = LENGTH(@savedstat) + 1;

   END IF;



-- Extract the value and return it.

   RETURN SUBSTR(@savedstat, v_pos, v_end - v_pos);

END//

delimiter ;

Sunday, July 31, 2011

Elastic cloud computing? Sure, but don't forget the database.

I was interviewed at the Xeround booth at the MySQL Users Conference and Expo in April, talking about databases in the cloud, and although this was a vendor interview, the issues were pretty much cloud generic. I think you know my views on clouds and elasticity and stuff like that by now, but if not, the you might want to read my blogpost The E is for Elastic. And if you don't don't want to read my rant, then Xeround has published the interview on YouTube, so you can see it there:
Challenges for databases in the cloud

On Elasticity and Cloud Databases

/Karlsson

Friday, July 29, 2011

Yikes.. Backing up a sharded MongoDB is no fun!

Backing up databases has never been fun, not as fun as having a cool English Ale on the balcony on a hot summar day anyway, but MongoDB takes this one step further when it comes to annoyances.

In general, I often feel that many Open Source projects start with good intentions for what the project should do and how, but then more stuff is added as you require it, and suddenly what started out as a simple and fast application for a narrow usecase, suddenly turns into a bit of a mess. And the issue might well be that building fast, compact software for a specialized usecase, as they start out, is not the same as writing generic software, with a wide range of use cases, code that can easily be maintained and enhanced as we go along. And why should it not be like that? In many cases, this is just fine and the limited usecase is just what the project sets out to do, and it does it well. But sometimes this turns into something really annoying, and at the same time useful.

I think MySQL is partly like this. There are many things in MySQL that work really well, so many things that in a small set of code achives so much useful stuff. And then there are things with MySQL that are just outright wrong and yet another bunch of things that are just plainly annoying. Largely, MySQL is very much developer focused more than DBA focused, although this is improving (and this is also a personal opinion of mine).

And then we have MongoDB, one of the big contenders on the NoSQL side in the NoSQL vs. SQL battle (which is a silly battle, but lets ignore that for now). Now, MongoDB is supposed to be a database. One that is faster, more compact and more targeted towards general database needs than MySQL. A database that can scale and replicate and shards automatically! Brilliant. And then this boring old DBA comes around with his bitterness and boredom and ignorance of the "new" database system. And he says things like: "Can you do a backup"... Yikes! Never thought of that. And what boooring guy that DBA is!

Yes, with MongoDB, backups is clearly an afterthought. And although this is again my personal opinion, I base it on something, namely this: Backing up a sharded Cluster. This is just plain silly. The way these steps are taken is in no way consistent, some of the operations are asynchronous, which means you have to wait for them (write code. To do a friggin' backup? Who came up with THAT daft idea?). You have to backup config servers cold, i.e. shut down. Dead. Who came up with that? And in the end, you don't even get a consistent backup. And yes, if you use replicas, you have to physically back up the replicas also! What?

Whoever figured out how to shard and replicate with MongoDB did a reasonable job, it actually works OK. But the person in question apparently forgot that databases are to be backed up. And before you ask: No, I am NOT going to take a mongodump of 1.5 Tb data!

This said, most aspects of MongoDB are OK, but backups are a mess. Read the page I linked to above, and you also realize that it is not well documented, to say the least, how to backup and what happens if the steps aren't followed? What happens if I do not backup the stopped config server? Can I do a mongodump of the config server instead? Why in heavens name can't I:

Flush and lock the config servers?
Flush all the dataservers in one go? No, you have to do it in one dataserver at the time.
Flush and lock the config servers?
And yes, why can't you flush and lock the config servers.

Also, locking is weird here. You have can lock serveral times, and then you have to unlock as many times. I really do not know why. Also, I have yet to figure out how to know if it is locked, without at the same time unlocking, but it might be that I missed something in the docs (which aren't very good I'm afraid, but this is not the first Open Source project with lacking documentation).

And having written all this, I have now created a script that will do all this for us, in unattended fashion. Yes, backups are supposed to be able to run unattended! No, I do not want any manual checking in the midst of a backup process! I do not want to be up at 3AM!

And Yes! I want a way to verify my backups, with some ease! No, I'm not going to set up a cluster of a 8 nodes with 4 Tb disk, just to verify a backup! And this is usually much less of a problem with MySQL, as it is not sharded / distributed. But for anything that IS sharded / distributed, for heavens sake, make sure there are tools to support this. In particular Backup tools!

/Karlsson

Wednesday, July 20, 2011

MySQL Installer part 2

OK, I was having too much fun in my last post on this subject, so I try again. Just to be safe i Uninstall MySQL Installer, reboot my machine, and download the Installer again. Installing it this time didn't make anything happen it seemed? No dialogs, no nothing? But it is back there in the Start menu at least... OK, Let's try that one...
Think think think.. Whammo:

Yes, I am aware this is running on Swedish Windows, but let's try it on a US Windows 7 installation....

Downloaded and tried it on this machine, and now the installer runs at least. The "configuration" step doesn't actually do much. The Connector/.NET installation still insists on not finding a download location at first, but suddenly it works. The Installer also finds old versions of MySQL, which is nice, but it doesn't seem to know how to handle that. I still get into a state which the Installer doesn't know how to get out of, and just hangs (on "Validating installation").

As there is no nice cnfiguration screen, and the VS integration plugin is not included, I don't really find this installer such an "easy to use" thingy. To being with, it will not let me choose WHERE to install things (no, I do not want MySQL on C:, even thogh may other programs are there. In particular, I do not want C: to be my drive where the database is kept! As C: is an SSD which is there to speed up booting, not to have MySQL fuzz around with it).

No, this wasn't really easy to use at all. No, it didn't help much. Yes, the interface is very nice. No, I do not think this is RC quality software.

/Karlsson

More on OR-conditions considered bad... And an apology..

In my recent post on OR-conditions I made a mistake, and I appologize for that. I made the statement that MySQL will only use 1 index per statement, whatever you do.

This is no longer true, as a matter of fact, and that has been the case since MySQL 5.0 and I should have checked. MySQL is actually able to use index_merge. An explanation why I didn't look for thi more carefully, yes an explanation, not an excuse, is that the optimizer doesn't seem to want to use this very often. Which is too bad.

So, with this in mind, and using the same table as in the previous post, let's look at index_merge in action. Or possibly, not so much in action. Let's recap what the table looks like:
CREATE TABLE `product` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`brand_id` int(11) NOT NULL,
`quantity` int(11) NOT NULL,
`weight` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `ix_brand` (`brand_id`),
KEY `ix_weight_brand` (`weight`,`brand_id`),
KEY `ix_quantity_brand` (`quantity`,`brand_id`)
) ENGINE=InnoDB AUTO_INCREMENT=2321268 DEFAULT CHARSET=utf8
OK, fair enough, one single table with a bunch of indexes. Looking at the index_merge documentation, we can see that if we have an OR condition with both sides appropriately indexed, then this algoritm would execute each path and then do a sort-merge of the result. Let's try with a simple example, using a similar query to the one used last time, except that we are to ignore the brand_id column this time:
EXPLAIN SELECT id FROM product WHERE weight = 41 OR quantity = 78;
and we get this:

+----+-------------+---------+-------------+-----------------------------------+-----------------------------------+---------+------+-------+------------------------------------------------------------------+
| id | select_type | table   | type        | possible_keys                     | key                               | key_len | ref  | rows  | Extra                                                            |
+----+-------------+---------+-------------+-----------------------------------+-----------------------------------+---------+------+-------+------------------------------------------------------------------+
|  1 | SIMPLE      | product | index_merge | ix_weight_brand,ix_quantity_brand | ix_weight_brand,ix_quantity_brand | 4,4     | NULL | 25729 | Using sort_union(ix_weight_brand,ix_quantity_brand); Using where |
+----+-------------+---------+-------------+-----------------------------------+-----------------------------------+---------+------+-------+------------------------------------------------------------------+

That is cool! We are seeing index_merge in action here! Coolness! So, knowing that, let's see if we can get index_merge to work for us in the case which we looked at last time, where we also had a brand_id column in the query. There are indexes on brand_id combined with both quantity and weight, the exact same two indexes used above actually, so adding brand_id should produce the same nice execution plan, but of course reduce the number of rows returned. Lets try it
EXPLAIN SELECT sql_no_cache id FROM product WHERE (brand_id = 6 AND weight = 41) OR (brand_id = 6 AND quantity = 78)
And we get this:

+----+-------------+---------+------+--------------------------------------------+----------+---------+-------+------+-------------+
| id | select_type | table   | type | possible_keys                              | key      | key_len | ref   | rows | Extra       |
+----+-------------+---------+------+--------------------------------------------+----------+---------+-------+------+-------------+
|  1 | SIMPLE      | product | ref  | ix_brand,ix_weight_brand,ix_quantity_brand | ix_brand | 4       | const | 4291 | Using where |
+----+-------------+---------+------+--------------------------------------------+----------+---------+-------+------+-------------+

No luck there. For some reason, the optimizer seems to dislike the index_merge access method, except in the most obvious of cases. But hey, we don't give up that easily, do we, we can use a force index, right? Like this:
EXPLAIN SELECT sql_no_cache id FROM product FORCE INDEX (ix_weight_brand,ix_quantity_brand) WHERE (brand_id = 6 AND weight = 41) OR (brand_id = 6 AND quantity = 78);
And the result is this:

+----+-------------+---------+-------------+-----------------------------------+-----------------------------------+---------+------+------+-------------------------------------------------------------+
| id | select_type | table   | type        | possible_keys                     | key                               | key_len | ref  | rows | Extra                                                       |
+----+-------------+---------+-------------+-----------------------------------+-----------------------------------+---------+------+------+-------------------------------------------------------------+
|  1 | SIMPLE      | product | index_merge | ix_weight_brand,ix_quantity_brand | ix_weight_brand,ix_quantity_brand | 8,8     | NULL |   31 | Using union(ix_weight_brand,ix_quantity_brand); Using where |
+----+-------------+---------+-------------+-----------------------------------+-----------------------------------+---------+------+------+-------------------------------------------------------------+

What is annoying here is that this query, using FORCE INDEX actually hits only 31 rows according to the statistics, whereas the one not using FORCE INDEX potentially hits 4291. Why the optimizer determines the latter to be faster I do not know, but it doesn't seem right to me.

In the example given here, I was using a brand_id of 6. That particular brand_id has less entries than the other ones, so lets giva a shot using brand_id 4, which takes a bit longer. The SELECT using a UNION then looks like this:
SELECT sql_no_cache id FROM product WHERE brand_id = 4 AND weight = 41 UNION SELECT id FROM product WHERE brand_id = 4 AND quantity = 78;
and the one using FORCE INDEX looks like this:
SELECT sql_no_cache id FROM product FORCE INDEX (ix_weight_brand,ix_quantity_brand) WHERE (brand_id = 4 AND weight = 41) OR (brand_id = 4 AND quantity = 78);
Both of these use the same access path: A merge sort using the indexes ix_weight_brand and ix_quantity_brand. Which one do I prefer then? My personal opinion (but it is just that: An opinion that is personal) is to use the UNION, based on four facts:

When running these two statements, side by side on the same data and using SQL_NO_CACHE (i.e. not using the query cache), the UNION is consistently faster. Not that the index_merge is much slower or anything, in particular not compared to when using the ix_brand index that is preferred by the optimizer, unless I tell it not to, but the UNION is still faster.
Sometimes I could see the optimizer still not doing it's job correctly, even with the FORCE INDEX in place. In some cases only one of the indexes I forced would be used. Don't ask me why, and I cannot reproduce it now, so maybe it was my eyeglassed having fun with me.
The UNION construct means that I can use ANSI SQL, the FORCE INDEX not so. This is important to me, as I want to keep my options open when it comes to databases. Which doesn't mean I always use ANSI SQL, but if I have the choice between ANSI and non-ANSI SQL for two statements that are otherwise similar, I choose ANSI SQL.
I have a feeling that in my case, the UNION will be more flexible. If more indexes and conditions are added, the FORCE INDEX part will be difficult to maintain, whereas in the UNION this will be easier. Which doesn't mean that I particularily enjoy using a specific SQL declarative construct to optimze performance, I would much rather want the optimizer to deal with this for me. But it doesn't.

In conclusion, yes, I was wrong, I admit it, MySQL sure can use two indexes and do an index merge. But I was right in the sense that this seems to happen rarely, and that the optimizer isn't really doing it's job properly here anyway. But I am glad there is some openings for fixing this, as an access methods exists and the optimizer knows about it.

/Karlsson
Who was wrong! I admit it!

Tuesday, July 19, 2011

First attempts with MySQL Installer

This was a sad day for me. I once, when I was at MySQL, was a big fan of a better installer for MySQL on Windows. Something that would install all you wanted on just Windows, in a way applicable for Windows and integrating with the appropriate Windows products. So installing MySQL would not just install MySQL, it would also install the Visual Studio plugin for MySQL, for example, if you so requested (and the installer might even be so smart so it could check if VS was installed, and then ask you politely if you wanted the plugin). The same goes for ODBC drivers, .NET drivers and what have you not.

Fact is, I had this idea a long time ago and was promoting it inside MySQL with my usual frenzy when I think I have a good idea, although I wasn't the first to have this idea.

So, now, after all these years, and I have even left MySQL since, this Installer is soon ready for prime time, MySQL Installer is at it's last RC before GA! So I decided to download it and see what they have done with this smart idea. And boy was I disappointed.

To begin with, it is announced as "RC" but the download page calls it "Beta" still. Hmmm whatever. Also, it is 32-bit only, there is no 64-bit download. But thinking about it, I realize this might be for the installer itself (32-bit that is), it might well install 64-bit software. And it does, thank you, but this should be made more clear, and why we even bother with 32-bit builds anymore i beyond me? Why not go for 16-bit while we are at it?

And without further ado, here is my verdict: It would be quite OK, assuming it was a Pre-alpha release! No, this is very far from RC quality! I am sorry MySQL and all developers who have put in lots of effort into this thingy, but is not good enough. far from it actually! Maybe it wasn't tested enough, maybe it wasn't tested at all? But frankly, from an RC product I don't expect "Unhandled exception" after just a few minutes or perfectly normal use.

One thing it does is that it downloads the suff on an as needed basis. In my case, it failed to find a download location for Connector/.NET (I am in the middle of nowhere, I know that, I mean Sweden, what kind of weird place is THAT? and Stockholm?). After that failure, nothing worked. retrying it to no avail. Not finding a download location for Connector/.NET (I have heard that .NET is pretty popular on Windows, but what do I know) and this stopping the whole installation process with an Unhandled exception. No, that is not RC, that is what we in the rest of world call Pre-Alpha.

For your own good, and for the sake of MySQL Credibility on Windows (which I am a fan of, and I am sure we can get this installer going): Do not release MySQL INstaller in this shape!

Calling it quits for the day, thinking Good idea, bad execution and feeling a bit sad
/Karlsson
And if you ask: Yes, I did report a few bugs on MySQL Installer today

OR conditions considered bad... Or? And a workaround.

Some things are known to be just bad. GOTOs used to be one such thing (something I still use them, but only where appropriate, which isn't that many places). Maybe it is just so, that some things are useful, but not for everything, so maybe the issue is that they are used inappropriately. Or?

The OR condition is one such things in MySQL circles! Oh, you have an OR condition! That is going to be so slow! sort of. And the reason an OR is "slow" is that as MySQL will use only one index for each statement, only one "side" or the or condition can use an index. Or sometimes even worse, MySQL will consider using an index that is common to the two "sides" or is outside the OR conditition, despite that fact that there are perfectly fine, highly selective indexes on both sides of the OR condition.

If you ask me, this is not a fault with the OR condition but rather a problem with the MySQL optimizer. Why in heavens name can't a single statement use two indexes, if that is what it takes? And let me let you in on a little secret: MySQL can use multiple indexes for one statement! But that depends on what you mean with a statement. And MySQL means something slightly different than many of us do!

Without further ado, lets have a look at an example. We work at a retail store, and a package from us has been stuck at the post office. We want to check what product this is, but we don't know the product id. What the guy who called us from the post-office said was something that looked like a brand name, that I can map to a brand ID, the number of units in the package and the weight. But to be honest, the last two weren't terribly reliable. OK, lets find the product in the product table, which looks like this:
CREATE TABLE `product` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`brand_id` int(11) NOT NULL,
`quantity` int(11) NOT NULL,
`weight` int(11) NOT NULL,
PRIMARY KEY (`id`),
KEY `ix_brand` (`brand_id`),
KEY `ix_weight_brand` (`weight`,`brand_id`),
KEY `ix_quantity_brand` (`quantity`,`brand_id`)
) ENGINE=InnoDB AUTO_INCREMENT=2321295 DEFAULT CHARSET=utf8

I know for certain that brand_id is 6, I already looked that up. But there are millions of products in the product table! Luckily, looking for approriate products using brand_id and either quantity or weight should be easy, right? We know now that the weight is 41 and quantity is 78. And we have approriate indexes, this should not be a big deal, right:
SELECT id FROM product WHERE brand_id = 6 AND (weight = 41 OR quantity = 78)

Well, although this works, it is a big sluggish, real slow actually. Lets look at what mySQL does with this statement:
EXPLAIN SELECT id FROM product WHERE brand_id = 6 AND (weight = 41 OR quantity = 78)
And what we get is this:

+----+-------------+---------+------+--------------------------------------------+----------+---------+-------+------+-------------+
| id | select_type | table   | type | possible_keys                              | key      | key_len | ref   | rows | Extra       |
+----+-------------+---------+------+--------------------------------------------+----------+---------+-------+------+-------------+
|  1 | SIMPLE      | product | ref  | ix_brand,ix_weight_brand,ix_quantity_brand | ix_brand | 4       | const | 4291 | Using where |
+----+-------------+---------+------+--------------------------------------------+----------+---------+-------+------+-------------+

That wasn't so good. Let's try a different way:
EXPLAIN SELECT id FROM product WHERE (brand_id = 6 AND weight = 41) or (brand_id = 6 AND quantity = 78);

And that will result in the same query path. Only one index can be used, and there is one index that fits with both paths, that on brand_id, so MySQL picks that. Using FORCE_INDEX will work, but still only 1 index will be used, and the result may well be even worse, as a FORCE_INDEX on, say the ix_weight_brand index, will make the other path, on quantity, dead slow! What you would like MySQL to do, which doesn't seem so complicated, is to realize that there are two distinct paths here which can be looked up using an index real easy, execute them both and merge the results. But no, MySQL will not DO that! Only 1 index per statement, that's it. Or?

Well, when you understand that MySQL will only use one index per statement, consider what MySQL means with statement here. For a SELECT it is the individual SELECT statement that is the statement, which sounds reasonable until you consider a UNION! Each and every statement in a UNION is considered a separate statement (in this particular case that is, but it is a but messy, UNIONs in MySQL are a bit of a kludge, really)! So if we rewrite the statement above as a UNION, which is easily done for many queries involving OR-conditions, you get something like this:
SELECT id FROM product
WHERE brand_id = 6 AND weight = 41
UNION
SELECT id FROM product
WHERE brand_id = 6 AND quantity = 78;

What are we saying here? We are telling MySQL that these are actually two separate paths, which is what we did with the OR condition, but in this case, MySQL can use two indexes, and will nicely merge the results, so an explain looks like this:

+------+--------------+------------+------+----------------------------+-------------------+---------+-------------+------+-------------+
| id   | select_type  | table      | type | possible_keys              | key               | key_len | ref         | rows | Extra       |
+------+--------------+------------+------+----------------------------+-------------------+---------+-------------+------+-------------+
|  1   | PRIMARY      | product    | ref  | ix_brand,ix_weight_brand   | ix_weight_brand   | 8       | const,const |   31 | Using index |
|  2   | UNION        | product    | ref  | ix_brand,ix_quantity_brand | ix_quantity_brand | 8       | const,const |    1 | Using index |
| NULL | UNION RESULT | <union1,2> | ALL  | NULL                       | NULL              | NULL    | NULL        | NULL |             |
+------+--------------+------------+------+----------------------------+-------------------+---------+-------------+------+-------------+

This latter query is often so much faster than the alternatives, and we have tricked MySQL into using two indexes and merge the result. But for some reason, MySQL is unable to figure this one out for itself. Is this an important tip I am giving you here? Is this a neat optimization trick that I am handing out? Short-term, the answer is yes.

But am I with this saying that you should stay clear of OR-conditions? Absolutely not, no way. What I am presenting here is an awkward way of circumventing some obvious flaws with the MySQL optimizer, and this should be fixed! But what I AM saying is this: If you currently have big performance problems with MySQL SELECTs involving OR conditions, you might consider rewriting those statements to UNIONs, sometimes that hels. But do not do this will ALL your OR-conditions, only where you have to and it makes sense. Let's meanwhile wait for the MySQL developers to fix this. (No, I'm not good enough at the optimizer code or most other parts of the MySQL kernel to fix this myself. I'm happy to build things around MySQL, but I do not have the time to get more involved with the kernel).

And before I keave you for now: This was tested with MySQL 5.5.7 on Linux. I have NOT checked for fixes, updates to this, but I do hope it has NOT been fixed? Why? Why do I now want it fixed?? Have I gone bonkers? Yes, I am bonkers, but that's not the issue here, the issue is that such rather involved fixes to the optimizer is NOT something I want introduced in the middle of a GA release! But I'd be really glad to have it fixed in 5.6 or whatever that release is to be called! And yes, I am ware this is not exactly with the optimizer itself, but more so with the query execution, but for now, I have decided to call it the optimimizer anyway, as the sun is shining and the weather is nice and all that, sometime around christmas I might consider changing my mind.

Cheers for now
/Karlsson