Karlsson on databases and stuff: February 2011

Saturday, February 26, 2011

SHOW or INFORMATION_SCHEMA tables? Can MySQL developers PLEASE make up their minds!

Which one do you prefer?
SHOW TABLES
or
SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE table_schema = DATABASE()
I know what I prefer these days! I have, I admit it, gotten used to the SHOW commands, and I find them useful in many cases. For example in the case above, if I don't have that many tables in the database and I only want to see what they are. a simple SHOW TABLES command does the trick nicely. And this is said by someone (myself) who really disliked the SHOW commands initially. And to be honest, I still dislike them in many cases, cases where the ability to filter, join, aggregate annd get an instance view (sic!) on things where only a proper SQL TABLE, such as the INFORMATION_SCHEMA.TABLES table, would do the trick. So all in all, they both have uses, the way I look at it is that SHOW commands are useful shortcuts for the INFORMATION_SCHEMA tables.

Regrettable, that is not how it works in real life. Rather, the INFORMATION_SCHEMA tables are largely based on what the SHOW commands return, which is a shame. Most it is a shame as many useful things with INFORMATION_SCHEMA tables, such as joins and aggregates, turns to being painfully slow even with a modest size instance.

And not only that. It was once the intention, IIRC, to make all SHOW commands available as INFORMATION_SCHEMA tables. But in recent MySQL versions, this is going in another, confusing, direction. MySQL 5.5 adds the SHOW RELAYLOG EVENTS command, for example. This returns data that would be really useful to do at least some filtering on, and possibly a join or two, but no, this is ONLY available as a SHOW command. You cannot open a cursor on a SHOW command either, which means that you cannot, in plain SQL, take data from it and store it in some other table, for example, so you cannot use MySQL, otherwise really useful, EVENTS here.

On the other hand, MySQL 5.5 introduces the PARAMETERS INFORMATION_SCHEMA table. This has no corresponding SHOW command? So if uou are to interactively call a procedure and want to look at the parameters, then you are back at the INFORMATION_SCHEMA tables again. Com one now, MySQL developers, make up your minds where you want stuff to be. And if you ask me, what I would want. No, correct that, require, at least in the long run:

All useful shortcuts to metadata available as SHOW commands, including some kind of filtering (the SHOW RELAYLOG EVENTS command lacks even the simple filtering along the lines of the SHOW TABLES and SHOW STATUScommands).
All available metadata available in INFORMATION_SCHEMA tables. Yes, that is ALL metadata.
And for the INFORMATION_SCHEMA tables, I don't really want a performance improvement, as currently there is no performance at all to improve. Can we please get SOME kind of performance on these tables, please.

In addition, a metadata repository would be nice. And if you ask me, I'd prefer to have that in a central repositiry, such as INFORMATION_SCHEMA. The whole idea of .frm files is hopelessly outdated I'm afraid. And yes, I understand the implications in terms of portability across versions and stuff, but you are smart guys, so I am sure you can fix that.

Cheerio
/Karlsson

Tuesday, February 15, 2011

VIEWS on INFORMATION_SCHEMA tables - Useful stuff

Views in MySQL really has quite a bad reputation, for bad performance mainly, but also there were some stability issues at some point. Now they are pretty stable, but I don't see them used that much. One place where I like to use them myself is in combination with INFORMATION_SCHEMA tables. The I_S tables are really useful and contain a lot of information, and by using VIEWs we can massage the data a bit.

I often have a database specifcally for DBA needs, so that is what we will use here.
CREATE DATABASE IF NOT EXISTS dba;
USE dba;

OK, now we have a database to play with. Lets solve a minor problem first. Being able to use the TABLES table in INFORMATION_SCHEMA is great, as it allows standard SQL filtering and processing, in difference to the output from SHOW TABLES (largely at least, some filtering is available in the SHOW commands also of course). But the TABLES table contains the tables in ALL databases in the MySQL instance, not just the current database. Also, although the INFORMATION_SCHEMA tables are neat, typing INFORMATION_SCHEMA to access these tables all the time is tiresome. This is something that is easily fixed with a view. In the dba database, create a VIEW like this:
CREATE OR REPLACE VIEW dba.tables AS
SELECT * FROM information_schema.tables
WHERE table_schema = database();

Note that this is "dynamic", i.e. the database() function is evaluated when a query runs against the view, not when the view is created. So if we change to the test database and issues the command:
USE test;
SELECT * FROM dba.tables;

You will get the tables in the test database. Pretty neat and useful. Another issue with the I_S tables is that there is a whole lot of data in them, the data is often at a low level and sometimes they are not organized in a way to make using them easy. Case in point is the GLOBAL_STATUS table. This has a LOT of values, 300+ in the version of MySQL I use, and often you are only interested in some values, which is solved by filtering. But some values are missing, like cache hit ratios. Which is not to say that cache hit ratios cannot be computed from values in the GLOBAL_STATUS table. But we can make things easier with a view:
CREATE OR REPLACE VIEW dba.global_status(
variable_name, variable_value)
AS SELECT gs1.variable_name, gs1.variable_value
FROM information_schema.global_status AS gs1
UNION SELECT 'INNODB_CACHE_HIT_RATIO', ROUND((1 - (gs1.variable_value
/ gs2.variable_value)) * 100, 0)
FROM information_schema.global_status AS gs1
STRAIGHT_JOIN information_schema.global_status AS gs2
WHERE gs1.variable_name = 'INNODB_BUFFER_POOL_READS'
AND gs2.variable_name = 'INNODB_BUFFER_POOL_READ_REQUESTS'
UNION SELECT 'MYISAM_CACHE_HIT_RATIO', ROUND((1 - (gs1.variable_value
/ gs2.variable_value)) * 100, 0)
FROM information_schema.global_status AS gs1
STRAIGHT_JOIN information_schema.global_status AS gs2
WHERE gs1.variable_name = 'KEY_READS'
AND gs2.variable_name = 'KEY_READ_REQUESTS'
UNION SELECT 'QCACHE_CACHE_HIT_RATIO', ROUND((gs1.variable_value
/ gs2.variable_value) * 100, 0)
FROM information_schema.global_status AS gs1
STRAIGHT_JOIN information_schema.global_status AS gs2
WHERE gs1.variable_name = 'QCACHE_HITS'
AND gs2.variable_name = 'COM_SELECT'
ORDER BY variable_name;

Now, we have cache hit ratio values added to the global status. All the existing status values are still there. Also, note an interesting MySQL extension to VIEWs here: You can add an ORDER BY to them!

/Karlsson

Monday, February 14, 2011

Yet another MySQL fork - Hurra! Not...

Announcing the KarlssonSQL MySQL fork versioon 5.2.7 based on MariaDB with patches from some guy at the local bar that seemed pretty cool, and also including the Facebook patch version 1.8.3 and some neat InnoDB multi-threading patches that I bought from a guy on eBay. The performance is great when it doesn't crash, and when it crashes, it crashes real fast, so performance is good even then!

Jokes aside, how many forks do we need, really? And what are the differences, really? With really here I mean in terms of real-world usefulness for real users? Also, it seems that different forks are focusing of different areas, based on their expertize, but on the other hand I find noone collecting the bits and pieces into a common Fork. The official Oracle MySQL distribution really is what I will stick to for now.

Which is not to say that there is anything wrong with the forks per se, I mean if someone outside Oracle has some brilliant idea of increasing the performance of InnoDB, and implements that and makes it available to us all, that is of course a good thing. But that is just the actions of that or those guys. If we assume it will also be maintained by those guys, then we have an enormous overlap here, where everyone is maintaining their own fork of MySQL, and this can hardly be what we wanted from Open Source: Source available to all, and everyone has their own version of the source. Again, I stick to Oracle for now. Probably not the hottest of "forks", but stable and reasonably reliable, in particular if you stay a version or two behind the latest and greatest release.

The same goes for Storage Engines. This is another thing that is inherently a good thing, but if it ends up with just too many or, as it might look right now, to specialized storage engines, what is the use? It's like we are back in the 1950's where everyone was running custom software. The difference is that then you had you, as there was no other option, whereas the reason these days seems to be that you can. Which is not a good reason, really, except that it is fun.

Fun is not the same as Common Good though.

/Karlsson
Who is possibly a bit grumpy again today...

Thursday, February 10, 2011

Which is "The company we love to hate" today??

WARNING! Sarcasm, irony and a few damn lies coming up!

Yeah, we all feel that at times. The big evil company that just acts like a bulldozer in the marketplace, or in your front yard. Boring, evil, moneymaking and bureaucratic. For a long time, that company was IBM. In those days, which lasted up until the late 1980's or so, before IBM posted the 1991 results, and we all laughed. Problem was then, we meeded some other company to hate with affection. And along came Microsoft, Bill Gates fit the Bill (!) perfectly here, a geek, a nerd and insanely rich. What an obvious target THAT was! And Bill wanted Windows on every desktop, hey that is just plain evil!

Today though, we have a problem. The thing is that to love to hate someone, there must, for some reason, be some kind of affection (admittedly, there was little to love with IBM also, except the alternatives were even worse). And although the stereotypical rich nerd that Gates is was something you could both hate and feel affection for, it is hard to feel affection for "Big Bull" Ballmer! Also, we are probably getting tired of the love-and-hate relationship with Microsoft by now, after close to 20 years with that.

So, which is the next company to take the leadership is this contest? I see three alternatives:

Apple - This is obvious. A charming Steve Jobs, that takes on the world of mobile phones and devices, and at the same time is also a rich dude and a founding father of Apple. And again insist on censoring your morning newspaper when feed through your iPad. Apple has grown big enough to take the helm here, and Steve is Charming enough to create the suspicison that maybe, just maybe, he and Apple is evil.
Google - This is an interesting contender. They get head to head with the former company-we-love-to-hate leader, Microsoft, which is sure way to into the contest. What Google lacks in a public likeable spokesperson, it compensates for by being the ubiquitous search engine. We like Google, maybe we love it, but maybe there is something with the results they give me? Any what are you doing,, censoring stuff in China?
Oracle - You have to give Larry credit for not being the most charming person on the planet, really, at least not in the way he is portrayed in the press. And Oracle as a company has a lot of something that we who love to hate really love-to-hate, lawyers. And those lawyers aren't just getting after anything, they are after the users and the user communities. There sure is evil enough within Oracle to qualify for the company-we-love-to-hate.

Of these three, only one remains, and here is my reasoning: Oracle, well they are evil enough, but they are not really known that well outside the world of IT professionals, and also, there should be something at least seen from the outside likeable, something that of the surface looks good that you can get affected to, but can also suspect that it is part of a big evil plan (like Windows when we first started loving to hate Microsoft). And Oracle lacks that, it is, in a way, just pure evil (not that anyone working there is. And not that Oracle hasn't been good to me when I worked there either.). But by far the biggest problem with Oracle is not that it is evil, but that it lacks good, and that it is lacks a wide public image. Sorry Larry.

As for Google, they are in much better shape than Oracle. The mix of good and evil is much better. Censorship of content is one evil thing, whereas the free search engine and apps is good. The feeling that you actually get too much paid for content when you search is just that feeling of an uncertain conspiracy that could get Google the #1 spot here. The problem is just that Google is too good. Noone really thinks that Google really is bad, and when Google acts bad, everyone seems to assume that Google was forced to. So where Oracle may be too evil, in the public eye, Google is too good.

This leaves us with Apple. And Apple sure fits the Bill. Steve Jobs is a great public spokesperson for a company that is after the company-we-love-to-hate Oscar, being both charming, unstable at times and sometime aggressive and very sure of himself. Brilliant! And Apple is trying real hard to win this prize, censoring content and applications on AppStore Apps, not allowing publications that are not "Apple friendly" and not allowing emulators or portability libraries. Yes, these are exactly the kind of things a winner of the company-we-love-to-hate prize show have. And Steve Jobs is a great spokesperson.

Congratulations Steve Jobs and Apple. And to Larry and Steve Ballmer: welcome back next year!

/Karlsson
PS. This is not serious. This I wrote for fun. This not the opinions of my employer or anyone else, actually, it is hardly my own opinion. I do not really think companies are evil. But I DO believe that we need a company to hate at times. Also, I don't have an iPhone!

Wednesday, February 9, 2011

EC2 - The E is for Elastic

So, you are thinking about Cloud Computing? Is it a fad, along the lines of SOA, OOP, NoSQL, ORDBMS or is it a new paradigm when it comes to infrastructure? (not that a fad is bad, it's just that a fad, in my mind, is something that is grossly overblown in proportion. OOP is a good thing, but tell you what, OO-talibans out there, despite what you may think, OOP will not create peace in the middle east (if it did, I'd embrace it right now)).

But all that aside, what is in the Cloud, really? And from a technical standpoint, it seems simple enough: Your servers running across a number of virtual machines, with virtual disks and what have you not, where you pay for resource use and you share the environment with a bunch of other users. And that really is not that complicated. And from a pure technical view, that is it, sort of, but there is more to it than that, because when you come to run your stuff in a cloud, you realize that things aren't as simple, and that running in a VM in a cloud really is different from just running in a VM Ware / Zen / Zones environment, or something like that.

This is an easy mistake to do. When I started working here at Recorded Future, where we use Amazon EC2 for everything, that is what I thought. We run Ubuntu in a virtual environment+ Big deal. There are some EC2 integration tools and some GUI to administer the whole shebang, but except for that, this is no different than your server-room Linux box, but at a lower cost. And yes, I fully admit it, I was wrong.

What this is about, more than running in a Virtual environment, is the side effects of an environment shared with other, and how you set up your system to support that. Two important things I have learnt so far in some 5 months with Recorded Future:

Scalability is key! It really is. And I know, everyone want scalability, but in an environment such as EC2, with many different configuration options, but still a shared environment, you must be able to scale. And Scale horizontally. Even the very largest virtual machines at Amazon has the power for demanding applications in terms of disk I/O performance, Network throughput and latency and CPU performance.
It's not cheap. No, it's not, you are wrong. And that is not to say that it's bad. But it is different! If you manage that difference, you can run a very cost-efficient operation with EC2. But if you expect vertical scaling or have monolithic setups that runs on a single machine that has to scale, in a single instance, with your needs, then don't think this will save you and headaces or money.
But if you build your infrastructure in such a way that it scales nicely acress servers, and ensure that performance requirements on a single server are modest, and can be distributed if needed, then EC2 is for you.

So what about the Elastic aspect then? Well, elasticity in EC2 works both ways, on one side, the performance you get, in particular in terms of network latency, will vary over time. It just will, so get over it, accept the facts and create a system that can sustain it. On the other hand then, you can add resources as you need them. Sounds great, doesn't it? Well, yes, but there is a limit to WHAT resources you can add, and how.

EC2 allows you to add disks to your system as you please. There is a GUI (which is not very good) and there are command-line programs (not particularly good either, to be frank, but I am not, I'm Anders) to do this. But they do the job reasonably well. What you can NOT buy is more disk I/O throughput, just like that. You can get more disks and stripe them for sure, but that's about it. The same goes for CPU, you can get more of them (to a limit), but they are only so powerful. It's not like "Gimme a gazzillion of Petaflops" just like that, I'm afraid.

Above all, the network is only so fast, and regrettably that is not terribly fast. Also, the way EC2 manages DNS looks is just plain weird, I have to assume it is so for a reason, but the reason just has to be something you smoke, but not necessarily something you inhale.

So where are we now, then? EC2 has servers with limited performance, with limited disk I/O capacity and interconnected by what sometimes seems like 2400 baud modems. How can all this be useful? And by the way, at times both Disk I/O and Network performance is quite acceptable, but your milage may vary. Tell you what makes this stuff rock: There are MANY servers and many disks. How many as you want. And you can move them around, mount one disk on one machine, and then on another (like in a SAN, for example) just like that.

All in all then, the perfect infrastructure for EC2 then is something that scale with the number of servers. Scale I/O, Scale network performance, scale User connections, whatever the bottleneck is in your system, with the number of servers, each having as much disk as is necessary.

And having said that, you may have figured out where I am going with this, as in the Letterman Show, we are playing Will it scale?. Yes, that is a valid question, and the answer is, maybe. The less state a system holds, the easier it will scale, to make it really simple. A webserver that is just serving stuff off a disk will scale real nice, for example. Also, the less you persist, one way or the other, the better it will scale. The most common way of persisting data is of course writing it to disk, that is clear, but any kind of persisting data that is shared will cause some limit to scalability.

A simple Web-server scales easily, as we have already said. I more complex Web application, say a PHP application, may not scale as well, but still a lot, as long as the state is limited to a single PHP session or similar. The same goes for App-servers I guess. The one thing in most systems that is difficult to scale is the database, and the reason is of course that the database has a lot of state. A stateless database is not much of a database, to be honest.

MySQL has a means of scaling the database that has worked quite well for a number of years in Web-style applications. Scale-out is the way to go. And scale-out is simple enough: Asynchronously Replicated Slaves being fed from a Master. The Asynchronous nature of this Replication means that database writes (which all go to the Master) are not held up by replication to the Slaves. And you can have, in theory, any number of Slaves. But there is a drawback to all this: Only Reads may scale, not writes. Master-Master may help, to an extent, but not much. And a massive replication setup, with N slaves all Replicating from a different point in the Master. Another issue is that for the Slave to do it's job properly, the operations on the Slave are serialized (if this wasn't the case, think about what Foreign Key relationships might cause you), which means the Slaves are slower than the Master when it comes to Writes, which in turn means that unless you want the Slaves to become more and more behind, you better keep the write-load average at the level that the Slave can sustain, not the Master.

And no, by the way, I don't think MySQL Replication is something bad and awful. But I do think I know what it is good at and what it is NOT good at. And it will not help you scale your writes. Nope.

So what would a database that could scale in an Amazon EC2 Cloud look like? Above all, it would need:

Flexible configuration - Adding a node should be just that, adding a node. No restart, no optimizations, no reorg of data, no downtime, just Here is a node: Use it. And removal of a node should be the same. And management of data in the database as well. No more monolithic database configurations, please!
Scalable performance - And once you add those nodes, they really should be able to increase the performance of things, and not just to a small extent.
Data distribution - Yes, I want my data distributed. Replicated to where it is used. Distributed and persisted to where it is best persisted.
Transparent - Yes, all this should be transparent to the application.
SQL based RDBMS - Yes, I know, I am an old fashioned guy, but this what I want. SQL because it is ubiquitous, not because I think it's the best query language on the planet. And I want an RDBMS because I firmly believe that that is best for my data. Which is not to say that some application might prefer some other means of storage (if you want my full view on these matters, read this post).

So, is this just a wet dream? I hope not. One technology I am looking at and which I am eager to try later this year is NimbusDB. This is created by Jim Starkey, and if you think that the MySQL Falcon debacle was caused by him, and that NimbusDB is therefore not something to look seriously, think again. Having looked at a number of different technologies in the past 5 months or so, I have to say that NimbuxDB is the only one where they have at least understood what problems to solve in a Cloud environment, and that is not a bad start. And tell you what? I do think that to support such a setup as NimbusDB sets out to do, you really need to start from Scratch, I do not think that Oracle 13g or MySQL Clouse Storage Engine or something like that will be around to fulfill the requirements to run and scale properly in a Cloud. But you never know, there are interesting times ahead.

/Karlsson

Thursday, February 3, 2011

Objections to Objects?

A long time ago, there were no objects. There was clean, cold code, and the closer to the hardware the faster, not only because it is faster that way, but also because not just anyone would write code in those days. Then came programming languages in the 1950's and structured programming in the 1960's (the first commercial software offering was Auto-Flow, a program that would generate flowchart prints from existing programs written in Cobol, Fortran etc. Auto-Flow dates from the early 1960's).

The 1970's came around, and besides awful clothing styles and horrible interior decoration colour schemes, and things such as ABBA, we got Personal Computers, sort of, and the Relational Database (Oracle was first with an true software offering in this arena, in 1977).

In the 1980's Object Orientation, which had been rather esoteric and academic in the 1970's, turned into reality. (Also, Oracle got BIG). And if you ask me, this was a good thing, if it wasn't for what happened in the 1990's.

The 1990's saw, among other things, my (still) favorite hangout Akkurat open in Stockholm serving English Cask Conditioned Real Ale and Belgian Beers, among other things. One of the worst things that I think happened in the 1990's though was that Object Orientation, which had so far been a real cool way of writing organized code, which in itself also documented, in the code, relationships and interactions between objects, into a Religion. Oh my, among the most awful things was getting into a customer where an OO consultant had taken control. You couldn't even go to teh bathroom without having a crapper object with number one and number two methods. Awful stuff.

And in the midst of that came... Yes, you know what I am talking about, the ORDBMS and OODBMS, the Object relational and the Object Oriented Databases. I was out there, trying to convince people to buy the ORDBMS I was promoting at the time, something I was doing with some, but still limited, success. I admit it, I was wrong, this was, in the general case, not a good idea!

Looking back, I think I know what I think the problem is with the OO database things. At least I have an opinion (you expected that I guess) and here it is: OO is fine, it's a great way or organizing data in your code, but just because can make code good (it doesn't HAVE to, just as little as C automatically makes your code "portable"), it doesn't necessarily make your data any good.

One thing here is that there is lots of code that use your data, and there are many way to look at data at. So your code makes do with a view of data that fits whatever task your code is set out to do. But the same data may well be applicable to another use, using some other, completely different, structure, relationships and methods. The mistake, in my view, is to say that just because I can process data in an OO way, my data should be organized in an OO manner. I do not think so.

Another thing is that, when looking at data, the Relational model is brilliant. It allows you to view data from any angle, independent of how it will later be processed, which is the key here: Data is data, and processing of that data is processing of that data. And yes, these two things are linked, but that doesn't mean they have to have the same structure.

A third things is that OO from the start was a developer thing. It allowed you to write black-box style code, it allowed smarter code, and it encouraged structure to the code and made it easy to provide and enforce strict interfaces between different aspects of the code. But it was still about code. I think nearly every highly paid OO evangelist of the 1990 was looking at OO from the POV of code, and not data.

My view is a bit like this, when it comes to relational databases and different kinds of programming methods and whathaveyounot. The RDBMS is like a container, you can fill it with whatever you want, but it is just contents, it will not move or do anything, it will just stay there in the container, and the bookshelf you put in there is there the next time you open the container (hopefully). The transport mechanisms, the trucks, ships, air carriers, satellites and what have you not, is the processing we do. You pick the best processing for the job. There may be better ways of transporting your data than a container, but because all containers look the same, this creates so much of an advantage as the disadvantage of not filling up every square inch of the ship is minor.

And then you have the standard things that allows your container to be loaded, moved, fastened, unfastened, lifted and loaded again etc in a standard way. This is your query language. Of which SQL might well not be the best (I don't think it is), but that is beside the point, that everybody uses it is what the advantage is. It really is.

And when I pack my container to move it, I don't have to worry about if the container will fly, go by truck or be shipped. I really do not care, I just fill it with stuff. Just as I just provide my database tables.

And another things, looking at data in the shape of fixed named columns and variable number of rows is something that everyone, more or less, can understand. It's a universal language that goes WAY beyond the RDBMS.

To conclude, no, I don't think OO sucks, I think it rocks! But for code, not for data. And no, I don't think that your data is good just because it's in an RDBMS. And no, I do not even think an RDBMS is good for everything! There are cases, many of them, where all you need is an organized persistent storage for your programs data: No-SQL is great for that. But if data is data, and not just an extension of the RAM occupied by your program, then an SQL based RDBMS is often (but not always) your best bet!

/Karlsson