Karlsson on databases and stuff: February 2012

Saturday, February 25, 2012

Book review: "The computer boys take over"

I guess that most folk who read this blog are IT people in one way or the other. I do not expect my mother to read it, or even my wife. This is, after all, mostly about rather geeky technical computer things. But what do I do, when I am at work? I have been asked this more times than I care to remember, and I think this goes for most of you. Your significant other asks, your mum, some realtive, a guy you meet on a train or in a bar.

So what do we do, really? 50 years ago, the kind of jobs we do didn't even exist, and now most young people today can't figure out what a world without computers would be like. A large organization with 10s of thousands of employees managing all those people and all their production completely without computers? Not even thinkable these days, but way into the 1960s, computers was not something for everyone or even for every corporation, and many didn't even know what to do with them, did they have one.

So there has been an explosion not only in the number of computers, but also in the number of people, like ourselves, working with computers. And this type of rapid growth of a profession usually cause things to happen also on the sociological side of things. How do we work together? How does the rest of the work commuicate with us? How does a high-ranking manager of a large corporation feel when his skills are questioned by some dude in a t-shirt, a long beard a wearing sandals? And there is nothing he an do about it. He has to communicate, not the other way around (which might be just as well, we computer dudes mostly aren't good communicators anyway). And how does this change society as a whole?

These were some questions and issues I wanted to have some ideas of answers on, or at least some additional insight in, when I picked up "The computer boys take over" by Nathan Ensmenger (ISBN 978-0-262-05093-7). The book didn't actually turn out to be just what I expected though. As for a historic book, I usually prefer them to be largely cronological, and if the subject is broken up in smaller ones, then at least those I perfer to have these sections in some cronology. In this case, this wasn't so. Rather, this book seems like a book for students to read sections of. The different sections on the different aspects of the computerization of society and the status of the IT workes largely overlap, and the same quotes are used to support different things. Also, the book really gives little insight on the process of things, rather, certain events are exlained, and the impacts those events have, but then they are largely lost, until they pop up in another section of the book and some more is mentioned on the subject.

Don't take me wrong here, this is not a bad book and I did get lot of interesting insights, but the book was really boring to read in the sense of the organization of things, the gems were hidden here and there all over the book, with the rest filled with a lot of repetition of arguments and quotes.

I don't regret buying this book or reading it, but I feel it is not really intended to be read like I read it, front to back. Rather, it seems like this is more of a book for classes or academics, where a section of the book is read, which is not a bad things per se. The reasearch behind the book has also been extensive and ambitious is seems, with a huge section at the send of the book with footnotes, references and a comprehensive index. The index itself makes this book useful to me as a reference work. But I feel that the book could have been better with some more organization of the subject, a better cronology and some more editing.

One thing that is missing throughout the book was the international persepctive. The book is very focused on the US environment, and there are very few mentions of the effects of society and how the computer professional profession grew in other parts of the world. There are a few quotes from Maurice Wilkes, but those are so common and well known that they just could not be left out in a book like this. Like this 1946 quote from Wilkes (which proved to be more true than anyone knew at the time I think): "It had not occured to me that there was going to be any difficulty about getting programs working [...] And it was with somewhat of a shock that I realized that for the reat of my life I was going to spend a good deal of my time finding mistakes I had made in my programs."

And again. If you want to read something about the sociological influence on society and on the growth of the "computer buys" as a profession, there isn' really that much to read. So maybe this book is for you, maybe on a long flight or something. But be prepared to be a bit bored in between in interesting sections.

/Karlsson

Tuesday, February 21, 2012

Amazon RDS for MySQL - Is it any good?

We have been testiing the MySQL RDS Service from Amazon here at RF, and so far it looks good for what we use MySQL for. We don't have that much data left in MySQL though, but some of the services are still mission critical, and we have had a couple running on RDS for a week or so.

I have to say I am pretty happy so far, but there are a couple of things you should know before you get to use the services. To begin with, you don't get a proper MySQL root account, i.e. no SUPER privileges. No big deal I guess, but this means there are a few things you cannot do. I was planning to migrate off the EC2 based server to RDS by setting up the RDS service as a slave, and then switching to the RDS service, but as SUPER privilege is required for CHANGE MASTER TO, I could not do this. And before you ask: Amazon DOES support slaves among the RDS servers, but this is set up using their own methods. Easy to use, yes, but not as flexible as the real thing.

As for parameters, you can modify a bunch of them, but not all. I have to benchmark performance here eventually, but it has been sufficient for our modest MySQL needs so far, so I might not even do that. I also have to check what parameters I am allowed to modify and which I can't in some more detail also, to figure out if there is anything significant missing.

So far though, I think the conclusion is that this is a great service for people with low / medium requirements in terms of MySQL. Backup and snapshots are a breeze and are automated, cross availability zone use is also an option (not tested by me though). The server is accessed just like any other service, so you can set up Route 53 with a more reasonable name to point to it, if you wish. Which nodes can access the server is also controlled by you, both the service itself controlling what can access it, and by the usual MySQL GRANTs.

When we get more servers running on RDS, I'll let you know more.

Cheers
/Karlsson

Friday, February 17, 2012

Do I really want to get rid of SQL_MODE?

I recently wrote about SQL_MODE, and Ronald Bradford was quick to reply with his view on things. Although my post was a bit of a rant, and I think it was a necessary rant, it seems that Ronald and I largely agree on the overall matter of things.

Now today WebYog implies that I and Ronald want to get rid of it and replace it with more strict, rigid checks on data entering MySQL. Tell you want, however much I would like that, I am not stupid, I realize that is not practical. Rather, I say sure, keep SQL_MODE, but make it possible to enforce it and check it, and do not allow just about anyone with access to the database to compromise the data in it by a single command.

WebYog also implies that this would break a lot of applications. I say that is not the case as I say you can still set it at anything you want, just that it is enforced!

Even better, as WebYog says that applications share a MySQL instance, they may have, probably do have, different requirements. OK. Read what I wrote: I suggest you set it and enforce it on the database level. Right?

WebYou then pops up this example. With a table like this:
CREATE TABLE `test`.`strict`(
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`col1` VARCHAR(10) NOT NULL,
`col2` VARCHAR(10),
PRIMARY KEY (`id`));
An insert like this would fail:
INSERT INTO `test`.`strict`(`col2`) VALUES (‘string’);
Yes it would. And that INSERT is plain silly. Why am I assuming a value without specifying it anywhere, nowhere in the INSERT statement, not in the table schema, I just assume that MySQL will figure out an appropriate value for me? And if you DO assume something, like the value of the column c2 should be set to an empty string when not specified, then why not set the default value for it in the create table statement? Like this:
CREATE TABLE `test`.`strict`(
`id` INT UNSIGNED NOT NULL AUTO_INCREMENT,
`col1` VARCHAR(10) NOT NULL,
`col2` VARCHAR(10) DEFAULT '',
PRIMARY KEY (`id`));

Or? And note that in this case, having done things right, you don't get a warning! (Yes, the INSERT on the original table actually issues a warning. Why would I want to have code that issues warnings. And if the user can leave the column out without specifying it, why is it NOT NULL).

Actually, this behavior shows another MySQL weirdo behavior. If I change the table definition above slightly, and allow NULL values in the col2 column, what do I get from an INSERT if i leave col2 out? The same INSERT as above. The column col2 has the same database. The INSERT still specifies no value for it. Now col2 will be set to NULL. And this is supposed to be coherent, consistent and easy to use?

Now, when the column col2 allows NULLS, MySQL does something with my schema also, it adds a DEFAULT NULL clause to col2. So that makes a bit more sense, if it wasn't for the fact that it doesn't add a DEFAULT '' when I specify NOT NULL.

Enough of weird MySQL behavior complaints, all database systems have these, to an extent. But it would be nice if MySQL was more in line with the rest of them, which SQL_MODE is on its way to do, but then MySQL should also be able to enforce it, if I wanted it to.

/Karlsson
PS Comments are welcome but before you fire off your tomatoes, read what I wrote.

Thursday, February 16, 2012

More on database consistency

I've written a few times about database consistency before, mainly in conjunction with NoSQL and the concept of Eventual consistency. Now, I'm about to do an update on the subject, as I have come to realize a few things.

From an oldtimer like myself, having been an SQL guy for 25 years, I remember Punk-rock and even The Beatles and I having hair growing out of my ears, what can be contributed? Well, let me beging with stating what I mean when I say Database consistency. What I mean is Consistency as the C in ACID (no, we aren't talking drugs here, we are talking databases). Let's see what the online authorative reference work on just about anything on this planet, from the size of J-Lo's feet to the number of Atoms in the universe (those two numbers are quite far apart by the way), Wikipedia: "The consistency property ensures that any transaction will bring the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including but not limited to constraints, cascades, triggers, and any combination thereof." In other words, consistency means that the databas is always in a consistent state, the different data items in it (rows, if you wish) are "in sync" with eachother. I think most of you agree with this notion.

Now, when it comes to NoSQL databases, like MongoDB, this terminology is different. These guys introduced Eventual consistency, which means that the database will eventually reach a consistent state with regards to a specific transaction that changes that "state" of the database. But there are multiple transactions at the same time, and they aren't necessarily, in an Eventual consistentcy model, consistent with eachother as they aren't on the same node. But the theory goes that some time, eventually, they will. If the system never stops, and transactions keep coming, then eventual consistens is determined to happen within 100 ms or less from the point in time when pigs fly. But if you stop all state changing transactions, then the state of the database will reach consistency. Eventually.

Now in NoSQL circles there is a thing called a Consistent read. If my database was consistent, then any read is consistent, right? And in the case of use SQL RDBMS folks, consistency is about the state of the database when I write to it? Well, if you have an eventual consistency model, where you have data distributed all over the place, things are different. To begin with, the basic thing that you have to make sure, and the NoSQL databases do this, is to ensure that the writes to the databases are all in order (we know this from MySQL also, and it is part of the issue with the MySQL slaves, and the NoSQL guys aren't fixing this particular bottleneck). And here we mean they are in order in each and every node. Across nodes, we don't care, which is where I get my abilility to scale out writes from!

A consistent read is a read where the data I am reading is in a consistent state, or sometime that my data is the most recent data. These two aren't always the same, but the second (reading most recent data) typically implies the former, although I assume this is not always the case. This is VERY different from the meaning of Database Consistency as we RDBMS folks look at it. All the same, the concept sure is useful, and as the NoSQL distributed systems doesn't need to keep the data consistent on a global level, a lot of shortcuts can be taken. But having Read Consistency has litte to do with Database Consistency. Your NoSQL fans will complain here and try to tell you that these achieve the same thing, but they don't. Achieving global Database Consistency costs an arm and a leg or two in performance, but the database is ALWAYS consistent.

So two different things, both with advantages and disadvantages, but they are STILL different! And the NoSQL folks will confuse things by allowing you not to have even Read Consistency, somewhat implying that turning it on means you get Database Consistency and that the Read Consistency model (which is very very simple by the way) means you get the effect of Database Consistency using Eventual Consistency. Nope. You don't. Which doesn't make it bad, but IT IS NOT THE SAME THING!

/Karlsson

Wednesday, February 15, 2012

Why I think SQL_MODE is useless...

I think the SQL_MODE setting is a sort-of good idea, but that the implementation of it is useless, more or less. What I would want from a feature like this is to allow me to control the quality of the data in the database, but that is not what SQL_MODE does, regrettably.

Rather, what SQL_MODE does is define a default quality checking level for an individual session. This is NOT the same thing as guaranteeing that data has certain qualities in the database.

To begin with, a specific session can change it. If you want to insert an invalid data, just set SQL_MODE=ALLOW_INVALID_DATES and go ahead and insert.
Secondly, setting this to a non-default value requires no specific privileges at all!
Thirdly, as anyone can bypass these checks, you would assume that there was a way to check afterwards that data follows a specific SQL_MODE, or at least a function that I could apply to a table or column to check it, but no, no such thing exists.
The different consistency checks provided are somewhat ambigious and difficult to understand. Some applicatioon that found certaian aspects of MySQL 4.0 to be incompatible with their applications sets SQL_MODE=MYSQL40, which implies a lot of things.
And last but not least: This variable controls data quality. If some user decides to set this to some value to get around the database quality checks, you expect it to be logged somewhere, but no. Well, you can see it if you look carefully in the binlog, after the fact (and there you don't see the actual SQL_MODE, only the numeric representation of it).

And having said all this, I know what many of you think: Why let the database chack your data, that is the job of the application anyway? Right? Well, sort of. In my case, when I write applications of build websites, SQL_MODE isn't often an issue (except that Joomla insists on setting SQLMODE=MYSQL40, which means that my nice FK constrained database gets screwed up). But in larger team, like at work, developers are in one part of the office, and I am playing with the database. My responsibility: Make sure data is there is is right. The developers responsibility: Build their applications and make them run.

Here at Recorded Future, this is a very, very small problem. We have a good team, which isn't that big, we communicate and we also use less and less of MySQL these days. But I have been in places where this is a very real problem, and if delevopers use stuff like Hibernate it gets even worse, as the developers are now even more distanced from the real data.

And note here: This is not to say that developers lack competence and knowledge of databases, or that they write bad database code something, it's just that their focus is different from someone like myself, who work with databases on a regular basis.

And look, if you were a consultant or a developer, wouldn't it be nice to look at a database and just KNOW what data goes in there. Will I find a date that is the 31st of April in a datetime field or not?

And yes, there are applications that just will not run with MySQL 5.x. Fine, allow an option to be set for this, but don't make it changeable on a session level! On a global level then? With multiple applications using different databases, this is also a bad idea. I'd say make it a per database attribute!

And then we come to the situation where you really NEED to disable constraints, SQL_MODE etc. OK, I can allow this, assuming:

A higher level privilege than normal, or even better, a special privilege, a special privilege, is needed.
There has to be a way to check it, like an ALTER DATABASE SQL_MODE=xxx, ALTER TABLE SQL_MODE=xxx etc. And also some means of column level checking (can you write a script for this checking? Sure. Will it be faster or more convenient than running it as part of the database process? Nope, far from it. And convenience is important here as we want to run this often, to make sure that our data is what it should be).

And by the way, it is OK to change SQL_MODE on a session level, but only change it UP (where this is applicable). I.e. if the global SQL_MODE is ALLOW_INVALID_DATES, then a session can turn this off, but not ON (and by the way, again, this is a weird name. Id's rather have all SQL_MODE options be names as a FEATURE that is a good things: STRICT_DATE_CHECKS with the default being ON would be a better option, in my mind).

That's enough of complaints for today, take care
/Karlsson

Monday, February 6, 2012

On datatypes, domains and why I think it's time we reconsidered

What's in a datatype then? A MySQL SMALLINT? A C int? An Oracle BLOB? One thing is for sure, they are not very well standardized, not even within the same environment, and much less so across them. And what does it means, really? When should I use a BLOB, when a BINARY VARCHAR and when to use a long long?

A datatype defines many attributes:

What data I can store in it: Only numbers? Printable alaphanumeric characters? Unicode? Binary data? An object?
What I can do with a value of that particular type and how does it behave? Can I concatenate two values? Add them (that is NOT the same as concatenate! The same operator (+) may be used, but it's not the same thing!)? Save it to a file?
How is it stored and represented in technical terms. And this can be argued, a C int doesn't define how it is stored on disk, but a MySQL INT does! And a BLOB is a type that is largely there just because it defines how to handle things on disk!

To link this up with MySQL, let's look at the built-in MySQL datatypes. Some of them are rarely used, and for all intents and purposes, many of them are overlapping and in some cases they are there JUST BECAUSE they have different semantics and behaviour! I am, for example, not a big fan of all the different INT types that MySQL supports; tinyint, smallint mediuming, bigint all in signed and unsigned shape. Why is this useful? Yes, I know a smallint is smaller on disk so there is 2 bytes less to read from disk compared to an int. Yeah right, big deal: Look now, if that is a big deal for your application, then you have bigger problems. Which doesn't mean that using a smallint isn't a means of, in some cases, improve performance a small bit. And that this is sometimes useful. But really, this should be hidden somewhere, we just have too many datatypes. The same goes for all the different BLOB datatypes! The BLOB datatypes has another issue by the way, the names of them a a bit silly: "TINYBLOB" is a contradiction, as BLOB stands for Binary LARGE Object! A Large object that is Tiny? And a something that is at the most 255 bytes isn't really "Large" anymore, that was way back in the 1970's or so.

No, I'm not a big fan of determining which datatype to use by considering storage requirements to optimize performance. That is something you might do as a performance enhancing measure after the job is done, possibly. In C, which is a language I have used longer than I care to remember, int seems to serve the role of being the default numeric integer datatype, which is not a bad thing, in particular not in C as C is a language used to write low-level stuff in, stuff that is performance sensitive, and a C int is linked to hardware specific integers.

FLOAT is another issue altogether. In very few business applications is there a need for FLOAT or DOUBLE in terms of IEEE floating point values. Really! This is a scientific number format, that is also supported by the Floating point unit in most computers of today, so operations on them are pretty fast these days. But I have seen too many business applications where the developer uses a FLOAT in the database for any kind of numeric value that is not an integer, which is a distinctly bad idea! It works as long as the numbers aren't too big, and when the operations on them are reasonably simple. This is not to say a FLOAT or DOUBLE isn’t useful, just that they are more targeted for scientific programming (Hey, they are after all defined by IEEE!). What most people really should use is DECIMAL (I'm using MySQL datatypes for reference, but you know what I mean). This is a datatype that accepts (hey, big surprise coming up here) decimal numbers, fixed point decimal numbers! Operations on DECIMAL are slower though, as they aren't really supported directly by either C (which is the language that MySQL is largely written in) or by the CPU / FPU. But they are better for business applications (for scientific applications, use FLOAT or DOUBLE. And continue to write your code in Fortran by all means!)

But all these datatype are increasingly used also to define the domain of the data in them! The SQL DOMAIN feature is in the SQL Standard but is largely forgotten. It is there in PostgreSQL for you to try. The advantage of this, in my mind, isn't just that you can define in more exact terms what data should go into a database column, without writing code to do that, and to make it apparent in the database schema what data is expected, it also means that we can get away from the wide array of different built-in datatypes in, say, MySQL. Instead we can stick with a few optimized ones, let them have a simple inherent domain and then we can define the application specific domains in application code. Anyone for a IP_ADDRESS datatype? A ZIPCODE datatype? An URL type. All those a reasonable requests from the application POV, but it really shouldn't (in my mind) be defined by the database server (As: What are the semantics of a ZIP-code? They are different in different countries, and the post-office can change their minds (although that is something that takes a long time I guess)). Why? As this is application specific domain! And application specifics should be in application code! Simple as that!

So what datatypes are useful then, the way I look at it:

RAW Data - This should be a basic type. As long as can be, or as short. Any data can be stored in it, and nearly as much as you would like.
Integers - Although an integer can be considered to be a special case of a more generic number, the integer domain is so generic, and the integer is such a common and basic type that is deserves a place here.
Decimal – Fixed point decimal numbers.
Float - Now we are getting close to the application domain here, but these guys deserve a place here anyway, as they are so common and hooked up to the programming languages and hardware.
DATETIME - Only one of them is really needed. One of DATETIME, DATE, YEAR, MONTH, TIMESTAMP, TIME etc etc. that is. Just one, very generic base type, where the specific domain is, you've heard it before, application specific!
Text - Reluctantly I add this to the list. Although this is just a RAW with some limitations (usually only UTF8 or ISO8859-15 or something characters), there are some operations and attributes that are so tied up with text strings that a TEXT type is reasonable to include.

And that's about it! Yes! We could add a few standard domains of top of all this, like FLOAT and DOUBLE, SMALLINT, TINYINT etc etc. But as for the basic types, this is what I would like to see. As for the weirdo TIMESTAMP semantics in MySQL, add them as triggers on your table, in application code or, best, as a specific domain (I do know that not all weirdiness of TIMESTAMP can be represented by a standard SQL DOMAIN, which is both a sign of just how weird TIMESTAMP is, and of some of the limitation of the SQL DOMAIN).

I have now mentioned SQL DOMAINs a few times, so lets spend some quality time with them right now before closing up, here is a simple session using PostgreSQL 8.4:

First we create a domain:
CREATE DOMAIN yesno AS CHAR(1)
DEFAULT 'N'
NOT NULL CHECK (VALUE = 'Y' OR VALUE = 'N');
Then we create a table that uses that domain:
CREATE TABLE user_active(username CHAR(10), is_active YESNO);
And then we check if it works using a few SQL statements:
INSERT INTO user_active VALUES('foo', NULL);
ERROR: domain yesno does not allow null values
INSERT INTO user_active VALUES('foo', 'Y');
INSERT 0 1
INSERT INTO user_active VALUES('foo', 'N');
INSERT 0 1
INSERT INTO user_active VALUES('foo', 'A');
ERROR: value for domain yesno violates check constraint "yesno_check"

Before I wrap this up, I want to say one more thing on the issue of SQL Domains: There is an alternative n PostgreSQL, which is user defined datatypes using a CREATE TYPE SQL command. To begin with, I like domains much better as they are declarative, which means I don’t have to read a whole bunch of (usually uncommented and undocumented) code to understand what they type does. Secondly, this little warning in the PostgreSQL 9.1 manual isn’t too encouraging (page 1289): “To create a new base type, you must be a superuser. (This restriction is made because an erroneous type definition could confuse or even crash the server.)”

I have more to say on types, and I will follow up with a new post on this subject eventually, but this is enough for now.

/Karlsson

Wednesday, February 1, 2012

More on using the commandline vs. Visual editors

Shlomi Noach has made a couple of posts (here and here) on Visual Editors vs. commandline for MySQL working with SQL statements. All in all, I tend to agree with him in many respects, but I am surprised that he recommands MySQL Query Browser, as this is the editor that breaks one of the things that seems to annoy him (and me) most: Every SQL statement that is issed uses it's own connection. Which is not to say others may feel differently about it.

As for myself, I started developing my own MyQuery tool because there are things you just cannot do from the commandline, and there are things that are really complex and non-inituitive to do from the commandline, and you need a better tool than the commandline. Again: This is not to say that there is something wrong with the MySQL commandline, it really should not be full featured, but it should run all SQL commands with ease.

So, what was I missing in the commandline MySQL that led to the development of MyQuery, and project that I am still working on, on and off? Mainly, using the commandline makes it REAL hard to develop and debug scripts. That was the main reason! At the time that I released MyQueru 1.0, I was involved in a project that required A LOT of SQL scripting. And if you have a big, big SQL script, then you know what happens: The scripts runs for a few 100 lines, and then there is an error or a typy, and the commandline just keeps processing until we get an error, and then it stops (alternatively, you use --force, in which case you usually get truckloads of errors, and even if youjust get one error,it gets lost among all the other errors).

OK, so the script failed with an error, then you just fix the error and run it again? Right? Well, that approach assumes that your script is idempotent, which isn't always possible and is at least difficult in many cases. Wouldn't it be neat to have the script running stop when there is an error, allow you to fix the error, and then keep running from where the error occured? That is exactly what you can do with MyQuery.

Alternatively, here is another script running scenario:
You have a script that takes forever to run. You might have the odd error in there, but you can fix them afterwards, just run the script please! But also tell me about any errors in there, so I can fix them! OK, MyQuery does this too. You can run with force in MyQuery also, but the errors are should in a special errors tab. After the script has finished running, you can see the errors in the error output tab, and that is not intermixed with any status printouts or anything. And yes, click on an error text and you get to that position in the script.

Using SQL to generate SQL? Yeah, you can do that with the MySQL commandline, but MyQuery makes it far easier.

MyQuery has a bunch of other neat features, like the ability to save and store blob data to disk (doing that from the commandline is somewhat possible, but only with many limitations. INTO DUMPFILE only works for single row for example), extend the interface with custom tools, using SQL statements, or the MyQuery API and many other things (One supplied tool will nicely break up a long SQL statement into several indented lines, simple but useful when working with big SQL statements). That MyQuery has a colour-coded editor with a very flexible setup is also a bonus that makes it easier to work with.

Actually, the reason I started developing MyQuery is probably still the best reason to use it: To run scriripts.

/Karlsson