Monday, February 6, 2012

On datatypes, domains and why I think it's time we reconsidered

What's in a datatype then? A MySQL SMALLINT? A C int? An Oracle BLOB? One thing is for sure, they are not very well standardized, not even within the same environment, and much less so across them. And what does it means, really? When should I use a BLOB, when a BINARY VARCHAR and when to use a long long?

A datatype defines many attributes:
  • What data I can store in it: Only numbers? Printable alaphanumeric characters? Unicode? Binary data? An object?
  • What I can do with a value of that particular type and how does it behave? Can I concatenate two values? Add them (that is NOT the same as concatenate! The same operator (+) may be used, but it's not the same thing!)? Save it to a file?
  • How is it stored and represented in technical terms. And this can be argued, a C int doesn't define how it is stored on disk, but a MySQL INT does! And a BLOB is a type that is largely there just because it defines how to handle things on disk!
To link this up with MySQL, let's look at the built-in MySQL datatypes. Some of them are rarely used, and for all intents and purposes, many of them are overlapping and in some cases they are there JUST BECAUSE they have different semantics and behaviour! I am, for example, not a big fan of all the different INT types that MySQL supports; tinyint, smallint mediuming, bigint all in signed and unsigned shape. Why is this useful? Yes, I know a smallint is smaller on disk so there is 2 bytes less to read from disk compared to an int. Yeah right, big deal: Look now, if that is a big deal for your application, then you have bigger problems. Which doesn't mean that using a smallint isn't a means of, in some cases, improve performance a small bit. And that this is sometimes useful. But really, this should be hidden somewhere, we just have too many datatypes. The same goes for all the different BLOB datatypes! The BLOB datatypes has another issue by the way, the names of them a a bit silly: "TINYBLOB" is a contradiction, as BLOB stands for Binary LARGE Object! A Large object that is Tiny? And a something that is at the most 255 bytes isn't really "Large" anymore, that was way back in the 1970's or so.

No, I'm not a big fan of determining which datatype to use by considering storage requirements to optimize performance. That is something you might do as a performance enhancing measure after the job is done, possibly. In C, which is a language I have used longer than I care to remember, int seems to serve the role of being the default numeric integer datatype, which is not a bad thing, in particular not in C as C is a language used to write low-level stuff in, stuff that is performance sensitive, and a C int is linked to hardware specific integers.

FLOAT is another issue altogether. In very few business applications is there a need for FLOAT or DOUBLE in terms of IEEE floating point values. Really! This is a scientific number format, that is also supported by the Floating point unit in most computers of today, so operations on them are pretty fast these days. But I have seen too many business applications where the developer uses a FLOAT in the database for any kind of numeric value that is not an integer, which is a distinctly bad idea! It works as long as the numbers aren't too big, and when the operations on them are reasonably simple. This is not to say a FLOAT or DOUBLE isn’t useful, just that they are more targeted for scientific programming (Hey, they are after all defined by IEEE!). What most people really should use is DECIMAL (I'm using MySQL datatypes for reference, but you know what I mean). This is a datatype that accepts (hey, big surprise coming up here) decimal numbers, fixed point decimal numbers! Operations on DECIMAL are slower though, as they aren't really supported directly by either C (which is the language that MySQL is largely written in) or by the CPU / FPU. But they are better for business applications (for scientific applications, use FLOAT or DOUBLE. And continue to write your code in Fortran by all means!)

But all these datatype are increasingly used also to define the domain of the data in them! The SQL DOMAIN feature is in the SQL Standard but is largely forgotten. It is there in PostgreSQL for you to try. The advantage of this, in my mind, isn't just that you can define in more exact terms what data should go into a database column, without writing code to do that, and to make it apparent in the database schema what data is expected, it also means that we can get away from the wide array of different built-in datatypes in, say, MySQL. Instead we can stick with a few optimized ones, let them have a simple inherent domain and then we can define the application specific domains in application code. Anyone for a IP_ADDRESS datatype? A ZIPCODE datatype? An URL type. All those a reasonable requests from the application POV, but it really shouldn't (in my mind) be defined by the database server (As: What are the semantics of a ZIP-code? They are different in different countries, and the post-office can change their minds (although that is something that takes a long time I guess)). Why? As this is application specific domain! And application specifics should be in application code! Simple as that!

So what datatypes are useful then, the way I look at it:
  • RAW Data - This should be a basic type. As long as can be, or as short. Any data can be stored in it, and nearly as much as you would like.
  • Integers - Although an integer can be considered to be a special case of a more generic number, the integer domain is so generic, and the integer is such a common and basic type that is deserves a place here.
  • Decimal – Fixed point decimal numbers.
  • Float - Now we are getting close to the application domain here, but these guys deserve a place here anyway, as they are so common and hooked up to the programming languages and hardware.
  • DATETIME - Only one of them is really needed. One of DATETIME, DATE, YEAR, MONTH, TIMESTAMP, TIME etc etc. that is. Just one, very generic base type, where the specific domain is, you've heard it before, application specific!
  • Text - Reluctantly I add this to the list. Although this is just a RAW with some limitations (usually only UTF8 or ISO8859-15 or something characters), there are some operations and attributes that are so tied up with text strings that a TEXT type is reasonable to include.
And that's about it! Yes! We could add a few standard domains of top of all this, like FLOAT and DOUBLE, SMALLINT, TINYINT etc etc. But as for the basic types, this is what I would like to see. As for the weirdo TIMESTAMP semantics in MySQL, add them as triggers on your table, in application code or, best, as a specific domain (I do know that not all weirdiness of TIMESTAMP can be represented by a standard SQL DOMAIN, which is both a sign of just how weird TIMESTAMP is, and of some of the limitation of the SQL DOMAIN).

I have now mentioned SQL DOMAINs a few times, so lets spend some quality time with them right now before closing up, here is a simple session using PostgreSQL 8.4:
  • First we create a domain:
    CREATE DOMAIN yesno AS CHAR(1)
    DEFAULT 'N'
    NOT NULL CHECK (VALUE = 'Y' OR VALUE = 'N');
  • Then we create a table that uses that domain:
    CREATE TABLE user_active(username CHAR(10), is_active YESNO);
  • And then we check if it works using a few SQL statements:
    INSERT INTO user_active VALUES('foo', NULL);
    ERROR: domain yesno does not allow null values
    INSERT INTO user_active VALUES('foo', 'Y');
    INSERT 0 1
    INSERT INTO user_active VALUES('foo', 'N');
    INSERT 0 1
    INSERT INTO user_active VALUES('foo', 'A');
    ERROR: value for domain yesno violates check constraint "yesno_check"
Before I wrap this up, I want to say one more thing on the issue of SQL Domains: There is an alternative n PostgreSQL, which is user defined datatypes using a CREATE TYPE SQL command. To begin with, I like domains much better as they are declarative, which means I don’t have to read a whole bunch of (usually uncommented and undocumented) code to understand what they type does. Secondly, this little warning in the PostgreSQL 9.1 manual isn’t too encouraging (page 1289): “To create a new base type, you must be a superuser. (This restriction is made because an erroneous type definition could confuse or even crash the server.)

I have more to say on types, and I will follow up with a new post on this subject eventually, but this is enough for now.

/Karlsson

4 comments:

Roland Bouman said...

Hi Anders,

interesting topic! I largely agree - DOMAIN is very useful, it could add a lot of semantics to DB schemas. Pity it's badly supported. AFAIK PostgreSQL and DB2 are the only RDBMSes that support them. Modeling tools don't support them well either.

I think MySQL has so many data types because of lack of proper CHECK constraints. By choosing a particular data type you can at least restrict the range of values to some extent.

I noticed a few remarks in your post that I wanted to comment on:

"A datatype defines many attributes"

Of the 3 things you mention, IMO the 2nd one is the essence of a data type.

The 1st bullet point doesn't look really strong, as you can store whatever you like in any data type; bitfields in ints, dates in ints, etc. Like you argue yourself, the 3rd bullet point is more an implementation detail.

That leaves the 2nd bullet - the effect of operations on the data as essence of data types.

"How is it stored and represented in technical terms. And this can be argued, a C int doesn't define how it is stored on disk, but a MySQL INT does!"

MySQL INTs do *not* define how they are stored on disk. The MySQL reference manual made it look like it did though. I'm referring to: "10.5. Data Type Storage Requirements" (http://dev.mysql.com/doc/refman/5.5/en/storage-requirements.html). Earlier versions of the page suggested that data type X requires so-and-so many bytes of "storage" which sounds like "disk". Fortunately this page was updated (also see http://bugs.mysql.com/bug.php?id=36850) and now more precisely explains that there's no direct correspondence between a MySQL data type and "storage requirement". The storage engine determines how things are stored. However, MySQL always reads data from disk into its row buffer. That's where the numbers come from.

(continued in the next comment)

Roland Bouman said...

(Continued from the first comment)

"DATETIME - Only one of them is really needed. Just one, very generic base type, where the specific domain is, you've heard it before, application specific!"

I do not share your opinion that its as simple as setting for one type:

- date and time are very different things. Date has to do with revolutions of the earth around the sun; time has to do with the revolutions of the earth around its axis. We always learn that we should normalize our data and put different things in different columns. Why settle for a data type that forces us to store this data denormalized?

- dates and times are fundamentally similar and simply enumerate an amount of time passed since an arbitrary fixed point in time. Date parts like year, month, etc. are just notational devices like the decimal and thousands separator in decimal numbers. (If you feel this argument contradicts my first one, then good! They are both true and just express a different point of view.)

- the nature of time makes it impossible to settle for a fixed resolution. Currently the maximum resolution is microseconds when in memory, and for unfathomable reasons, seconds when stored on disk. It doesn't really matter what the resolution is it will never be appropriate because it depends on the domain. Second resolution may be appropriate for recording marathon finish times, it is not for a 100 metre dash. Milliseconds might just be appropriate to record network latency, it is not if you want to record half-life of quarks. Date range is another example. For many business purposes a minimum date of 1000-01-01 is fine, but it might not do for an archaeologist.

- date/datetime data types make it hard to write efficient analytical queries. A point-of-sale system might record product scans at the register in date/time. Typical queries like: "what is on average the busiest time of day per season expressed in whole hours", “what are the sales of product X for each month (regardless of year)” and "how do sales of product X in this year compare to last year" are inefficient because we have to extract a particular field from the date and aggregate or join on that. Basically this is an extension of the 1st argument, namely that not only datetime is composite (date and time) but that date and time themselves are also composites. I hear PostgreSQL has something called a GIN index (http://www.postgresql.org/docs/devel/static/gin-intro.html) which allows you to separately index these different fields. If you do have such a composite data type, then really we should also have an index that can handle it.

- In extension of the prior argument that date/time is a composite: why settle here? Aren't there many "data types" composite in nature? Email addresses, zip codes, phone numbers, urls - potentially an endless list. Why should date/time be graced with its own data type?

Daniƫl van Eeden said...

Float in MySQL can be dangerous if you expect it to behave the same as it does in an Oracle database.
Bug #57519

Besides (date)time I think an inverval type is also needed. Time is a specific time of a day and Interval could be the number of seconds of a telephone call. It doesn't make sense to multiply 20:15:18 if it means 15 minutes past 8. It does make sense if it is the time it took to drive somewhere. My blog post about this issue.

Karlsson said...

Daniel!

Am INTERVAL datatype is a good idea, yes.

Roland!

Why I think Datetime has it's own type, but that DATE and TIME aren't needed is because:
- DATE and DATETIME are basically the same thing, with different precision.
- TIME is a pretty useless type, assuming DATETIME works as expected. This is because TIME is really a pretty simple type with simple semantics.
- Why DATETIME is special and is a type of it's own is because it has difficult and strange historical semantics. Leap-year, time-zones, etc spring to mind here. A DATETIME really should know "it's timesoze" for example.

/Karlsson