Sunday, October 24, 2010

It's about time

TIMESTAMP and DATETIME is how dates and times, or temporal, data is represented ion MySQL. These types are combined with a bunch of operators on these types, such as ADDATE and TIME_FORMAT and such. So is this enough to effectively manage temporal data then? Nope, in my mind, far from it.

To begin with, time isn't such a simple type of data as we might look at it. Is a date just a count of the number of seconds since 00:00:00 on Jan 1 1970 or something like that? And we then convert it to a calendar and all is fine? Is time data really linear in that sense? No, it's not. Dec 24 2010 is Christmas Eve and when you get your Christmas gifts here in Sweden. Is this significant? Also, it's a Friday. That this data is a Friday can be calculated from the date or the seconds-since-jan-1-1970 counter, but that is a different thing. Looking at that counter, you cannot see that there is no way to look at that number and figure out that this day is special, but it is.

To manage this, we have the date and time functions in MySQL, which help us figure things like the above out, but there are issues with it still. To begin woth, the constructs quickly turn difficult. Secondly, you cannot really do an indexed search on temporal data, except pretty simple cases, such as searching for a data and time, before or later than a given data and things like that. A search for "all saturdays" will sure not use an index.

There are more issues though, like that fact that, as I have shown above, a temporal value as we know them above, contains hidden data, something that is considered bad (like the day of the week, the month, the fact that a data is a leap day etc).

And it doesn't end there, there are different calendars in use, in different parts of the world and in different inductries. If you have a mortage on your house, chances are that the interest you pay is the same each month, despite the fact that the interest in calculated over the full year and different months have different number of days. The reason is that interest uses a "360 day" calendar, where each month is assumed to have 30 days. And different financial instruments use diferent calendars (for example the one called 366 day, which is our normal calendar and the 365 day calendar. which is the normal calendar with the exception that there is no leap day). If you try to match one financial instrument to another, when they use different calendars, is not a lot of fun (I've done it, so I know).

But chances are that many of you now think that this is not a big problem for you, as you don't see people asking questions on data with complex temporal relationships. And to that I say that, there is a big chance that this is because:
  • Temporal data is difficult. More so than simple integers and strings, disregarding the simple before date X or something like that. But we just don't ask the questions like "how much of the sales per quarter happen on the last day of the quarter" and things like that. Which doesn't mean the information isn't interesting, but the questions is difficult to formalize and even more so to execute in a database. Not to mention checking for overlapping dates and things like that.
  • We are used to not being able to ask these kinds of questions.
But really we should be able to ask, query and get data with a lot more temporal information. When was the last time you asked a temporal data query in Google (not that you CAN do this). Like, how has the matching or IBM and Pressrelease changed over the last 10 years, i.e. how many Press releases did IBM post in 2001, in 2002 etc. Yes, the data is there somewhere, querying it is difficult. And using temporal data combined with data you get on the web should allow you to determine more than what you can today (i.e. IBM and Press release matches X number of documents in Google right now). Adding a temporal axis to web data should allow you to determine what might happen.

One issue with all this is that databases are hardly geared up to support this, and a second is that the data to support this isn't available in large quantities yet. This is one of the things we are looking at at Recorded Future, but there is more stuff coming. The web is there with masses of data, now we need to make sense of all that data. One aspect which is there but which hasn't yet been explored much is just the temporal aspect of data.

/Karlsson

4 comments:

Jeremy Cole said...

Anders,

I see this more as a relatively easy to solve secondary problem. For instance, learning and being able to query on the "hidden" metadata could be exposed with a two-prong approach:

1. Some function to take an arbitrary date/time and return "interesting" metadata about it, perhaps as K/V pairs. E.g., holiday=Christmas-Eve, quarter=4, days-to-end-of-quarter=7, days-from-start-of-quarter=84, etc.

2. The ability to index on functions in some useful way.

Combining those above things you could fairly easily construct queries about any "interesting" metadata your function happens to support.

Regards,

Jeremy

Anonymous said...

The problem is basically that we pretend that date and datetimes are atomic, but they really aren't :)

(Perhaps atomic is not exactly the right word, complex is perhaps better)

But regarding the "soft" metadata, I think that problem is adequately solved by keeping a date table like you do in data warehousing (a date dimension): simple have a table with one row for each day, and use columns to store metadata like name of holdiday, whether or not it is a weekend, whether this year is a leap year, the quarter according to the calendar, the fiscal quarter etc.

Depending on your requirements you can store calendar data in that table too, or create separate date tables for each distinct calendar (taking care that the key of the date tables is the same everywhere). Same goes for localization and internationalization.

Of course, you still need to fill this table periodically, but after you generated and stored that data, it is amazing how much easier queries become once you have all metadata as simple fields, ready to be used (as opposed to crafting complex date formatting and extraction functions)

There are a few maintenance problems to take care of, such as how often do I add more data for future dates, but in reality you'll find that you only need to have a supply of dates to add temporal context to events in the past or events in the relatively near future.

Karlsson said...

Jeremy: Yes, absolutely, but that query would not be easy to read, nor easy or even possible to index using a b-tree at least. Like searching for overlaping intervals and stuff like that. There is a PostgreSQL project to adress thi, have a look and you see what I mean.

Roland: Regarding atomicity, I agree. Regarding the solution to the problem you outline, well, true, that would solve the problem, sort of. But would not be terribly relational (data duplication). And what about different holidays in different cultures or regions?

sapphirepaw said...

Recently I've been dealing with data hidden a layer deeper: the business is only closed during a subset of US holidays. Even if I could get information like "Is $day a holiday?" it's not enough to answer whether we will be closed that day.

Another interesting one is that MySQL doesn't store timezone. I store everything as UTC in my own projects, but most of my employers have had systems storing local time into DATETIME columns, and then I can't tell when '2010-11-07 01:45:00' is EST or EDT.

(DATETIME can get even more exciting when you're mixing use of NOW() in mysql and date() in php, and they're working off of different tzdata or system timezone. But I don't expect anyone can solve that one in the DB alone.)