Thursday, February 3, 2011

Objections to Objects?

A long time ago, there were no objects. There was clean, cold code, and the closer to the hardware the faster, not only because it is faster that way, but also because not just anyone would write code in those days. Then came programming languages in the 1950's and structured programming in the 1960's (the first commercial software offering was Auto-Flow, a program that would generate flowchart prints from existing programs written in Cobol, Fortran etc. Auto-Flow dates from the early 1960's).

The 1970's came around, and besides awful clothing styles and horrible interior decoration colour schemes, and things such as ABBA, we got Personal Computers, sort of, and the Relational Database (Oracle was first with an true software offering in this arena, in 1977).

In the 1980's Object Orientation, which had been rather esoteric and academic in the 1970's, turned into reality. (Also, Oracle got BIG). And if you ask me, this was a good thing, if it wasn't for what happened in the 1990's.

The 1990's saw, among other things, my (still) favorite hangout Akkurat open in Stockholm serving English Cask Conditioned Real Ale and Belgian Beers, among other things. One of the worst things that I think happened in the 1990's though was that Object Orientation, which had so far been a real cool way of writing organized code, which in itself also documented, in the code, relationships and interactions between objects, into a Religion. Oh my, among the most awful things was getting into a customer where an OO consultant had taken control. You couldn't even go to teh bathroom without having a crapper object with number one and number two methods. Awful stuff.

And in the midst of that came... Yes, you know what I am talking about, the ORDBMS and OODBMS, the Object relational and the Object Oriented Databases. I was out there, trying to convince people to buy the ORDBMS I was promoting at the time, something I was doing with some, but still limited, success. I admit it, I was wrong, this was, in the general case, not a good idea!

Looking back, I think I know what I think the problem is with the OO database things. At least I have an opinion (you expected that I guess) and here it is: OO is fine, it's a great way or organizing data in your code, but just because can make code good (it doesn't HAVE to, just as little as C automatically makes your code "portable"), it doesn't necessarily make your data any good.

One thing here is that there is lots of code that use your data, and there are many way to look at data at. So your code makes do with a view of data that fits whatever task your code is set out to do. But the same data may well be applicable to another use, using some other, completely different, structure, relationships and methods. The mistake, in my view, is to say that just because I can process data in an OO way, my data should be organized in an OO manner. I do not think so.

Another thing is that, when looking at data, the Relational model is brilliant. It allows you to view data from any angle, independent of how it will later be processed, which is the key here: Data is data, and processing of that data is processing of that data. And yes, these two things are linked, but that doesn't mean they have to have the same structure.

A third things is that OO from the start was a developer thing. It allowed you to write black-box style code, it allowed smarter code, and it encouraged structure to the code and made it easy to provide and enforce strict interfaces between different aspects of the code. But it was still about code. I think nearly every highly paid OO evangelist of the 1990 was looking at OO from the POV of code, and not data.

My view is a bit like this, when it comes to relational databases and different kinds of programming methods and whathaveyounot. The RDBMS is like a container, you can fill it with whatever you want, but it is just contents, it will not move or do anything, it will just stay there in the container, and the bookshelf you put in there is there the next time you open the container (hopefully). The transport mechanisms, the trucks, ships, air carriers, satellites and what have you not, is the processing we do. You pick the best processing for the job. There may be better ways of transporting your data than a container, but because all containers look the same, this creates so much of an advantage as the disadvantage of not filling up every square inch of the ship is minor.

And then you have the standard things that allows your container to be loaded, moved, fastened, unfastened, lifted and loaded again etc in a standard way. This is your query language. Of which SQL might well not be the best (I don't think it is), but that is beside the point, that everybody uses it is what the advantage is. It really is.

And when I pack my container to move it, I don't have to worry about if the container will fly, go by truck or be shipped. I really do not care, I just fill it with stuff. Just as I just provide my database tables.

And another things, looking at data in the shape of fixed named columns and variable number of rows is something that everyone, more or less, can understand. It's a universal language that goes WAY beyond the RDBMS.

To conclude, no, I don't think OO sucks, I think it rocks! But for code, not for data. And no, I don't think that your data is good just because it's in an RDBMS. And no, I do not even think an RDBMS is good for everything! There are cases, many of them, where all you need is an organized persistent storage for your programs data: No-SQL is great for that. But if data is data, and not just an extension of the RAM occupied by your program, then an SQL based RDBMS is often (but not always) your best bet!

/Karlsson

6 comments:

Evan Sparks said...

My main problem with a traditional RDBMS is that the relational paradigm is fundamentally inefficient (read: broken) for certain types of tasks. What types of tasks? Tasks that depend tremendously on order in your data.

Why is this? Well, the relational data model (and by extension SQL) is grounded in set theory. If we remember set theory from our university classes, the key properties of sets are that 1) items in a set are unordered, and that 2) there are no duplicate items. It is these two properties that allow the math to work out well and that allow relational databases to function incredibly fast for transactional applications (O(1) inserts? Just throw it in the bag!).

However, particularly in the analytics space, order can matter a lot. As a trivial example, let's consider that we're looking at a table of quarterly sales. We want to calculate growth in the quarterly sales which has the mathematical formula 100*((S/lag(S,1) - 1). In MySQL, I'm unaware of a method for calculating that value without a JOIN.

In a multidimensional, or OLAP database system, these operations where order matters are baked right into the data model. Whether these systems perform at scale is another worry altogether. It can take some time to wrap your head around the hypercube data model if you come from a relational world, but it's powerful stuff once you grok it.

rpbouman said...

Anders,

While I don't agree on all points, I do agree that from a data maintenance point of view, the relational model is superior to many other models I know of.

But I'm a developer too, and I feel the pain in making the languages and the RDBMS play nice almost everyday.

I think freebase, and its query language MQL have found a pretty good solution to this problem. Even though freebase is a graph database, I think some of the principles apply rather nicely to the RDBMS.

As a poc, I created mql-to-sql, which implements a webservice that lets you query any RDBMS (well, any PDO RDBMS with reasonable SQL support) over HTTP in MQL.

The nice thing is that you get to keep the benefits of storing and maintaing data in a relational form, and still query it from any angle you want, but at the same time, you can project the data in non-relational ways, with multiple repeating groups, arbitrary deep nesting etc, and the client just needs one roundtrip to the app server, and no additional processing at the client side.

It would even be nicer if this was implemented directly inside a RDBMS, since I think there are opportunities where the RDBMS could be much more efficient than a middleware layer.

For some example queries, see http://code.google.com/p/mql-to-sql/wiki/SampleMQLQueries
It has a link to an online query editor so you can play with it without installing anything.

Karlsson said...

Evan!

I agree with what you say, although I do not think the relational model is broken in any sense. Question is, should we adopt the data model to the processing needs, or the other way around? And I say, both: Create a datastora that is good for your data. You say that the RDBMS model is not good for OLAP needs, which is very true, and I then say: take the RDBMS data and create a new store which is modelled after the OLAP processing needs.
In conclusion, I agree that the RDBMS model not optimal for, sat OLAP or some other processing needs. But I maintain that it is optimal for the data itself. And that it may well be adopted for other needs.
We at Recorded Future is an example of this: Store, maintain and process in an RDBMS (MySQL currently). Then query in Sphinx and look up details in MongoDB. Works like a charm.

Roland:
Yes, I would love to see an RDBMS have a more developer friendly interface. I think this may well be the key here: An agreed data structure, i.e. Relational, and a multitude of interfaces (SQL being the data centric interface). And I am not talking about things like an ORM or something, that is just an extra layer, we don't need that.

/Karlsson

rpbouman said...

Anders,

Personally, I feel a language *is* an extra layer on top of the raw storage engine bit of the DBMS.

But I agree there is a practical difference between that and an ORM, since an ORM lets the developer (or maybe I should say "requires") to model the relationships between the application specific object layer and a specific relational schema.

A language stipulates a generic data model and its semantics and lets you work with the data model of the underlying data model of the DBMS.

Main difference is that in the case of a language implementation, the data models are not application specific, but more like metamodels, and the mapping is pretty rigorous and therefor standardized regardless of any particular application.

These definitions of ORM vs language don't say anything in particular about the component that implements it - this maybe client, server or middleware.

I'm not sure if you agree with my definitions here, (if not, I'd love to hear your point of view on it, my thinking about these things is not as clear as I'd like to have it), but if you do you'll notice that mql-to-sql is currently a MQL language implementation (not an ORM) but implemented at a middleware layer. I'd love to have it built-in to the RDBMS, which would offer all kinds of opportunities to optimize the implementation, but for now I'm happy with something less fancy that is easy to implement and portable across RDBMS-es.

Karlsson said...

Roland!

MQL really sounds cool and useful. In my mind, the relational way of organizing and managing data is brilliant. As for SQL I am much less sure, actually, I am 99.99% convinced it is far from the best solution. The best aspect of SQL, by far at least in my mind, is that it is ubiquitous.
I'll have a look at MQL, it sounds fun. If you are at the MySQL UC in April, why don't we have a BoF on the subject?

/Karlsson

rpbouman said...

Anders,

yes, I will be at the UC, and I will be talking exactly about this subject: http://en.oreilly.com/mysql2011/public/schedule/detail/17134

I'm looking forward to meeting you there. I would love to attend/help organize a BoF about SQL/Database query languages