Wednesday, January 18, 2012

Database Innovation, pleeease!

I think you have heard me say it before, but in this case I think repetion is needed: We should be much more innovative in the database world. And no, I am not talking NoSQL here, not at all. For all the good things with the NoSQL technologies and the movement itself, it's not really innovative. Rather, in my mind, NoSQL sacrifices functionality for performance, largely. The schema-less design of most of these technologies is probably the one aspect I would consider innovative, the rest is just RAM based storage, sharding, key-based lookups and good, old B-Trees.

Talking about B-Trees, isn't it time we retired them soon? There should be better ways if indexing data. Look at something like Mongo. With MongoDB, you really want to have your indexes in-memory, all of them, without that, performance will be awful (there are exceptions though, but in general this is true). Now, a B-Tree is an index mechanism that has worked well, as the structure of it lends itself to good performance be it on disk or memory, although in general, a B-Tree is built for disk-based storage with caching; for in-memory use, there are better, more efficient, indexing (or access) methods. So if an index in Mongo is supposed to be in memory, why choose a disk-oriented indexing mechanism? T-Trees are there, they are optimized for in-memory use and has been around for ages? I guess the answer is tradition.

Tell you what, tradition is a BAD BAD argument for anything in an industry that changes as fast as the IT-industry. Would anyone suggest that Facebook base their hardware platform in Motorola 6800 CPUs? I think not. But the B-Tree predates the 6800 by far.

Which is not to say that the B-Tree is so bad (or that the Motorola 6800 is either), it's not, but we have much more diverse needs these days, so there should be more diverse access methods in use, but the B-Tree persists, despite that.

And look at SSD-disks. Yeah, the future, right? A largely random access style memory hooked onto an interface designed for electro-mechanical harddisks in the 1970's. Innovative? I think not. Apple got it right in attaching Flash on the Mobo and PCI-based Flash is growing and coming down in rice, so it seems things are moving there at least.

But in any case, Flash / SSD isn't an electromechanical disk with cylinders and sectors, despite what the SSD interface tells us. And if the B-Tree works well on disk, we talk electro-mechanical disks. Where is the access methods designed specifically to reap the benefits of direct attached Flash?

And to be honest, the SQL-Based RDBMS, something which I have spent my career with, in one shape or te other, for 25'is years, is hopelessly outdated, but that is not why I'm no big fan of the NoSQL movement. Rather, my problem is just that the NoSQL movement really doesn't represent something new or is a disruptive technology in any way. Where, my friends, is the disruptive database technology? A Technology built for (you are sitting down now, I hope) the 21st century, If you missed it, we are there now, since 11 years back actually, so start inventing.

And yes, I know about the different MySQL variations with sharding, storage engines etc. etc. But that is not terribly innovative or new. The closest we get to a disruptive technology in the database world recently, is the column based storage databases. But thses are not gereric enough in my mind, and also, most of them have a SQL based interface tucked onto them. And I understand why they want SQL, they need this to be able to sell it, as all consumers of database products (most at least) wants SQL to integrate it with some tool or infrastructure. And I understand this too, but it brings up a question. Where is the customer or end-users who is willing to sacrifice using a query language as old as Led Zepplin to instead get the benefits from some new disruotive database technology?

But this is, I'm afraid, a bit of and chicken-and-egg-situation. The customer isn't requesting innovative products as that technology doesn't exist much, and the products aren't developed and research isn't much done as the customers aren't there. This really has to change soon, and I am sure it will. If for no other reason so that I can retire in peace, knowing that my SQL skills are truly outdated and I will not have to work, because noone wants my skills!

/Karlsson
Lookiing forward to retirement

6 comments:

Bradley C. Kuszmaul said...

T-trees are not as good as B+ tree even in main memory. That's because even in main memory caching effects are important. One problem with T-trees is that they store data in internal nodes of the tree. The space at the root of the tree would be put to better use holding pivot keys instead of data. The wikipedia page for T-trees cites several articles which studied this problem and concluded that T-trees do not beat B-trees.

If you want something innovative, consider LSM trees. If you want something really innovative, consider fractal trees which have the insertion performance of LSM trees without the read penalty of LSM trees.

Karlsson said...

Agreed to an extent, although for in-memory data, despite the shortcomings, T-Trees are good for in-memory data. But not always. B-Trees is a technologu that has served us very well for many MANY years, but now is the time for change. And I do not imply that T-Trees is the answer, just that we need something better (and not even I think that T-trees is the answer, just that they are different).

Thanx for commenting
/Karlsson

Karlsson said...

And as for LSM / Fractal trees! Fine, cool! Let me have some real products to try them on!

Luca Garulli said...

Innovation? Look at the MVRB-Tree of OrientDB: it's derived from the well known RB-Tree (super fast in memory) but storing X entries per page like B+Tree. On memory reclaim sub trees are kept in memory.

Karlsson said...

Luca!

Again, cool stuff. I'll look into that. Then we need some commercial products (yes we do, to get these technologies accepted and get some money to develop them). Menawhile I'll look into these interesting ideas.
But my take on the lack of innovation I think is still valid: The big database products out there are still, after all these years, awfully traditional, i.e. the likes of Oracle, MySQL, DB2 etc. It has gotten to the point that when you talk about "INDEX" in the context of a database, everyone assumes you mean a B-Tree, and most other index types are explained in terms of how they differ from a B-Tree. Like exlaining what Lada GaGa is about by talking about what is different from Manfred Mann.

/Karlssoon

CV said...

Agree with you Karlsson.
Innovation in Databases is low; the reason could be because of their dependency on storage/processors/OS and other factors.
I think the main areas where there is ample scope for innovation come through the processing speed, data storage etc.
I am in my own way trying to think of different ways of query processing, indexing etc.
eg:
Matrix databases,
sub columns
parent-child databases etc