The whole deal with Eventual consistency is something that I am still opposed to, I want to know if my data is consistent. And I am not not sure that you cannot have a fully consistent, distributed system either. But I guess that debate goes on. And I still want my base data to be consistent. Like in RDBMS-SQL-Foreign-keys-all-over-the-place-and-not-a-bl**dy-bit-lost-in-the-MyISAM-swamp consistent. That is what I want the base data to look like. And if there are compromises with this, which it may well be, then I want to know about those too.
So, having covered that, what am I trying to say? Well, if you properly normalize your data, then the more you normalize and the more strict you are with data quality, the more troublesome management of that data is going to be, and that is something we have to live with I guess. But if you then are to ask some hefty queries on that data, where the data is organized in such a way to make data real consistent, and the queries just want to data, and the query-side of things really doesn't care about normalization at all, how do you deal with that? One way of course may be to replicate to something more query-friendly, possibly a second MySQL-server or possibly even a bunch of such servers in a scale-out scenario. But your data structures still look really complex, having being built to support storage, update, maintenance and consistency requirement foremost.
At Recorded Future we have taken a different path in our latest release: Choose the best tool for the job at hand. We use MySQL with InnoDB for our data loading and storage. And for that, MySQL worka real well. So we have the data we have collected and processed and organized, structed nicely in an RDBMS.
Now, on the other side of things, where queries are made, things look different, there we want to fulfil 2 needs, basically:
- Fast quering for data, in out case these are instances.
- Fast retrieval of attributes of the instances that was retrieved.
Forgetting Sphinx for now, what we do with Sphinx is actually really simple, and concentrating on MongoDB, where we also do pretty simple things, but the requirements and the scale in the case of MongoDB is higher for us. MongoDB so far has performed well for us. We are running in an Amazon EC2 environment, and that has issues of it's own (in particular this seemed to be the case with Sphinx, but they are on the case). As for Mongo, this is so common in EC2 environments so I guess ot has been more tested.
We are always on the lookout for new technologies, and we do try many things, but the current setup is really useful and we do get much better performence and scalability. And yes, we do get both, with the same number of servers, we get better performance, and much better distributed load of the machines. Now we are waiting for Amazon to fix their disk IO and Network issues.
Hope to see you in Santa Clara in April at the MySQL UC!