Monday, December 20, 2010

Using the right tool for the job at hand - MongoDB, MySQL and Sphinx

You may have seen my posting regarding "eventual consistency" some months ago, and you may have come to the conclusion that I was insisting that a SQL based RDBMS is the way to go for just about anything. Tell you what, that is not so. And nether am I against using. say, MongoDB, where it is appropriate.

The whole deal with Eventual consistency is something that I am still opposed to, I want to know if my data is consistent. And I am not not sure that you cannot have a fully consistent, distributed system either. But I guess that debate goes on. And I still want my base data to be consistent. Like in RDBMS-SQL-Foreign-keys-all-over-the-place-and-not-a-bl**dy-bit-lost-in-the-MyISAM-swamp consistent. That is what I want the base data to look like. And if there are compromises with this, which it may well be, then I want to know about those too.

So, having covered that, what am I trying to say? Well, if you properly normalize your data, then the more you normalize and the more strict you are with data quality, the more troublesome management of that data is going to be, and that is something we have to live with I guess. But if you then are to ask some hefty queries on that data, where the data is organized in such a way to make data real consistent, and the queries just want to data, and the query-side of things really doesn't care about normalization at all, how do you deal with that? One way of course may be to replicate to something more query-friendly, possibly a second MySQL-server or possibly even a bunch of such servers in a scale-out scenario. But your data structures still look really complex, having being built to support storage, update, maintenance and consistency requirement foremost.

At Recorded Future we have taken a different path in our latest release: Choose the best tool for the job at hand. We use MySQL with InnoDB for our data loading and storage. And for that, MySQL worka real well. So we have the data we have collected and processed and organized, structed nicely in an RDBMS.

Now, on the other side of things, where queries are made, things look different, there we want to fulfil 2 needs, basically:
  • Fast quering for data, in out case these are instances.
  • Fast retrieval of attributes of the instances that was retrieved.
The above is fulfilled with 2 supporting technologies, Sphinx for instance querying and MongoDB for instance attribute (anwering the typical kind of key-value-store query: Here is a key, gimme all the data). This is what we are live with now, and so far we are really happy. Mind you, this didn't come that easy, both Sphinx and MongoDB has caused some problems.

Forgetting Sphinx for now, what we do with Sphinx is actually really simple, and concentrating on MongoDB, where we also do pretty simple things, but the requirements and the scale in the case of MongoDB is higher for us. MongoDB so far has performed well for us. We are running in an Amazon EC2 environment, and that has issues of it's own (in particular this seemed to be the case with Sphinx, but they are on the case). As for Mongo, this is so common in EC2 environments so I guess ot has been more tested.

I admit to liking MongoDB. It is lightweight, easy to manage and has some cool features, such as automatic, and reasonably transparent, sharding. MongoDB at first seems to have good documentation, there is a lot of examples and samples. What is lacking is a good reference though, and a good administration guide is also missing and the Sharding support is still not well documented from admin point-of-view. The JavaScript orientation in MongoDB at first was annoying to me, but increasingly I find it useful. One thing to note about MongoDB is that it needs it's memory: the more the better. For good performance with MongoDB you really want to avoid disk accesses (in particular in EC2 environments). What we do to avoid disk-accesses as much as we can, we have gzipped the data items. We store data in JSON format (yes, we are a JSON shop, sorry XMLers out there) and compressing data fast and having a fast JSON parser is crucial. And yes, we DID think about the case with frequently accessed data: we have the option, on an instance-by-instance basis, to store data as plain JSON or compressed. Plain JSON has advantages, such as as MongoDB is using JavaScript, JSON is pretty much Native to MongoDB, making certain operations real easy.

We are always on the lookout for new technologies, and we do try many things, but the current setup is really useful and we do get much better performence and scalability. And yes, we do get both, with the same number of servers, we get better performance, and much better distributed load of the machines. Now we are waiting for Amazon to fix their disk IO and Network issues.

Your truly
Hope to see you in Santa Clara in April at the MySQL UC!


YoungNDangerous6 said...

Can you elaborate on what you use MySQL for and what you use MongoDB for? You said you used MySQL for "data storage" and MongoDB for "instance attributes". What is the difference? Why did you choose MongoDB instead of MySQL for instance attributes?

Karlsson said...


MySQL us used as the back end of data. Whatever data we have, either picked or processed ourselves, MySQL has the truth. In MySQL we store the data in structured, normalized and verified form. In MySQL we manage, load and massage the data.
In MongoDB we have a copy of what is in MySQL, but organized in such a way to optimized seraching. For example, when loading data, updating attributes is each with MySQL (one row per attribute, using some unique key), whereas in MongoDB that is more difficult and performancehungry, much more so.
On the other hand, if an instance has, say 500 attributes, that will require 500 roundtrips to MySQL and exactly one to MongoDB, as we store all attributes in one JSON object access by a sinle unique key on instance.
All this is simplifying that quite a bit, but bascically this is what we do.

Best regards
Anders Karlsson

Anonymous said...


Just a suggestion. Have Amazon look into using Fusionio (It may fix the disk i/o problems - which may appear as network problems!)