Tuesday, January 17, 2012

MongoDB for MySQL Folks part 4 - Sharding

Welcome part four of this series of blog-posts on MongoDB, where we previously looked at:
These were introducing some basic concept when it comes to querying MongoDB and to show some simple use cases. By now you realize that MongoDB is different from MySQL, but you probably knew that already, but why would you move from MySQL to MongoDB? Well, you know the answer to that: Performance and Scalability. If MongoDB didn't provide pretty seamless sharding, scaling out over a large number of nodes, then MongoDB wouldn't be that interesting.

With MongoDB, it's performance depends on having large portions of data in RAM, and this is no different from MySQL, but it's even more true with MongoDB. But if you were running on a single machine, the amount of RAM you can use is limited, there is a limit to how much RAM you can (c)RAM into a single box. This is will limit performance of course, and is again not much different from MySQL. What is different is that MongoDB has a solution: transparent sharding (yes, I am aware of the different transparent sharding implementation of MySQL, like Scalebase, but that is a different story, that I will get into at a later stage).

Without Sharding, we would not have gone into using MongoDB here at Recorded Future. Why anyone would use MongoDB in the versions before when sharding was introduced, is beyond me.

MongoDB sharding from above

From a high-level view, this is how sharding works in MongoDB:
  • Data is distributed across one or more mongo servers automatically. A server may be either a single MongoDB server or a replica set (more on this in a later post).
  • Data is distributed in ranges in a user-defined shard-key. The shard key is, as is obvioious, a unique key. It may well be the unique _id identifier that MongoDB assigns to each document, but that is not necessary. Each such range is called a chunk and is some 64 Mb is size by default.
  • Each MongoDB server holds a number of these chunks.
  • Balancing a sharded setup involves moving chunks between the servers and this is automatic. In a perfectly balanced MongoDB shard setup, each involved server has the same number of chunks.
So far so good. Nothing incredibly complex, right, and also useful, right? Yes,, useful and workable, but you have to know what you are doing here.

There are actually three kinds of servers involved here:
  • Mongo shard servers. These are the MongoDB servers or replicasets that holds the actual sharded data. These are the same servers as for a non-sharded MongoDB setup, and there is no stopping you connecting to it just like that, accessing the sharded data in a non sharded way. Also, there is no stopping you adding databases, collections and documents to this server. The server may well hold both sharded and non-sharded data. The latter seems like an advantage, but actually is not and should probably be avoided. The number of mongo shard servers or shard replica sets determines how much data is distributed obviously, with 2 shard servers, each server will hold about 50% of the data etc.
  • Mongo "router" mongos. The mongos process is the process that the application connects to, and it is responsible for distributing a query between to the appropriate shard servers for a particular query. This is a different program than the other mongo servers, and it also has the role of performing the automatic balancing. The interface to it, from the point-of-view of the poor old application, is the same though, so an application that works with a non-sharded MongoDB should also work with a sharded one, but it connects to mongos instead of mongod. You may have as many mongos servers as you please.
  • The Mongo "config server". This is process that runs the usual mongod server, but it has a special role. Instead of storing the actual data, the mongoc manages the metadata, in other words, the config server data tells mongos in which mongo shard server an actual document is located, based on some query. It might seem like this server is a single point of failure, but it is not, you may have 3 of them, in which case they are replicated using a special mechanism.
To summarize this: The application talks to mongos. Mongos need to know where that data that the application is located, so it asks the mongoc. Once it has this data, it goes ahead and asks the mongo shard server for the actual data. And you might think that the extra roundtrip that the mongos has to do do the config server will slow things down and that the config server(s) is a potential bottleneck, but this is not so as the mongos process caches the config data. If the mongos data gets outdated (stale in MongoDB terms), the cache is refreshed.

Setting up MongoDB sharding

There are a couple of weird issues when you configure MongoDB for sharding, and some which are not so obvious. What I have described above is how the MongoDB server processes interact, and setting that up is difficult enough, although far from as difficult as it used to be.

You might be tempted to think that once the above setup is running, all data will be sharded just like that. And that would be nice and that would be the natural way for things to work, but hey, this is MongoDB, so it was close, but no cigar! No, you still have to tell the system that a particular database is sharded, and then enable sharding and determine the key to use for each and every individual collection in that database.

And again, as often is the case with mongo, some things are controlled from the commandline, some things are in configuration files and some things are stored in MongoDB itself. Also, in many cases, even though you have to set certain things in a configuration file, they end up in the database itself anyway. Why this is so you have to ask someone else than yours truly.

So, to get thing straight, this is how you set up sharding with Mongo in not so few and not so easy steps:
  • Start your mongod daemons that stores the actual data. Start them with the --shardsvr option enabled. Apparently this is no longer needed, but I think it's a good thing to put it there anyway.
  • Start your mongoc servers, you have 1 or 3 of these. These are mongod servers running with the --configsvr option.
  • Wait until the config servers are up and running. All of them! This is not well documented, but mongos will not start unless at least 1 config server is running, and the first time around, ALL config servers must be running!
  • Start your mongos servers (routers). Make sure that they actually start, if the config servers cannot be contacted, then mongos will just fail with a message in the logfile.
  • Now, we must tell the config servers where our shards are, lets say we have two of them, "datasvr" and "datasvr2", both running on port 33010. Then enter the mongo shell, connecting to the mongos host and port (in this case mongos_host and 33011 respectively). First you must be in the admin database, and then you can add the shards:
    $ mongo --host mongos_host:33011
    mongo> use admin
    mongo> db.runCommand({addShard: "datasvr1:33010"});
    mongo> db.runCommand({addShard: "datasvr2:33010"});
  • Now we can make sure that we shard the data we have, so enable sharding for the database in question, lets call the database "mydb". Enter the mongo shell and connect with one of the mongos servers, again in the admin database:
    mongo> use admin
    mongo> db.runCommand({enablesharding: "mydb"});
  • And then we have one step left: enable sharding on the collections we want to shard, again from the mongo commandline connected to mongos. Here we specify the name of the collection and the key used for sharding. In this case, we use the always present unique _id column as the shard key and the collection is called mycoll:
    mongo> use admin
    mongo> db.runCommand({shardcollection: "mydb.mycoll", key: {_id: 1}});

Does the commands above that you run from mongos look weird to you? Like the key specification key: {_id: 1}? Yes, at first it looks awkward, even to me. But after a while you get used to it and can start to appreciate the strict Java Script / JSON syntax used all over the place in MongoDB, it is actually quite powerful and easy to use, once you get used to it.

Now, we have enabled sharding for one colletion here. If you have been through all this fuzz to create a sharded setup, I guess you would want all tables in that setup to use the powerful sharding mechanism? Like, automatically? Nope, can't do. You have to enable sharding manually for each MongoDB database and collection you create.

I plan to dig deeper into mongo in a later post, to show some monitoring features, how to use Java Script, how to set up replication and some other things. But for now, this is it!


1 comment:

hingo said...

Anders, I've really appreciated your MongoDB sequel. Interesting to learn about technologies even if I don't use them myself.

As for this one: yeah, I think when someone says "transparent sharding" that might create higher expectations than what you describe here :-)