Monday, October 24, 2011

MongoDB for MySQL folks - Part 2

In this second part of a series of blogs on using a NoSQL database, in this case MongoDB, aimed at MySQL users, I will describe querying a bit. And it's probably not what you expected if you haven't tried a NoSQL database or MongoDB more specifically, before. The first blogpost on this subject is here.

At the heart of accessing MongoDB is JavaScript, more specifically it uses the Mozilla SpiderMonkey JavaScript engine. As I wrote in the previous blog on this subject, JavScript is all over the place in MongoB, largely you can look at MonngoDB as JavaScript with HUGE space for variables, and although many would look at this as gross oversimplification, it works as simple descripton of what we have here. JavaScript for querying, inserting, administration, scripting and "stored procedures" (they aren''t called this in MongoDB, but you know what I mean). And JSON (JavaScript Object Notation) for data, when inserting an object, you construct a JSON object, when you query, what you get back is a JSON object. etc etc. Internally, Mongo uses a binary representation of JSON called BSON.

So we have A LOT of JavaScript and JSON, and if you are no fan of JavaScript, then probably MongoDB isn't for you! In the MongoDB database there are collections then, which are similar to a table, but in a collection, there isn't, like in an RDBMS, a bunch of rows all with the same columns (attributes), instead there are multiple object, all using the same "primary key" (called _id) and the object itself is a JSON Object with any attributes. This is why MongoDB is "schemaless", there is no set schema for the objects (i.e. "rows") except the _id then.

The schemaless design sometimes gives the expressionh that beyond the unique access key, there is no other way to access the data and that you cannot have secondary indexes, but you sure can! But now I am getting ahead of myself.

Let's give MongoDB a shot. After installing MongoDB (which is dead simple) and starting the server, access the command-line using the "mongo" command line program. In the examples I show here, I use a sharded setup, so the prompt is "mongos", and it might be different in your case, depending on your configuration and version of MongoDB. First, lets create a database to play with;
mongos> use mytest
Hey, you ask, what is going on here? You never created a database, you just accessed it? Yes, that is the deal with MongoDB, you just access a database or a collection, and if it doesn't already exist, it is created for you. Which is the reason many MongoDB databases, even in production, has a bunch of wrongly spelled database and collection names.
Now, let's see of the database we access really exists.
mongos> show dbs
admin (empty)
config 0.1875GB
test (empty)
and as you can see, I lied to you, the mytest database isn't there at all! Well, the reason is that there is no collection there yet, so let's create one:
mongos> db.foo.insert({attr1: 1})
Again I am accessing something that doesn't exist to create it, in this case a collection called foo, where the first object has one attribute, attr1, with the value 1. Now, the mytest database should exist, as there is data in it, let's have a look:
mongos> show dbs
admin (empty)
config 0.1875GB
mytest 0.203125GB
And there you go, the mystest database is created and so is the collection foo:
mongos> show collections
foo
system.indexes
OK, there you go, a database and a collection for you. This is getting hotter! Let's insert some more data:
mongos> db.foo.insert({attr1: 2})
mongos> db.foo.insert({attr2: 3})
mongos> db.foo.insert({attr2: 4, attr3: 5})
All this is valid data, every object may have different and different number of attributes, fact is, every object can have any attribute you would want. Which doesn't necessarily make it a good idea, but it's good to have that feeling of complete freedom, right?
Some more inserts:
mongos> db.foo.insert({attr2: 4, attr4: {attr5: 6, attr6: 7}})
mongos> db.foo.insert({attr2: 4, attr7: [10, 11, 12]})
Here I first insert an object where one attribute in itself is an object, and then an object where one attribute is an array. Hey, cool!
Now, let's do some youerying. First just get att the data:
mongos> db.foo.find()
{ "_id" : ObjectId("4ea528fb35a81331e048baa9"), "attr1" : 1 }
{ "_id" : ObjectId("4ea52a4835a81331e048baaa"), "attr1" : 2 }
{ "_id" : ObjectId("4ea52a5035a81331e048baab"), "attr2" : 3 }
{ "_id" : ObjectId("4ea52a5f35a81331e048baac"), "attr2" : 4, "attr3" : 5 }
{ "_id" : ObjectId("4ea52ae635a81331e048baad"), "attr2" : 4, "attr4" : { "attr5" : 6, "attr6" : 7 } }
{ "_id" : ObjectId("4ea52afc35a81331e048baae"), "attr2" : 4, "attr7" : [ 10, 11, 12 ] }
The find() function is a built-in MongoDB function that retries data for you, and it can take 2 optional arguments, both of which is an object in itself (surprise!):
  1. A query-specification.
  2. The columes to return.
For example:
mongos> db.foo.find({ attr2: 3})
{ "_id" : ObjectId("4ea52a5035a81331e048baab"), "attr2" : 3 }

Here I am querying for all objects where c1 is 3. The attr2 attribute doesn't have an index on it, but I can query on it. To the use enforced _id key, just do this (in your case, you will se a different object id of course):
mongos> db.foo.find({ _id: ObjectId("4ea52a5035a81331e048baab")})
{ "_id" : ObjectId("4ea52a5035a81331e048baab"), "attr2" : 3 }
Before I close for now, let me show you how you can tell that the latter of the two last queries above was more effective by using explain() (you didn't expect that, did you):
mongos> db.foo.find({ attr2: 3}).explain()
{
"cursor" : "BasicCursor",
"nscanned" : 6,
"nscannedObjects" : 6,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {

}
}
The result from an explain is, as you would expect by now, a JavaScript object. Look at the nscanned attribute here. There is no index on the attr2 attribute, so all objects are scanned, compared to this:
mongos> db.foo.find({ _id: ObjectId("4ea52a5035a81331e048baab")}).explain()
{
"cursor" : "BtreeCursor _id_",
"nscanned" : 1,
"nscannedObjects" : 1,
"n" : 1,
"millis" : 22,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"_id" : [
[
ObjectId("4ea52a5035a81331e048baab"),
ObjectId("4ea52a5035a81331e048baab")
]
]
}
}
Here, just one object is looked at, to get the data, and the id column is used to find it (look in the indexBounds attribute). So far, querying doesn't all that difficult to what you see in an RDBMS, except all those objects and JavaScript.

That's it for now, I'll be back with some more MongoDB lessons real soon! The next post will be on more advanced querying!

/Karlsson

1 comment:

Aki said...

So if attributes are not indexed, does that make lookups slow for large data volumes (say, above 1TB)?