Thursday, May 31, 2012

An introduction to JSON

JSON is a data format that has become very popular recently. The name JSON means "JavaScript Object Notation" and it's a bit misleading, as although the format originally was used in JavaScript, it has become very popular in many more environments.

JSON is used as a format of both queries and results in database applications, as well as for data export and import. Among the most JSON centric database system, and also the most JavaScript focus one I guess, is MongoDB. In the world of NoSQL, JSON is used extensively, as JSON is really cool for representing documents and stuff like, and for a variety of other reasons too. JSON is a format that is also textual and easy to read, and the format is pretty standardized.

Drizzle has some features for working JSON, and this is a pretty good example of something that is becoming real common: REST + JSON, so a standard HTTP call using JSON for input and output. There are a whole bunch of implementations of this, so you can use curl to query Drizzle.

The best reference for JSON is probably the website, but I give a brief introduction here.

To begin with, the basic datatypes in JSON (as well as in JavaScript) are string, number, boolean and NULL. There is nothing else. Also, string is UTF-8, no Latin-1 or stuff like that to handle, either it's UTF-8 or it's not JSON, except that you may put any character in a string if it is properly escaped. A number is just that, a string of numeric characters, optionally prefixed by a - (minus) sign and optionally with a decimal point. Also in a number, you may optionally use exponential notation.

Now, in JSON, everything is one an object or an array. Every object and array may contain embedded objects and arrays, so there may be indefinite levels here. Every attribute is named and the name is specified in double quotes, followed by a colon and then the value. The attributes are separated with a coma and one or more optional spaces. If the value is a string, it is embedded in double quotes.

An array is enclosed within brackets, and an object within burly braces. So with all this theory, let's look at an example, where I represent a set of users:

{"id": 1, "fname": "Joe", "lname": "Bloggs", "groups": ["admin", "guest"]}
{"fname": "Ann", "lname": "Bloggs", "groups": ["users"], "address": {"street": "Main Street 1", "city": "San Jose"}, "id": 2}
{"id": 3, "fname": "John", "lname": "Doe", "groups": ["admin"], "address": {"city": "New York"}, "active": false}

This wasn't too bad, right? Nothing terribly complex in there. This was 3 JSON records, and this is what a typical JSON formatted file might look like, for example a dump from a MongoDB database. There is one more thing to note here: The order if the attributes is undefined!

One thing to watch out for are numbers here, and how they work in MongoDB and JavaScript. The JavaScript number datatype, which again is standardized, is not really exact when it comes to big numbers, most specifically, a JavaScript number is a 8-byte floating point, which in turn means that an integer with more than some 15 digits will not be precise. This is an issue when we use integers as identifiers when we have large amounts of data or this integer is some kind of bitmap. Now, JSON doesn't have this issue, but when transferring data from say, MySQL to MongoDB, we need to sonsider it. So how does this work with MongoDB then, where we have large integers as identifiers? Well, MongoDB uses the open source SpiderMonkey JavaScript engine from Mozilla, the one you find in products such as Firefox. To work around the issue with large integers though, MongoDB adds a datatype of it's own, and some function to manage it (this is why you see NumberLong("some long number") is MongoDB sometimes). When we have a number in MySQL that we want to import into MongoDB, and that number is an integer with 16 digits or more, then we must ensure we use the MongoDB specific datatype. A MongoDB integer and a JavaScript native integer may be operated on as equals in many cases, but with large integers, the result might not be what you expect:
db.mytable.find({id: 15617798765916278})
Might not work as expected, but this might:
db.mytable.find({id: NumberLong(15617798765916278)})

I will present the first version of my JSON (MongoDB focused) to MySQL loader real soon, so don't touch that dial!