Tuesday, April 24, 2012

More on DynamoDB - The good part!

In a previous post on DynamoDB, I told you we were in the process of migrating to DDB and from MongoDB for our largest datastore. Now, we have moved a bit further on this and we, including myself, has pretty positive view on DDB, it really is a viable MongoDB alternative if you can live with the limitations. That said, there are many limitations, but I would like to put this differently. I would say this is an opportunity to simplify and get real good performance from your base level data, and combine it with other technologies where appropriate.

I wouldn't think that any serious application that use a database could live with DynamoDB only, unless the application developers were prepared to do just about everything database related, beyond the most simple, themselves. For example, you might need a secondary index, DDB doesn't provide you with them, so what you could do is use another DDB table as an index into the main data. Which is fine, but you have to implement it yourself, no more CREATE INDEX statement, no more ensureIndex() command and no more "the index is there so the optimizer will use it" rather "I have now an index on that previously unindexed attribute, so I rewrite my code to take advantage of it".

That said, how I would like to see DDB, and this is how we use it here at Recorded Future, is as a store for low level objects, like BLOB, pieces of text, pieces of XML, collections of keywords, you name it. Then you reference that data with an id that is looked up in some kind of supporting technology, like a free-text search engine or MySQL or both.

What we are looking at doing here at Recorded Future is to use DDB for just this kind of stuff. The supporting technologies are, in our case, MongoDB (yes, MongoDB, we have data in MongoDB today that will not work well in DDB, data that has secondary indexes on it, data that has uses more features in MongoDB etc) and Sphinx. But this may change. The database we are moving from MongoDB to DDB is just so simple and straightforward as is required to make it a good for for DDB.

And despite the limited functionality, DDB has several advantages:
  • It performs well, and I can pay for only the throughput I need. Actually, pricing is one of the intriguing aspects of DDB, that you pay for throughput, basically, not for storage, number of servers, number of users or something as arcane as that.
  • It is managed by Amazon, and Amazon seems to do a good job here.
  • DDB currently lacks any kind of backup mechanism, and as DDB isn't exposed outside the managed Amazon DDB environment, there isn't much I can do about it, so I just ignore it and tell my managers that Amazon will not allow me to back up our data (yes, I am kidding now).
  • There are several reasonable well working APIs, Ruby, Java etc that is integrated in the same Amazon APIs as the other Amazon services (the REST based API that these are built on leaves a fair bit to be desired, as does the documentation).
We are not live with DDB yet, we need to figure out a way to perform backups (as we can't get backups out of DDB, we have to find a way in out application to catch data before it enters DDB) for example, and we have coding to do, but my initial reservations regarding DDB are not as strong as they used to be, but one has to know the limitations, fave the facts and work with them. But that is life in a Cloud environment anyway.

As for the DDB pricing model, should we call that Cloud-based prising?


Face the facts: There is no Big Data / Big Analytics silver bullet

We have a lot more storage space available these days, and a lot more data to work with, so Big Data and Big Analytics is getting much more mainstream now. And there are conclusions and insights you can get from that data, any data more or less, but Web data in particular brings a new dimension when combined with more traditional, domain specific data. But this data is also mostly in the shape of plain text, like your blogs, twitters, news articles and other web content. And this in turn means that to combine your organized structures sales data for 20 years with Web data, the Web data first needs to be analyzed.

Web data also brings in a new difficulty: the data is big, and it's not organized at it's core, so you can not easily aggregate or something like that to save space (and why would you want to do that?). It's not until after you have analyzed it that you know what data is interesting and what is not. And to be frank (but I am not, I'm Anders), not even then can you start to aggregate data or throw data away that isn't interesting. And in my mind, this is an mistake that has been done in all sorts of analytics, even with smaller amounts of data.

When it comes to analytics, in my mind "If you think you have all the right answers, you haven't asked all the right questions". This is an important point, analytics is a recurring activity, and the more questions you get answered, the more questions you should get. And with this in mind, how can you know what to aggregate? In particular when it comes to web content?

So, can we live with Web data not being aggregated and how do we do it? What database can support that? Oracle? MySQL? MongoDB? Vertica? And the answer is, in the same way as with analytics, you will not know when you start analyzing, and once you have started doing that, you will be even more in doubt! Which technology supports all the aspects you might need to look at? And the keyword is might!

So, how can we solve this? And my answer is: By using the right tool for the job at hand, and be prepared to combine different tools! Postgres and Oracle are great for temporal analysis, for GIS we have Oracle, MySQL and PostGIS. For handling large amounts of data with good scalability and keeping the cost down, you might want a key-value store like MongoDB or DynamoDB. To search data you might head for Sphinx or Lucene. Etc. etc.

As an example, I'd might want to look at a key-value store for my raw Web data, holding some key for easy lookup. An RDBMS for the attributes of this data. Sphinx for searching it. Sphinx and Lucene are much better tools than your average RDBMS, be it MySQL or Oracle or Whatever, and RDBMS search is different than a text search in web-data!

So the most important aspect to look at, if you ask me, is to choose technologies that can easily be combined and where different aspects of data can be served by different technologies as appropriate. And be prepared to add, remove and change technologies as you go along!


Monday, April 2, 2012

Speaking at Big Data in Stockholm

I will be spaking at the conference "Reality Check: Big Data" coming up on April 26 here in Stockholm. This is all about big data in in's different shapes, and the conference is run by Swedish Computer Society and is sponsored by IDG. Read more about it here: http://www.dfkompetens.se/konferenser/kurser/1112022/index.xml

If you are there and want to talk, just catch me there, I'll be there most of the day, and my talk is in the afternoon, at 15:15.