Karlsson on databases and stuff

Thursday, May 24, 2012

On simplicity, awk and potatoes

Yes, things certainly changes all the time for us IT folks. New classes of hardware, new cool pieces of software and cool gadgets. Stuff that usually gets better and better, and if not, at least they are on the move. Constantly.

2012 years new potatoes. (c) God

If it isn't Ivy Bridge based motherboards, it's "New iPad" (that is a strange name by the way, what is wrong with iPad 3? And it will not be "new" for long) or MySQL 5.6 or Ubuntu 11.10.

And then there are things that don't improve much over time, and still stays around. Sometimes because they have some powerful backers, despite it being pure evil (Adobe Flash anyone), but sometimes because it's so good that you just cannot improve on the original. Awk truly was right from the start, few improvements has been done since I first used it some 30 years ago, and it's still the same helpful tool (sure, some things have changed, but the basic concept is the same, and most of the common syntax also).

And then we have stuff that, unlike awk, isn't developing at all. No new releases, no features added, no cool beta versions, just the same plain old thing, year after year, and that's how we like it to be. For example (you MySQLers reading this already know what this is about) swedish new potatoes. I had my first for this year today. Oh my god, this is so good! With a pinch of salt and some butter, this is better than candy!

/Karlsson

A tale of a benchmark attempt - Attempt 1

Whoa, it was a long time since I posted here! But I have been very busy with non-MySQL related issues lately, so that is an explanation I guess.

This week I decided to try a few MySQL things anyway, the plan was to compare MongoDB with MySQL Cluster as a key-value store. We have some data here at Recorded Future that is currently in MongoDB, it will not fit in DynamoDB (it has secondary indexes for example) and I was thinking that maybe MySQL Cluster was an alternative, it was some time ago since I tried Cluster anyway.

At Recorded Future, we run everything on Amazon EC2, but I was thinking that this benchmark should be about another thing than just comparing MySQL Cluster with MongoDB, I wanted to see the difference between EC2 and some hard iron.

So, I started downloading some data to my recently constructed Linux server box. This server is a home brew machine housed in a Chieftech 4U rackmount box. There is Asus M5A88V EVO mobo in it, and on that there is an 8-core AMD CPU and 16 Gb RAM, so for a box at home for experimental purposes, it's quite reasonable.

What is not reasonable is how Linux folks treat hardware and user requirements sometimes. I understand that the developers of Linux to a not small extent does this in free time. Also, I understand that stuff sometimes go wrong. But hey, Ubuntu 10.10 (which we use, despite that it is old) is a pretty common Linux distro. On my Mobo there is a Gigibit LAN thingy, as on all Mobos these days, more or less. One of the most common LAN chipsets is from Realtek, either the 8111 or 8168. Seems insignificant, right? No big deal? Just install Linux and it works, Linux may have issues with very unusual hardware, but not with something as common as the Realtek 8111/8168? Beeep! Wrong answer! Sure, it works, but slowly. If you look carefully, you realize that network performance is dead slow, and further investigation shows that this is due to a lot of dropped packets.

Doing an lsmod you can see that Linux (this is Ubuntu 10.10, but this is not uncommon on other Linuxes either) has determined that it wants to use the driver for the much less common Realtek 8169 Gigibit Eithernet chip. These guys are seemingly compatible, hey it works, but it doesn't work well. Back to the drawing board, in this case: Download the module source, for the 8111/8168 then, from Realtek, make a new Linux module, remove and blacklist the r8169 module and then instead install the r8168 module. Yes, I can live with this. But those you are not developers or administrators and wants to use Linux will have issues with this. Look OSS folks, you are doing a good job, but look at your packaging and how you address users.

That said, it was back to my seemingly how 16 Gb Linux box which Linux was thinking had only 3.2 Gb available. Again, this is a Linux Kernel issue. And again it's not that uncommon, if affects certain AMD Mobos with a certain AMD Chipset. Again the patch from AMD for this is simple, but it does require patching the Linux kernel. I would expect stuff to work better than this and to be better tested, but on the other hand, my Mobo is pretty new and Ubuntu 10.10 is pretty old, so OK. But I have much less issues with hardware related stuff with my Windows machines. And before you reply that those guys are paid, I understand that, but I was hoping the power of the Linux community and the better way of developing software that OSS represents should compensate for that. But that is not so it seems, so I guess Linux stays largely between us Geeks for a while, which might be just as well, as that is how I make my money!

Oh now, what happened to the benchmark? Well, instead of benchmarking I have been busy upgrading, downgrading, rebuilding and patching Linux so this never happened. Now I do have a server where Linux can see all 16 Gb of Memory and where the network seems to work OK (I have to admit it, Realtek Sucks. I have been trying to find an alternative, but most PCI boards also have a Realtek chip on them).

But stay tuned, once my box again is properly VPNed to the Recorded Future network I'll install MongoDB again, reimport the data to it, and the convert and load the data into MySQL Cluster and then I am ready for some simple testing. But this is taking WAY longer than I expected!

/Karlsson

Tuesday, April 24, 2012

More on DynamoDB - The good part!

In a previous post on DynamoDB, I told you we were in the process of migrating to DDB and from MongoDB for our largest datastore. Now, we have moved a bit further on this and we, including myself, has pretty positive view on DDB, it really is a viable MongoDB alternative if you can live with the limitations. That said, there are many limitations, but I would like to put this differently. I would say this is an opportunity to simplify and get real good performance from your base level data, and combine it with other technologies where appropriate.

I wouldn't think that any serious application that use a database could live with DynamoDB only, unless the application developers were prepared to do just about everything database related, beyond the most simple, themselves. For example, you might need a secondary index, DDB doesn't provide you with them, so what you could do is use another DDB table as an index into the main data. Which is fine, but you have to implement it yourself, no more CREATE INDEX statement, no more ensureIndex() command and no more "the index is there so the optimizer will use it" rather "I have now an index on that previously unindexed attribute, so I rewrite my code to take advantage of it".

That said, how I would like to see DDB, and this is how we use it here at Recorded Future, is as a store for low level objects, like BLOB, pieces of text, pieces of XML, collections of keywords, you name it. Then you reference that data with an id that is looked up in some kind of supporting technology, like a free-text search engine or MySQL or both.

What we are looking at doing here at Recorded Future is to use DDB for just this kind of stuff. The supporting technologies are, in our case, MongoDB (yes, MongoDB, we have data in MongoDB today that will not work well in DDB, data that has secondary indexes on it, data that has uses more features in MongoDB etc) and Sphinx. But this may change. The database we are moving from MongoDB to DDB is just so simple and straightforward as is required to make it a good for for DDB.

And despite the limited functionality, DDB has several advantages:

It performs well, and I can pay for only the throughput I need. Actually, pricing is one of the intriguing aspects of DDB, that you pay for throughput, basically, not for storage, number of servers, number of users or something as arcane as that.
It is managed by Amazon, and Amazon seems to do a good job here.
DDB currently lacks any kind of backup mechanism, and as DDB isn't exposed outside the managed Amazon DDB environment, there isn't much I can do about it, so I just ignore it and tell my managers that Amazon will not allow me to back up our data (yes, I am kidding now).
There are several reasonable well working APIs, Ruby, Java etc that is integrated in the same Amazon APIs as the other Amazon services (the REST based API that these are built on leaves a fair bit to be desired, as does the documentation).

We are not live with DDB yet, we need to figure out a way to perform backups (as we can't get backups out of DDB, we have to find a way in out application to catch data before it enters DDB) for example, and we have coding to do, but my initial reservations regarding DDB are not as strong as they used to be, but one has to know the limitations, fave the facts and work with them. But that is life in a Cloud environment anyway.

As for the DDB pricing model, should we call that Cloud-based prising?

/Karlsson

Face the facts: There is no Big Data / Big Analytics silver bullet

We have a lot more storage space available these days, and a lot more data to work with, so Big Data and Big Analytics is getting much more mainstream now. And there are conclusions and insights you can get from that data, any data more or less, but Web data in particular brings a new dimension when combined with more traditional, domain specific data. But this data is also mostly in the shape of plain text, like your blogs, twitters, news articles and other web content. And this in turn means that to combine your organized structures sales data for 20 years with Web data, the Web data first needs to be analyzed.

Web data also brings in a new difficulty: the data is big, and it's not organized at it's core, so you can not easily aggregate or something like that to save space (and why would you want to do that?). It's not until after you have analyzed it that you know what data is interesting and what is not. And to be frank (but I am not, I'm Anders), not even then can you start to aggregate data or throw data away that isn't interesting. And in my mind, this is an mistake that has been done in all sorts of analytics, even with smaller amounts of data.

When it comes to analytics, in my mind "If you think you have all the right answers, you haven't asked all the right questions". This is an important point, analytics is a recurring activity, and the more questions you get answered, the more questions you should get. And with this in mind, how can you know what to aggregate? In particular when it comes to web content?

So, can we live with Web data not being aggregated and how do we do it? What database can support that? Oracle? MySQL? MongoDB? Vertica? And the answer is, in the same way as with analytics, you will not know when you start analyzing, and once you have started doing that, you will be even more in doubt! Which technology supports all the aspects you might need to look at? And the keyword is might!

So, how can we solve this? And my answer is: By using the right tool for the job at hand, and be prepared to combine different tools! Postgres and Oracle are great for temporal analysis, for GIS we have Oracle, MySQL and PostGIS. For handling large amounts of data with good scalability and keeping the cost down, you might want a key-value store like MongoDB or DynamoDB. To search data you might head for Sphinx or Lucene. Etc. etc.

As an example, I'd might want to look at a key-value store for my raw Web data, holding some key for easy lookup. An RDBMS for the attributes of this data. Sphinx for searching it. Sphinx and Lucene are much better tools than your average RDBMS, be it MySQL or Oracle or Whatever, and RDBMS search is different than a text search in web-data!

So the most important aspect to look at, if you ask me, is to choose technologies that can easily be combined and where different aspects of data can be served by different technologies as appropriate. And be prepared to add, remove and change technologies as you go along!

/Karlsson

Monday, April 2, 2012

Speaking at Big Data in Stockholm

I will be spaking at the conference "Reality Check: Big Data" coming up on April 26 here in Stockholm. This is all about big data in in's different shapes, and the conference is run by Swedish Computer Society and is sponsored by IDG. Read more about it here: http://www.dfkompetens.se/konferenser/kurser/1112022/index.xml

If you are there and want to talk, just catch me there, I'll be there most of the day, and my talk is in the afternoon, at 15:15.

Cheers
/Karlsson

Wednesday, March 21, 2012

Amazon DynamoDB ... Is it any good

As you might have noticed, I'm getting further away from MySQL here. This is just how things are I guess, I just do much less work with MySQL these days. The first migration was from MySQL to MongoDB, which was some time back. This was pretty successful, but note that we still have some data in MySQL, but the bulk of the data is in MongoDB right now.

Running any database on Amazon (and we run all databases on Amazon or on Amazon RDS service) may be costly, depending on how you utilize your resources. The recently announced Amazon DynamoDB is Amazons NoSQL service offering, but it is not like MongoDB with a twist, far from it. If you have read what I have written about MongoDB, I have now and then complained about the lack of functionality, but to be honest, I have learnt to live with MongoDB and it's shortcomings, and have started to like many of the JavaScript features of it (one thing I hate about it, and about JavaScript in general though, is the numeric datatype. It's just plain silly).

That said, we are now taking a shot at migrating again, this time to DynamoDB. In comparison with MongoDB, DynamoDB is incredibly simplistic, there are very few things you can do. To begin with, there is one "index" and one index only on each table. This index is either a unique hash key (that is what they call it) or a combination of a hash-key and a range-key (a unique composite key). I'll soon get into the gory details later. You can not have secondary indexes, i.e. indexes on any other attribute than the "primary key" or whatever you want to call it.

You can then read data in 1 of three ways. Simple:

You read a single row by unique key access. If you have a composite hey, provide both the hash-key and the range-key, else provide just the hash key.
You scan the whole table.
If you have a composite key, access by the hash-key part and scan (you may filter, but in essence, this is still a scan) on the range key.

There is nothing else you can do, and note that unless doing a full table scan, you must always provide the hash-key, i.e. if you do not know the exact hash key for the row to get, you have to do a full table scan. There is just no other way.

The supported datatypes aren't overly exciting either: Number, String and a Set of Number and String. The string type is UTF-8 and the Number is a signed 38 precision number. Other notable limits is that there is a max of 64 K per row limit, and that a scan will only scan up to a max of 1Mb of data. Note that there is no binary datatype (we have binary data in out MongoDB setup and use base64 encoding on that in DynamoDB).

Pricing is interesting. What you pay for is throughput and storage, which is pretty different from what you may be used to. Throughput may adjusted to what you need, and it's calculated in kb of row data per second, i.e. a table with rows of up to 1Kb in size that with a requirement of 10 reads per second will mean you need 10 units of read capacity (there is a similar throughput number for write capacity). Read more on pricing here.

All in all, DynamoDB still has to prove itself, and in some situations it might turn out expensive, in other situations not so. I have one major big gripe with DynamoDB, before I close for this time: DynamoDB is not Open Source, nor is it a product you can buy, except as a service from Amazon. You want to run DynamoDB on a machine in your datacenter? Forget it. This annoys the h*ll out of me!

We are still testing, but so far I am reasonably happy with DynamoDB, despite the issues listed above. The lack of tools (no, there are no DynamoDB tools. At all. No backup tool, no import / export, nothing) means that a certain amount of app development is necessary to access it, even for the simplest of things. Also, there is no Backup, but I am sure this will be fixed soon.

/Karlsson

Wednesday, March 7, 2012

Amazon RDS take two

I wrote in a previous blogpost, a few weeks ago, that we had started to migrate our MySQL databases to Amazon RDS. We have now been running this for quite a while, and so far we have not had any issues at all. The plan was to migrate even more, but this has not happened yet, as we got into another interesting migration: MongoDB to DynamoDB!

So far we have done some benchmarking, and we are pretty happy with that. DynamoDB has some interesting features, among the most interesting one is how they price it, where you pay for performance and resources used. This is not too dissimilar to what I suggested way back at a MySQL Sales conference, so maybe my head was correctly screwed on after all.

As usual, Amazon will not give much details on what they are doing, and how, but as DynamoDB is related to Cassandra, we have to assume it is an LSM-tree database. Amazon also claims that Flash / SSD storage is used for DynamoDB.

The main issue I have with DynamoDB isn't performance, I think this is an area where Amazon have done their homework. But DynamoDB is really limited in terms of functionality, there are only some simple operations available, you can have only 1 key (which may be composite, but still). There is a very limited number of datatypes, but this isn't that much of a problem for me per see (see this blog post), rather the types that are available aren't really that generic (a numeric 38-digit precsion number and a UTF-8 string), I ask where the generic variable length byte stream type is (in my mind, this is the one datatype that all databases should provide. There is hardly anything that cannot be represented as such a datatype).

Anyway, when we get furrther along with our DynamoDB testing, I'll let you know.

Cheers
/Karlsson