Wednesday, February 9, 2011

EC2 - The E is for Elastic

So, you are thinking about Cloud Computing? Is it a fad, along the lines of SOA, OOP, NoSQL, ORDBMS or is it a new paradigm when it comes to infrastructure? (not that a fad is bad, it's just that a fad, in my mind, is something that is grossly overblown in proportion. OOP is a good thing, but tell you what, OO-talibans out there, despite what you may think, OOP will not create peace in the middle east (if it did, I'd embrace it right now)).

But all that aside, what is in the Cloud, really? And from a technical standpoint, it seems simple enough: Your servers running across a number of virtual machines, with virtual disks and what have you not, where you pay for resource use and you share the environment with a bunch of other users. And that really is not that complicated. And from a pure technical view, that is it, sort of, but there is more to it than that, because when you come to run your stuff in a cloud, you realize that things aren't as simple, and that running in a VM in a cloud really is different from just running in a VM Ware / Zen / Zones environment, or something like that.

This is an easy mistake to do. When I started working here at Recorded Future, where we use Amazon EC2 for everything, that is what I thought. We run Ubuntu in a virtual environment+ Big deal. There are some EC2 integration tools and some GUI to administer the whole shebang, but except for that, this is no different than your server-room Linux box, but at a lower cost. And yes, I fully admit it, I was wrong.

What this is about, more than running in a Virtual environment, is the side effects of an environment shared with other, and how you set up your system to support that. Two important things I have learnt so far in some 5 months with Recorded Future:
  • Scalability is key! It really is. And I know, everyone want scalability, but in an environment such as EC2, with many different configuration options, but still a shared environment, you must be able to scale. And Scale horizontally. Even the very largest virtual machines at Amazon has the power for demanding applications in terms of disk I/O performance, Network throughput and latency and CPU performance.
  • It's not cheap. No, it's not, you are wrong. And that is not to say that it's bad. But it is different! If you manage that difference, you can run a very cost-efficient operation with EC2. But if you expect vertical scaling or have monolithic setups that runs on a single machine that has to scale, in a single instance, with your needs, then don't think this will save you and headaces or money.
    But if you build your infrastructure in such a way that it scales nicely acress servers, and ensure that performance requirements on a single server are modest, and can be distributed if needed, then EC2 is for you.
So what about the Elastic aspect then? Well, elasticity in EC2 works both ways, on one side, the performance you get, in particular in terms of network latency, will vary over time. It just will, so get over it, accept the facts and create a system that can sustain it. On the other hand then, you can add resources as you need them. Sounds great, doesn't it? Well, yes, but there is a limit to WHAT resources you can add, and how.

EC2 allows you to add disks to your system as you please. There is a GUI (which is not very good) and there are command-line programs (not particularly good either, to be frank, but I am not, I'm Anders) to do this. But they do the job reasonably well. What you can NOT buy is more disk I/O throughput, just like that. You can get more disks and stripe them for sure, but that's about it. The same goes for CPU, you can get more of them (to a limit), but they are only so powerful. It's not like "Gimme a gazzillion of Petaflops" just like that, I'm afraid.

Above all, the network is only so fast, and regrettably that is not terribly fast. Also, the way EC2 manages DNS looks is just plain weird, I have to assume it is so for a reason, but the reason just has to be something you smoke, but not necessarily something you inhale.

So where are we now, then? EC2 has servers with limited performance, with limited disk I/O capacity and interconnected by what sometimes seems like 2400 baud modems. How can all this be useful? And by the way, at times both Disk I/O and Network performance is quite acceptable, but your milage may vary. Tell you what makes this stuff rock: There are MANY servers and many disks. How many as you want. And you can move them around, mount one disk on one machine, and then on another (like in a SAN, for example) just like that.

All in all then, the perfect infrastructure for EC2 then is something that scale with the number of servers. Scale I/O, Scale network performance, scale User connections, whatever the bottleneck is in your system, with the number of servers, each having as much disk as is necessary.

And having said that, you may have figured out where I am going with this, as in the Letterman Show, we are playing Will it scale?. Yes, that is a valid question, and the answer is, maybe. The less state a system holds, the easier it will scale, to make it really simple. A webserver that is just serving stuff off a disk will scale real nice, for example. Also, the less you persist, one way or the other, the better it will scale. The most common way of persisting data is of course writing it to disk, that is clear, but any kind of persisting data that is shared will cause some limit to scalability.

A simple Web-server scales easily, as we have already said. I more complex Web application, say a PHP application, may not scale as well, but still a lot, as long as the state is limited to a single PHP session or similar. The same goes for App-servers I guess. The one thing in most systems that is difficult to scale is the database, and the reason is of course that the database has a lot of state. A stateless database is not much of a database, to be honest.

MySQL has a means of scaling the database that has worked quite well for a number of years in Web-style applications. Scale-out is the way to go. And scale-out is simple enough: Asynchronously Replicated Slaves being fed from a Master. The Asynchronous nature of this Replication means that database writes (which all go to the Master) are not held up by replication to the Slaves. And you can have, in theory, any number of Slaves. But there is a drawback to all this: Only Reads may scale, not writes. Master-Master may help, to an extent, but not much. And a massive replication setup, with N slaves all Replicating from a different point in the Master. Another issue is that for the Slave to do it's job properly, the operations on the Slave are serialized (if this wasn't the case, think about what Foreign Key relationships might cause you), which means the Slaves are slower than the Master when it comes to Writes, which in turn means that unless you want the Slaves to become more and more behind, you better keep the write-load average at the level that the Slave can sustain, not the Master.

And no, by the way, I don't think MySQL Replication is something bad and awful. But I do think I know what it is good at and what it is NOT good at. And it will not help you scale your writes. Nope.

So what would a database that could scale in an Amazon EC2 Cloud look like? Above all, it would need:
  • Flexible configuration - Adding a node should be just that, adding a node. No restart, no optimizations, no reorg of data, no downtime, just Here is a node: Use it. And removal of a node should be the same. And management of data in the database as well. No more monolithic database configurations, please!
  • Scalable performance - And once you add those nodes, they really should be able to increase the performance of things, and not just to a small extent.
  • Data distribution - Yes, I want my data distributed. Replicated to where it is used. Distributed and persisted to where it is best persisted.
  • Transparent - Yes, all this should be transparent to the application.
  • SQL based RDBMS - Yes, I know, I am an old fashioned guy, but this what I want. SQL because it is ubiquitous, not because I think it's the best query language on the planet. And I want an RDBMS because I firmly believe that that is best for my data. Which is not to say that some application might prefer some other means of storage (if you want my full view on these matters, read this post).
So, is this just a wet dream? I hope not. One technology I am looking at and which I am eager to try later this year is NimbusDB. This is created by Jim Starkey, and if you think that the MySQL Falcon debacle was caused by him, and that NimbusDB is therefore not something to look seriously, think again. Having looked at a number of different technologies in the past 5 months or so, I have to say that NimbuxDB is the only one where they have at least understood what problems to solve in a Cloud environment, and that is not a bad start. And tell you what? I do think that to support such a setup as NimbusDB sets out to do, you really need to start from Scratch, I do not think that Oracle 13g or MySQL Clouse Storage Engine or something like that will be around to fulfill the requirements to run and scale properly in a Cloud. But you never know, there are interesting times ahead.

/Karlsson

1 comment:

Liran Zelkha said...

Hi

I totally agree with what you wrote about scaling, and especially about MySQL scaling.
If you have issues with MySQL scaling I suggest you try out ScaleBase (disclosure: I work there) - they give you a transparent sharding solution that fits the needs you describe.