Thursday, June 2, 2011

Automatically restarting MySQL slaves the easy way.

MySQL Replication is something that is used by many, many MySQL users, and here at Recorded Future we are no exception. In our case, the slaves are used for different purposes, and as we develop our system so much and so fast, so sometimes things happen which could have been avoided. Like something getting executed in the master that really should not end up in a slave at all, and which would cause all sorts of problems on the slave.

Also, some things that work fine on the master can sometimes cause things to break on the slave, a typical such issue is a big operation on the master that when executed on the slave would cause a lock timeout. And you could argue that the lock timeout should be increased, but the question is how much, and frankly, do you really want those locks hanging around? And in some cases, it's the replication thread that has a lock timeout.

I can easily identify three different actions to take based on the errors that the MySQL slave had, and which cased the slave to stop processing, and which I have so far processed manually:
  • Errors which really is no issue at all to restart the slave on. One such issue is lock timeouts, but there are more.
  • Then there are errors which you just know you can skip a record in the binlog on. In our case, error number 1317 is one such error. This is an error when a large operation has been interupted on the master, for example a large INSERT INTO .. SELECT statement. In our case, this only happens rarely when we use large temp tables, and more or less always we really should not have the temp table on the slave at all. What this means is that, in our case, these errors can be disregarded, we can stop this binlog record and go on to restart the slave. I.e. we can do a SET SQL_SLAVE_SKIP_COUNTER=1; and then a SLAVE START; just like that.
  • And then there are errors which needs manual intervention always, we just should not attempt a restart.
This seems simple enough, right? So it should be possible to have a program that runs in the background, checks the status of the slave a regular intervals, and if it's an error we can restart, or possibly skip 1 and restart, on, do just that. Look no further then, I created a small shellscript for just that, and I have tried it a bit with our systems and it seems to work OK.

To be honest, this is as far from rocket science as you can get I think, but the script works well so here it is.

It's a bash shell-script and I run it on Ubuntu, but I don't use many fancy bash features, so it should be easy enough to understand. There is a --help option, and you can configure which errors not to attempt a restart on, which errors to skip one binlog record and restart on (any errors beside these, a restart is attempted on) and a sleep-time to set the time between checking the slave status (default is 10 minutes).

As i said, nuthin' fancy, but simple and workable. And before I finish this up, no i didn't check if someone has already published a script for this, I assume someone has, but here is my attempt at the same thing then.

Cheers
/Karlsson

4 comments:

Andrew said...

Why not maatkit's mk-slave-restart? :)

Anonymous said...

on line 220 you might want to have it issue "stop slave; set global sql_slave_skip_counter=1;slave start;". Omitting the stop slave command might make the script barf when trying to issue a skip counter in some (not all) error situations. Better to be sure it's fully stopped as sometimes only the SQL or IO thread will be stopped due to an error and they both have to be OFF for the skip counter to work.

Anonymous said...

@Andrew: sometimes people like to write their own scripts. Maatkit has some good scripts but they don't always do exactly what you need for your specific environment, or you want something more portable that doesn't require perl or etc etc etc. Might as well as why we're not using Oracle instead of MySQL.

Karlsson said...

Thanx Matt. I'll fix that.

Andrew: Yes, I assumed that maatkit has something like this, but I hope I have added something to it at least.

Cheers
/Karlsson