Also, some things that work fine on the master can sometimes cause things to break on the slave, a typical such issue is a big operation on the master that when executed on the slave would cause a lock timeout. And you could argue that the lock timeout should be increased, but the question is how much, and frankly, do you really want those locks hanging around? And in some cases, it's the replication thread that has a lock timeout.
I can easily identify three different actions to take based on the errors that the MySQL slave had, and which cased the slave to stop processing, and which I have so far processed manually:
- Errors which really is no issue at all to restart the slave on. One such issue is lock timeouts, but there are more.
- Then there are errors which you just know you can skip a record in the binlog on. In our case, error number 1317 is one such error. This is an error when a large operation has been interupted on the master, for example a large INSERT INTO .. SELECT statement. In our case, this only happens rarely when we use large temp tables, and more or less always we really should not have the temp table on the slave at all. What this means is that, in our case, these errors can be disregarded, we can stop this binlog record and go on to restart the slave. I.e. we can do a SET SQL_SLAVE_SKIP_COUNTER=1; and then a SLAVE START; just like that.
- And then there are errors which needs manual intervention always, we just should not attempt a restart.
To be honest, this is as far from rocket science as you can get I think, but the script works well so here it is.
It's a bash shell-script and I run it on Ubuntu, but I don't use many fancy bash features, so it should be easy enough to understand. There is a --help option, and you can configure which errors not to attempt a restart on, which errors to skip one binlog record and restart on (any errors beside these, a restart is attempted on) and a sleep-time to set the time between checking the slave status (default is 10 minutes).
As i said, nuthin' fancy, but simple and workable. And before I finish this up, no i didn't check if someone has already published a script for this, I assume someone has, but here is my attempt at the same thing then.
Cheers
/Karlsson
4 comments:
Why not maatkit's mk-slave-restart? :)
on line 220 you might want to have it issue "stop slave; set global sql_slave_skip_counter=1;slave start;". Omitting the stop slave command might make the script barf when trying to issue a skip counter in some (not all) error situations. Better to be sure it's fully stopped as sometimes only the SQL or IO thread will be stopped due to an error and they both have to be OFF for the skip counter to work.
@Andrew: sometimes people like to write their own scripts. Maatkit has some good scripts but they don't always do exactly what you need for your specific environment, or you want something more portable that doesn't require perl or etc etc etc. Might as well as why we're not using Oracle instead of MySQL.
Thanx Matt. I'll fix that.
Andrew: Yes, I assumed that maatkit has something like this, but I hope I have added something to it at least.
Cheers
/Karlsson
Post a Comment