Wednesday, May 25, 2011

MySQL Database cleaner 1.0 released

If you ever, and I think many of use DBAs have, been in the situation where you are stuck with data in the database that isn't used and isn't accessed, data which may consist of rows that are no longer used, data rows that aren't references, because you don't use FOREIGN KEYs or they weren't applicable in this case. Or data that was once used, but no longer is.

And in many cases, this data is tucked in among your other good rows of data :-( One way of cleaning up the database in a case like this is to run standard DELETE statements, but there are a few issues with this:
  • You may be accessing a lot of data, so this may take a while.
  • You will be locking large amount of data for this.
  • The join statement to get the data that is no longer used and / or no longer referenced is complex.
  • There is no really good way to split this DELETE in smaller chunks, except using LIMIT, but if what takes a long time to execute if the complex join to find offending rows, this is not always useful.
What is needed is a cleaner. A program that can run a simple or complex SELECT statement to find identifiers of the data you want to delete, and as this is a SELECT, we will have less issues with long running transactions and database locking. And the the IDs gotten from this SELECT is used to DELETE the rows we want to delete, based on the PRIMARY KEY to make it fast and to minimize the time we lock things.
In addition, we would prefer the DELETE to run in multiple threads and for the program to be very configurable. Well, here is the fix for you: mycleaner. MyCleaner is a program that does just this. It is not very complex, but useful.

Before I point you to the download location, I have one thing that is on the TODO list but which is not yet fixed: MyCleaner only works with integer numeric keys (INT, SMALLINT, BIGINT etc). If you are find with this and have a need to clean your data, give MyCleaner a try, it 's GPL Open Source and is ready for download from sourceforge here.

Ideas for improvement, code contributes etc are welcome. And yes, there is also documentation, not the most comprehensive of documentation, but it is there.

/Karlsson

2 comments:

Anonymous said...

From reading the PDF file included with the program download, I think this is a great program. We do the cleaning in a similar way but use shell scripts for this.

Here are some features I was looking to add to my shell scripts, but maybe could be instead incorporated into this program:
1) Modify batch size and commit-interval at run time to adjust for the load on live database system on as needed basis.
2) Support not only deletes, but also updates. At times we need to process lots of updates in a similar fashion. New values for specific columns and primary key could be loaded into a staging table and the cleaner program could do the rest of the magic.
3) Down the road make this program support multiple RDBMSs.

Anonymous said...

For feature request 1, also add a sleep option in milliseconds or microseconds in between rounds and/or commits.