And in many cases, this data is tucked in among your other good rows of data :-( One way of cleaning up the database in a case like this is to run standard DELETE statements, but there are a few issues with this:
- You may be accessing a lot of data, so this may take a while.
- You will be locking large amount of data for this.
- The join statement to get the data that is no longer used and / or no longer referenced is complex.
- There is no really good way to split this DELETE in smaller chunks, except using LIMIT, but if what takes a long time to execute if the complex join to find offending rows, this is not always useful.
In addition, we would prefer the DELETE to run in multiple threads and for the program to be very configurable. Well, here is the fix for you: mycleaner. MyCleaner is a program that does just this. It is not very complex, but useful.
Before I point you to the download location, I have one thing that is on the TODO list but which is not yet fixed: MyCleaner only works with integer numeric keys (INT, SMALLINT, BIGINT etc). If you are find with this and have a need to clean your data, give MyCleaner a try, it 's GPL Open Source and is ready for download from sourceforge here.
Ideas for improvement, code contributes etc are welcome. And yes, there is also documentation, not the most comprehensive of documentation, but it is there.
/Karlsson
2 comments:
From reading the PDF file included with the program download, I think this is a great program. We do the cleaning in a similar way but use shell scripts for this.
Here are some features I was looking to add to my shell scripts, but maybe could be instead incorporated into this program:
1) Modify batch size and commit-interval at run time to adjust for the load on live database system on as needed basis.
2) Support not only deletes, but also updates. At times we need to process lots of updates in a similar fashion. New values for specific columns and primary key could be loaded into a staging table and the cleaner program could do the rest of the magic.
3) Down the road make this program support multiple RDBMSs.
For feature request 1, also add a sleep option in milliseconds or microseconds in between rounds and/or commits.
Post a Comment