Tuesday, July 19, 2011

OR conditions considered bad... Or? And a workaround.

Some things are known to be just bad. GOTOs used to be one such thing (something I still use them, but only where appropriate, which isn't that many places). Maybe it is just so, that some things are useful, but not for everything, so maybe the issue is that they are used inappropriately. Or?

The OR condition is one such things in MySQL circles! Oh, you have an OR condition! That is going to be so slow! sort of. And the reason an OR is "slow" is that as MySQL will use only one index for each statement, only one "side" or the or condition can use an index. Or sometimes even worse, MySQL will consider using an index that is common to the two "sides" or is outside the OR conditition, despite that fact that there are perfectly fine, highly selective indexes on both sides of the OR condition.

If you ask me, this is not a fault with the OR condition but rather a problem with the MySQL optimizer. Why in heavens name can't a single statement use two indexes, if that is what it takes? And let me let you in on a little secret: MySQL can use multiple indexes for one statement! But that depends on what you mean with a statement. And MySQL means something slightly different than many of us do!

Without further ado, lets have a look at an example. We work at a retail store, and a package from us has been stuck at the post office. We want to check what product this is, but we don't know the product id. What the guy who called us from the post-office said was something that looked like a brand name, that I can map to a brand ID, the number of units in the package and the weight. But to be honest, the last two weren't terribly reliable. OK, lets find the product in the product table, which looks like this:
CREATE TABLE `product` (
`brand_id` int(11) NOT NULL,
`quantity` int(11) NOT NULL,
`weight` int(11) NOT NULL,
KEY `ix_brand` (`brand_id`),
KEY `ix_weight_brand` (`weight`,`brand_id`),
KEY `ix_quantity_brand` (`quantity`,`brand_id`)

I know for certain that brand_id is 6, I already looked that up. But there are millions of products in the product table! Luckily, looking for approriate products using brand_id and either quantity or weight should be easy, right? We know now that the weight is 41 and quantity is 78. And we have approriate indexes, this should not be a big deal, right:
SELECT id FROM product WHERE brand_id = 6 AND (weight = 41 OR quantity = 78)

Well, although this works, it is a big sluggish, real slow actually. Lets look at what mySQL does with this statement:
EXPLAIN SELECT id FROM product WHERE brand_id = 6 AND (weight = 41 OR quantity = 78)
And what we get is this:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | SIMPLE | product | ref | ix_brand,ix_weight_brand,ix_quantity_brand | ix_brand | 4 | const | 4291 | Using where |
That wasn't so good. Let's try a different way:
EXPLAIN SELECT id FROM product WHERE (brand_id = 6 AND weight = 41) or (brand_id = 6 AND quantity = 78);

And that will result in the same query path. Only one index can be used, and there is one index that fits with both paths, that on brand_id, so MySQL picks that. Using FORCE_INDEX will work, but still only 1 index will be used, and the result may well be even worse, as a FORCE_INDEX on, say the ix_weight_brand index, will make the other path, on quantity, dead slow! What you would like MySQL to do, which doesn't seem so complicated, is to realize that there are two distinct paths here which can be looked up using an index real easy, execute them both and merge the results. But no, MySQL will not DO that! Only 1 index per statement, that's it. Or?

Well, when you understand that MySQL will only use one index per statement, consider what MySQL means with statement here. For a SELECT it is the individual SELECT statement that is the statement, which sounds reasonable until you consider a UNION! Each and every statement in a UNION is considered a separate statement (in this particular case that is, but it is a but messy, UNIONs in MySQL are a bit of a kludge, really)! So if we rewrite the statement above as a UNION, which is easily done for many queries involving OR-conditions, you get something like this:
SELECT id FROM product
WHERE brand_id = 6 AND weight = 41
SELECT id FROM product
WHERE brand_id = 6 AND quantity = 78;

What are we saying here? We are telling MySQL that these are actually two separate paths, which is what we did with the OR condition, but in this case, MySQL can use two indexes, and will nicely merge the results, so an explain looks like this:
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
| 1 | PRIMARY | product | ref | ix_brand,ix_weight_brand | ix_weight_brand | 8 | const,const | 31 | Using index |
| 2 | UNION | product | ref | ix_brand,ix_quantity_brand | ix_quantity_brand | 8 | const,const | 1 | Using index |
| NULL | UNION RESULT | <union1,2> | ALL | NULL | NULL | NULL | NULL | NULL | |
This latter query is often so much faster than the alternatives, and we have tricked MySQL into using two indexes and merge the result. But for some reason, MySQL is unable to figure this one out for itself. Is this an important tip I am giving you here? Is this a neat optimization trick that I am handing out? Short-term, the answer is yes.

But am I with this saying that you should stay clear of OR-conditions? Absolutely not, no way. What I am presenting here is an awkward way of circumventing some obvious flaws with the MySQL optimizer, and this should be fixed! But what I AM saying is this: If you currently have big performance problems with MySQL SELECTs involving OR conditions, you might consider rewriting those statements to UNIONs, sometimes that hels. But do not do this will ALL your OR-conditions, only where you have to and it makes sense. Let's meanwhile wait for the MySQL developers to fix this. (No, I'm not good enough at the optimizer code or most other parts of the MySQL kernel to fix this myself. I'm happy to build things around MySQL, but I do not have the time to get more involved with the kernel).

And before I keave you for now: This was tested with MySQL 5.5.7 on Linux. I have NOT checked for fixes, updates to this, but I do hope it has NOT been fixed? Why? Why do I now want it fixed?? Have I gone bonkers? Yes, I am bonkers, but that's not the issue here, the issue is that such rather involved fixes to the optimizer is NOT something I want introduced in the middle of a GA release! But I'd be really glad to have it fixed in 5.6 or whatever that release is to be called! And yes, I am ware this is not exactly with the optimizer itself, but more so with the query execution, but for now, I have decided to call it the optimimizer anyway, as the sun is shining and the weather is nice and all that, sometime around christmas I might consider changing my mind.

Cheers for now


Anonymous said...

"The Index Merge method is used to retrieve rows with several range scans and to merge their results into one. The merge can produce unions, intersections, or unions-of-intersections of its underlying scans."

I see this most commonly in WordPress (I took a few rows out for brevity):

select_type: SIMPLE
table: wp_comments
type: index_merge
key: comment_author_IP,comment_author_email
Extra: Using union(comment_author_IP,comment_author_email); Using where; Using filesort

It's not always the most reliable, but it does end up using two indexes and merging the result sets from both indexes.

Anonymous said...

I guess I should've posted the example query too:

SELECT comment_date_gmt
FROM wp_comments
WHERE comment_author_IP = '' OR
comment_author_email = 'example@example.com'
ORDER BY comment_date DESC;

Both columns are indexed in the table, which is shown as both being selected as a key in the EXPLAIN

Unknown said...

Thanx Mark, I knew there was some work in this area, but I just assumed it wasn't implemented yet. But apparently the optimizer still has to figure out a way to use this method in any case than the most simple one. I'll make some more experiments and see what happens!


Unknown said...

OK, I admit, I hadn't read the docs carefully enough, there is an index merge method since 5.0. Shame on me. But I can't remember when I saw that one used last. I'll do some experiments today and post another blogpost on that, and then I'll read the docs.
In theory, the "one index per statement" limitation should be gone in 5.0. But as I said, I have not seen the index_merge method used often.

Unknown said...

You might need to add a column to one of your indexes to get index_merge working.