Few days ago I wrote an article about using UNION to implement loose index scan.
First I should mention double IN also works same way so you do not have to use the union. So changing query to:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | mysql> SELECT sql_no_cache name FROM people WHERE age in(18,19,20) AND zip IN (12345,12346, 12347); +----------------------------------+ | name | +----------------------------------+ | ed4481336eb9adca222fd404fa15658e | | 888ba838661aff00bbbce114a2a22423 | +----------------------------------+ 2 rows in set (0.00 sec) mysql> explain SELECT sql_no_cache name FROM people WHERE age in(18,19,20) AND zip IN (12345,12346, 12347); +----+-------------+--------+-------+---------------+------+---------+------+------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+---------------+------+---------+------+------+-------------+ | 1 | SIMPLE | people | range | age | age | 4 | NULL | 9 | Using where | +----+-------------+--------+-------+---------------+------+---------+------+------+-------------+ 1 row in set (0.00 sec) |
So as you see there are really different types of ranges in MySQL. IN range allows to optimize lookups on the second key part, while BETWEEN and other ranges do not. Using same access type in EXPLAIN makes it very confusing.
I also was wrong about bug in key length in 5.0 explain. Actually I used tinyint for age and mediumint for zip which makes 4 right answer for using full key.
Be careful however with these nested IN clauses. MySQL has to internally build all possible combinations for row retrieval which ma become very slow if IN lists are large. Take 3 IN lists 1000 values each, on appropriate 3 keyparts and you may finish your lunch before query completes even if table has just couple of rows.
Let me however show how you can profile queries to see what exactly happens during query execution – very helpful for MySQL Performance optimization:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | mysql> flush status; Query OK, 0 rows affected (0.00 sec) mysql> SELECT sql_no_cache name FROM people WHERE age BETWEEN 18 and 20 AND zip IN (12345,12346, 12347); +----------------------------------+ | name | +----------------------------------+ | ed4481336eb9adca222fd404fa15658e | | 888ba838661aff00bbbce114a2a22423 | +----------------------------------+ 2 rows in set (0.39 sec) mysql> show status like "Handler%"; +----------------------------+-------+ | Variable_name | Value | +----------------------------+-------+ | Handler_commit | 0 | | Handler_delete | 0 | | Handler_discover | 0 | | Handler_prepare | 0 | | Handler_read_first | 0 | | Handler_read_key | 1 | | Handler_read_next | 42250 | | Handler_read_prev | 0 | | Handler_read_rnd | 0 | | Handler_read_rnd_next | 0 | | Handler_rollback | 0 | | Handler_savepoint | 0 | | Handler_savepoint_rollback | 0 | | Handler_update | 0 | | Handler_write | 14 | +----------------------------+-------+ 15 rows in set (0.00 sec) |
So you can do FLUSH STATUS to reset counters run the query (assiming your system does not do anything) and run SHOW STATUS to see how counters have changed. It was quite inconvenient you could only do it on idle box so as in MySQL 5.0 you do not have to any more. SHOW STATUS now will show per session counter increments and to get global counters SHOW GLOBAL STATUS needs to be used.
Let us look at this handler statistic – we can see Handler_read_key=1 – this means one index range scan was initiated. Handler_read_next=42250 means 42250 rows were analyzed during this scan. Basically MySQL started scanning Index with age>=18 and continue scanning as soon as it met something larger than 20.
Now let’s see what UNION can handle what IN can’t:
Lets say we want to show people in appropriate age group sorting by time when they were last online. If age is fixed this works great and it is efficient, however if we have multiple ages to deal with ether as BETWEEN range or as IN filesort appears and query becomes very slow:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | mysql> explain select * from people where age=18 order by last_online desc limit 10; +----+-------------+--------+------+---------------+------+---------+-------+-------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+------+---------------+------+---------+-------+-------+-------------+ | 1 | SIMPLE | people | ref | age | age | 1 | const | 12543 | Using where | +----+-------------+--------+------+---------------+------+---------+-------+-------+-------------+ 1 row in set (0.00 sec) mysql> explain select * from people where age in(18,19,20) order by last_online desc limit 10; +----+-------------+--------+-------+---------------+------+---------+------+-------+-----------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+--------+-------+---------------+------+---------+------+-------+-----------------------------+ | 1 | SIMPLE | people | range | age | age | 1 | NULL | 37915 | Using where; Using filesort | +----+-------------+--------+-------+---------------+------+---------+------+-------+-----------------------------+ 1 row in set (0.00 sec) |
We can however use UNION to avoid filesort of full table:
1 2 3 4 5 6 7 8 9 10 | mysql> explain (select * from people where age=18 order by last_online desc limit 10) UNION ALL (select * from people where age=19 order by last_online desc limit 10) UNION ALL (select * from people where age=20 order by last_online desc limit 10) ORDER BY last_online desc limit 10; +----+--------------+--------------+------+---------------+------+---------+-------+-------+----------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------+--------------+------+---------------+------+---------+-------+-------+----------------+ | 1 | PRIMARY | people | ref | age | age | 1 | const | 12543 | Using where | | 2 | UNION | people | ref | age | age | 1 | const | 12741 | Using where | | 3 | UNION | people | ref | age | age | 1 | const | 12631 | Using where | |NULL | UNION RESULT | <union1,2,3> | ALL | NULL | NULL | NULL | NULL | NULL | Using filesort | +----+--------------+--------------+------+---------------+------+---------+-------+-------+----------------+ 4 rows in set (0.01 sec) |
In this case there is also filesort but it applied only to very small table which is result of union, so it is rather fast.
Peter, I think you are making a mistake regarding the filesort – in both cases (when using “IN (18,19,20)” and when using “UNION”) mysql makes a temporary table of about the same size on which it does a filesort. Basically, in the first version the results are filtered by age and then a temporary table is created with the *filtered* records.
Mocanu,
Nope there is a serious difference. Assuming you have query which matches 50.000 rows and you want to get and display only 10 of them – a way most search applications work. In case you’re using simply IN you will get 50.000 rows stored in temporary table and sorted by filesort – going to be quite slow.
If you’re using UNION – each of the queries in the UNION will produce only 10 rows, using index for sorting not filesort. If you have 5 queries in the UNION you will have 50 rows in the temporary table which you still need to do filesort for.
Sorting 50.000 rows vs 50 rows is the difference here, now avoiding filesort completely 🙂
Great article…One quaestion : Can I use UNION, if I don’t have certain values for a field ( like age=18, age=19 ) ?? What if I have a query like :
SELECT * from table WHERE type = 1 and value >= 150.4 and value 31 order by score desc limit 20 ???
How can I avoid “using filesort”, in this case ???
Thanx
Hi Peter,
I wonder if you might be able to expand on the limitations of using “IN” vs. “BETWEEN” in terms of a point of decreasing returns? (e.g., at what point do the number of values in “IN” become large enough that there is no performance difference between using “IN” and “BETWEEN”?)
Also, does the performance increase involved in using “IN” apply to JOIN queries as well? Is there a difference in how “IN” is treated when the query in question utilizes a JOIN?
Of course, one thing I love about this site is that often times you tell us how we can measure for ourselves. I can see if one query is faster than the other, but am still not so awesome at unwrapping things to determine why.
Thanks.
Peter,
one question on UNION vers. IN : Is it a general rule that UNION is faster then IN? Another thing I experienced is a serious speed-advantage when using UNION instead of long OR-Statements. Is this correct?
Thank You!
Brian,
You would need to test large IN vs BETWEEN this likely to be rather workload dependent.
Note very large IN sets (especially cascaded) may cause MySQL to abandon full range optimization – in this case there is it should become slower than BETWEEN.
Tobias,
UNION generally has larger overhead than IN or OR. It is faster in the cases when it allows better execution plan – avoiding full table scan with OR on different columns (before 5.0) or avoiding filesort.
Hi Peter,
I use a query like you mentioned above and i can see the performance improvement as you mention above.
However, is it also good to use such a union operation for about 1000 Queries to query for about 20 millions of total records.
I mean, about 1000 times.
(SELECT …….. LIMIT 15)
UNION ALL
(SELECT ……… LIMIT 15)
.
.
ORDER BY CREATED LIMIT 15
Or do you advice me to use another way?
Peter,
Could you give the specific version of MySQL you were using here?
Onur,
I think it is not a good idea to concatinate 1k querys. With such a query you’ll achive the limits of max_allowed_packet very rapidly (if you did not make the mistake to rise the value up to 100MB or so). The UNION workaround is only good for a REALLY small set of numbers, because as you saw, a last filesort is still remaining.
I would recommend the usage of a temporary table. After that you’ll be able to join that table with a much better index usage.
Robert
Hi Peter,
thank you for the two great posts! Just as you said in practice you won’t store the age in a database but rather the birthdate. For this, your workarounds (IN or UNION SELECT) both would not be applicable… what should you do in this case? E.g. having a database with birthdates as timestamps, no other filtering option and ORDER BY last_online?
Actually, I’ve just come across and idea and on the first sight it seems to work: if you cannot decide between an index on filtering (because of a range) and an index on sorting, is there anything wrong with using a Sub-SELECT?
E.g. SELECT … FROM table_sorting WHERE EXISTS (SELECT … FROM table_filtering AS tf WHERE tf.birthdate BETWEEN … AND … ) ORDER BY last_online DESC LIMIT 20
For me it seems to work, the main SELECT uses the index for sorting the results of the inner Sub-SELECT, which uses the index on
birthdate
for filtering. The query takes on my localhost about 0.09s for 2milion rows compared to about 6s for using the index for filtering or 2-3s for using the index for sorting.Would anybody see a weakness with this solution that I have not?