We discussed in an earlier post how to design indexes for many types of queries using a single table. Here is a real-world example of the challenges you will face when trying to optimize queries: two similar queries, but one is performing a full table scan while the other one is using the index we specially created for these queries. Bug or expected behavior? Read on!

Our two similar queries

Q1 runs a full-table scan while Q2 is using the index on ts, which by the way is covering – See Using index in the Extra field. Why such different execution plans?

Let’s try to understand what happens with Q1.

This is a query with a single inequality on the ts field and we have an index on ts. The optimizer tries to see if this index is usable (possible_keys field), this is all very logical. Now if we look at the rows field for Q1 and Q2, we can see that the index would allow us to only read 45% of the records (1.8M out of 4.1M). Granted, this is not excellent but this should be much better than a full table scan anyway, right?

If you think so, read carefully what’s next. Because this assumption is simply not correct!

Estimating the cost of an execution plan (simplified)

First of all, the optimizer does not know if data or indexes are in memory or need to be read from disk, it will simply assume everything is on disk. What it does know however is that sequential reads are much faster than random reads.

So let’s execute Q1 with the index on ts. Step 1 is to perform a range scan on this index to identify the 1.8M records that match the condition: this is a sequential read, so this is quite fast. However now step 2 is to get the col1 and col2 fields for each record that match the condition. The index provides the primary key value for each matching record so we will have to run a primary key lookup for each matching record.

Here is the issue: 1.8M primary key lookups is equivalent to 1.8M random reads, therefore this will take a lot of time. Much more time than sequentially reading the full table (which means doing a full scan of the primary key because we are using InnoDB here).

Contrast that with how Q2 can be executed with the index on ts. Step 1 is the same: identify the 1.8M matching records. But the difference is: there’s no step 2! That’s why we call this index a ‘covering index’: we don’t need to run point queries on the primary key to get extra fields. So this time, using the index on ts is much more efficient than reading the full table (which again would mean that we would do a full-table scan of the primary key).

Now there is one more thing to understand: a full-table scan is a sequential operation when you think about it from a logical point of view, however the InnoDB pages are certainly not stored sequentially on disk. So at the disk level, a full table is more like multiple random reads than a single large sequential read.

However it is still much faster than a very large number or point query and it’s easy to understand why: when you read a 16KB page for a full table scan, all records will be used. While when you read a 16KB page for a random read, you might only use a single record. So in the worst case, reading 1.8M records will require 1.8M random reads while reading the full table with 4M records will only require 100K random reads – the full table scan is still an order of magnitude faster.

Optimizing our query

Now that we’ve understood why the optimizer chose a full table scan for Q1, is there a way to make it run faster by using an index? If we can create a covering index, we will no longer need the expensive primary key lookups. Then the optimizer is very likely to choose this index over a full table scan. Creating such a covering index is easy:

Some of you may object that because we have an inequality on ts, the other columns cannot be used. This would be true if we had conditions on col1 or col2 in the WHERE clause, but that does not apply here since we’re only adding these extra columns to get a covering index.

Conclusion

Understanding how indexes can be used to filter, sort or cover is paramount to be able to optimize queries, even simple ones. Understanding (even approximately) how a query is run according to a given execution plan is also very useful. Otherwise you will sometimes be puzzled by the decisions made by the optimizer.

Also note that beginning in MySQL 5.7, the cost model can be tuned. This can help the optimizer make better decisions: for instance random reads are far cheaper on fast storage than on regular disks.

2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Roland Bouman

Excellent post, very insightful!

For completeness, it would be very instructive to show the EXPLAIN after adding the new index.