(There is an updated version of the content in this post by Percona’s Stephane Combaudon available here.)
After my previous post there were questions raised about Index Merge on Multiple Indexes vs Two Column Index efficiency. I mentioned in most cases when query can use both of the ways using multiple column index would be faster but I also went ahead to do some benchmarks today.
I’m using couple of simple tables:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | CREATE TABLE `t1000idxmerge` ( `i` int(11) NOT NULL, `j` int(11) NOT NULL, `val` char(10) NOT NULL, KEY `i` (`i`), KEY `j` (`j`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 CREATE TABLE `t1000idx2` ( `i` int(11) NOT NULL, `j` int(11) NOT NULL, `val` char(10) NOT NULL, KEY `i` (`i`,`j`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 |
I have populated this table with random data for i and j having both of them having 1000 of distinct values, independent on each other. I also created couple of other tables with same data just with really low cardinality with i and j having just 3 values each. The table contained about 18M rows though was small enough to fit in the systems memory.
I’ve benchmarked simple queries using where clause which covers multiple columns:
1 2 3 4 5 | Q1 SELECT sum(length(val)) FROM T WHERE i=2 AND j=1 Q2 SELECT sum(length(val)) FROM T WHERE i=2 AND j BETWEEN 1 and 2 Q3 SELECT sum(length(val)) FROM T WHERE j=2 AND i BETWEEN 1 and 2 Q4 SELECT sum(length(val)) FROM T WHERE i=2 OR j=1 Q5 SELECT sum(length(val)) FROM T WHERE j=2 AND i BETWEEN 100 and 200 |
As some of them there way too fast if run once I ran them multiple times and measured time appropriately. To remove the overhead of starting MySQL etc from equation I also measured execution of “SELECT 1” query using same script and subtracted this time from result in the table.
time for ((i=0;i<100;i+=1)); do mysql test -e “SELECT sum(length(val)) FROM t1000idxmerge WHERE i=2 AND j=1”; done > /dev/null
In the result table I compute per query results and present results in milliseconds.
Query | 1000 – 2 indexes | 1000 – 2 columns | 3 – 2 indexes | 3 – 2 columns |
Q1 | 20 | 0.2 | 6940 | 2530 |
Q2 | 25 | 0.3 | 7400 | 7500 |
Q3 | 25 | 30 | 7200 | 3830 |
Q4 | 70 | 3800 | 4700 | 4700 |
Q5 | 25 | 2980 | – | – |
Note1: Q1 will not use Index Merge technique for low cardinality table but instead pick to do single index scan. I’m not aware of the optimizer hint which would allow to force index merge as you can do with index accesses in general.
Note2 Q2/Q3 can’t use Index Merge however as it is currently implemented so they would use single index range scan. The 2 indexes however benefits to Q3 because it can only use first keypart of index (j,i) to resolve BETWEEN part of the clause which may not be very selective.
Note3 You may be surprised why 2 column index is faster for Q3 in case of low cardinality even though MySQL can’t use index well. You’re right MySQL can’t and MySQL does not – Full table scan is performed and in this case turns to be faster than scanning 1/5th of the table using index. Also Full Table Scan is preferred for Q4 in all cases but in case of high cardinality multiple index configuration.
Note4 Q5 was just run on high cardinality tables to show what difference large BETWEEN can make.
Conclusion: For benchmarked queries we can see Multiple Column index beats Index Merge in all cases when such index can be used. It is also worth to watchout a MySQL may decide not to do Index merge (either intersection or union) but instead do full table scan or access table picking only one index on the pair.
Multi column indexes may be faster in some cases, however, you also need to balance the performance of write operations, and disk consumption.
As a rule of thumb, the more indexes you have, the slower write operations are. DELETEs can be especially costly, under the right circumstances. Also, I’ve seen situations where indexes use more disk space than the actual table.
Peter,
very intersting. Please post a histogram of query time versus range of BETWEEN of queries Q2 and Q3/Q5. I did my tests on a table crafted on the specifications you posted but I’d like to see your data before I post my comments. I’m not interested in the case where cardinality is 3 since it’s too low to have a reliable index usage.
Tgabi,
I did not do the timing for histogram. You have the table information and so can do the run yourself and share results 🙂
Very well. Here’s my data on the table with cardinality 1000 and 18M records – the query is SELECT SQL_NO_CACHE sum(length(val)) FROM t1000c WHERE j=2 AND i BETWEEN 1 AND x:
x time
————–
2 48.57
3 97.82
4 147.25
5 197.1
10 443.31
15 691.06
If goes downhill from there. So, very limited range of aplicability.
On the other hand, I’ve got consistent results from 2 index table, surprisingly, much faster than your results. This is why I wanted to see your data. I’m not sure why my results are so much faster:
x time
————–
2 46.08
10 46.14
20 46.24
Anyway, the 2 column index is only useful for a limited range of values. It can only be used as performance boost in same specific cases in addition to single column indexes.
I’ve used a 8 GB RAM computer dual core CPU. Tables are less than 1 GB in size. The index cache was set at 4 GB. Mysql version 5.0.67 ( i was doing testing for bug 29857).
mysql> show index from t1000;
+——-+————+———-+————–+————-+———–+————-+———-+——–+——+————+———+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+——-+————+———-+————–+————-+———–+————-+———-+——–+——+————+———+
| t1000 | 1 | i | 1 | i | A | 1001 | NULL | NULL | | BTREE | |
| t1000 | 1 | j | 1 | j | A | 1001 | NULL | NULL | | BTREE | |
+——-+————+———-+————–+————-+———–+————-+———-+——–+——+————+———+
2 rows in set (0.00 sec)
mysql> select count(*) from t1000;
+———-+
| count(*) |
+———-+
| 18000000 |
+———-+
1 row in set (0.00 sec)
(the other table, t1000c, has same data only combined index instead of simple index).
Update: for whatever reason the multi-column index is buggered by BETWEEN. See below:
mysql> explain SELECT SQL_NO_CACHE sum(length(val)) FROM t1000c WHERE j=2 AND i BETWEEN 1 AND 2;
+—-+————-+——–+——-+—————+——+———+——+——-+————-+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+——–+——-+—————+——+———+——+——-+————-+
| 1 | SIMPLE | t1000c | range | i | i | 8 | NULL | 14729 | Using where |
+—-+————-+——–+——-+—————+——+———+——+——-+————-+
1 row in set (0.00 sec)
mysql> explain SELECT SQL_NO_CACHE sum(length(val)) FROM t1000c WHERE j=2 AND i IN (1,2);
+—-+————-+——–+——-+—————+——+———+——+——+————-+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+—-+————-+——–+——-+—————+——+———+——+——+————-+
| 1 | SIMPLE | t1000c | range | i | i | 8 | NULL | 33 | Using where |
+—-+————-+——–+——-+—————+——+———+——+——+————-+
1 row in set (0.00 sec)
Peter, what version of mysql are you using ?
Tgabi,
The query you’re running is exactly query which can’t use multi column index completely – BETWEEN makes it to scan only first key part this is why with it the performance is bad. If you have query as you described the index should be (j,i) not (i,j)
Now you also see two more things with your explain – first Index Merge is not really used here – it is just range scan by one or other index. Second there is a huge difference between IN and BETWEEN – IN allows scanning both keyparts while BETWEEN does not.
Yet, in your data shows as quick. Look at Q3:
Q3 25 0.3
was this on a “switched” index that doesn’t show on the description ?
Regarding BETWEEN: shouldn’t the optimized transform this ?
tgabi,
Ah… This is typo it should be 30ms not 0.3ms So pretty much it works as good as BETWEEN 1 and 2 query would 🙂
BETWEEN is never optimized to IN even though it is often possible – you can do it manually though
What about the impact of multiple column index on ORDER BY speed ?
e.g. :
SELECT * FROM a WHERE b=1 AND c=2 ORDER BY d
Here, an index on (b,c,d) will avoid MySQL using filesorting.
If three indexes are put on b, c and d, will MySQL be able to use index merge on d to avoid filesorting ?
hi all,
i have a table:
CREATE TABLE
yellow_pages
(bid
int(11) NOT NULL auto_increment,bname
varchar(100) character set utf8 NOT NULL,address
varchar(50) character set utf8 default NULL,city
varchar(30) character set utf8 default NULL,state
varchar(2) default NULL,zip
int(5) unsigned NOT NULL,country
varchar(20) default NULL,website
varchar(50) default NULL,phone
varchar(10) default NULL,fax
varchar(10) default NULL,latitude
decimal(10,6) default NULL,longitude
decimal(10,6) default NULL,contact
varchar(50) default NULL,title
varchar(50) default NULL,sic
int(10) unsigned default NULL,sic_des
varchar(100) default NULL,uid
int(10) default NULL,PRIMARY KEY (
bid
),KEY
latitude
(latitude
),KEY
longitude
(longitude
),KEY
zip
(zip
),FULLTEXT KEY
bname
(bname
),FULLTEXT KEY
sic_des
(sic_des
)) ENGINE=MyISAM DEFAULT CHARSET=latin1
and the query:
select (IFNULL(ACOS(0.796641758191*COS(RADIANS(l.latitude))*(-0.525545896347*COS(RADIANS(l.longitude)) + -0.850765250132*SIN(RADIANS(l.longitude))) + 0.604451742578*SIN(RADIANS(l.latitude))), 0.00000)*6370298.85223) as dis, l.* from yellow_pages l WHERE match(sic_des) against(‘”hotel*” ‘ in boolean mode) and ( longitude BETWEEN -122.159246013 AND -121.250753987 and latitude BETWEEN 36.827530043 AND 37.551269957 or zip in(95050,95051,95052,95053,95054,95055,95056)) order by dis LIMIT 0, 10
i use php(v5.2) to connect to mysql(v5.02) but it run sooooo long
any idea??
thanks
Hi, I am pasting my query. It takes some 1 min 22 sec to execute.
TBL_RES_USER_SKILLS has some 5 lacks data. If I remove “AND SKILL_NAME=’JAVA'” condition from the below query then it gets result in 5 seconds. I am not able to figure out the problem.Please guide me.
select count(*) from (select ID from ((SELECT ID FROM TBL_RES_USR) ORDER BY SCORE DESC) a where a.ID=(SELECT DISTINCT(USER_ID) FROM TBL_RES_SCORE_DATA WHERE COMPLETENESS>=80 and USER_ID=a.ID ) AND NOT((select COUNT(*) from TBL_USR_ORGANIZATION_INFO b where TO_DATE =(SELECT max(TO_DATE) from TBL_USR_ORGANIZATION_INFO where USER_ID=a.ID AND DEL_IND =’n’) AND USER_ID=a.ID AND DEL_IND =’n’ and ORGAN_NAME In(‘SAP’))) and a.ID=(select distinct(USER_ID) from TBL_RES_USER_SKILLS where USER_ID=a.ID and SKILL_NAME=’JAVA’ )and a.ID=(select distinct(USER_ID) from TBL_RES_SUMMARY where USER_ID = a.ID and TOTAL_EXPERIENCE<=100)and (a.ID=(select distinct(USER_ID) from TBL_RES_SUMMARY where (timediff(date_add((select curdate()),INTERVAL -’22:12′ HOUR_MINUTE),date_add(LAST_LOGIN,INTERVAL -’22:12′ HOUR_MINUTE))/24)<=365 and USER_ID=a.ID)))as Counts;
+———-+
| count(*) |
+———-+
| 4333 |
+———-+
1 row in set, 8666 warnings (1 min 22.93 sec)
@praveen:
Do an “Explain ” in both the cases and check the difference
excellent!
thanks for the post.
I am wondering what is the best practice from performance point of view to index enumerations/chars
If you have a lot of flags like: has_garden, has_pool, has_whatever. Its just an example from top of my head i am not actually writing property selling site 😉
Mysql does not have bitmap indexes as far as i know and creating index with all of them does not seem like a good idea or does it? Columns order matters so if you search for house without considering first flag you are doomed.
what is your take on that situation with millions of rows?
Thanks : )
we are working on a product.it has 93 tables,so how to optimize the performance..i am a complete fresher and i know nothing apart from the basic concepts of dbms.