Using MySQL 5.7 Generated Columns to Increase Query Performance

In this blog post, we’ll look at ways you can use MySQL 5.7 generated columns (or virtual columns) to improve query performance.

Introduction

About two years ago I published a blog post about Generated (Virtual) Columns in MySQL 5.7. Since then, it’s been one of my favorite features in the MySQL 5.7 release. The reason is simple: with the help of virtual columns, we can create fine-grained indexes that can significantly increase query performance. I’m going to show you some tricks that can potentially fix slow reporting queries with GROUP BY and ORDER BY.

The Problem

Recently I was working with a customer who was struggling with this query:

SELECT 
CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', 
COUNT(*) as 'No. of API Calls', 
AVG(ExecutionTime) as 'Avg. Execution Time', 
COUNT(distinct AccountId) as 'No. Of Accounts', 
COUNT(distinct ParentAccountId) as 'No. Of Parents' 
FROM ApiLog 
WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' 
GROUP BY CONCAT(verb, ' - ', replace(url,'.xml','')) 
HAVING COUNT(*) >= 1 ;

SELECT

CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call',

COUNT(*) as 'No. of API Calls',

AVG(ExecutionTime) as 'Avg. Execution Time',

COUNT(distinct AccountId) as 'No. Of Accounts',

COUNT(distinct ParentAccountId) as 'No. Of Parents'

FROM ApiLog

WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'

GROUP BY CONCAT(verb, ' - ', replace(url,'.xml',''))

HAVING COUNT(*) >= 1 ;

The query was running for more than an hour and used all space in the tmp directory (with sort files).

The table looked like this:

CREATE TABLE `ApiLog` ( 
`Id` int(11) NOT NULL AUTO_INCREMENT, 
`ts` timestamp DEFAULT CURRENT_TIMESTAMP, 
`ServerName` varchar(50)  NOT NULL default '', 
`ServerIP` varchar(50)  NOT NULL default '', 
`ClientIP` varchar(50)  NOT NULL default '', 
`ExecutionTime` int(11) NOT NULL default 0, 
`URL` varchar(3000)  NOT NULL COLLATE utf8mb4_unicode_ci NOT NULL, 
`Verb` varchar(16)  NOT NULL,
`AccountId` int(11) NOT NULL, 
`ParentAccountId` int(11) NOT NULL, 
`QueryString` varchar(3000) NOT NULL, 
`Request` text NOT NULL, 
`RequestHeaders` varchar(2000) NOT NULL, 
`Response` text NOT NULL, 
`ResponseHeaders` varchar(2000) NOT NULL, 
`ResponseCode` varchar(4000) NOT NULL, 
... // other fields removed for simplicity
PRIMARY KEY (`Id`),
KEY `index_timestamp` (`ts`),
... // other indexes removed for simplicity
) ENGINE=InnoDB;

CREATE TABLE `ApiLog` (

`Id` int(11) NOT NULL AUTO_INCREMENT,

`ts` timestamp DEFAULT CURRENT_TIMESTAMP,

`ServerName` varchar(50) NOT NULL default '',

`ServerIP` varchar(50) NOT NULL default '',

`ClientIP` varchar(50) NOT NULL default '',

`ExecutionTime` int(11) NOT NULL default 0,

`URL` varchar(3000) NOT NULL COLLATE utf8mb4_unicode_ci NOT NULL,

`Verb` varchar(16) NOT NULL,

`AccountId` int(11) NOT NULL,

`ParentAccountId` int(11) NOT NULL,

`QueryString` varchar(3000) NOT NULL,

`Request` text NOT NULL,

`RequestHeaders` varchar(2000) NOT NULL,

`Response` text NOT NULL,

`ResponseHeaders` varchar(2000) NOT NULL,

`ResponseCode` varchar(4000) NOT NULL,

... // other fields removed for simplicity

PRIMARY KEY (`Id`),

KEY `index_timestamp` (`ts`),

... // other indexes removed for simplicity

) ENGINE=InnoDB;

We found out the query was not using an index on the timestamp field (“ts”):

mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*)  as 'No. of API Calls',  avg(ExecutionTime) as 'Avg. Execution Time', count(distinct AccountId) as 'No. Of Accounts',  count(distinct ParentAccountId) as 'No. Of Parents'  FROM ApiLog  WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'  GROUP BY CONCAT(verb, ' - ', replace(url,'.xml',''))  HAVING COUNT(*)  >= 1G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ApiLog
   partitions: NULL
         type: ALL
possible_keys: ts
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 22255292
     filtered: 50.00
        Extra: Using where; Using filesort
1 row in set, 1 warning (0.00 sec)

mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*) as 'No. of API Calls', avg(ExecutionTime) as 'Avg. Execution Time', count(distinct AccountId) as 'No. Of Accounts', count(distinct ParentAccountId) as 'No. Of Parents' FROM ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' GROUP BY CONCAT(verb, ' - ', replace(url,'.xml','')) HAVING COUNT(*) >= 1G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: ApiLog

partitions: NULL

type: ALL

possible_keys: ts

key: NULL

key_len: NULL

ref: NULL

rows: 22255292

filtered: 50.00

Extra: Using where; Using filesort

1 row in set, 1 warning (0.00 sec)

The reason for that is simple: the number of rows matching the filter condition was too large for an index scan to be efficient (or at least the optimizer thinks that):

mysql> select count(*) from ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' ;
+----------+
| count(*) |
+----------+
|  7948800 |
+----------+
1 row in set (2.68 sec)

mysql> select count(*) from ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' ;

+----------+

| count(*) |

+----------+

| 7948800 |

+----------+

1 row in set (2.68 sec)

Total number of rows: 21998514. The query needs to scan 36% of the total rows (7948800 / 21998514).

In this case, we have a number of approaches:

Create a combined index on timestamp column + group by fields
Create a covered index (including fields that are selected)
Create an index on just GROUP BY fields
Create an index for loose index scan

However, if we look closer at the “GROUP BY” part of the query, we quickly realize that none of those solutions will work. Here is our GROUP BY part:

GROUP BY CONCAT(verb, ' - ', replace(url,'.xml',''))

1	GROUP BY CONCAT(verb, ' - ', replace(url,'.xml',''))

There are two problems here:

It is using a calculating field, so MySQL can’t just scan the index on verb + url. It needs to first concat two fields, and then group on the concatenated string. That means that the index won’t be used.
The URL is declared as “varchar(3000) COLLATE utf8mb4_unicode_ci NOT NULL” and can’t be indexed in full (even with innodb_large_prefix=1 option, which is the default as we have utf8 enabled). We can only do a partial index, which won’t be helpful for GROUP BY optimization.

Here, I’m trying to add a full index on the URL with innodb_large_prefix=1:

mysql> alter table ApiLog add key verb_url(verb, url);
ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes

1 2	mysql> alter table ApiLog add key verb_url(verb, url); ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes

Well, changing the “GROUP BY CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))” to “GROUP BY verb, url” could help (assuming that we somehow trim the field definition from varchar(3000) to something smaller, which may or may not be possible). However, it will change the results as it will not remove the .xml extension from the URL field.

The Solution

The good news is that in MySQL 5.7 we have virtual columns. So we can create a virtual column on top of “CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))”. The best part: we do not have to perform a GROUP BY with the full string (potentially > 3000 bytes). We can use an MD5 hash (or longer hashes, i.e., sha1/sha2) for the purposes of the GROUP BY.

Here is the solution:

alter table ApiLog add verb_url_hash varbinary(16) GENERATED ALWAYS AS (unhex(md5(CONCAT(verb, ' - ', replace(url,'.xml',''))))) VIRTUAL;
alter table ApiLog add key (verb_url_hash);

1 2	alter table ApiLog add verb_url_hash varbinary(16) GENERATED ALWAYS AS (unhex(md5(CONCAT(verb, ' - ', replace(url,'.xml',''))))) VIRTUAL; alter table ApiLog add key (verb_url_hash);

So what we did here is:

Declared the virtual column with type varbinary(16)
Created a virtual column on CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”), and used an MD5 hash on top plus an unhex to convert 32 hex bytes to 16 binary bytes
Created and index on top of the virtual column

Now we can change the query and GROUP BY verb_url_hash column:

mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) 
AS 'API Call', COUNT(*)  as 'No. of API Calls',  
avg(ExecutionTime) as 'Avg. Execution Time', 
count(distinct AccountId) as 'No. Of Accounts', 
count(distinct ParentAccountId) as 'No. Of Parents' 
FROM ApiLog 
WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'  
GROUP BY verb_url_hash 
HAVING COUNT(*)  >= 1;

ERROR 1055 (42000): Expression #1 of SELECT list is not in 
GROUP BY clause and contains nonaggregated column 'ApiLog.ApiLog.Verb' 
which is not functionally dependent on columns in GROUP BY clause; 
this is incompatible with sql_mode=only_full_group_by

mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml',''))

AS 'API Call', COUNT(*) as 'No. of API Calls',

avg(ExecutionTime) as 'Avg. Execution Time',

count(distinct AccountId) as 'No. Of Accounts',

count(distinct ParentAccountId) as 'No. Of Parents'

FROM ApiLog

WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'

GROUP BY verb_url_hash

HAVING COUNT(*) >= 1;

ERROR 1055 (42000): Expression #1 of SELECT list is not in

GROUP BY clause and contains nonaggregated column 'ApiLog.ApiLog.Verb'

which is not functionally dependent on columns in GROUP BY clause;

this is incompatible with sql_mode=only_full_group_by

MySQL 5.7 has a strict mode enabled by default, which we can change for that query only.

Now the explain plan looks much better:

mysql> select @@sql_mode;
+-------------------------------------------------------------------------------------------------------------------------------------------+
| @@sql_mode                                                                                                                                |
+-------------------------------------------------------------------------------------------------------------------------------------------+
| ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION |
+-------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

mysql> set sql_mode='STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';
Query OK, 0 rows affected (0.00 sec)

mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*)  as 'No. of API Calls',  avg(ExecutionTime) as 'Avg. Execution Time', count(distinct AccountId) as 'No. Of Accounts',  count(distinct ParentAccountId) as 'No. Of Parents'  FROM ApiLog  WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'  GROUP BY verb_url_hash HAVING COUNT(*)  >= 1G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ApiLog
   partitions: NULL
         type: index
possible_keys: ts,verb_url_hash
          key: verb_url_hash
      key_len: 19
          ref: NULL
         rows: 22008891
     filtered: 50.00
        Extra: Using where
1 row in set, 1 warning (0.00 sec)

mysql> select @@sql_mode;

+-------------------------------------------------------------------------------------------------------------------------------------------+

| @@sql_mode |

+-------------------------------------------------------------------------------------------------------------------------------------------+

| ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION |

+-------------------------------------------------------------------------------------------------------------------------------------------+

1 row in set (0.00 sec)

mysql> set sql_mode='STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';

Query OK, 0 rows affected (0.00 sec)

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: ApiLog

partitions: NULL

type: index

possible_keys: ts,verb_url_hash

key: verb_url_hash

key_len: 19

ref: NULL

rows: 22008891

filtered: 50.00

Extra: Using where

1 row in set, 1 warning (0.00 sec)

MySQL will avoid any sorting, which is much faster. It will still have to eventually scan all the table in the order of the index. The response time is significantly better: ~38 seconds as opposed to > an hour.

Covered Index

Now we can attempt to do a covered index, which will be quite large:

mysql> alter table ApiLog add key covered_index (`verb_url_hash`,`ts`,`ExecutionTime`,`AccountId`,`ParentAccountId`, verb, url);
Query OK, 0 rows affected (1 min 29.71 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> alter table ApiLog add key covered_index (`verb_url_hash`,`ts`,`ExecutionTime`,`AccountId`,`ParentAccountId`, verb, url);

Query OK, 0 rows affected (1 min 29.71 sec)

Records: 0 Duplicates: 0 Warnings: 0

We had to add a “verb” and “url”, so beforehand I had to remove the COLLATE utf8mb4_unicode_ci from the table definition. Now explain shows that we’re using the index:

mysql> explain SELECT  CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call',  COUNT(*) as 'No. of API Calls',  AVG(ExecutionTime) as 'Avg. Execution Time',  COUNT(distinct AccountId) as 'No. Of Accounts',  COUNT(distinct ParentAccountId) as 'No. Of Parents'  FROM ApiLog  WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'  GROUP BY verb_url_hash  HAVING COUNT(*) >= 1G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: ApiLog
   partitions: NULL
         type: index
possible_keys: ts,verb_url_hash,covered_index
          key: covered_index
      key_len: 3057
          ref: NULL
         rows: 22382136
     filtered: 50.00
        Extra: Using where; Using index
1 row in set, 1 warning (0.00 sec)

mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*) as 'No. of API Calls', AVG(ExecutionTime) as 'Avg. Execution Time', COUNT(distinct AccountId) as 'No. Of Accounts', COUNT(distinct ParentAccountId) as 'No. Of Parents' FROM ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' GROUP BY verb_url_hash HAVING COUNT(*) >= 1G

*************************** 1. row ***************************

id: 1

select_type: SIMPLE

table: ApiLog

partitions: NULL

type: index

possible_keys: ts,verb_url_hash,covered_index

key: covered_index

key_len: 3057

ref: NULL

rows: 22382136

filtered: 50.00

Extra: Using where; Using index

1 row in set, 1 warning (0.00 sec)

The response time dropped to ~12 seconds! However, the index size is significantly larger compared to just verb_url_hash (16 bytes per record).

Conclusion

MySQL 5.7 generated columns provide a valuable way to improve query performance. If you have an interesting case, please share in the comments.

Learn more about Percona Server for MySQL

2 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

A lonely DBA

6 years ago

The “GROUP BY verb, url” will not change the results, you will still SELECT CONCAT( verb, ‘-‘, URL), and will solve the wrong grouping done using the hash índex.
At least, MySQL finally has some reasonables defaults, it’s a pitty that DBAs insist in going back to the wrong ones…

A Developer

4 years ago

What would be performance difference when we compare:
1. Index with generated stored vs virtual column.
2. Index with generated vs actual column.

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Using MySQL 5.7 Generated Columns to Increase Query Performance

Introduction

The Problem

The Solution

Covered Index

Conclusion

Related

Related Blog Articles

RECOMMENDED ARTICLES

New Valkey Packages by Percona

Can We Set up a Replicate Filter Within the Percona XtraDB Cluster?

Valkey/Redis: Not-So-Good Practices

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Using MySQL 5.7 Generated Columns to Increase Query Performance

Introduction

The Problem

The Solution

Covered Index

Conclusion

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

New Valkey Packages by Percona

Can We Set up a Replicate Filter Within the Percona XtraDB Cluster?

Valkey/Redis: Not-So-Good Practices

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation