About 2 weeks ago Oracle published the MySQL 5.7.7-labs-json version which includes a very interesting feature called “Generated columns” (also know as Virtual or Computed columns). MariaDB has a similar feature as well: Virtual (Computed) Columns.
The idea is very simple: if we store a column
1 | `FlightDate` date |
in our table we may want to filter or group by year(FlightDate), month(FlightDate) or even dayofweek(FlightDate). The “brute-force” approach: use the above Date and Time MySQL functions in the query; however it will prevent MySQL from using an index (see below). Generated columns will allow you to declare a “Virtual”, non-stored column which is computed based on the existing field; you can then add index on that virtual column, so the query will use that index.
Here is the original example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | CREATE TABLE `ontime` ( `id` int(11) NOT NULL AUTO_INCREMENT, `FlightDate` date DEFAULT NULL, `Carrier` char(2) DEFAULT NULL, `OriginAirportID` int(11) DEFAULT NULL, `OriginCityName` varchar(100) DEFAULT NULL, `OriginState` char(2) DEFAULT NULL, `DestAirportID` int(11) DEFAULT NULL, `DestCityName` varchar(100) DEFAULT NULL, `DestState` char(2) DEFAULT NULL, `DepDelayMinutes` int(11) DEFAULT NULL, `ArrDelayMinutes` int(11) DEFAULT NULL, `Cancelled` tinyint(4) DEFAULT NULL, `CancellationCode` char(1) DEFAULT NULL, `Diverted` tinyint(4) DEFAULT NULL, PRIMARY KEY (`id`), KEY `FlightDate` (`FlightDate`) ) ENGINE=InnoDB |
Now I want to find all flights on Sundays (in 2013) and group by airline.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | mysql> EXPLAIN SELECT carrier, count(*) FROM ontime_sm WHERE dayofweek(FlightDate) = 7 group by carrier *************************** 1. row *************************** id: 1 select_type: SIMPLE table: ontime_sm type: ALL possible_keys: NULL key: NULL key_len: NULL ref: NULL rows: 151253427 Extra: Using where; Using temporary; Using filesort Results: 32 rows in set (1 min 57.93 sec) |
The problem here is: MySQL will not be able to use index when you use a function which will “extract” something from the column. The standard approach is to “materialize” the column:
1 | ALTER TABLE ontime_sm ADD Flight_dayofweek tinyint NOT NULL; |
Then we will need to load data into that by running “UPDATE ontime_sm SET Flight_dayofweek = dayofweek(flight_date)”. After that we will also need to change the application to support that additional column or use a trigger to update the column. Here is the trigger example:
1 2 3 4 5 6 | CREATE DEFINER = CURRENT_USER TRIGGER ontime_insert BEFORE INSERT ON ontime_sm_triggers FOR EACH ROW SET NEW.Flight_dayofweek = dayofweek(NEW.FlightDate); |
One problem with the trigger is that it is slow. In my simple example it took almost 2x slower to “copy” the table using “insert into ontime_sm_copy select * from ontime_sm” when the trigger was on.
The Generated Columns from MySQL 5.7.7-labs-json version (only this version supports it on the time of writing) solves this problem. Here is the example which demonstrate its use:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | CREATE TABLE `ontime_sm_virtual` ( `id` int(11) NOT NULL AUTO_INCREMENT, `FlightDate` date DEFAULT NULL, `Carrier` char(2) DEFAULT NULL, `OriginAirportID` int(11) DEFAULT NULL, `OriginCityName` varchar(100) DEFAULT NULL, `OriginState` char(2) DEFAULT NULL, `DestAirportID` int(11) DEFAULT NULL, `DestCityName` varchar(100) DEFAULT NULL, `DestState` char(2) DEFAULT NULL, `DepDelayMinutes` int(11) DEFAULT NULL, `ArrDelayMinutes` int(11) DEFAULT NULL, `Cancelled` tinyint(4) DEFAULT NULL, `CancellationCode` char(1) DEFAULT NULL, `Diverted` tinyint(4) DEFAULT NULL, `CRSElapsedTime` int(11) DEFAULT NULL, `ActualElapsedTime` int(11) DEFAULT NULL, `AirTime` int(11) DEFAULT NULL, `Flights` int(11) DEFAULT NULL, `Distance` int(11) DEFAULT NULL, `Flight_dayofweek` tinyint(4) GENERATED ALWAYS AS (dayofweek(FlightDate)) VIRTUAL, PRIMARY KEY (`id`), KEY `Flight_dayofweek` (`Flight_dayofweek`), ) ENGINE=InnoDB |
Here we add Flight_dayofweek
tinyint(4) GENERATED ALWAYS AS (dayofweek(FlightDate)) VIRTUAL column and index it.
Now MySQL can use this index:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | mysql> EXPLAIN SELECT carrier, count(*) FROM ontime_sm_virtual WHERE Flight_dayofweek = 7 group by carrier *************************** 1. row *************************** id: 1 select_type: SIMPLE table: ontime_sm_virtual partitions: NULL type: ref possible_keys: Flight_dayofweek key: Flight_dayofweek key_len: 2 ref: const rows: 165409 filtered: 100.00 Extra: Using where; Using temporary; Using filesort |
To further increase performance of this query we want to add a combined index on (Flight_dayofweek, carrier) so MySQL will avoid creating temporary table. However it is not currently supported:
1 2 3 | mysql> alter table ontime_sm_virtual add key comb(Flight_dayofweek, carrier); ERROR 3105 (HY000): 'Virtual generated column combines with other columns to be indexed together' is not supported for generated columns. |
We can add an index on 2 generated columns thou, which is good. So a trick here will be to create a “dummy” virtual column on “carrier” and index 2 of those columns:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | mysql> alter table ontime_sm_virtual add Carrier_virtual char(2) GENERATED ALWAYS AS (Carrier) VIRTUAL; Query OK, 0 rows affected (0.43 sec) Records: 0 Duplicates: 0 Warnings: 0 mysql> alter table ontime_sm_virtual add key comb(Flight_dayofweek, Carrier_virtual); Query OK, 999999 rows affected (36.79 sec) Records: 999999 Duplicates: 0 Warnings: 0 mysql> EXPLAIN SELECT Carrier_virtual, count(*) FROM ontime_sm_virtual WHERE Flight_dayofweek = 7 group by Carrier_virtual *************************** 1. row *************************** id: 1 select_type: SIMPLE table: ontime_sm_virtual partitions: NULL type: ref possible_keys: Flight_dayofweek,comb key: comb key_len: 2 ref: const rows: 141223 filtered: 100.00 Extra: Using where; Using index |
Now MySQL will use an index and completely avoid the filesort.
The last, but not the least: loading data to the table with generated columns is significantly faster compared to loading it into the same table with triggers:
1 2 3 4 5 6 7 | mysql> insert into ontime_sm_triggers (id, YearD, FlightDate, Carrier, OriginAirportID, OriginCityName, OriginState, DestAirportID, DestCityName, DestState, DepDelayMinutes, ArrDelayMinutes, Cancelled, CancellationCode,Diverted, CRSElapsedTime, ActualElapsedTime, AirTime, Flights, Distance) select * from ontime_sm; Query OK, 999999 rows affected (27.86 sec) Records: 999999 Duplicates: 0 Warnings: 0 mysql> insert into ontime_sm_virtual (id, YearD, FlightDate, Carrier, OriginAirportID, OriginCityName, OriginState, DestAirportID, DestCityName, DestState, DepDelayMinutes, ArrDelayMinutes, Cancelled, CancellationCode,Diverted, CRSElapsedTime, ActualElapsedTime, AirTime, Flights, Distance) select * from ontime_sm; Query OK, 999999 rows affected (16.29 sec) Records: 999999 Duplicates: 0 Warnings: 0 |
Now the big disappointment: all operations with generated columns are not online right now.
1 2 3 4 5 | mysql> alter table ontime_sm_virtual add Flight_year year GENERATED ALWAYS AS (year(FlightDate)) VIRTUAL, add key (Flight_year), lock=NONE; ERROR 1846 (0A000): LOCK=NONE is not supported. Reason: '%s' is not supported for generated columns.. Try LOCK=SHARED. mysql> alter table ontime_sm_virtual add key (Flight_year), lock=NONE; ERROR 1846 (0A000): LOCK=NONE is not supported. Reason: '%s' is not supported for generated columns.. Try LOCK=SHARED. |
I hope it will be fixed in the future releases.
Conclusion
Generated columns feature is very useful. Imagine an ability to add a column + index for any “logical” piece of data without actually duplicating the data. And this can be any function: date/time/calendar, text (extract(), reverse(), metaphone()) or anything else. I hope this feature will be available in MySQL 5.7 GA. Finally, I wish adding a generated column and index can be online (it is not right now).
More information:
- Original blog post (older implementation) by MySQL server team about Generated Columns in MySQL 5.7.5
- New implementation of the virtual column: WL#8114: Don’t store virtual generated columns in database
Alex, thanks for the nice write up!
I would like to add that one big difference for this implementation (comparing to MariaDB) is the virtual columns are really “not materialized”, so they are not present in the table data (cluster index). So when user add/drop such virtual column, it is essentially a few system table updates (no need to rebuild table) and really fast. In such, there is no need to make this operation(add or drop virtual column) online. Since even for any current “online” DDL operations, the metadata update phase are not online (locked by metadata lock).
However, it is true that adding a key on it is not online in the LAB release. Such limitation is already lifted in our current implementation.
I’m glad time has been taken to do the implementation of the “right” way with virtual columns being really virtual and not stored anywhere unless one indexes them. This is a great feature which is long overdue in MySQL and it is great MariaDB has an implementation which works for some users for a while.
I also would point out if adding the column just modifies couple of rows in the index table I would run it with ONLINE modifier as it is online from user point of view, not give the error
Number of Ops people add ONLINE to all their DDL to make sure there is never blocking DDL run in production and causing downtime, so refusing to run it with this modifier can cause confusion.
Your example contradicts the documentation provided by the MySQL Server Team regarding generated columns:
http://mysqlserverteam.com/generated-columns-in-mysql-5-7-5/
“… but only a stored Generated Column can be be a part of an index.”
I guess this was something recent fixed/improved?
Peter:
> I also would point out if adding the column just modifies couple of rows
> in the index table I would run it with ONLINE modifier as it is online from
> user point of view, not give the error
> Number of Ops people add ONLINE to all their DDL to make sure there
> is never blocking DDL run in production and causing downtime, so refusing
> to run it with this modifier can cause confusion.
This is a good point! I think this makes sense. I will propose such change (allowing “online” modifier for adding/dropping virtual columns), and if strong reason against it, we will just do it.
Matthew:
> I guess this was something recent fixed/improved?
This feature is only part of “MySQL 5.7.7-labs-json version” release, which means it is not yet into official 5.7 releases. Once it goes in, the docuemtation will change accordingly.
Jimmy, thank you!
Is there any chance that mixing indexes on stored columns and generated columns will be allowed? So we do not have to create a “dummy” generated column to create a combined index?
Ah, Jimmy answered my question, “How is this implemented differently than in Maria?”
I would be curious to try, though, given that difference, if operations are faster in the Maria version or the MySQL version.
Nice writeup, looking forward to 5.7.
This is probably going to be one of my top 3 favourite features of MySQL 5.7. Thanks for sharing some details on this, Alexander!
Mixing indexes on stored and virtual is support in Percona Server 5.7.10. Not necessary to do the “dummy” column any more.
Excellent! Exactly what I was seeking.
Using MySQL Workbench to modify column with generated expressions, how would one modify a MySQL table to include a generated calculated column?
https://i.imgur.com/u9M9X2b.png
Am using MySQL Workbench v8.0 does this not accomodate the solution you shared above,
“The Generated Columns from MySQL 5.7.7-labs-json version (only this version supports it on the time of writing) solves this problem.”
https://i.imgur.com/NuajW3X.png