Comments on: Tuning InnoDB Primary Keys

By: Yves Trudeau

Yves Trudeau — Fri, 24 May 2019 14:22:50 +0000

In reply to H. @H, yes, it would make perfect sense to prefix the primary with doc_account. You need an additional index for the auto_increment doc_id column though. The secondary index is not bad at all. Remember InnoDB adds the primary key to the index columns. Naively, this gives (doc_account, doc_date, doc_account,doc_id) but, InnoDB is wise enough to avoid repeating columns so you really have (doc_account, doc_date,doc_id) for the secondary key.

By: H

Thu, 23 May 2019 11:53:43 +0000

Yves, thanks for this very helpful article! I have a question regarding the concepts you explained: when having e.g. document data in a InnoDB table in a multi-tenant database used for multiple customer accounts, would it make sense / be wise to “preface” the primary key with the account ID? (Access pattern would be to retrieve e.g. the newest documents for a specific account.)

So instead of

CREATE TABLE documents (
doc_id INT UNSIGNED NOT NULL AUTO_INCREMENT,
doc_account MEDIUMINT UNSIGNED NOT NULL,
doc_title VARCHAR(128) NOT NULL,
doc_date DATE NOT NULL,
…
PRIMARY KEY (doc_id),
INDEX (doc_account, doc_date)
)

it would be better to do this

? Would I leave the secondary index just as it is? Any help is highly appreciated!

By: Yves Trudeau

Yves Trudeau — Thu, 01 Nov 2018 13:30:47 +0000

In reply to John Lee. @John If you search mainly by timestamp, it would be logical to have the primary key defined as (timestamp,id) and secondary key on (id) to support the auto_increment. Partitions by range of timestamp could still be useful to prune older data.

By: John Lee

John Lee — Thu, 01 Nov 2018 00:24:12 +0000

Very informative. What is the relationship between primary key vs. partition key and their impact on performance? For example, I’d like to use ‘id’ as primary key because they are unique. id is also incrementally generated, so storing them together generally provide grouping of items that are relatively close in incoming timestamp. However, search is usually done with timestamp since I do not know id in advance. I am thinking a primary key index as such (id, timestamp). Since the tables are usually well over 100G, I would like to partition as well but on timestamp.

By: SCG82

SCG82 — Sat, 25 Aug 2018 20:05:36 +0000

In reply to Jose María Zaragoza. because 93% of queries are by user_id. your schema would perform very poorly for these queries.

By: Yves Trudeau

Yves Trudeau — Wed, 01 Aug 2018 13:42:20 +0000

Just because removing it means modifying the application, some ORMs are stubborn in their will of having an “id” column. If it can be modified to use the new primary key, then yes, the id column and idx_id index can be removed.

By: Sids

Sids — Wed, 01 Aug 2018 13:28:06 +0000

Thanks for such informative post.

Just a random question. Is there any particular reason you’ve kept column ( id int(10) unsigned NOT NULL AUTO_INCREMENT) and index (KEY idx_id (id)) ? Since you’ve already set new PRIMARY KEY (user_id,friend_user_id), do you still need ID column and with index on it?!

By: Sids

Sids — Tue, 31 Jul 2018 03:31:33 +0000

Great article on tuning ideology.

Just a random question, is there any specific need of ID column “id int(10) unsigned NOT NULL AUTO_INCREMENT” afterwards you’ve shifted primary key to “PRIMARY KEY (user_id,friend_user_id)”?

Is there any use-case to keep
Column : id int(10) unsigned NOT NULL AUTO_INCREMENT and
Key : idx_id (id) ?

By: Yves Trudeau

Yves Trudeau — Sun, 29 Jul 2018 21:58:43 +0000

You basically have a covering index, it will work great for searches by friend_user_id and the innodb change buffer will limit the impacts of the index maintenance.

I see a few issues.

– You cannot search by user_id
– You don’t have unicity for the pair (user_id,friend_user_id)
– Query forms like: “select friends_user_id from friends where user_id=1234 and active = 1;” are very bad because the index doesn’t provide “active”. The primary key b-tree must be queried.

My table works fine in all cases.

By: Jose María Zaragoza

Jose María Zaragoza — Sun, 29 Jul 2018 16:40:58 +0000

Thanks for your post

Why not to create ‘friends’ table like this:

CREATE TABLE friends (
id int(10) unsigned NOT NULL AUTO_INCREMENT,
user_id int(10) unsigned NOT NULL,
friend_user_id int(10) unsigned NOT NULL,
created timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
active tinyint(4) NOT NULL DEFAULT ‘1’,
PRIMARY KEY (id),
KEY idx_friend (friend_user_id,’user_id’)
) ENGINE=InnoDB AUTO_INCREMENT=144002 DEFAULT CHARSET=latin1

– PK is autoincrement ( so, inserts are ordered , better for selecting write page )
– only one secondary index. It supports searches by (friend_user_id) and (friend_user_id,user_id)

Regards

By: Jose María Zaragoza

Jose María Zaragoza — Sat, 28 Jul 2018 06:13:54 +0000

Thanks for you post

Why not to create ‘friends’ table like this:

– PK is autoincrement ( so, inserts are ordered , better for selecting write page )
– only one secondary index. It supports searches by (friend_user_id) and (friend_user_id,user_id)

Regards

By: Yves Trudeau

Yves Trudeau — Fri, 27 Jul 2018 18:04:04 +0000

In reply to Stofa.. Varchar are slow to compare, it is one byte at a time and use the charset and collation code. Integer types, other than mediumint, are compared by the CPU directly. MD5 and UUIDs are huge and random. All the secondary keys will be inflate by the huge PK. Maybe it is a good idea for developers but it is bad for InnoDB. When I have a customer stuck with suck PKs, we have to struggle to improve performance. I have an upcoming blog post on that topic.

By: Stofa.

Stofa. — Fri, 27 Jul 2018 16:15:59 +0000

Hi,

The question seems to be stupid.
What about the choice of using a varchar column as a PRIMARY KEY ? Is there any limit concerning the maximum length of a varchar column used as PRIMARY KEY for Innodb table?
Someone has suggested, as alternative to int or a bingint columns, the use of MD5 or an UUID as a PRIMARY key for innodb.

Stofa.

Thanks.

By: Yves Trudeau

Yves Trudeau — Fri, 27 Jul 2018 12:27:16 +0000

In reply to 肖建平. From https://dev.mysql.com/doc/refman/5.7/en/innodb-index-types.html "The row ID is a 6-byte field that increases monotonically as new rows are inserted."

By: 肖建平

肖建平 — Fri, 27 Jul 2018 02:04:38 +0000

If there is no primary key for a table, InnoDB adds a hidden auto-incremented 6 bytes counter to the table and use that hidden counter as the primary key.
I remember is 8 bytes.

By: xiaojianping

xiaojianping — Fri, 27 Jul 2018 02:01:29 +0000

If there is no primary key for a table, InnoDB adds a hidden auto-incremented 6 bytes counter to the table and use that hidden counter as the primary key.
I remember is 8 bytes.

By: Scott Klasing

Scott Klasing — Thu, 26 Jul 2018 19:08:29 +0000

Yves, great article on what so many consider after they have placed a table in production. My only complaint is you have not written anything on how to choose the best primary key and it involves/requires knowledge of the data.

If more mysql or database technicians were to focus on this during the data modeling phase they would have far fewer problems. There are huge gains for those who best understand database modeling, normalization of data, and the relational model itself as invented by Ted Codd back in the 60’s era. The model is straight set theory and is easily, mathmatically proven to work well when adhered to.

I have to admit I am very surprised the database industry does not require you to specify a pk rather than default to an internal hidden useless to users approach. Force the dba/programmer/user to think what is the best access path to the data. Ted Codd (inventor of the relational model) use to say if you have not identified the table’s PK then you have not identified the tables purpose which I fully agree. The proof PK’s should be mandated is if you have a table without a PK, then how do your batch programs indentify which row should be updated. You might be able to view the data and make that determination, but a program can not.

As most already know, know a Primary Key must be unique and can not contain nulls, otherwise you have not identified the row so to start you first identify all possible candidate keys, ie, those that are unique and dont allow nulls. Second choose the better of the candidate keys.

An example, best explains. Usually in most mysql shops you see lazy choices of auto incremented PKs proliferating the design and at the same time you see the most common secondary indexes that should have/ could have been the primary key. The point is improper choice of the PK always results in additional unnecessary secondary indexes which further slows inserts/updates/deletes. The pattern that shows is usually in the WHERE clauses where since the application did not know the value of the PK but it did know a name or a description it searched by the secondary key to then identify the PK. Note the similar pattern where Mysql includes the pk in all secondary indexes. Quite often the main secondary index assuming unique outweighs the choice of the autoincrement, not thought through key.

Then question the impact of single attribute PK’s like autoincrement verses a key that better represents the business but is perhaps a two attribute PK. If your model table hierarchy is only 2 to 3 deep then a multi-attribute key is manageable. If much deeper then it becomes problematic writing code since the lower levels of the hierarchy have to add an additional attribute that presents the next level. Also question the order of the attributes in the PK where your WHERE clauses must reference the first order attributes to even use the index. Also question the distinct cardinality of an attribute because the higher the distinct cardinality of an attribute the better it is in an index. Example, if an index’s first attribute has low cardinality, number of distinct values, then the index has to read/lock more similar rows to answer the questions quickly.

I believe the above process is why DBA’s, developers, implementors get lazy because they dont understand their data as well as they should. If they did understand the above becomes a very easy process to identify the best PK.

In summary I always inform our teams, choose the candidate key that best presents the business, the one you will search by, and most likely order by, group by since the PK is typically the clustered, physically sorted already index. Finally, for shops with huge write volumes as in 100’s of thousands of queries per second and more, Always, always, always UPDATE VIA the PK, never allow updates or deletes that are not PK based. Even if your volumes are lower “Design for millions of transactions per second” by only taking ONE ROW based update/delete locks as opposed to locking many rows. If you dont know the PK to update or delete by, then read in non locking mode to obtain the lists of PKs and then update or delete by the list.

I have used the above techniques for the last 37 plus years and they have never failed to produced significantly improved performance results. The above logic applies to every database I have ever worked on (IDMS, IMS, DB2, Oracle, MySQL, ObjectStar, CouchDB, MongoDB, other NOSQL db’s, and several CloudDBs), not just innodb engine tables.

Loved the article, keep up the good work.