Background on Backup Locks

I was very excited to see Backup locks support in release notes for the latest Percona XtraDB Cluster 5.6.21 release. For those who are not aware, backup locks offer an alternative to FLUSH TABLES WITH READ LOCK (FTWRL) in Xtrabackup. While Xtrabackup can hot-copy Innodb, everything else in MySQL must be locked (usually briefly) to get a consistent snapshot that lines up with Innodb. This includes all other storage engines, but also things like table schemas (even on Innodb) and async replication binary logs. You can skip this lock, but it isn’t generally considered a ‘safe’ backup in every case.

Until recently, Xtrabackup (like most other backup tools) used FTWRL to accomplish this. This worked great, but had the unfortunate side-effect of locking every single table, even the Innodb ones.  This functionally meant that even a hot-backup tool for Innodb had to take a (usually short) global lock to get a consistent backup with MySQL overall.

Backup locks change that by introducing a new locking command on Percona Server called ‘LOCK TABLES FOR BACKUP’.  This works by locking writes to non-transactional tables, as well as locking DDL on all tables (including Innodb).  If Xtrabackup (of a recent vintage) detects that it’s backing up a Percona Server (also of recent vintage), it will automatically use LOCK TABLES WITH BACKUP instead of FLUSH TABLES WITH READ LOCK.

The TL;DR of this is that you can keep on modifying your Innodb data through the entire backup, since we don’t need to use FTWRL any longer.

This feature was introduced in Percona Server 5.6.16-64.0 and Percona XtraBackup 2.2.  I do not believe you will find it in any other MySQL variant, though I could be corrected.

What this means for Percona XtraDB Cluster (PXC)

The most common (and logical) SST method for Percona XtraDB Cluster is using Xtrabackup. This latest release of PXC includes support for backup locks, meaning that Xtrabackup donor nodes will no longer need to get a global lock. Practically for PXC users, this means that your Donor nodes can stay in rotation without causing client interruptions due to FTWRL.

Seeing it in action

To test this out, I spun up a 3-node cluster on AWS and fired up a sysbench run on the first node. I forced and SST on the node. Here is a snippet of the innobackup.backup.log (generated by all Xtrabackup donors in Percona XtraDB Cluster):

We can see the LOCK TABLES FOR BACKUP issued at 19:24:06 and unlocked at 19:24:10. Let’s see Galera apply stats from this node during that time:

Initially the node is keeping up ok with replication. The Down Queue (wsrep_local_recv_queue) is sticking around 0. We’re applying 4-5k transactions per second (Ops Dn). When the backup lock kicks in, we do see an increase in the queue size, but note that transactions are still applying on this node:

So this node isn’t locked from innodb write transactions, it’s just suffering a bit of IO load while the backup finishes copying its files and such. After this, the backup finished up and the node goes back to a Synced state pretty quickly:

Compared to Percona XtraDB Cluster 5.5

The Backup Locking is only a feature of Percona XtraDB Cluster 5.6, so if we repeat the experiment on 5.5, we can see a more severe lock:

Our lock lasts from 20:31:19 until 20:31:21, so it’s fairly short. Note that with larger databases with more schemas and tables, this can be quite a bit longer. Let’s see the effect on the apply rate for this node:

The drop here is more severe and the apply rate hits 0 (and stays there for the duration of the FTWRL).

Implications

Obviously Xtrabackup running on a PXC node will cause some load on the node itself, so there still maybe good reasons to keep a Donor node out of rotation from your application.  However, this is less of an issue than it was in the past, where writes would definitely stall on a Donor node and present potentially intermittent stalls on the application.

How you allow applications to start using a Donor node automatically (or not) depends on how you have your HA between the application and cluster setup.  If you use HAproxy or similar with clustercheck, you can either modify the script itself or change a command line argument. The node is in the Donor/Desynced state below:

For those doing their own custom health checking, you basically just need to pass nodes that have a wsrep_local_state_comment of either ‘Synced’ or ‘Donor/Desynced’.