Experimental Feature: $backupCursorExtend in Percona Server for MongoDB

$backupCursorExtend MongoDB Percona Server for MongoDB (PSMDB) has provided a ‘hot’ backup of its underlying data db directory files using WiredTiger library methods since v3.4. When you use the { createBackup: … } command it will copy them to whichever other filesystem directory or object store bucket location you specify.

createBackup partially freezes the WiredTiger *.wt Btree files with an internal backup cursor while the command runs. That’s “partially” because more data may be added to the files whilst the data is copied to the backup storage. The key point is all the file blocks of the last checkpoint will be frozen until the backup cursor is closed.

As well as the Btree-storing *.wt files, some metadata files, config-ish files, and the journal/WiredTigerLog.* files will also be copied. We’ll get into those a little later though.

In PSMDB 4.4.6-8 we’ve added two new aggregation pipeline stages that make the file-by-file mechanics of this fastest backup procedure more visible. They also support cluster-consistent backup and restore.

These aggregation pipeline stages are not real aggregation functions such as group, sort, etc. They’re functions for DBAs to use, like $collStats, $currentOp, $getDiagnosticData, etc.

New Aggregation Stage #1: $backupCursor

When you run db.aggregate([{$backupCursor: {}}] it simply returns a list of files to copy.

testrs:SECONDARY> bkcsr = db.getSiblingDB("admin").aggregate([{$backupCursor: {}}])
{ "metadata" : { 
    "backupId" : UUID("f53a93d5-8a0c-4f09-b654-909ff023a7ed"), 
    "dbpath" : "/data/plain_rs/n1/data", 
    "oplogStart" : { "ts" : Timestamp(1615341658, 18745), "t" : NumberLong(105) }, 
    "oplogEnd" : { "ts" : Timestamp(1622185503, 1), "t" : NumberLong(115) }, 
    "checkpointTimestamp" : Timestamp(1622185473, 1) 
  } 
}
{ "filename" : "/data/plain_rs/n1/data/collection-105-8595782607099944640.wt", "fileSize" : NumberLong(20480) }
{ "filename" : "/data/plain_rs/n1/data/index-1--1329013882440624564.wt", "fileSize" : NumberLong(20480) }
...
...
{ "filename" : "/data/plain_rs/n1/data/journal/WiredTigerLog.0000000042", "fileSize" : NumberLong(104857600) }
testrs:SECONDARY>

testrs:SECONDARY> bkcsr = db.getSiblingDB("admin").aggregate([{$backupCursor: {}}])

{ "metadata" : {

"backupId" : UUID("f53a93d5-8a0c-4f09-b654-909ff023a7ed"),

"dbpath" : "/data/plain_rs/n1/data",

"oplogStart" : { "ts" : Timestamp(1615341658, 18745), "t" : NumberLong(105) },

"oplogEnd" : { "ts" : Timestamp(1622185503, 1), "t" : NumberLong(115) },

"checkpointTimestamp" : Timestamp(1622185473, 1)

}

{ "filename" : "/data/plain_rs/n1/data/collection-105-8595782607099944640.wt", "fileSize" : NumberLong(20480) }

{ "filename" : "/data/plain_rs/n1/data/index-1--1329013882440624564.wt", "fileSize" : NumberLong(20480) }

...

{ "filename" : "/data/plain_rs/n1/data/journal/WiredTigerLog.0000000042", "fileSize" : NumberLong(104857600) }

testrs:SECONDARY>

See the “Full instructions for using” section below for a thorough example.

It just lists the files and size without copying them anywhere. This is less convenient than the existing createBackup which does. So why did we add it? Compatibility with the other enterprise version of MongoDB is the main point. But there’s another one – adding this aggregation stage exposes the WiredTiger library’s backup cursor API in an almost 1-to-1 structural match. Which is educational for advanced users, and that’s worth something.

Cursors of … File Names?

You may have heard that in Unix “Everything is file”. Well, in the WiredTiger library API everything is a cursor. Tables and indexes are accessed by a cursor of course, and so is the transaction log. Colgroups are too. Also statistics and metadata. Even files are accessed through a cursor. How did this happen? I guess when you’re a db engine developer …. hammers, nails – you know the saying.

New Aggregation Stage #2: $backupCursorExtend

This extra function returns just a few more file paths to copy. The output is small, as you see. The files it returns are just the list of WiredTiger transaction log files that have been updated or newly added since the $backupCursor was first run.

The WT log files are small, but they have an important benefit: in combination with some restart-time tricks, we can use the WT transaction ops in these files to restore to any arbitrary time between the $backupCursor and $backupCursorExtend execution times.

You must run $backupCursor first. As well as keeping that first cursor open, which is a requirement, it’ll be helpful to save a copy of the “metadata” document that is its first record. The reason is that the “backupId” value in the metadata is a required parameter when creating the $backupCursorExtend cursor – it has to know from which main backup cursor it is meant to extend.

testrs:SECONDARY> var bkcsr = db.getSiblingDB("admin").aggregate([{$backupCursor: {}}])

testrs:SECONDARY> var bkCsrMetadata = bkCsr.next().metadata //first document in $backupCursor result has just one child property, the "metadata" object.

testrs:SECONDARY> // do other things with the main backup cursor

testrs:SECONDARY> # ...

testrs:SECONDARY> var bkExtCsr = db.aggregate([{$backupCursorExtend: {backupId: bkCsrMetadata.backupId, timestamp: new Timestamp(162218619, 1)}}])

{ "filename" : "/data/plain_rs/n1/data/journal/WiredTigerLog.0000000042" }

{ "filename" : "/data/plain_rs/n1/data/journal/WiredTigerLog.0000000043" }

{ "filename" : "/data/plain_rs/n1/data/journal/WiredTigerLog.0000000044" }

testrs:SECONDARY>

The second parameter, the timestamp one, will block the command from returning its result until the node has reached that cluster time. This guarantees that you don’t have, for example, a lagging secondary’s snapshot that ends at a cluster time that is before the starting cluster time of other shard’s backups.

Full Instructions for Using $backupCursor and $backupCursorExtend

N.b. this feature is experimental in PMSDB as of now (v4.4.6), so it may change in the future.

Backup

From one node in each replica set (including the config server replica set) run the following procedure.

It’s OK to use either a primary or a secondary, but using a secondary that has significant lag will oblige you to wait for that much time in between steps 1 and 2.

1. Get a list of all files to copy with $backupCursor

Getting the result response, with the metadata doc and the list of files, will take less than a second, but (important!) this cursor must be kept alive until the end of all the backup steps shown below. So don’t close it.

When the command is completed on all replica sets save somewhere in your script etc, the maximum (latest) “oplogEnd” timestamp amongst the replica set nodes – this will be the target, common time of the backup to restore. In this blog, I’ll call this value the commonBackupTimestamp.

(Pre-2: wait until all complete step 1.)

Step 1 will only take ~1 second to complete, but make sure you finish it for all servers involved before doing the next step.

2. If backing up a cluster also run $backupCursorExtend

Do it as soon as you like, you only need to capture a span of time that is long enough to overlap the same process on the other shards and the config server replica set.

Pass the metadata “backupId” from the $backupCursor on this node as the first parameter, and the commonBackupTimestamp (from all replica sets) as the second parameter. This could be run within seconds if your script/program is fast and is good at parallel execution across servers.

You don’t need to explicitly close the cursor of the $backupCursorExtend, it happens automatically. It doesn’t need to do the pinning of any files because $backupCursor is already doing that.

3. Start a < 10 min loop that runs getmore on backup cursor

This will prevent the automatic close of the backup cursor. This is crucial – if the backup cursor is closed WiredTiger library will start allowing the backup snapshot file blocks to be cleaned up and overwritten.

The $backupCursor MongoDB cursor is a tailable one that will return a successful but empty result, as many times as you have to.

4. Now start copying all the files listed to your backup storage

This is the part that takes hours … or whatever <your-db-file-size>/<transfer-rate> is for you.

5. Terminate the getmore loop

6. Close the cursor of the $backupCursor command

Or just let it time out (10mins).

7. Save the commonBackupTimestamp for restore.

This is a MongoDB timestamp, eg. Timestamp(162218619, 1). If you don’t save it in a handy place you’ll have to peek at all the restored nodes to find some common cluster timestamp they all support.

Restore

Having made backups with the current $backupCursor and $backupCursorExtend some work is required to restore a consistent cluster state. The reason is that $backupCursorExtend, in its current experimental version at least, returns WT log files that probably contain operations beyond the target cluster time.

To fix this we can use a special function of the replication subsystem’s startup process to trim the amount of oplog being restored beyond the last checkpoint to the desired, common cluster time. The “oplogTruncateAfterPoint” value in the local db’s replset.oplogTruncateAfterPoint collection is how you set that limit. (It’s undocumented in the MongoDB Manual, but a critical part of replication being updated all the time in normal replication.)

The big hassle is that more restarts are required – the oplogTruncateAfterPoint value can only be set whilst in standalone mode because it’s a field in a system collection of the local db; enabling replication again is required to get the value to be enacted upon. Saving the final result in a new checkpoint takes another standalone mode restart, with special extra options. And, unrelated to timestamp matters, you’ll probably be fixing the replica set and shard configuration to accept the new hostnames too.

For the general process of how to restore a filesystem snapshot of a cluster please see: Restore a Sharded Cluster. Although it’s not an absolutely required part of this restore process we also recommend setting an ephemeral port in the config file for all restarts, except the final one when rs.initiate() is executed. This prevents accidental connection to the original replica set nodes, wherever that environment is, and prevents interference between config and shard nodes before they’re perfectly restored.

The difference to the documented method above is, for both configsvr and shardsvr type nodes, instead of following the two steps to start the node in standalone mode and run db.getSiblingDB(“local”).drop() do these three steps instead:

Start mongod as standalone (i.e. replication and sharding options deleted or commented out)

Don’t drop the entire local db. We need to keep the oplog. Drop just the “replset.minvalid”, “replset.oplogTruncateAfterPoint”, “replset.election” and “system.replset” collections.
Insert a dummy replset.minvalid timestamp.
db.getSiblingDB(“local”).replset.minvalid.insert({“_id”: ObjectId(), “t”: NumberLong(-1), “ts”: Timestamp(0,1) });

Manually create a temporary replset configuration with this minimal config:

var tmp_rs_conf = {
&nbsp;&nbsp;"_id" : THE-REPLSET-NAME,
&nbsp;&nbsp;"version" : NumberInt(1), "protocolVersion" : NumberInt(1),
&nbsp;&nbsp;"members" : [
&nbsp;&nbsp;&nbsp;&nbsp;{ "_id" : NumberInt(0), "host" : "localhost:<port>" }
&nbsp;&nbsp;], "settings" : {}
};

//Note: If config server add "configsvr": true as well
// tmp_rs_conf["configsvr"] = true;

db.getSiblingDB("local").system.replset.insert(tmp_rs_conf);

var tmp_rs_conf = {

  "_id" : THE-REPLSET-NAME,

  "version" : NumberInt(1), "protocolVersion" : NumberInt(1),

  "members" : [

    { "_id" : NumberInt(0), "host" : "localhost:<port>" }

  ], "settings" : {}

};

//Note: If config server add "configsvr": true as well

// tmp_rs_conf["configsvr"] = true;

db.getSiblingDB("local").system.replset.insert(tmp_rs_conf);

Put the commonBackupTimestamp in the replset.oplogTruncateAfterPoint collection.
db.getSiblingDB("local").replset.oplogTruncateAfterPoint.insert( {    "_id": "oplogTruncateAfterPoint",    "oplogTruncateAfterPoint": <commonBackupTimestamp> })
1
2
3
db.getSiblingDB("local").replset.oplogTruncateAfterPoint.insert( {
   "_id": "oplogTruncateAfterPoint",
   "oplogTruncateAfterPoint": <commonBackupTimestamp> })
Shutdown instance

Start as a single-node replica set to recover the oplog.
- CSRS can be started with sharding enabled (clusterRole=configsvr) but shard’s replica sets should be started without clusterRole configured
- At this start, oplog history is truncated to the point set in the oplogTruncateAfterPoint collection
- Shutdown instance
Start as standalone with these parameters so the WT log and then mongodb-layer oplog updates are A) applied and B) saved into a checkpoint.
- –setParameter recoverFromOplogAsStandalone=true
- –setParameter takeUnstableCheckpointOnShutdown=true
- Shutdown instance

I won’t say “… and that’s all there is to it!”, because five restarts is a big job. (Per the usual process in Restore a Sharded Cluster there are another two to fix up replica set and sharding config required when moving hosts.) But this is how we can restore all nodes to exactly the same MongoDB cluster time after using the current experimental status $backupCursorExtend being released in Percona Server for MongoDB v4.4 this week.

Summary

The WiredTiger API exposes ways to list all files that should be copied for backup, and it will freeze those files so you can copy them without them being deleted or necessary parts within them being overwritten. These methods of the WT storage engine are exposed at the MongoDB user level with the $backupCursor and $backupCursorExtend aggregation pipeline stages.

$backupCursorExtend in particular gives us a way to keep on collecting WT transaction log files so that we can guarantee coverage of some common time between all shards and the config server replica set of a sharded cluster.

This makes it possible to achieve cluster-consistent filesystem copy restores of a MongoDB sharded cluster. Distributed transactions, if you are using them, will be restored all-or-nothing style up to the cluster timestamp you specify during restore.

Looking ahead: $backupCursorExtend in PSMDB 4.4.6 (and as far as we can tell, in MongoDB Enterprise edition too) doesn’t limit transaction log file copies to an exact MongoDB cluster time. It only guarantees greater-than-or-equal-to inclusion of log writes of that time. So during restore, we must use extra techniques that the MongoDB replication system provides, in particular the undocumented oplogTruncateAfterPoint trick, to do a clean cut of the oplog up to the same cluster timestamp.

After we’ve heard feedback from the community about these new backup cursor functions we’d like to improve them so let us know how you use them, and what a better solution would be for you.

Percona Distribution for MongoDB is a freely available MongoDB database alternative, giving you a single solution that combines the best and most important enterprise components from the open source community, designed and tested to work together.

Download Percona Distribution for MongoDB Today!

0 Comments

Inline Feedbacks

View all comments

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Experimental Feature: $backupCursorExtend in Percona Server for MongoDB

New Aggregation Stage #1: $backupCursor

Cursors of … File Names?

New Aggregation Stage #2: $backupCursorExtend

Full Instructions for Using $backupCursor and $backupCursorExtend

Backup

1. Get a list of all files to copy with $backupCursor

(Pre-2: wait until all complete step 1.)

2. If backing up a cluster also run $backupCursorExtend

3. Start a < 10 min loop that runs getmore on backup cursor

4. Now start copying all the files listed to your backup storage

5. Terminate the getmore loop

6. Close the cursor of the $backupCursor command

7. Save the commonBackupTimestamp for restore.

Restore

Summary

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

New Valkey Packages by Percona

Can We Set up a Replicate Filter Within the Percona XtraDB Cluster?

Valkey/Redis: Not-So-Good Practices

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation