Percona Backup for MongoDB (PBM) is a backup utility custom-built by Percona to help solve the needs of customers and users who don’t want to pay for proprietary software like MongoDB Enterprise and Ops Manager but want a fully-supported open-source backup tool that can perform cluster-wide consistent backups in MongoDB.

Version 1.7.0 of PBM was released in April 2022 and comes with a technical preview of physical backup functionality. This functionality enables users to benefit from the reduced recovery time. 

With logical backups, you extract the data from the database and store it in some binary or text format, and for recovery, you write all this data back to the database. For huge data sets, it is a time-consuming operation and might take hours and days. Physical backups take all your files which belong to the database from the disk itself, and recovery is just putting these files back. Such recovery is much faster as it does not depend on the database performance at all.

In this blog post, you will learn about the architecture of the physical backups feature in PBM, see how fast it is compared to logical backups, and try it out yourself.

Tech Peek

Architecture Review

In general, physical backup means a copy of a database’s physical files. In the case of Percona Server for MongoDB (PSMDB) these are WiredTiger *.wt  Btree, config, metadata, and journal files you can usually find in /data/db.  The trick is to copy all those files without stopping the cluster and interrupting running operations. And to be sure that data in files is consistent and no data will be changed during copying. Another challenge is to achieve consistency in a sharded cluster. In other words, how to be sure that we are gonna be able to restore data to the same cluster time across all shards.

PBM’s physical backups are based on the backupCursors feature of PSMDB (PSMDB). This implies that to use this feature, you should use Percona Server for MongoDB.

Backup

On each replica set, PBM uses $backupCursor to retrieve a list of files that need to be copied to achieve backup. Having that list next step is to ensure cluster-wide consistency. For that, each replica set posts a cluster time of the latest observed operation. The backup leader picks the most recent one. This will be the common backup timestamp (recovery timestamp) saved as the last_write_ts in the backup metadata. After agreeing on the backup time, the pbm-agent on each cluster opens a $backupCursorExtend. The cursor will return its result only after the node reaches the given timestamp. Thus the returned list of logs (journals) will contain the “common backup timestamp”. At that point, we have a list of all files that have to be in the backup. So each node copies them to the storage, saves metadata, closes cursors, and calls it a backup. Here is a blog post explaining Backup Cursors in great detail.

Of course, PBM does a lot more behind the scenes starting from electing appropriate nodes for the backup, coordinating operations across the cluster, logging, error handling, and many more. But all these subjects are for other posts.

Backup’s Recovery Timestamp

Restoring any backup PBM returns the cluster to some particular point in time. Here we’re talking about the time, not in terms of wall time but MongoDB’s cluster-wide logical clock. So the point that point-in-time is consistent across all nodes and replica sets in a cluster. In the case of logical or physical backup, that time is reflected in the complete section of pbm list  of pbm status  outputs. E.g.:

This time is not the time when a backup has finished, but the time at which cluster state was captured (hence the time the cluster will be returned to after the restore). In PBM’s logical backups, the recovery timestamp tends to be closer to the backup finish. To define it, PBM has to wait until the snapshot on all replica sets finishes. And then it starts oplog capturing from the backup start up to that time. Doing physical backups, PBM would pick a recovery timestamp right after a backup start. Holding the backup cursor open guarantees the checkpoint data won’t change during the backup, and PBM can define complete-time right ahead.

Restore

There are a few considerations for restoration.

First of all, files in the backup may contain operations beyond the target time ( commonBackupTimestamp). To deal with that, PBM uses a special function of the replication subsystem’s startup process to set the limit of the oplog being restored. It’s done by setting the oplogTruncateAfterPoint value in the local DB’s replset.oplogTruncateAfterPoint collection.

Along with the oplogTruncateAfterPoint database needs some other changes and clean-up before start. This requires a series of restarts of PSMDB in a standalone mode.

Which in turn brings some hassle to the PBM operation. To communicate and coordinate its work across all agents, PBM relies on PSMDB itself. But once a cluster is taken down, PBM has to switch to communication via storage. Also, during standalone runs, PBM is unable to store its logs in the database. Hence, at some point during restore, pbm-agent logs are being available only in agents’ stderr. And pbm logs won’t have access to them. We’re planning to solve this problem by the physical backups GA.

Also, we had to decide on the restore strategy in a replica set. One way is to restore one node, then delete all data on the rest and let the PSMDB replication do the job. Although it’s a bit easier, it means until InitialSync finishes, the cluster will be of little use. Besides, logical replication at this stage almost neglects all the speed benefits (later on that) the physical restore brings to the table. So we went with the restoration of each node in a replica set. And making sure after the cluster starts, no node will spot any difference and won’t start ReSync.

As with the PBM’s logical backups, the physical once currently can be restored to the cluster with the same topology, meaning replica set names in the backup and the target cluster should match. Although it won’t be an issue for logical backups starting from the next PBM version. And later this feature will be extended to the physical backups as well. Along with that, the number of replica sets in the cluster could be more than those in the backup but not vice-versa. Meaning all data in the backup should be restored. 

Performance Review

We used the following setup:

  • Cluster: 3-node replica set. Each mongod+pbm-agent on Digital Ocean droplet: 16GB, 8vCPU (CPU optimized).
  • Storage: nyc3.digitaloceanspaces.com
  • Data: randomly generated, ~1MB documents

physical backup MongoDB

In general, a logical backup should be more beneficial on small databases (a few hundred megabytes). Since on such a scale, the extra overhead on top of data that physical files bring still makes a difference. Basically reading/writing only user data during logical backup means less data needs to be transferred over the network. But as the database grows, overhead on logical read(select) and mostly write(insert) became a bottleneck for the logical backups. As for the physical backup, the speed is almost always bounded only by the network bandwidth to/from remote storage. In our tests, restoration time from physical backups has linear dependency on the dataset size, whereas logical restoration time grows non-linearly. The more data you have, the longer it takes to replay all the data and rebuild indexes. For example, for a 600GB dataset physical restore took 5x less time compared to logical. 

But on a small DB size, the difference is neglectable – a couple of minutes. So the main benefit of logical backups lay beyond the performance. It’s flexibility. Logical backups allow partial backup/restore of the database (on the roadmap for PBM). You can choose particular databases and/or collections to work with.  As physical backups work directly with database storage-engine files, they operate in an all-or-nothing frame.

Hands-on

PBM Configuration

In order to start using PBM with PSMDB or MongoDB, install all the necessary packages according to the installation instructions. Please note that starting from version 1.7.0 the user running the pbm-agent process should also have the read/write access to PSMDB data directory for the purpose of performing operations with datafiles during physical backup or restore. 

Considering the design, starting from 1.7.0 the default user for pbm-agent is changed from pbm to mongod. So unless PSMDB runs under a different user than mongod, no extra actions are required. Otherwise, please carefully re-check your configuration and provide the necessary permissions to ensure proper PBM functioning.

In addition, keep in mind that for using PBM physical backups, you should run Percona Server for MongoDB starting from versions 4.2.15-16 and 4.4.6-8 and higher – this is where hotBackups and backup cursors were introduced.

Creating a Backup

With the new PBM version, you can specify what type of backup you wish to make: physical or logical. By default when no type is selected, PBM makes a logical backup.

Point-in-Time Recovery

Point-in-Time Recovery is currently supported only for logical backups. It means that a logical backup snapshot is required for pbm-agent to start periodically saving consecutive slices of the oplog. You can still make a physical backup while PITR is enabled, it won’t break or change the oplog saving process. 

The restoration process to the specific point in time will also use a respective logical backup snapshot and oplog slices which will be replayed on top of the backup.

Checking the Logs

During physical backup, PBM logs are available via pbm logs command as well as for all other operations. 

As for restore, pbm logs command doesn’t provide information about restore from a physical backup. It’s caused by peculiarities of the restore procedure and will be improved in the upcoming PBM versions. However, pbm-agent still saves log locally, so it’s possible to check information about restore process on each node: 

Restoring from a Backup

The restore process from a physical backup is similar to a logical one but requires several extra steps after the restore is finished by PBM.

After starting the restore process, pbm CLI returns the leader node ID, so it’s possible to track the restore progress by checking logs of the pbm-agent leader. In addition, status is written to the metadata file created on the remote storage. The status file is created in the root of the storage path and has the format .pbm.restore/<restore_timestamp>.json. As an option it’s also possible to pass -w flag during restore which will block the current shell session and wait for the restore to finish:

After the restore is complete, it’s required to perform the following steps:

  • Restart all mongod (and mongos if present) nodes
  • Restart all pbm-agents
  • Run the following command to resync the backup list with the storage:

Conclusion

MongoDB allows users to store enormous amounts of data. Especially if we talk about sharded clusters, where users are not limited by a single storage volume size limit. Database administrators often have to implement various home-grown solutions to ensure timely backups and restores of such big clusters. The usual approach is a storage-level snapshot. Such solutions do not guarantee data consistency and provide false confidence that data is safe.

Percona Backup for MongoDB with physical backup and restore capabilities enable users to backup and restore data fast and at the same time comes with data-consistency guarantees. 

Physical Backup functionality is in the Technical Preview stage. We encourage you to read more about it in our documentation and try it out. In case you face any issues feel free to contact us on the forum or raise the JIRA issue.

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments