PostgreSQL logopg_basebackup is a widely used PostgreSQL backup tool that allows us to take an ONLINE and CONSISTENT file system level backup. These backups can be used for point-in-time-recovery or to set up a slave/standby. You may want to refer to our previous blog posts, PostgreSQL Backup StrategyStreaming Replication in PostgreSQL and Faster PITR in PostgreSQL where we describe how we used pg_basebackup for different purposes. In this post, I’ll demonstrate the steps to restore a backup taken using pg_basebackup when we have many tablespaces that store databases or their underlying objects.

Backup and Restore a PostgreSQL Cluster

A simple backup can be taken using the following syntax.

Using a tar and compressed format is advantageous when you wish to use less disk space to backup and store all tablespaces, data directory and WAL segments, with everything in just one directory (target directory for backup).

Whereas a plain format stores a copy of the data directory as is, in the target directory. When you have one or more non-default tablespaces, tablespaces may be stored in a separate directory. This is usually the same as the original location, unless you use --tablespace-mapping  to modify the destination for storing the tablespaces backup.

PostgreSQL supports the concept of tablespaces. In simple words, a tablespace helps us maintain multiple locations to scatter databases or their objects. In this way, we can distribute the IO and balance the load across multiple disks.

To understand what happens when we backup a PostgreSQL cluster that contains multiple tablespaces, let’s consider the following example. We’ll take these steps:

  • Create two tablespaces in an existing master-slave replication setup.
  • Take a backup and see what is inside the backup directory.
  • Restore the backup.
  • Conclude our findings

Create 2 tablespaces and take a backup (tar format) using pg_basebackup

Step 1 :

I set up a replication cluster using PostgreSQL 11.2. You can refer to our blog post Streaming Replication in PostgreSQL to reproduce the same scenario. Here are the steps used to create two tablespaces:

Step 2 :

Now, I create two databases in two different tablespaces, using pgbench to create a few tables and load some data in them.

In a master-slave setup built using streaming replication, you must ensure that the directories exist in the slave, before running a "CREATE TABLESPACE ..."  on the master. This is because the same statements used to create a tablespace are shipped/applied to the slave through WALs – this is unavoidable. The slave crashes with the following message, when these directories do not exist:

Step 3 :

Let’s now use pg_basebackup to take a backup. In this example, I use a tar format backup.

In the above log, you could see that there are three tablespaces that have been backed up: one default, and two newly created tablespaces. If we go back and check how the data in the two tablespaces are distributed to appropriate directories, we see that there are symbolic links created inside the pg_tblspc directory (within the data directory) for the oid’s of both tablespaces. These links are directed to the actual location of the tablespaces, we specified in Step 1.

Step 4 :

Here are the contents inside the backup directory, that was generated through the backup taken in Step 3.

Tar Files : 16419.tar.gz and 16420.tar.gz are created as a backup for the two tablespaces. These are created with the same names as the OIDs of their respective tablespaces.

Let’s now take a look at how we can restore this backup to completely different locations for data and tablespaces.

Restore a PostgreSQL Cluster backup with multiple tablespaces

Step 1 :

In order to proceed further with the restore, let’s first extract the base.tar.gz file. This file contains some important files that help us to proceed further.

Step 2 :

The files that we need to consider for our recovery are :

  • backup_label
  • tablespace_map

When you open the backup_label file, we see the start WAL location, backup start time, etc. These are some details that help us perform a point-in-time-recovery.

Now, let us see what is inside the tablespace_map file.

In the above log, you could see that there are two entries – one for each tablespace. This is a file that maps a tablespace (oid) to its location. When you start PostgreSQL after extracting the tablespace and WAL tar files, symbolic links are created automatically by postgres – inside the pg_tblspc directory for each tablespace – to the appropriate tablespace location, using the mapping done in this files.

Step 3 :

Now, in order to restore this backup in the same postgres server from where the backup was taken, you must remove the existing data in the original tablespace directories. This allows you to extract the tar files of each tablespaces to the appropriate tablespace locations.

The actual commands for extracting tablespaces from the backup, in this case, were the following:

In a scenario where you want to restore the backup to the same machine from where the backup was originally taken, we must use different locations while extracting the data directory and tablespaces from the backup. In order to achieve that, tar files for individual tablespaces may be extracted to different directories than the original directories specified in tablespace_map file, upon which we can modify the tablespace_map file with the new tablespace locations. The next two steps should help you to see how this works.

Step 3a :

Create two different directories and extract the tablespaces to them.

Step 3b :

Edit the tablespace_map file with the new tablespace locations. Replace the original location of each tablespace with the new location, where we have extracted the tablespaces in the previous step. Here is how it appears after the edit.

Step 4 :

Extract pg_wal.tar.gz from backup to pg_wal directory of the new data directory.

Step 5 :

Create recovery.conf to specify the time until when you wish to perform a point-in-time-recovery. Please refer to our previous blog post – Step 3, to understand recovery.conf for PITR in detail.

Step 6 :

Once all of the steps above are complete you can start PostgreSQL.
You should see the following files renamed after recovery.

To avoid the exercise of manually modifying the tablespace_map file, you can use --tablespace-mapping . This is an option that works when you use a plain format backup, but not with tar. Let’s see why you may prefer a tar format when compared to plain.

Backup of PostgreSQL cluster with tablespaces using plain format

Consider the same scenario where you have a PostgreSQL cluster with two tablespaces. You might see the following error when you do not use --tablespace-mapping .

What the above error means is that the pg_basebackup is trying to store the tablespaces in the same location as the original tablespace directory. Here /data_pgbench is the location of tablespace : data_pgbench. And, now, pg_basebackup is trying to store the tablespace backup in the same location. In order to overcome this error, you can apply tablespace mapping using the following syntax.

-T is used to specify the tablespace mapping. -T can be replaced by --tablespace-mapping.

The advantage of using -T ( --tablespace-mapping ) is that the tablespaces are stored separately in the mapping directories. In this example with plain format backup, you must extract all the following three directories in order to restore/recover the database using backup.

  • /backup/latest_backup
  • /pgdata_pgtest
  • /pgdata_pgbench

However, you do not need a tablespace_map  file in this scenario, as it is automatically taken care of by PostgreSQL.
If you take a backup in tar format, you see all the tar files for base, tablespaces and WAL segments stored in the same backup directory, and just this directory can be extracted for performing restore/recovery. However, you must manually extract the tablespaces and WAL segments to appropriate locations and edit the tablespace_map file, as discussed above.


Image based on Photos by Alan James Hendry on Unsplash   and  Tanner Boriack on Unsplash

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Shashank R

Hi Avinash,

I am new to Postgresql. In our project we are trying to migrate to Postgresql.

When there is a requirement to LOAD huge volume of data into a table, usual process followed in Mainframe – DB2 is as mentioned below

• Set Quiesce point or take image copy of the tablespace (this is nothing but a backup/restore point if in case there is issue with loading the data)
• Load table
• Repair tablespace – This is to check copy pending status and Check data
• Reorg and Runstats Indexes in the tablespace – This is to ensure that the indexes are rebuilt post the mass load.

If the load volume is low or confined to few hundreds of records not all of these steps are performed. The only reason for following the above process is for performance.

Kindly let me know for Postgres what is the process followed in general, I believe not all the above steps are required. Please clarify.

Thanks,

Shashank R.