Enable Data Checksums With Minimum Downtime in PostgreSQL

PostgreSQL has a feature called data checksum which is used to detect any storage-level corruption of data pages. However, a new PostgreSQL database cluster does not have this feature enabled by default. To activate this feature, you must specify the –data-checksums flag when initializing a new PostgreSQL database cluster with the initdb utility.

So what about clusters that are already running? Is there any way to enable data checksums for an existing cluster? Previously, the only way was to initialize a new cluster with data checksums and dump the database into it using dump-restore or logical replication. Unfortunately, these methods are slow. Starting from PostgreSQL 12, users can utilize a new built-in utility called ‘pg_checksum‘ to enable checksums in a cluster that was not initialized with this feature. However, this utility only works when the database server is offline. For a large production server, it may be difficult to schedule a significant amount of downtime. To avoid this issue, a primary-standby architecture can be created. Below are the steps that can be followed to achieve this.

1. Please create a set-up consisting of a primary and standby server using streaming replication to ensure data redundancy and availability. To set up a primary-standby system using streaming replication, please refer to How to Set Up Streaming Replication in PostgreSQL 12 for detailed instructions. Once you have built this setup, you can use the following statement to verify that the replication is working correctly.

postgres=# select * from pg_stat_replication;
-[ RECORD 1 ]----+------------------------------
pid              | 234189
usesysid         | 16402
usename          | replicator
application_name | walreceiver
client_addr      | 172.31.30.60
client_hostname  |
client_port      | 42950
backend_start    | 2024-03-12 16:40:07.668783-05
backend_xmin     |
state            | streaming
sent_lsn         | 9445/8E82D2F8
write_lsn        | 9445/8E82D2F8
flush_lsn        | 9445/8E82D2F8
replay_lsn       | 9445/8E82D2F8
write_lag        | 00:00:00.000344
flush_lag        | 00:00:00.000475
replay_lag       | 00:00:00.000674
sync_priority    | 0
sync_state       | async
reply_time       | 2024-03-14 00:45:18.968746-05

postgres=# select * from pg_stat_replication;

-[ RECORD 1 ]----+------------------------------

pid | 234189

usesysid | 16402

usename | replicator

application_name | walreceiver

client_addr | 172.31.30.60

client_hostname |

client_port | 42950

backend_start | 2024-03-12 16:40:07.668783-05

backend_xmin |

state | streaming

sent_lsn | 9445/8E82D2F8

write_lsn | 9445/8E82D2F8

flush_lsn | 9445/8E82D2F8

replay_lsn | 9445/8E82D2F8

write_lag | 00:00:00.000344

flush_lag | 00:00:00.000475

replay_lag | 00:00:00.000674

sync_priority | 0

sync_state | async

reply_time | 2024-03-14 00:45:18.968746-05

2. To ensure high availability and minimal downtime, configure an automatic failover/switchover for the primary-standby nodes with any HA tool. In this example, Repmgr has been used as a High Availability (HA) tool. However, the steps for setting up Repmgr have not been discussed in this blog to keep it simple. You can refer to this link for more information on Repmgr.

3. After configuring repmgr correctly, both nodes will be visible in the cluster. In this example, repmgr will be used for switchover.

postgres@ip-172-31-26-9:~$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf cluster show
ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 1        | host=172.31.26.9  user=postgres dbname=postgres password=postgres connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 1        | host=172.31.30.60  user=postgres dbname=postgres password=postgres connect_timeout=2

postgres@ip-172-31-26-9:~$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf cluster show

----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------------------

4. Confirm that the checksum is disabled on the standby node.

postgres@ip-172-31-30-60:~$ psql
psql (14.10 (Ubuntu 14.10-1.pgdg22.04+1))
Type "help" for help.
postgres=# show data_checksums;
 data_checksums
----------------
 off
(1 row)

postgres@ip-172-31-30-60:~$ psql

psql (14.10 (Ubuntu 14.10-1.pgdg22.04+1))

Type "help" for help.

postgres=# show data_checksums;

data_checksums

----------------

off

(1 row)

5. Stop the standby database cluster.

ubuntu@ip-172-31-30-60:~$ sudo systemctl stop postgresql@14-main
ubuntu@ip-172-31-30-60:~$

1 2	ubuntu@ip-172-31-30-60:~$ sudo systemctl stop postgresql@14-main ubuntu@ip-172-31-30-60:~$

6. Enable checksum on the standby database cluster.

ubuntu@ip-172-31-30-60:~$ sudo su - postgres
postgres@ip-172-31-30-60:~$
postgres@ip-172-31-30-60:~$ /usr/lib/postgresql/14/bin/pg_checksums --pgdata /var/lib/postgresql/14/main --enable --progress
25/25 MB (100%) computed
Checksum operation completed
Files scanned:  942
Blocks scanned: 3262
pg_checksums: syncing data directory
pg_checksums: updating control file
Checksums enabled in cluster

ubuntu@ip-172-31-30-60:~$ sudo su - postgres

postgres@ip-172-31-30-60:~$

postgres@ip-172-31-30-60:~$ /usr/lib/postgresql/14/bin/pg_checksums --pgdata /var/lib/postgresql/14/main --enable --progress

25/25 MB (100%) computed

Checksum operation completed

Files scanned: 942

Blocks scanned: 3262

pg_checksums: syncing data directory

pg_checksums: updating control file

Checksums enabled in cluster

7. Start the standby database cluster.

ubuntu@ip-172-31-30-60:~$ sudo systemctl start postgresql@14-main

1	ubuntu@ip-172-31-30-60:~$ sudo systemctl start postgresql@14-main

8. Verify the status of the checksum on the standby node, now this must be enabled.

ubuntu@ip-172-31-30-60:~$ sudo su - postgres
postgres@ip-172-31-30-60:~$ psql
psql (14.10 (Ubuntu 14.10-1.pgdg22.04+1))
Type "help" for help.
postgres=# show data_checksums;
 data_checksums
----------------
 on
(1 row)

ubuntu@ip-172-31-30-60:~$ sudo su - postgres

postgres@ip-172-31-30-60:~$ psql

psql (14.10 (Ubuntu 14.10-1.pgdg22.04+1))

Type "help" for help.

postgres=# show data_checksums;

data_checksums

----------------

(1 row)

9. Confirm that standby is fully synchronized with the primary and there is no lag.

postgres=# select application_name,state,pg_current_wal_lsn() as current_wal,replay_lsn,replay_lag,pg_size_pretty((pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn))) as size from pg_stat_replication order by pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn) desc;
application_name |   state   |  current_wal  |  replay_lsn   |   replay_lag    |  size
------------------+-----------+---------------+---------------+-----------------+---------
walreceiver      | streaming | 10BC/BC4B62E0 | 10BC/BC4B62E0 | 00:00:00.000574 | 0 bytes
(1 row)

postgres=# select application_name,state,pg_current_wal_lsn() as current_wal,replay_lsn,replay_lag,pg_size_pretty((pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn))) as size from pg_stat_replication order by pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn) desc;

------------------+-----------+---------------+---------------+-----------------+---------

(1 row)

10. Once this gets confirmed there is no lag between primary and standby, perform a dry run for the switchover to confirm everything is perfect.

postgres@ip-172-31-30-60:~/.ssh$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf standby switchover --dry-run
NOTICE: checking switchover on node "node2" (ID: 2) in --dry-run mode
INFO: SSH connection to host "172.31.26.9" succeeded
INFO: able to execute "repmgr" on remote host "172.31.26.9"
INFO: 1 walsenders required, 10 available
INFO: demotion candidate is able to make replication connection to promotion candidate
INFO: 0 pending archive files
INFO: replication lag on this standby is 0 seconds
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node2" (ID: 2) would be promoted to primary; current primary "node1" (ID: 1) would be demoted to standby
INFO: following shutdown command would be run on node "node1":
"pg_ctl  -D '/var/lib/postgresql/14/main' -W -m fast stop"
INFO: parameter "shutdown_check_timeout" is set to 60 seconds
INFO: prerequisites for executing STANDBY SWITCHOVER are met

postgres@ip-172-31-30-60:~/.ssh$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf standby switchover --dry-run

NOTICE: checking switchover on node "node2" (ID: 2) in --dry-run mode

INFO: SSH connection to host "172.31.26.9" succeeded

INFO: able to execute "repmgr" on remote host "172.31.26.9"

INFO: 1 walsenders required, 10 available

INFO: demotion candidate is able to make replication connection to promotion candidate

INFO: 0 pending archive files

INFO: replication lag on this standby is 0 seconds

NOTICE: attempting to pause repmgrd on 2 nodes

NOTICE: local node "node2" (ID: 2) would be promoted to primary; current primary "node1" (ID: 1) would be demoted to standby

INFO: following shutdown command would be run on node "node1":

"pg_ctl -D '/var/lib/postgresql/14/main' -W -m fast stop"

INFO: parameter "shutdown_check_timeout" is set to 60 seconds

INFO: prerequisites for executing STANDBY SWITCHOVER are met

11. Perform actual switchover.

postgres@ip-172-31-30-60:~$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf standby switchover
NOTICE: executing switchover on node "node2" (ID: 2)
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby
NOTICE: stopping current primary node "node1" (ID: 1)
NOTICE: issuing CHECKPOINT on node "node1" (ID: 1)
DETAIL: executing server command "pg_ctlcluster 14 main -m f stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/D000028
NOTICE: promoting standby to primary
DETAIL: promoting server "node2" (ID: 2) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node2" (ID: 2) was successfully promoted to primary
NOTICE: node "node2" (ID: 2) promoted to primary, node "node1" (ID: 1) demoted to standby
NOTICE: switchover was successful
DETAIL: node "node2" is now primary and node "node1" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

postgres@ip-172-31-30-60:~$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf standby switchover

NOTICE: executing switchover on node "node2" (ID: 2)

NOTICE: attempting to pause repmgrd on 2 nodes

NOTICE: local node "node2" (ID: 2) will be promoted to primary; current primary "node1" (ID: 1) will be demoted to standby

NOTICE: stopping current primary node "node1" (ID: 1)

NOTICE: issuing CHECKPOINT on node "node1" (ID: 1)

DETAIL: executing server command "pg_ctlcluster 14 main -m f stop"

INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")

NOTICE: current primary has been cleanly shut down at location 0/D000028

NOTICE: promoting standby to primary

DETAIL: promoting server "node2" (ID: 2) using pg_promote()

NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete

NOTICE: STANDBY PROMOTE successful

DETAIL: server "node2" (ID: 2) was successfully promoted to primary

NOTICE: node "node2" (ID: 2) promoted to primary, node "node1" (ID: 1) demoted to standby

NOTICE: switchover was successful

DETAIL: node "node2" is now primary and node "node1" is attached as standby

NOTICE: STANDBY SWITCHOVER has completed successfully

12. Now the old standby server acts like the new primary node.

postgres@ip-172-31-26-9:~$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf cluster show
 ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 1        | host=172.31.26.9  user=postgres dbname=postgres password=postgres connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 2        | host=172.31.30.60  user=postgres dbname=postgres password=postgres connect_timeout=2

postgres@ip-172-31-26-9:~$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf cluster show

----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------------------

13. Stop the new standby (ex-primary).

postgres@ip-172-31-26-9:~$ pg_ctlcluster 14 main -m f stop
postgres@ip-172-31-26-9:~$
postgres@ip-172-31-26-9:~$ pg_ctlcluster 14 main -m f status
pg_ctl: no server running

postgres@ip-172-31-26-9:~$ pg_ctlcluster 14 main -m f stop

postgres@ip-172-31-26-9:~$

postgres@ip-172-31-26-9:~$ pg_ctlcluster 14 main -m f status

pg_ctl: no server running

14. Enable checksum on new standby (ex-primary).

postgres@ip-172-31-26-9:~$ /usr/lib/postgresql/14/bin/pg_checksums --pgdata /var/lib/postgresql/14/main --enable
Checksum operation completed
Files scanned:&nbsp; 943
Blocks scanned: 3262
pg_checksums: syncing data directory
pg_checksums: updating control file
Checksums enabled in cluster

postgres@ip-172-31-26-9:~$ /usr/lib/postgresql/14/bin/pg_checksums --pgdata /var/lib/postgresql/14/main --enable

Checksum operation completed

Files scanned:  943

Blocks scanned: 3262

pg_checksums: syncing data directory

pg_checksums: updating control file

Checksums enabled in cluster

15. Start the new standby (ex-primary).

postgres@ip-172-31-26-9:~$ pg_ctlcluster 14 main -m f start
postgres@ip-172-31-26-9:~$ pg_ctlcluster 14 main -m f status
pg_ctl: server is running (PID: 27480)
/usr/lib/postgresql/14/bin/postgres "-D" "/var/lib/postgresql/14/main" "-c" "config_file=/etc/postgresql/14/main/postgresql.conf"

postgres@ip-172-31-26-9:~$ pg_ctlcluster 14 main -m f start

postgres@ip-172-31-26-9:~$ pg_ctlcluster 14 main -m f status

pg_ctl: server is running (PID: 27480)

/usr/lib/postgresql/14/bin/postgres "-D" "/var/lib/postgresql/14/main" "-c" "config_file=/etc/postgresql/14/main/postgresql.conf"

16. Verify the status of the checksum on new standby (ex-primary). Now this must be enabled.

postgres@ip-172-31-26-9:~$ psql
psql (14.10 (Ubuntu 14.10-1.pgdg22.04+1))
Type "help" for help.
postgres=# show data_checksums;
data_checksums
----------------
 on
(1 row)

postgres@ip-172-31-26-9:~$ psql

psql (14.10 (Ubuntu 14.10-1.pgdg22.04+1))

Type "help" for help.

postgres=# show data_checksums;

data_checksums

----------------

(1 row)

17. Check the current cluster status in repmgr.

postgres@ip-172-31-26-9:~$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf cluster show
ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------------------
 1  | node1 | standby |   running | node2    | default  | 100      | 2        | host=172.31.26.9  user=postgres dbname=postgres password=postgres connect_timeout=2
 2  | node2 | primary | * running |          | default  | 100      | 2        | host=172.31.30.60  user=postgres dbname=postgres password=postgres connect_timeout=2

postgres@ip-172-31-26-9:~$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf cluster show

----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------------------

18. Confirm that the standby is fully synchronized with the new primary and there is no lag.

postgres=# select application_name,state,pg_current_wal_lsn() as current_wal,replay_lsn,replay_lag,pg_size_pretty((pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn))) as size from pg_stat_replication order by pg_wal_lsn_diff(pg_current_wal_lsn(),replay_lsn) desc;
-[ RECORD 1 ]----+----------------
application_name | walreceiver
state            | streaming
current_wal      | 14F68/16C14000
replay_lsn       | 14F68/16C14000
replay_lag       | 00:00:00.000624
size             | 0 bytes

-[ RECORD 1 ]----+----------------

application_name | walreceiver

state | streaming

current_wal | 14F68/16C14000

replay_lsn | 14F68/16C14000

replay_lag | 00:00:00.000624

size | 0 bytes

19. Once this has been confirmed there is no lag, perform switchover again to get the original setup of primary-standby.

postgres@ip-172-31-26-9:~$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf standby switchover
NOTICE: executing switchover on node "node1" (ID: 1)
NOTICE: attempting to pause repmgrd on 2 nodes
NOTICE: local node "node1" (ID: 1) will be promoted to primary; current primary "node2" (ID: 2) will be demoted to standby
NOTICE: stopping current primary node "node2" (ID: 2)
NOTICE: issuing CHECKPOINT on node "node2" (ID: 2)
DETAIL: executing server command "pg_ctlcluster 14 main -m f stop"
INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")
NOTICE: current primary has been cleanly shut down at location 0/F000028
NOTICE: promoting standby to primary
DETAIL: promoting server "node1" (ID: 1) using pg_promote()
NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
NOTICE: STANDBY PROMOTE successful
DETAIL: server "node1" (ID: 1) was successfully promoted to primary
NOTICE: node "node1" (ID: 1) promoted to primary, node "node2" (ID: 2) demoted to standby
NOTICE: switchover was successful
DETAIL: node "node1" is now primary and node "node2" is attached as standby
NOTICE: STANDBY SWITCHOVER has completed successfully

postgres@ip-172-31-26-9:~$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf standby switchover

NOTICE: executing switchover on node "node1" (ID: 1)

NOTICE: attempting to pause repmgrd on 2 nodes

NOTICE: local node "node1" (ID: 1) will be promoted to primary; current primary "node2" (ID: 2) will be demoted to standby

NOTICE: stopping current primary node "node2" (ID: 2)

NOTICE: issuing CHECKPOINT on node "node2" (ID: 2)

DETAIL: executing server command "pg_ctlcluster 14 main -m f stop"

INFO: checking for primary shutdown; 1 of 60 attempts ("shutdown_check_timeout")

NOTICE: current primary has been cleanly shut down at location 0/F000028

NOTICE: promoting standby to primary

DETAIL: promoting server "node1" (ID: 1) using pg_promote()

NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete

NOTICE: STANDBY PROMOTE successful

DETAIL: server "node1" (ID: 1) was successfully promoted to primary

NOTICE: node "node1" (ID: 1) promoted to primary, node "node2" (ID: 2) demoted to standby

NOTICE: switchover was successful

DETAIL: node "node1" is now primary and node "node2" is attached as standby

NOTICE: STANDBY SWITCHOVER has completed successfully

20. Validate the cluster output to confirm it reflects the original setup.

postgres@ip-172-31-26-9:~$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf  cluster show
ID | Name  | Role    | Status    | Upstream | Location | Priority | Timeline | Connection string
----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------------------
 1  | node1 | primary | * running |          | default  | 100      | 3        | host=172.31.26.9  user=postgres dbname=postgres password=postgres connect_timeout=2
 2  | node2 | standby |   running | node1    | default  | 100      | 3        | host=172.31.30.60  user=postgres dbname=postgres password=postgres connect_timeout=2

postgres@ip-172-31-26-9:~$ /usr/lib/postgresql/14/bin/repmgr -f /etc/repmgr.conf cluster show

----+-------+---------+-----------+----------+----------+----------+----------+--------------------------------------------------------------------------------------

Before executing the pg_checksum procedure on a production environment, there are some precautions that should be taken care of:

If a lower environment is available, it is recommended to execute pg_checksum on that environment first. This will help in estimating the execution time and the lag generated during the activity.
If a lower environment is not available, then it is suggested to build a clone of the Production environment and test it first to get the above-mentioned estimations.

In this blog post, we have discussed the procedure for enabling data checksums on an existing running cluster using pg_checksums with minimum downtime. I hope that this article provides you with informative and helpful insights.

Percona Distribution for PostgreSQL provides the best and most critical enterprise components from the open-source community in a single distribution, designed and tested to work together. Run PostgreSQL in your production and mission-critical environments and easily deploy and orchestrate reliable PostgreSQL in Kubernetes.

Download Percona Distribution for PostgreSQL Today!

0 Comments

Inline Feedbacks

View all comments

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Enable Data Checksums With Minimum Downtime in PostgreSQL

Related

Related Blog Articles

RECOMMENDED ARTICLES

New Valkey Packages by Percona

Valkey/Redis: Not-So-Good Practices

Choosing the Right Database: Comparing MariaDB vs. MySQL, PostgreSQL, and MongoDB

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Enable Data Checksums With Minimum Downtime in PostgreSQL

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

New Valkey Packages by Percona

Valkey/Redis: Not-So-Good Practices

Choosing the Right Database: Comparing MariaDB vs. MySQL, PostgreSQL, and MongoDB

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation