Overcoming VACUUM WRAPAROUND

Transaction ID Wraparound occurs when the VACUUM process cannot keep up with database activity and the PostgreSQL service is forced to shut down.

In more technical parlance: Transaction ID Wraparound occurs when the semantics of Multi-Version Concurrency Control (MVCC) fail and when the number of unique transaction ids reaches its maximum which numbers about two billion.

What leads up to this situation is when the VACUUM process managed by either the autovacuum workers or user-interaction (manual) does not keep up with the DML operations.

Transaction ID Wraparound can be caused by a combination of one or more of the following circumstances:

Autovacuum is turned off
Long-lived transactions
Database logical dumps (on a REPLICA using streaming replication)
Many session connections with locks extending across large swaths of the data cluster
Intense DML operations forcing the cancellation of autovacuum worker processes

Transaction WRAPAROUND can cause a spontaneous shutdown of the Postgres database server in order to protect the integrity of the data.

PostgreSQL at any one time has a number of transactions that are tracked by a unique ID. Every so often that number reaches the upper limit that can be registered, for example, 200 million transactions which is the default and is then renumbered. But if the number of unique transaction IDs goes to its maximum transactions limit, known as TXID Wraparound, Postgres will force a shutdown in order to protect the data.

Here’s how it works:

4 billion transactions, 2^32, is the integer upper limit for the datatype used in Postgres.
2 billion transactions, 2^31, is the upper limit that PostgreSQL permits before forcing a shutdown.
10 million transactions before the upper limit is reached, WARNING messages consisting of a countdown will be logged.
1 million transactions before the upper limit is reached, PostgreSQL goes to READ-ONLY mode.

Warning Signs

In the case of the autovacuum daemon falling behind across the entire data cluster, review your monitoring solution in order to identify the trend of these metrics:

IO wait increases
CPU load increases
SQL performance decreases

Mitigation steps include:

Reviewing the internal Postgres monitoring metrics and confirming tables are being vacuumed.
Reviewing the Postgres logs, look for an overabundance of canceled autovacuum worker processes.
Reviewing the view “pg_stat_activity” and looking for a query string – PREVENTING TRANSACTION ID WRAPAROUND -. Actually, this is a normal message. But one should not see autovacuum running solely for the purposes of mitigating WRAPAROUND.

Here are example error messages that you can find in the Postgres logs when threatened by a shutdown due to WRAPAROUND:

#
# When less than 10 million transactions remain before shutdown
#
WARNING: database "mydb" must be vacuumed within 177009986 transactions
HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb".

# When less than 10 million transactions remain before shutdown

WARNING: database "mydb" must be vacuumed within 177009986 transactions

HINT: To avoid a database shutdown, execute a database-wide VACUUM in "mydb".

Here’s a set of queries that will help you determine if WRAPAROUND is a risk:

    --
    -- Database query for transaction age per database
    --  and as a percentage of maximum permitted transactions
    --
SELECT  datname,
       age(datfrozenxid),
       (age(datfrozenxid)::numeric/1000000000*100)::numeric(4,2) as "% WRAPAROUND RISK"
FROM pg_database ORDER BY 2 DESC;

-- Database query for transaction age per database

-- and as a percentage of maximum permitted transactions

SELECT datname,

age(datfrozenxid),

(age(datfrozenxid)::numeric/1000000000*100)::numeric(4,2) as "% WRAPAROUND RISK"

FROM pg_database ORDER BY 2 DESC;

--
-- Database query for transaction age per table
--
SELECT  
c.oid::regclass as table_name,
greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as "TXID age",
(greatest(age(c.relfrozenxid),age(t.relfrozenxid))::numeric/1000000000*100)::numeric(4,2) as "% WRAPAROUND RISK"
FROM pg_class c
LEFT JOIN pg_class t ON c.reltoastrelid = t.oid
WHERE c.relkind IN ('r', 'm')
ORDER BY 2 DESC;

-- Database query for transaction age per table

SELECT

c.oid::regclass as table_name,

greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as "TXID age",

(greatest(age(c.relfrozenxid),age(t.relfrozenxid))::numeric/1000000000*100)::numeric(4,2) as "% WRAPAROUND RISK"

FROM pg_class c

LEFT JOIN pg_class t ON c.reltoastrelid = t.oid

WHERE c.relkind IN ('r', 'm')

ORDER BY 2 DESC;

--
-- Other vacuum runtime parameters of interest
--  returning TXID age
--
SELECT  name, setting
FROM pg_settings
WHERE name ~ 'vacuum'
AND name ~'_age$'
ORDER BY 1 ASC;

-- Other vacuum runtime parameters of interest

-- returning TXID age

SELECT name, setting

FROM pg_settings

WHERE name ~ 'vacuum'

AND name ~'_age$'

ORDER BY 1 ASC;

Preventing Transaction ID Wraparound

First and foremost, make certain all tables are regularly vacuumed. A correctly configured autovacuum process takes care of this without it ever becoming an issue. Otherwise, you will need to consider a manual VACUUM strategy.

The following are merely suggestions since each situation is highly subjective.

If you have the time, run the following invocation of vacuumdb. The value for option ‘-j’ can vary from a couple to a value equal to the number of CPUs on the host. The option ‘-a’ will process each database in alphabetical order.

vacuumdb -F -a -z -j 10 -v

1	vacuumdb -F -a -z -j 10 -v

Consider a bash script targeting individual databases if you see one is more urgent than another:

vacuumdb -z -j 10 -v <mydatabase>

1	vacuumdb -z -j 10 -v <mydatabase>

Immediate Action: Approaching Wraparound at < 10 Million Transactions

The following is the set of actions to take when TRANSACTION WRAPAROUND is imminent. Remember, you are in a race against time.

You must vacuum the entire data cluster before the remaining available transaction id drops to 1 million transactions.

Action

The task is to vacuum the databases as quickly as possible.
The tool of choice is the CLI “vacuumdb”.
Use as many threads as reasonable.
Run VACUUM in verbose mode and log the output.
Monitor log output for anomalous messages i.e. vacuum fails, etc.
Run “vacuumdb” against individual databases and, if necessary, individual tables
Avoid using the option ‘-a’

Scripts

Here’s a pair of example scripts that you can use as a starting point when developing your own mitigation protocol.

Method

Identify the database with the oldest TXID
Generate a list of tables in order of the oldest TXID age to the youngest
Feed this list of tables into a script that invokes vacuumdb and VACUUM one table per invocation

The secret sauce is xargs which enables one to utilize as many CPUs as reasonably possible. The following pair of bash scripts invoke vacuumdb against a series of tables. Of course, there’s more than one way to do this.

Script one generates a list of tables in a selected database and calls script two which executes the VACUUM on each of those tables individually.

SCRIPT ONE (go1_highspeed_vacuum.sh)

#!/bin/bash
#
# INVOCATION
# EX: ./go1_highspeed_vacuum.sh
#

########################################################
# EDIT AS REQUIRED
export CPU=4
export PAGER=less PGUSER=postgres PGPASSWORD=mypassword PGDATABASE=db01 PGOPTIONS='-c statement_timeout=0'
########################################################

SQL1="
with a as (select   c.oid::regclass as table_name,
                    greatest(age(c.relfrozenxid),age(t.relfrozenxid))
           from pg_class c
           left join pg_class t on c.reltoastrelid = t.oid
           where c.relkind in ('r', 'm')
            order by 2 desc)
select table_name from a
"


LIST="$(echo "$SQL1" | psql -t)"

# the 'P' sets the number of CPU to use simultaneously
xargs -t -n 1 -P $CPU ./go2_highspeed_vacuum.sh $PGDATABASE<<<$LIST

echo "$(date): DONE"

#!/bin/bash

# INVOCATION

# EX: ./go1_highspeed_vacuum.sh

########################################################

# EDIT AS REQUIRED

export CPU=4

export PAGER=less PGUSER=postgres PGPASSWORD=mypassword PGDATABASE=db01 PGOPTIONS='-c statement_timeout=0'

########################################################

SQL1="

with a as (select c.oid::regclass as table_name,

greatest(age(c.relfrozenxid),age(t.relfrozenxid))

from pg_class c

left join pg_class t on c.reltoastrelid = t.oid

where c.relkind in ('r', 'm')

order by 2 desc)

select table_name from a

LIST="$(echo "$SQL1" | psql -t)"

# the 'P' sets the number of CPU to use simultaneously

xargs -t -n 1 -P $CPU ./go2_highspeed_vacuum.sh $PGDATABASE<<<$LIST

echo "$(date): DONE"

SCRIPT TWO (go2_highspeed_vacuum.sh)

#!/bin/bash

########################################################
# EDIT AS REQUIRED
export PAGER=less PGUSER=postgres PGPASSWORD=mypassword PGOPTIONS='-c statement_timeout=0'
export DB=$1

########################################################

vacuumdb --verbose ${DB} > ${DB}.log 2>&1

#!/bin/bash

########################################################

# EDIT AS REQUIRED

export PAGER=less PGUSER=postgres PGPASSWORD=mypassword PGOPTIONS='-c statement_timeout=0'

export DB=$1

########################################################

vacuumdb --verbose ${DB} > ${DB}.log 2>&1

Tips

Be prepared to execute vacuumdb against the databases in REVERSE alphabetical order to avoid clashing with the autovacuum worker processes which vacuums in FORWARD alphabetical order.
Query table “pg_stat_activity”.
Always monitor where the autovacuum processes are working.
Avoid working on the same table that the autovacuum workers are currently processing.
Use the autovacuum workers as an indicator of what databases remain to be processed.
Kill active autovacuum workers when in conflict with a manual vacuum in order to speed things up.

Immediate Action: When PostgreSQL Service Has Shutdown Due to Transaction Wraparound

One recovers from a forced shutdown due to transaction id wraparound by performing a cluster-wide vacuum in single-user mode:

    # it is understood that environment 
    # variable PGDATA points to the data cluster
    #
    postgres --single -D $PGDATA postgres <<< 'vacuum analyze'

# it is understood that environment

# variable PGDATA points to the data cluster

postgres --single -D $PGDATA postgres <<< 'vacuum analyze'

I would suggest scripting the vacuum process because you’ll need to log in to each database to perform the VACUUM.

Generate and edit a list of all the databases:

    postgres --single -D $PGDATA postgres <<< 'select datname from pg_database' 
    | grep '"' | cut -d '"' -f 2 > list_db

1 2	postgres --single -D $PGDATA postgres <<< 'select datname from pg_database' \| grep '"' \| cut -d '"' -f 2 > list_db

Here is an example using the aforementioned list “list_db”:

#
# it is understood the database list has 

#  been edited before invoking this code snippet
#
for u in $(cat list_db)
do
    postgres --single -D $PGDATA $u <<< 'vacuum analyze'
done

# it is understood the database list has

# been edited before invoking this code snippet

for u in $(cat list_db)

postgres --single -D $PGDATA $u <<< 'vacuum analyze'

done

TXID Wraparound is one of the scariest scenarios that can occur. Thankfully, this is an extremely rare incident and only occurs when systems are either under extremely heavy load or have been neglected.

Don’t get caught!

Remember: the best DBA is the one that’s never noticed. 🙂

References

3 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Eddie

1 year ago

Is there any reason to associate the TOAST table age to the main table? A manual vacuum of the main table will also vacuum the TOAST table but not the reverse. if just trying to vacuum the oldest tables, is there any reason to not just vacuum the ones that are the oldest even if it’s a TOAST table? Vacuuming the main table may be wasting resources if only the TOAST table has an old relfrozenxid.

I recently found myself facing the wraparound situation and grabbed a list of oldest tables and started vacuum threads. When I looked at the list I realized they were all TOAST tables. Still trying to figure out why autovacuum was leaving those alone but there didn’t seem to be any problem manually vacuuming the TOAST directly and it progressed faster.

I have not yet gotten to play with PostgreSQL 13+ and I see there are plenty of changes in this area.

Thanks for the article!

Robert Bernier

Reply to Eddie

1 year ago

Hi,

> Is there any reason to associate the TOAST table age to the main table?

Hi, I hope I can answer in a manner that will satisfy your question.

Creating something like a column of type text or bytea in a regular table automatically creates the TOASTED table as the column at some point becomes a reference pointer to the values which are in the toasted table and not in the table i.e. there’s a relationship between the two.

Because vacuum cannot operate on TOAST tables alone, as they are not relations, one has no choice but to vacuum the table itself that the toast is related. At least that’s the current architecture of postgres.

Alternatively you can always vacuum the column itself in the table.

Hope this helps.

Eddie

Reply to Robert Bernier

1 year ago

Thanks for the reply. I’m not sure that I understand the premise. Autovacuum does process TOAST tables independently as seen in pg_stat_activity and pg_stat_progress_vacuum and I can run a manual vacuum on the TOAST table as mentioned earlier. pg_class.relfrozenxid of the main table and its associated entry for the TOAST appear to be maintained separately.
Until recently, I thought of the TOAST as an adjunct to the main table where all management was through the main table. I’m trying to clarify some DO’s and DON’Ts.

Thanks

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Overcoming VACUUM WRAPAROUND

Warning Signs

Preventing Transaction ID Wraparound

Immediate Action: Approaching Wraparound at < 10 Million Transactions

Action

Scripts

Method

Tips

Immediate Action: When PostgreSQL Service Has Shutdown Due to Transaction Wraparound

References

Related

Related Blog Articles

RECOMMENDED ARTICLES

New Valkey Packages by Percona

Valkey/Redis: Not-So-Good Practices

Choosing the Right Database: Comparing MariaDB vs. MySQL, PostgreSQL, and MongoDB

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Overcoming VACUUM WRAPAROUND

Warning Signs

Preventing Transaction ID Wraparound

Immediate Action: Approaching Wraparound at < 10 Million Transactions

Action

Scripts

Method

Tips

Immediate Action: When PostgreSQL Service Has Shutdown Due to Transaction Wraparound

References

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

New Valkey Packages by Percona

Valkey/Redis: Not-So-Good Practices

Choosing the Right Database: Comparing MariaDB vs. MySQL, PostgreSQL, and MongoDB

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation