How Missing Primary Keys Break Your Galera Cluster

Any Galera documentation about limitations will state that tables must have primary keys. They state that DELETEs are unsupported and other DMLs could have unwanted side-effects such as inconsistent ordering: rows can appear in different order on different nodes in your cluster.

If you are not actively relying on row orders, this could seem acceptable. Deletes seem to be working when doing a quick test, and it is replicated. Especially when the table schema is from an existing solution such as Magento or Hybris: the cost to adapt the schema seems higher than the risk associated with some missing primary key.

After all, what is the risk, exactly?

The risk of lacking primary keys

As every documentation stated, the most impacting risk is on DELETEs. On the source node, the DELETE will execute without issues.

The issue is on the replication side. This is not due to any Galera internal, this is a row-based replication limitation, and there have been writings on this, like Attack No-PK Replication Lag with MySQL/Percona Server 8 Invisible Columns.

On asynchronous primary-replica setups, a DELETE without a primary key will cause a severe lag, which is already a risk for your data redundancy and high availability. Your replicas will be out of sync, so any read-write split mechanism will lead you to read obsolete data on your replicas. Your primary stays unaffected.

Though on a Galera cluster, the risk is severely aggravated due to the semi-synchronous features: flow control and certification.

On a large table, any massive DELETE can lead to a completely blocked cluster with every transaction stuck with this state:

wsrep: initiating replication for write set

1	wsrep: initiating replication for write set

Story of a production recurrent failure

A simple table was used to record “raw” customer transactions made through external processes. The table would grow every week. Batches would run, and each row is read to be properly integrated into the regular tables. Then, the whole table would be emptied.

At first, this was a minor system, with thousands of rows at most, so it was running smoothly for years. Thousands became millions thanks to a great marketing campaign someday, and that’s how the problems started. Because an ORM was used, emptying the table actually was:

DELETE FROM table;

1	DELETE FROM table;

Instead of a proper:

TRUNCATE table;

1	TRUNCATE table;

Some specificities worth noting:

Only one node served writes.
The flow control limit was already set at 100000 to relieve replication of some of its semi-synchronous constraints.
Hardware was generous, with 60+ GBs of RAM and NVMe disks.

The DELETE was launched every Saturday night, and the symptoms were:

A painfully slow performance on the whole cluster during the weekend
Every transaction was blocked on Mondays on “wsrep: initiating replication for write set” for hours. Waiting was not helping, it did not seem to improve on its own.
Trying to stop “replica” nodes would hang indefinitely.
Killing replicas and starting one again would block transactions again.

Why such severe symptoms

The DELETE, due to https://bugs.mysql.com/bug.php?id=53375, was actually making one table full scan for each record to delete.

With one million rows deleted in one go, essentially to apply the transaction, the “replica” nodes had to iterate over:

1,000,000 + 999,999 + 999,998 + 999,997 + 999,996 ... rows
1,000,000 * (1,000,000 + 1) / 2 = 500,000,500,000 rows

1 2	1,000,000 + 999,999 + 999,998 + 999,997 + 999,996 ... rows 1,000,000 * (1,000,000 + 1) / 2 = 500,000,500,000 rows

This was the case on MySQL 5.7 by default, at least. On 8.0, it’s way lower than this. More on it later!

Even waiting a week did not cut it: replica nodes never managed to apply the transaction with a big enough table.

Symptoms to look for

Quick reproduction of the issue

Starting from an oltp_delete sysbench,

$ sysbench oltp_delete --mysql-host=127.0.0.1 --mysql-port=3360 --mysql-user=root --mysql-password=test1234# --threads=10 --tables=20 --table-size=1000000 --rate=1000 --distinct_ranges=0 --order_ranges=0 --point_selects=0 --index_updates=0  --skip_trx=on --time=0 --report-interval=1 --time=0 --mysql-ignore-errors=all prepare
$ sysbench oltp_delete --mysql-host=127.0.0.1 --mysql-port=3360 --mysql-user=root --mysql-password=test1234# --threads=10 --tables=20 --table-size=1000000 --rate=1000 --distinct_ranges=0 --order_ranges=0 --point_selects=0 --index_updates=0  --skip_trx=on --time=0 --report-interval=1 --time=0 --mysql-ignore-errors=all run

$ sysbench oltp_delete --mysql-host=127.0.0.1 --mysql-port=3360 --mysql-user=root --mysql-password=test1234# --threads=10 --tables=20 --table-size=1000000 --rate=1000 --distinct_ranges=0 --order_ranges=0 --point_selects=0 --index_updates=0 --skip_trx=on --time=0 --report-interval=1 --time=0 --mysql-ignore-errors=all prepare

With steady performances:

[ 314s ] thds: 10 tps: 1046.09 qps: 1046.09 (r/w/o: 0.00/427.04/619.06) lat (ms,95%): 2.07 err/s: 0.00 reconn/s: 0.00
[ 315s ] thds: 10 tps: 953.99 qps: 953.99 (r/w/o: 0.00/372.99/580.99) lat (ms,95%): 2.07 err/s: 0.00 reconn/s: 0.00
[ 316s ] thds: 10 tps: 1023.95 qps: 1023.95 (r/w/o: 0.00/414.98/608.97) lat (ms,95%): 2.18 err/s: 0.00 reconn/s: 0.00

[ 314s ] thds: 10 tps: 1046.09 qps: 1046.09 (r/w/o: 0.00/427.04/619.06) lat (ms,95%): 2.07 err/s: 0.00 reconn/s: 0.00

[ 315s ] thds: 10 tps: 953.99 qps: 953.99 (r/w/o: 0.00/372.99/580.99) lat (ms,95%): 2.07 err/s: 0.00 reconn/s: 0.00

[ 316s ] thds: 10 tps: 1023.95 qps: 1023.95 (r/w/o: 0.00/414.98/608.97) lat (ms,95%): 2.18 err/s: 0.00 reconn/s: 0.00

We make a copy of one table, and it will lack a primary key using this method.

create table sbtest_no_pk as select * from sbtest1;

1	create table sbtest_no_pk as select * from sbtest1;

Now, any delete on it can bring our whole cluster down:

mysql> delete from sbtest.sbtest_no_pk;
Query OK, 999973 rows affected (16.81 sec)

1 2	mysql> delete from sbtest.sbtest_no_pk; Query OK, 999973 rows affected (16.81 sec)

Only 16 seconds, but now sysbench seems stuck:

[ 544s ] thds: 10 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00

1	[ 544s ] thds: 10 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00

It eventually took around two and a half hours because it was tested on the latest Percona XtraDB Cluster (PXC) 8.0, but trying it out on PXC 5.7 could have taken almost four days if none of the workarounds were applied.

Blocked commits

From the writer node getting queries, you would see:

SHOW PROCESSLIST;
(...)
| 209 | root        | 172.18.0.1:56492 | sbtest | Execute |   277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest18 WHERE id=523304 |   276700 |     0 |         1 |
| 210 | root        | 172.18.0.1:56494 | sbtest | Execute |   277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest19 WHERE id=508568 |   276699 |     0 |         1 |
| 211 | root        | 172.18.0.1:56516 | sbtest | Execute |   277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest7 WHERE id=501242  |   276701 |     0 |         1 |
| 212 | root        | 172.18.0.1:56532 | sbtest | Execute |   277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest7 WHERE id=500892  |   276700 |     0 |         1 |
| 213 | root        | 172.18.0.1:56550 | sbtest | Execute |   277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest7 WHERE id=499751  |   276696 |     0 |         1 |
| 214 | root        | 172.18.0.1:56540 | sbtest | Execute |   277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest17 WHERE id=503867 |   276701 |     0 |         1 |
| 215 | root        | 172.18.0.1:56570 | sbtest | Execute |   277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest6 WHERE id=497555  |   276703 |     0 |         1 |
| 216 | root        | 172.18.0.1:56562 | sbtest | Execute |   277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest1 WHERE id=393311  |   276700 |     0 |         1 |
| 217 | root        | 172.18.0.1:56596 | sbtest | Execute |   277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest2 WHERE id=498986  |   276700 |     0 |         1 |
| 218 | root        | 172.18.0.1:56582 | sbtest | Execute |   277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest14 WHERE id=606188 |   276704 |     0 |         1 |

SHOW PROCESSLIST;

(...)

| 209 | root | 172.18.0.1:56492 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest18 WHERE id=523304 | 276700 | 0 | 1 |

| 210 | root | 172.18.0.1:56494 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest19 WHERE id=508568 | 276699 | 0 | 1 |

| 211 | root | 172.18.0.1:56516 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest7 WHERE id=501242 | 276701 | 0 | 1 |

| 212 | root | 172.18.0.1:56532 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest7 WHERE id=500892 | 276700 | 0 | 1 |

| 213 | root | 172.18.0.1:56550 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest7 WHERE id=499751 | 276696 | 0 | 1 |

| 214 | root | 172.18.0.1:56540 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest17 WHERE id=503867 | 276701 | 0 | 1 |

| 215 | root | 172.18.0.1:56570 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest6 WHERE id=497555 | 276703 | 0 | 1 |

| 216 | root | 172.18.0.1:56562 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest1 WHERE id=393311 | 276700 | 0 | 1 |

| 217 | root | 172.18.0.1:56596 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest2 WHERE id=498986 | 276700 | 0 | 1 |

| 218 | root | 172.18.0.1:56582 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest14 WHERE id=606188 | 276704 | 0 | 1 |

It indicates another node in the cluster is struggling with its replication queues.

From a node struggling to replicate, the processlist would show:

|   2 | system user |       | NULL   | Sleep   |   635 | wsrep: committed write set (191726)      | NULL              |   634478 |     0 |         0 |
|  10 | system user |       | sbtest | Query   |   648 | Applying batch of row changes (delete)   | NULL              |  155 |     0 |         0 |
|  19 | system user |       | NULL   | Sleep   |   635 | wsrep: committed write set (191731)      | NULL              |   634453 |     0 |         0 |
|  20 | system user |       | NULL   | Sleep   |   635 | wsrep: committed write set (191727)      | NULL              |   634469 |     0 |         0 |
|  21 | system user |       | NULL   | Sleep   |   635 | wsrep: preparing to commit write set(191738) | NULL              |   631375 |     0 |         0 |
|  22 | system user |       | NULL   | Sleep   |   635 | wsrep: preparing to commit write set(191739) | NULL              |   631375 |     0 |         0 |
|  23 | system user |       | NULL   | Sleep   |   635 | wsrep: preparing to commit write set(191736) | NULL              |   631375 |     0 |         0 |
|  24 | system user |       | NULL   | Sleep   |   635 | wsrep: preparing to commit write set(191737) | NULL              |   631375 |     0 |         0 |
|  25 | system user |       | NULL   | Sleep   |   635 | wsrep: committed write set (191728)      | NULL              |   634458 |     0 |         0 |

| 2 | system user | | NULL | Sleep | 635 | wsrep: committed write set (191726) | NULL | 634478 | 0 | 0 |

| 10 | system user | | sbtest | Query | 648 | Applying batch of row changes (delete) | NULL | 155 | 0 | 0 |

| 19 | system user | | NULL | Sleep | 635 | wsrep: committed write set (191731) | NULL | 634453 | 0 | 0 |

| 20 | system user | | NULL | Sleep | 635 | wsrep: committed write set (191727) | NULL | 634469 | 0 | 0 |

| 21 | system user | | NULL | Sleep | 635 | wsrep: preparing to commit write set(191738) | NULL | 631375 | 0 | 0 |

| 22 | system user | | NULL | Sleep | 635 | wsrep: preparing to commit write set(191739) | NULL | 631375 | 0 | 0 |

| 23 | system user | | NULL | Sleep | 635 | wsrep: preparing to commit write set(191736) | NULL | 631375 | 0 | 0 |

| 24 | system user | | NULL | Sleep | 635 | wsrep: preparing to commit write set(191737) | NULL | 631375 | 0 | 0 |

| 25 | system user | | NULL | Sleep | 635 | wsrep: committed write set (191728) | NULL | 634458 | 0 | 0 |

We do see the slow wsrep applier thread “Applying batch of row changes (delete)“ running for 648 seconds on this output while all the others are waiting.

Flow control

Flow control will become permanent, explaining why every commit is delayed. We can monitor this by comparing two values of ‘wsrep_flow_control_paused_ns’ one second apart:

mysql> select variable_value into @value1 from performance_schema.global_status where variable_name='wsrep_flow_control_paused_ns'; select sleep(1); select variable_value into @value2 from performance_schema.global_status where variable_name='wsrep_flow_control_paused_ns' ;
Query OK, 1 row affected (0.01 sec)
+----------+
| sleep(1) |
+----------+
|    0     |
+----------+
1 row in set (1.00 sec)
Query OK, 1 row affected (0.00 sec)

mysql> select (@value2 - @value1) / 1000000000 ;
+----------------------------------+
| (@value2 - @value1) / 1000000000 |
+----------------------------------+
|                  1.001653224     |
+----------------------------------+
1 row in set (0.00 sec)

mysql> select variable_value into @value1 from performance_schema.global_status where variable_name='wsrep_flow_control_paused_ns'; select sleep(1); select variable_value into @value2 from performance_schema.global_status where variable_name='wsrep_flow_control_paused_ns' ;

Query OK, 1 row affected (0.01 sec)

+----------+

| sleep(1) |

+----------+

| 0 |

+----------+

1 row in set (1.00 sec)

Query OK, 1 row affected (0.00 sec)

mysql> select (@value2 - @value1) / 1000000000 ;

+----------------------------------+

| (@value2 - @value1) / 1000000000 |

+----------------------------------+

| 1.001653224 |

+----------------------------------+

1 row in set (0.00 sec)

Locks

Using:

SHOW ENGINE INNODB STATUSG

1	SHOW ENGINE INNODB STATUSG

We can find out some wsrep applier information:

---TRANSACTION 330135, ACTIVE 239 sec
mysql tables in use 1, locked 1
2597 lock struct(s), heap size 303224, 179090 row lock(s), undo log entries 179090
MySQL thread id 10, OS thread handle 131161865012992, query id 383465 Applying batch of row changes (delete)

---TRANSACTION 330135, ACTIVE 239 sec

mysql tables in use 1, locked 1

2597 lock struct(s), heap size 303224, 179090 row lock(s), undo log entries 179090

MySQL thread id 10, OS thread handle 131161865012992, query id 383465 Applying batch of row changes (delete)

The number of row locks can give an approximate of the progress done. After waiting a bit more, we can compare the growth:

---TRANSACTION 330135, ACTIVE 1359 sec
mysql tables in use 1, locked 1
6350 lock struct(s), heap size 729208, 438104 row lock(s), undo log entries 438104
MySQL thread id 10, OS thread handle 131161865012992, query id 383465 Applying batch of row changes (delete)

---TRANSACTION 330135, ACTIVE 1359 sec

mysql tables in use 1, locked 1

6350 lock struct(s), heap size 729208, 438104 row lock(s), undo log entries 438104

MySQL thread id 10, OS thread handle 131161865012992, query id 383465 Applying batch of row changes (delete)

Monitoring it will show how slow it is to make progress:

$ while /bin/true; do docker exec -it pxc-node1 /usr/bin/mysql -uroot -ptest1234# -e "show engine innodb statusG show full processlist;" >> /tmp/node1_engine_status; sleep 1; done &

$ tail -f /tmp/node1_engine_status  | grep -oP "[0-9]{2,10} row lock"
509336 row lock
509504 row lock
509672 row lock
509840 row lock
510008 row lock
510176 row lock
510344 row lock
510512 row lock
510680 row lock

$ while /bin/true; do docker exec -it pxc-node1 /usr/bin/mysql -uroot -ptest1234# -e "show engine innodb statusG show full processlist;" >> /tmp/node1_engine_status; sleep 1; done &

$ tail -f /tmp/node1_engine_status | grep -oP "[0-9]{2,10} row lock"

509336 row lock

509504 row lock

509672 row lock

509840 row lock

510008 row lock

510176 row lock

510344 row lock

510512 row lock

510680 row lock

Around 100-200 rows are deleted per second.

Solutions and workarounds

Find the responsible table

This is the hardest part; there are no straightforward methods for it.

It can be done by:

Analyzing binlogs content
Listing tables without primary keys

SELECT t.table_schema, t.table_name FROM information_schema.tables t LEFT JOIN information_schema.table_constraints tco ON t.table_schema = tco.table_schema AND t.table_name = tco.table_name AND (tco.constraint_type = 'PRIMARY KEY' OR tco.constraint_type = 'UNIQUE') WHERE tco.constraint_type IS NULL AND t.table_schema NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys') AND t.table_type = 'BASE TABLE';

SELECT t.table_schema, t.table_name FROM information_schema.tables t LEFT JOIN information_schema.table_constraints tco ON t.table_schema = tco.table_schema AND t.table_name = tco.table_name AND (tco.constraint_type = 'PRIMARY KEY' OR tco.constraint_type = 'UNIQUE') WHERE tco.constraint_type IS NULL AND t.table_schema NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys') AND t.table_type = 'BASE TABLE';

Or, a personal method (which is not scientific at all, but still brought me good results): listing the latest ibd files checkpointed when it is currently blocked.

bash-4.4$ find /var/lib/mysql -name *.ibd -type f -printf '%T@ %pn' | sort -nr | head -1
1710873754.4388385620 /var/lib/mysql/sbtest/sbtest_no_pk.ibd

1 2	bash-4.4$ find /var/lib/mysql -name *.ibd -type f -printf '%T@ %pn' \| sort -nr \| head -1 1710873754.4388385620 /var/lib/mysql/sbtest/sbtest_no_pk.ibd

slave_rows_search_algorithms

It has immediate effects on how bad this behavior will be.

On 8.0, its default value is already configured on the best value: “INDEX_SCAN,HASH_SCAN”.

It means it will avoid scanning the whole table for each row to delete.

On 5.7, though, it was “TABLE_SCAN,INDEX_SCAN”. The difference in performance can be seen easily and dynamically:

598292 row lock
598418 row lock
598586 row lock => 100-200 rows deleted per second on wsrep applier thread
--> set global  slave_rows_search_algorithms='TABLE_SCAN,INDEX_SCAN';
598631 row lock
598634 row lock
598637 row lock
598640 row lock => it slows down to 3 rows deleted per second
...
598788 row lock
598791 row lock
598794 row lock
--> set global  slave_rows_search_algorithms='INDEX_SCAN,HASH_SCAN';
598880 row lock
599006 row lock
599132 row lock
599300 row lock => back to 100-200 rows deleted per second

598292 row lock

598418 row lock

598586 row lock => 100-200 rows deleted per second on wsrep applier thread

--> set global slave_rows_search_algorithms='TABLE_SCAN,INDEX_SCAN';

598631 row lock

598634 row lock

598637 row lock

598640 row lock => it slows down to 3 rows deleted per second

...

598788 row lock

598791 row lock

598794 row lock

--> set global slave_rows_search_algorithms='INDEX_SCAN,HASH_SCAN';

598880 row lock

599006 row lock

599132 row lock

599300 row lock => back to 100-200 rows deleted per second

pxc_strict_mode

If you are using Percona XtraDB Cluster, pxc_strict_mode will block any DML with missing primary keys by default.
Unfortunately, you may have to keep it permissive on some third-party software.

Having primary keys everywhere

The best workaround will still be to have primary keys on every table, always, even if invisible. Your DELETEs don’t even need to filter on the primary key, it just needs to be defined, and MySQL will use it. sql_require_primary_key is another variable added in 8.0.13 that can be used to avoid any mistake in the future.

Percona Distribution for MySQL is a complete, stable, scalable, and secure, open source MySQL solution, delivering enterprise-grade database environments for your most critical business applications. Deploy anywhere and implement easily with one-to-one compatibility with MySQL Community Edition.

Try Percona Distribution for MySQL today!

0 Comments

Inline Feedbacks

View all comments

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

How Missing Primary Keys Break Your Galera Cluster

The risk of lacking primary keys

Story of a production recurrent failure

Why such severe symptoms

Symptoms to look for

Quick reproduction of the issue

Blocked commits

Flow control

Locks

Solutions and workarounds

Find the responsible table

slave_rows_search_algorithms

pxc_strict_mode

Having primary keys everywhere

Related

Related Blog Articles

RECOMMENDED ARTICLES

New Valkey Packages by Percona

Can We Set up a Replicate Filter Within the Percona XtraDB Cluster?

Valkey/Redis: Not-So-Good Practices

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

How Missing Primary Keys Break Your Galera Cluster

The risk of lacking primary keys

Story of a production recurrent failure

Why such severe symptoms

Symptoms to look for

Quick reproduction of the issue

Blocked commits

Flow control

Locks

Solutions and workarounds

Find the responsible table

slave_rows_search_algorithms

pxc_strict_mode

Having primary keys everywhere

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

New Valkey Packages by Percona

Can We Set up a Replicate Filter Within the Percona XtraDB Cluster?

Valkey/Redis: Not-So-Good Practices

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation