Any Galera documentation about limitations will state that tables must have primary keys. They state that DELETEs are unsupported and other DMLs could have unwanted side-effects such as inconsistent ordering: rows can appear in different order on different nodes in your cluster.
If you are not actively relying on row orders, this could seem acceptable. Deletes seem to be working when doing a quick test, and it is replicated. Especially when the table schema is from an existing solution such as Magento or Hybris: the cost to adapt the schema seems higher than the risk associated with some missing primary key.
After all, what is the risk, exactly?
The risk of lacking primary keys
As every documentation stated, the most impacting risk is on DELETEs. On the source node, the DELETE will execute without issues.
The issue is on the replication side. This is not due to any Galera internal, this is a row-based replication limitation, and there have been writings on this, like Attack No-PK Replication Lag with MySQL/Percona Server 8 Invisible Columns.
On asynchronous primary-replica setups, a DELETE without a primary key will cause a severe lag, which is already a risk for your data redundancy and high availability. Your replicas will be out of sync, so any read-write split mechanism will lead you to read obsolete data on your replicas. Your primary stays unaffected.
Though on a Galera cluster, the risk is severely aggravated due to the semi-synchronous features: flow control and certification.
On a large table, any massive DELETE can lead to a completely blocked cluster with every transaction stuck with this state:
1 | wsrep: initiating replication for write set |
Story of a production recurrent failure
A simple table was used to record “raw” customer transactions made through external processes. The table would grow every week. Batches would run, and each row is read to be properly integrated into the regular tables. Then, the whole table would be emptied.
At first, this was a minor system, with thousands of rows at most, so it was running smoothly for years. Thousands became millions thanks to a great marketing campaign someday, and that’s how the problems started. Because an ORM was used, emptying the table actually was:
1 | DELETE FROM table; |
Instead of a proper:
1 | TRUNCATE table; |
Some specificities worth noting:
- Only one node served writes.
- The flow control limit was already set at 100000 to relieve replication of some of its semi-synchronous constraints.
- Hardware was generous, with 60+ GBs of RAM and NVMe disks.
The DELETE was launched every Saturday night, and the symptoms were:
- A painfully slow performance on the whole cluster during the weekend
- Every transaction was blocked on Mondays on “wsrep: initiating replication for write set” for hours. Waiting was not helping, it did not seem to improve on its own.
- Trying to stop “replica” nodes would hang indefinitely.
- Killing replicas and starting one again would block transactions again.
Why such severe symptoms
The DELETE, due to https://bugs.mysql.com/bug.php?id=53375, was actually making one table full scan for each record to delete.
With one million rows deleted in one go, essentially to apply the transaction, the “replica” nodes had to iterate over:
1 2 | 1,000,000 + 999,999 + 999,998 + 999,997 + 999,996 ... rows 1,000,000 * (1,000,000 + 1) / 2 = 500,000,500,000 rows |
This was the case on MySQL 5.7 by default, at least. On 8.0, it’s way lower than this. More on it later!
Even waiting a week did not cut it: replica nodes never managed to apply the transaction with a big enough table.
Symptoms to look for
Quick reproduction of the issue
Starting from an oltp_delete sysbench,
1 2 | $ sysbench oltp_delete --mysql-host=127.0.0.1 --mysql-port=3360 --mysql-user=root --mysql-password=test1234# --threads=10 --tables=20 --table-size=1000000 --rate=1000 --distinct_ranges=0 --order_ranges=0 --point_selects=0 --index_updates=0 --skip_trx=on --time=0 --report-interval=1 --time=0 --mysql-ignore-errors=all prepare $ sysbench oltp_delete --mysql-host=127.0.0.1 --mysql-port=3360 --mysql-user=root --mysql-password=test1234# --threads=10 --tables=20 --table-size=1000000 --rate=1000 --distinct_ranges=0 --order_ranges=0 --point_selects=0 --index_updates=0 --skip_trx=on --time=0 --report-interval=1 --time=0 --mysql-ignore-errors=all run |
With steady performances:
1 2 3 | [ 314s ] thds: 10 tps: 1046.09 qps: 1046.09 (r/w/o: 0.00/427.04/619.06) lat (ms,95%): 2.07 err/s: 0.00 reconn/s: 0.00 [ 315s ] thds: 10 tps: 953.99 qps: 953.99 (r/w/o: 0.00/372.99/580.99) lat (ms,95%): 2.07 err/s: 0.00 reconn/s: 0.00 [ 316s ] thds: 10 tps: 1023.95 qps: 1023.95 (r/w/o: 0.00/414.98/608.97) lat (ms,95%): 2.18 err/s: 0.00 reconn/s: 0.00 |
We make a copy of one table, and it will lack a primary key using this method.
1 | create table sbtest_no_pk as select * from sbtest1; |
Now, any delete on it can bring our whole cluster down:
1 2 | mysql> delete from sbtest.sbtest_no_pk; Query OK, 999973 rows affected (16.81 sec) |
Only 16 seconds, but now sysbench seems stuck:
1 | [ 544s ] thds: 10 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00 |
It eventually took around two and a half hours because it was tested on the latest Percona XtraDB Cluster (PXC) 8.0, but trying it out on PXC 5.7 could have taken almost four days if none of the workarounds were applied.
Blocked commits
From the writer node getting queries, you would see:
1 2 3 4 5 6 7 8 9 10 11 12 | SHOW PROCESSLIST; (...) | 209 | root | 172.18.0.1:56492 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest18 WHERE id=523304 | 276700 | 0 | 1 | | 210 | root | 172.18.0.1:56494 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest19 WHERE id=508568 | 276699 | 0 | 1 | | 211 | root | 172.18.0.1:56516 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest7 WHERE id=501242 | 276701 | 0 | 1 | | 212 | root | 172.18.0.1:56532 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest7 WHERE id=500892 | 276700 | 0 | 1 | | 213 | root | 172.18.0.1:56550 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest7 WHERE id=499751 | 276696 | 0 | 1 | | 214 | root | 172.18.0.1:56540 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest17 WHERE id=503867 | 276701 | 0 | 1 | | 215 | root | 172.18.0.1:56570 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest6 WHERE id=497555 | 276703 | 0 | 1 | | 216 | root | 172.18.0.1:56562 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest1 WHERE id=393311 | 276700 | 0 | 1 | | 217 | root | 172.18.0.1:56596 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest2 WHERE id=498986 | 276700 | 0 | 1 | | 218 | root | 172.18.0.1:56582 | sbtest | Execute | 277 | wsrep: replicating and certifying write set(-1) | DELETE FROM sbtest14 WHERE id=606188 | 276704 | 0 | 1 | |
It indicates another node in the cluster is struggling with its replication queues.
From a node struggling to replicate, the processlist would show:
1 2 3 4 5 6 7 8 9 | | 2 | system user | | NULL | Sleep | 635 | wsrep: committed write set (191726) | NULL | 634478 | 0 | 0 | | 10 | system user | | sbtest | Query | 648 | Applying batch of row changes (delete) | NULL | 155 | 0 | 0 | | 19 | system user | | NULL | Sleep | 635 | wsrep: committed write set (191731) | NULL | 634453 | 0 | 0 | | 20 | system user | | NULL | Sleep | 635 | wsrep: committed write set (191727) | NULL | 634469 | 0 | 0 | | 21 | system user | | NULL | Sleep | 635 | wsrep: preparing to commit write set(191738) | NULL | 631375 | 0 | 0 | | 22 | system user | | NULL | Sleep | 635 | wsrep: preparing to commit write set(191739) | NULL | 631375 | 0 | 0 | | 23 | system user | | NULL | Sleep | 635 | wsrep: preparing to commit write set(191736) | NULL | 631375 | 0 | 0 | | 24 | system user | | NULL | Sleep | 635 | wsrep: preparing to commit write set(191737) | NULL | 631375 | 0 | 0 | | 25 | system user | | NULL | Sleep | 635 | wsrep: committed write set (191728) | NULL | 634458 | 0 | 0 | |
We do see the slow wsrep applier thread “Applying batch of row changes (delete)“ running for 648 seconds on this output while all the others are waiting.
Flow control
Flow control will become permanent, explaining why every commit is delayed. We can monitor this by comparing two values of ‘wsrep_flow_control_paused_ns’ one second apart:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | mysql> select variable_value into @value1 from performance_schema.global_status where variable_name='wsrep_flow_control_paused_ns'; select sleep(1); select variable_value into @value2 from performance_schema.global_status where variable_name='wsrep_flow_control_paused_ns' ; Query OK, 1 row affected (0.01 sec) +----------+ | sleep(1) | +----------+ | 0 | +----------+ 1 row in set (1.00 sec) Query OK, 1 row affected (0.00 sec) mysql> select (@value2 - @value1) / 1000000000 ; +----------------------------------+ | (@value2 - @value1) / 1000000000 | +----------------------------------+ | 1.001653224 | +----------------------------------+ 1 row in set (0.00 sec) |
Locks
Using:
1 | SHOW ENGINE INNODB STATUSG |
We can find out some wsrep applier information:
1 2 3 4 | ---TRANSACTION 330135, ACTIVE 239 sec mysql tables in use 1, locked 1 2597 lock struct(s), heap size 303224, 179090 row lock(s), undo log entries 179090 MySQL thread id 10, OS thread handle 131161865012992, query id 383465 Applying batch of row changes (delete) |
The number of row locks can give an approximate of the progress done. After waiting a bit more, we can compare the growth:
1 2 3 4 | ---TRANSACTION 330135, ACTIVE 1359 sec mysql tables in use 1, locked 1 6350 lock struct(s), heap size 729208, 438104 row lock(s), undo log entries 438104 MySQL thread id 10, OS thread handle 131161865012992, query id 383465 Applying batch of row changes (delete) |
Monitoring it will show how slow it is to make progress:
1 2 3 4 5 6 7 8 9 10 11 12 | $ while /bin/true; do docker exec -it pxc-node1 /usr/bin/mysql -uroot -ptest1234# -e "show engine innodb statusG show full processlist;" >> /tmp/node1_engine_status; sleep 1; done & $ tail -f /tmp/node1_engine_status | grep -oP "[0-9]{2,10} row lock" 509336 row lock 509504 row lock 509672 row lock 509840 row lock 510008 row lock 510176 row lock 510344 row lock 510512 row lock 510680 row lock |
Around 100-200 rows are deleted per second.
Solutions and workarounds
Find the responsible table
This is the hardest part; there are no straightforward methods for it.
It can be done by:
- Analyzing binlogs content
- Listing tables without primary keys
1 | SELECT t.table_schema, t.table_name FROM information_schema.tables t LEFT JOIN information_schema.table_constraints tco ON t.table_schema = tco.table_schema AND t.table_name = tco.table_name AND (tco.constraint_type = 'PRIMARY KEY' OR tco.constraint_type = 'UNIQUE') WHERE tco.constraint_type IS NULL AND t.table_schema NOT IN ('mysql', 'information_schema', 'performance_schema', 'sys') AND t.table_type = 'BASE TABLE'; |
- Or, a personal method (which is not scientific at all, but still brought me good results): listing the latest ibd files checkpointed when it is currently blocked.
1 2 | bash-4.4$ find /var/lib/mysql -name *.ibd -type f -printf '%T@ %pn' | sort -nr | head -1 1710873754.4388385620 /var/lib/mysql/sbtest/sbtest_no_pk.ibd |
slave_rows_search_algorithms
It has immediate effects on how bad this behavior will be.
On 8.0, its default value is already configured on the best value: “INDEX_SCAN,HASH_SCAN”.
It means it will avoid scanning the whole table for each row to delete.
On 5.7, though, it was “TABLE_SCAN,INDEX_SCAN”. The difference in performance can be seen easily and dynamically:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | 598292 row lock 598418 row lock 598586 row lock => 100-200 rows deleted per second on wsrep applier thread --> set global slave_rows_search_algorithms='TABLE_SCAN,INDEX_SCAN'; 598631 row lock 598634 row lock 598637 row lock 598640 row lock => it slows down to 3 rows deleted per second ... 598788 row lock 598791 row lock 598794 row lock --> set global slave_rows_search_algorithms='INDEX_SCAN,HASH_SCAN'; 598880 row lock 599006 row lock 599132 row lock 599300 row lock => back to 100-200 rows deleted per second |
pxc_strict_mode
If you are using Percona XtraDB Cluster, pxc_strict_mode will block any DML with missing primary keys by default.
Unfortunately, you may have to keep it permissive on some third-party software.
Having primary keys everywhere
The best workaround will still be to have primary keys on every table, always, even if invisible. Your DELETEs don’t even need to filter on the primary key, it just needs to be defined, and MySQL will use it. sql_require_primary_key is another variable added in 8.0.13 that can be used to avoid any mistake in the future.
Percona Distribution for MySQL is a complete, stable, scalable, and secure, open source MySQL solution, delivering enterprise-grade database environments for your most critical business applications. Deploy anywhere and implement easily with one-to-one compatibility with MySQL Community Edition.
Try Percona Distribution for MySQL today!