How easy is it to identify and debug Percona XtraDB Cluster replication problem ?
If you are using PXC, you may have already seen in your datadirectory several log files starting with GRA_
Those files correspond to a replication failure. That means the slave thread was not able to apply one transaction. For each of those file, a corresponding warning or error message is present in the mysql error log file.
Those error can also be false positive like a bad DDL statement (DROP a table that doesn’t exists for example) and therefore nothing to worry about. However it’s always recommended to understand what’s is happening.
As the GRA files contain binlog events in ROW format representing the failed transaction this post explains how to proceed.
The first step to be able to analyze your GRA files is to add a binlog header to the file.
You can download one here :GRA-header
We can verify it easily:
1 2 | file /tmp/GRA-header /tmp/GRA-header: MySQL replication log |
Now we need to select one GRA log file:
1 2 3 4 5 6 | [root@node2 mysql]# ls GRA_*.log GRA_3_3.log [root@node2 mysql]# file GRA_3_3.log GRA_3_3.log: data [root@node2 mysql]# ls -l GRA_3_3.log -rw-rw----. 1 mysql mysql 106 Nov 29 23:28 GRA_3_3.log |
We add the header and we can then use mysqlbinlog to see its content:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | [root@node2 mysql]# cat GRA-header > GRA_3_3-bin.log [root@node2 mysql]# cat GRA_3_3.log >> GRA_3_3-bin.log [root@node2 mysql]# file GRA_3_3.log GRA_3_3.log: data [root@node2 mysql]# file GRA_3_3-bin.log GRA_3_3-bin.log: MySQL replication log [root@node2 mysql]# mysqlbinlog -vvv GRA_3_3-bin.log /*!40019 SET @@session.max_insert_delayed_threads=0*/; /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 4 #120715 9:45:56 server id 1 end_log_pos 107 Start: binlog v 4, server v 5.5.25-debug-log created 120715 9:45:56 at startup # Warning: this binlog is either in use or was not closed properly. ROLLBACK/*!*/; BINLOG ' NHUCUA8BAAAAZwAAAGsAAAABAAQANS41LjI1LWRlYnVnLWxvZwAAAAAAAAAAAAAAAAAAAAAAAAAA AAAAAAAAAAAAAAAAAAA0dQJQEzgNAAgAEgAEBAQEEgAAVAAEGggAAAAICAgCAA== '/*!*/; # at 107 # at 160 #121129 23:28:54 server id 0 end_log_pos 53 Table_map: `sakila`.`actor` mapped to number 33 #121129 23:28:54 server id 0 end_log_pos 106 Write_rows: table id 33 flags: STMT_END_F BINLOG ' puG3UBMAAAAANQAAADUAAAAAACEAAAAAAAEABnNha2lsYQAFYWN0b3IABAIPDwcEhwCHAAA= puG3UBcAAAAANQAAAGoAAAAAACEAAAAAAAEABP/wvwEJR3Vkb3FmdW5lBk5pa25ldqbht1A= '/*!*/; ### INSERT INTO sakila.actor ### SET ### @1=447 /* SHORTINT meta=0 nullable=0 is_null=0 */ ### @2='Gudoqfune' /* VARSTRING(135) meta=135 nullable=0 is_null=0 */ ### @3='Niknev' /* VARSTRING(135) meta=135 nullable=0 is_null=0 */ ### @4=1354228134 /* TIMESTAMP meta=0 nullable=0 is_null=0 */ DELIMITER ; # End of log file ROLLBACK /* added by mysqlbinlog */; /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/; |
So it’s clear that the problem occurred when inserting a record in sakila.actor table.
And if we check in the error log for the corresponding error message (we know at what time to check):
1 2 3 4 5 6 7 | 121129 23:28:54 [ERROR] Slave SQL: Could not execute Write_rows event on table sakila.actor; Duplicate entry '447' for key 'PRIMARY', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 106, Error_code: 1062 121129 23:28:54 [Warning] WSREP: RBR event 2 Write_rows apply warning: 121, 3 121129 23:28:55 [ERROR] WSREP: Failed to apply trx: source: 7af1ab8e-3a70-11e2-0800-66155499f3af version: 2 local: 0 state: APPLYING flags: 1 conn_id: 8 trx_id: 2582 seqnos (l: 9, g: 3, s: 2, d: 0, ts: 1354228134888418369) 121129 23:28:55 [ERROR] WSREP: Failed to apply app buffer: ¦á·P^S, seqno: 3, status: WSREP_FATAL at galera/src/replicator_smm.cpp:apply_wscoll():49 at galera/src/replicator_smm.cpp:apply_trx_ws():120 121129 23:28:55 [ERROR] WSREP: Node consistency compromized, aborting... |
In this case it’s obvious why it failed but it’s not always the case. Now you know how to find the cause of these replication problems.
Also those files (GRA_*.log) doesn’t clean up automatically and are present only for troubleshooting purpose, so after having identified if they really represent a problem or not, you can manually delete them.
This was also discussed in galera-codership mailing list.
For those unsing PXC 5.6, this is the header for 5.6
Is there a header for 5.7?
Great info! Is there a way to point the GRA_* files to another folder or location?
I can able to see the replicated statement exactly by using cat GRA_2_146.log .