Replay the Execution of MySQL With RR (Record and Replay)

Chasing bugs can be a tedious task, and multi-threaded software doesn’t make it any easier. Threads will be scheduled at different times, instructions will not have deterministic results, and in order for one to reproduce a particular issue, it might require the exact same threads, doing the exact same work, at the exact same time. As you can imagine, this is not straightforward.

Let’s say your database is crashing or even having a transient stall. By the time you get to it, the crash has happened and you are stuck restoring service quickly and doing after-the-fact forensics. Wouldn’t it be nice to replay the work from right before or during the crash and see exactly what was happening?

Record and Replay is a technique where we record the execution of a program allowing it to be replayed over and over producing the same result. Engineers at Mozilla have created RR, and basically, this open source tool allows you to record the execution of the software and replay it under the well-known GDB.

A Backup Problem

To demonstrate how powerful the tool is, we will be walking through how we used it to narrow down the issue from PXB-2180 (Special thanks to Satya Bodapati, who helped with all the InnoDB internals research for this bug).

In summary, we were seeing Percona XtraBackup crashing at the prepare stage (remember, always test your backup!). The crash was happening randomly, sometimes after the second incremental, sometimes after the 10th incremental, with no visible pattern.

The stack trace was also not always the same. It was crashing on different parts of InnoDB, but here we had one commonality from all crashes – it always happened while trying to apply a redo log record to the same block page and space id:

#12 0x00000000015ad05f in recv_parse_or_apply_log_rec_body (type=MLOG_COMP_REC_INSERT, ptr=0x7f2849150556 "\003K4G", '\377' <repeats 13 times>, end_ptr=0x7f2849150573 "", space_id=<optimized out>, page_no=<optimized out>, block=0x7f2847d7da00, mtr=0x7f286857b4f0, parsed_bytes=18446744073709551615) at /home/marcelo.altmann/percona-xtrabackup/storage/innobase/log/log0recv.cc:2002
2002         ptr = page_cur_parse_insert_rec(FALSE, ptr, end_ptr, block, index, mtr);
(gdb) p block->page->id
+p block->page->id
$3 = {
  m_space = 4294967294,
  m_page_no = 5
}

#12 0x00000000015ad05f in recv_parse_or_apply_log_rec_body (type=MLOG_COMP_REC_INSERT, ptr=0x7f2849150556 "\003K4G", '\377' <repeats 13 times>, end_ptr=0x7f2849150573 "", space_id=<optimized out>, page_no=<optimized out>, block=0x7f2847d7da00, mtr=0x7f286857b4f0, parsed_bytes=18446744073709551615) at /home/marcelo.altmann/percona-xtrabackup/storage/innobase/log/log0recv.cc:2002

2002 ptr = page_cur_parse_insert_rec(FALSE, ptr, end_ptr, block, index, mtr);

(gdb) p block->page->id

+p block->page->id

$3 = {

m_space = 4294967294,

m_page_no = 5

}

Our suspicion was that the page layout on this block diverged between MySQL and XtraBackup. When working with these types of bugs, the crash is always the consequence of something that happened earlier, eg.: a crash on the sixth incremental backup could be the consequence of an issue that happened on the fourth incremental.

The main goal at this step is to prove and identify where the page layout has diverted.

With this information, we ran MySQL under RR and reran the backup until we saw the same issue at prepare. We can now replay the MySQL execution and check how it compares. Our idea is to:

Read the LSNs for this same page before/after each backup prepare.
Identify all changes to m_space = 4294967294 & m_page_no = 5 at mysqld.

Before we progress further, let’s explain a few things:

m_space = 4294967294 correspond to the MySQL data dictionary (mysql.ibd) – dict0dict.h:1146
On disk page, LSN is stored at the 16th byte of the page and has a size of 8 bytes – fil0types.h:66
Pages are written sequentially to disk, as an example, for the default 16k page size, from bytes 1 to 16384 will have the data for page 0, from byte 16385 to 32768 data from page 1, and so on.
Frame is raw data of a page – buf0buf.h:1358

Replaying the Execution

To start, let’s read what LSN we have on mysql.ibd for page five before the backup. We will be using od (check man od for more information) and the information explained above:

$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd
0240020 00 00 00 00 01 10 21 85
0240030

$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd

0240020 00 00 00 00 01 10 21 85

0240030

And check if it matches an LSN stamp from mysqld. For that we will add a conditional breakpoint on the replay execution of MySQL at function buf_flush_note_modification:

$ rr replay .
. . .
(rr) b buf_flush_note_modification if block->page->id->m_space == 4294967294 && block->page->id->m_page_no == 5
+b buf_flush_note_modification if block->page->id->m_space == 4294967294 && block->page->id->m_page_no == 5
Breakpoint 1 at 0x495beb1: file /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic, line 69.
(rr) c
[Switching to Thread 18839.18868]

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=17892965, end_lsn=17893015, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69
69     ut_ad(!srv_read_only_mode ||
++rr-set-suppress-run-hook 1
(rr) p/x block->frame[16]@8
+p/x block->frame[16]@8
$1 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x10,
  [0x6] = 0x21,
  [0x7] = 0x85}
(rr)

$ rr replay .

. . .

(rr) b buf_flush_note_modification if block->page->id->m_space == 4294967294 && block->page->id->m_page_no == 5

+b buf_flush_note_modification if block->page->id->m_space == 4294967294 && block->page->id->m_page_no == 5

Breakpoint 1 at 0x495beb1: file /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic, line 69.

(rr) c

[Switching to Thread 18839.18868]

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=17892965, end_lsn=17893015, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69

69 ut_ad(!srv_read_only_mode ||

++rr-set-suppress-run-hook 1

(rr) p/x block->frame[16]@8

+p/x block->frame[16]@8

$1 = {[0x0] = 0x0,

[0x1] = 0x0,

[0x2] = 0x0,

[0x3] = 0x0,

[0x4] = 0x1,

[0x5] = 0x10,

[0x6] = 0x21,

[0x7] = 0x85}

(rr)

We can see the LSN stamp from before the preparation of full backup and the first stamp from the replay session match. Time to prepare the backup, advance the replay execution, and recheck:

xtrabackup --prepare --apply-log-only --target-dir=full/
. . .
Shutdown completed; log sequence number 17897577
Number of pools: 1
210402 17:46:29 completed OK!


$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd
0240020 00 00 00 00 01 11 07 06
0240030


(rr) c
+c
Continuing.
[Switching to Thread 18839.18868]

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=19077332, end_lsn=19077382, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69
69     ut_ad(!srv_read_only_mode ||
++rr-set-suppress-run-hook 1
(rr) p/x block->frame[16]@8
+p/x block->frame[16]@8
$16 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x11,
  [0x6] = 0x7,
  [0x7] = 0x6}
(rr)

xtrabackup --prepare --apply-log-only --target-dir=full/

. . .

Shutdown completed; log sequence number 17897577

Number of pools: 1

210402 17:46:29 completed OK!

$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd

0240020 00 00 00 00 01 11 07 06

0240030

(rr) c

Continuing.

[Switching to Thread 18839.18868]

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=19077332, end_lsn=19077382, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69

69 ut_ad(!srv_read_only_mode ||

++rr-set-suppress-run-hook 1

(rr) p/x block->frame[16]@8

+p/x block->frame[16]@8

$16 = {[0x0] = 0x0,

[0x1] = 0x0,

[0x2] = 0x0,

[0x3] = 0x0,

[0x4] = 0x1,

[0x5] = 0x11,

[0x6] = 0x7,

[0x7] = 0x6}

(rr)

Same LSN stamp on both, server and backup. Time to move on and start to apply the incrementals:

xtrabackup --prepare --apply-log-only --target-dir=full/ --incremental-dir=inc1/
. . .
Shutdown completed; log sequence number 19082430
. . .
210402 18:12:20 completed OK!


$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd
0240020 00 00 00 00 01 23 19 06
0240030


(rr) c
+c
Continuing.
Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=20262758, end_lsn=20262808, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69
69     ut_ad(!srv_read_only_mode ||
++rr-set-suppress-run-hook 1
(rr) p/x block->frame[16]@8
+p/x block->frame[16]@8
$17 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x23,
  [0x6] = 0x19,
  [0x7] = 0x6}
(rr)

xtrabackup --prepare --apply-log-only --target-dir=full/ --incremental-dir=inc1/

. . .

Shutdown completed; log sequence number 19082430

. . .

210402 18:12:20 completed OK!

$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd

0240020 00 00 00 00 01 23 19 06

0240030

(rr) c

Continuing.

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=20262758, end_lsn=20262808, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69

69 ut_ad(!srv_read_only_mode ||

++rr-set-suppress-run-hook 1

(rr) p/x block->frame[16]@8

+p/x block->frame[16]@8

$17 = {[0x0] = 0x0,

[0x1] = 0x0,

[0x2] = 0x0,

[0x3] = 0x0,

[0x4] = 0x1,

[0x5] = 0x23,

[0x6] = 0x19,

[0x7] = 0x6}

(rr)

Once again, we have a matching LSN stamp on both sides. Moving to the next incremental:

xtrabackup --prepare --apply-log-only --target-dir=full/ --incremental-dir=inc2/
. . .
Shutdown completed; log sequence number 20269669
. . .
210402 18:15:04 completed OK!


$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd
0240020 00 00 00 00 01 35 2f 98
0240030


(rr) c
+c
Continuing.

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=21449997, end_lsn=21450047, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69
69     ut_ad(!srv_read_only_mode ||
++rr-set-suppress-run-hook 1
(rr) p/x block->frame[16]@8
+p/x block->frame[16]@8
$18 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x35,
  [0x6] = 0x2f,
  [0x7] = 0x98}
(rr)

xtrabackup --prepare --apply-log-only --target-dir=full/ --incremental-dir=inc2/

. . .

Shutdown completed; log sequence number 20269669

. . .

210402 18:15:04 completed OK!

$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd

0240020 00 00 00 00 01 35 2f 98

0240030

(rr) c

Continuing.

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=21449997, end_lsn=21450047, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69

69 ut_ad(!srv_read_only_mode ||

++rr-set-suppress-run-hook 1

(rr) p/x block->frame[16]@8

+p/x block->frame[16]@8

$18 = {[0x0] = 0x0,

[0x1] = 0x0,

[0x2] = 0x0,

[0x3] = 0x0,

[0x4] = 0x1,

[0x5] = 0x35,

[0x6] = 0x2f,

[0x7] = 0x98}

(rr)

Incremental two applied and matching LSN stamp from mysqld. Let’s keep doing this until we find a mismatch:

xtrabackup --prepare --apply-log-only --target-dir=full/ --incremental-dir=inc3/
. . .
Shutdown completed; log sequence number 21455916
. . .
210402 18:18:25 completed OK!


$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd
0240020 00 00 00 00 01 47 4d 3f
0240030


(rr) c
+c
Continuing.

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=25529471, end_lsn=25529521, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69
69     ut_ad(!srv_read_only_mode ||
++rr-set-suppress-run-hook 1
(rr) p/x block->frame[16]@8
+p/x block->frame[16]@8
$19 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x47,
  [0x6] = 0x4d,
  [0x7] = 0x3f}
(rr)


xtrabackup --prepare --apply-log-only --target-dir=full/ --incremental-dir=inc4/
. . .
Shutdown completed; log sequence number 23044902
. . .
210402 18:24:00 completed OK!

$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd
0240020 00 00 00 00 01 5f a3 26
0240030


(rr) c
+c
Continuing.

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=27218464, end_lsn=27218532, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69
69     ut_ad(!srv_read_only_mode ||
++rr-set-suppress-run-hook 1
(rr) p/x block->frame[16]@8
+p/x block->frame[16]@8
$242 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x9f,
  [0x6] = 0x3f,
  [0x7] = 0xc9}
(rr)

xtrabackup --prepare --apply-log-only --target-dir=full/ --incremental-dir=inc3/

. . .

Shutdown completed; log sequence number 21455916

. . .

210402 18:18:25 completed OK!

$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd

0240020 00 00 00 00 01 47 4d 3f

0240030

(rr) c

Continuing.

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=25529471, end_lsn=25529521, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69

69 ut_ad(!srv_read_only_mode ||

++rr-set-suppress-run-hook 1

(rr) p/x block->frame[16]@8

+p/x block->frame[16]@8

$19 = {[0x0] = 0x0,

[0x1] = 0x0,

[0x2] = 0x0,

[0x3] = 0x0,

[0x4] = 0x1,

[0x5] = 0x47,

[0x6] = 0x4d,

[0x7] = 0x3f}

(rr)

xtrabackup --prepare --apply-log-only --target-dir=full/ --incremental-dir=inc4/

. . .

Shutdown completed; log sequence number 23044902

. . .

210402 18:24:00 completed OK!

$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd

0240020 00 00 00 00 01 5f a3 26

0240030

(rr) c

Continuing.

Breakpoint 1, buf_flush_note_modification (block=0x7fd2df4ad750, start_lsn=27218464, end_lsn=27218532, observer=0x0) at /home/marcelo.altmann/percona-server/storage/innobase/include/buf0flu.ic:69

69 ut_ad(!srv_read_only_mode ||

++rr-set-suppress-run-hook 1

(rr) p/x block->frame[16]@8

+p/x block->frame[16]@8

$242 = {[0x0] = 0x0,

[0x1] = 0x0,

[0x2] = 0x0,

[0x3] = 0x0,

[0x4] = 0x1,

[0x5] = 0x9f,

[0x6] = 0x3f,

[0x7] = 0xc9}

(rr)

Ok, here we have something. Backup files jumped from 0x01474d3f to 0x015fa326 when applying incremental four while the server moved from 0x01474d3f to 0x019f3fc9 . Perhaps we missed some other place where we can update the LSN stamp of a page? But now, we are at a point in the future with our replay execution of the MySQL server.

Replaying the Execution Backward

Here is (yet) another very cool feature from RR, it allows you to replay the execution backward. To eliminate the possibility of missing a place that is also updating the LSN of this block, let’s add a hardware watchpoint on the block->frame memory address and reverse the execution:

(rr) p block->frame
+p block->frame
$243 = (unsigned char *) 0x7fd2e0758000 "\327\064X["
(rr) watch *(unsigned char *) 0x7fd2e0758000
+watch *(unsigned char *) 0x7fd2e0758000
Hardware watchpoint 2: *(unsigned char *) 0x7fd2e0758000
(rr) disa 1
+disa 1
(rr) reverse-cont
+reverse-cont
+continue
Continuing.
Hardware watchpoint 2: *(unsigned char *) 0x7fd2e0758000

Old value = 215 '\327'
New value = 80 'P'

0x0000000004c13903 in mach_write_to_4 (b=0x7fd2e0758000 "P\257\"\347", n=3610531931) at /home/marcelo.altmann/percona-server/storage/innobase/include/mach0data.ic:135
135   b[0] = static_cast<byte>(n >> 24);
++rr-set-suppress-run-hook 1
++rr-set-suppress-run-hook 1
(rr) p/x buf_flush_init_for_writing::block->frame[16]@8
+p/x buf_flush_init_for_writing::block->frame[16]@8
$11 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x9f,
  [0x6] = 0x3f,
  [0x7] = 0xc9}
(rr) reverse-cont
+reverse-cont
+continue
Continuing.
Hardware watchpoint 2: *(unsigned char *) 0x7fd2e0758000

Old value = 80 'P'
New value = 43 '+'
0x0000000004c13903 in mach_write_to_4 (b=0x7fd2e0758000 "+k*\304", n=1353655015) at /home/marcelo.altmann/percona-server/storage/innobase/include/mach0data.ic:135
135   b[0] = static_cast<byte>(n >> 24);
++rr-set-suppress-run-hook 1
++rr-set-suppress-run-hook 1
(rr) p/x buf_flush_init_for_writing::block->frame[16]@8
+p/x buf_flush_init_for_writing::block->frame[16]@8
$12 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x1,
  [0x5] = 0x47,
  [0x6] = 0x4d,
  [0x7] = 0x3f}
(rr)

(rr) p block->frame

+p block->frame

$243 = (unsigned char *) 0x7fd2e0758000 "\327\064X["

(rr) watch *(unsigned char *) 0x7fd2e0758000

+watch *(unsigned char *) 0x7fd2e0758000

Hardware watchpoint 2: *(unsigned char *) 0x7fd2e0758000

(rr) disa 1

+disa 1

(rr) reverse-cont

+reverse-cont

+continue

Continuing.

Hardware watchpoint 2: *(unsigned char *) 0x7fd2e0758000

Old value = 215 '\327'

New value = 80 'P'

0x0000000004c13903 in mach_write_to_4 (b=0x7fd2e0758000 "P\257\"\347", n=3610531931) at /home/marcelo.altmann/percona-server/storage/innobase/include/mach0data.ic:135

135 b[0] = static_cast<byte>(n >> 24);

++rr-set-suppress-run-hook 1

(rr) p/x buf_flush_init_for_writing::block->frame[16]@8

+p/x buf_flush_init_for_writing::block->frame[16]@8

$11 = {[0x0] = 0x0,

[0x1] = 0x0,

[0x2] = 0x0,

[0x3] = 0x0,

[0x4] = 0x1,

[0x5] = 0x9f,

[0x6] = 0x3f,

[0x7] = 0xc9}

(rr) reverse-cont

+reverse-cont

+continue

Continuing.

Hardware watchpoint 2: *(unsigned char *) 0x7fd2e0758000

Old value = 80 'P'

New value = 43 '+'

0x0000000004c13903 in mach_write_to_4 (b=0x7fd2e0758000 "+k*\304", n=1353655015) at /home/marcelo.altmann/percona-server/storage/innobase/include/mach0data.ic:135

135 b[0] = static_cast<byte>(n >> 24);

++rr-set-suppress-run-hook 1

(rr) p/x buf_flush_init_for_writing::block->frame[16]@8

+p/x buf_flush_init_for_writing::block->frame[16]@8

$12 = {[0x0] = 0x0,

[0x1] = 0x0,

[0x2] = 0x0,

[0x3] = 0x0,

[0x4] = 0x1,

[0x5] = 0x47,

[0x6] = 0x4d,

[0x7] = 0x3f}

(rr)

By replaying the execution backward we can see that indeed the server changed the LSN from 0x01474d3f to 0x019f3fc9. This confirms the issue is at incremental backup four as the LSN 0x015fa326 we see at end of incremental four was never a valid LSN at the server execution.

Root Cause

Now that we have limited the scope from six backups to a single one, things will become easier.

If we look closely at the log messages from the –prepare of the backup we can see that LSN of mysql.ibd matches the LSN stamp at the end of the backup:

xtrabackup --prepare --apply-log-only --target-dir=full/ --incremental-dir=inc4/
. . .
Shutdown completed; log sequence number 23044902
. . .
210402 18:24:00 completed OK!

$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd
0240020 00 00 00 00 01 5f a3 26
0240030


$ echo $(( 16#015fa326 ))
23044902

xtrabackup --prepare --apply-log-only --target-dir=full/ --incremental-dir=inc4/

. . .

Shutdown completed; log sequence number 23044902

. . .

210402 18:24:00 completed OK!

$ od -j $((16384 * 5 + 16)) -N 8 -t x1 full/mysql.ibd

0240020 00 00 00 00 01 5f a3 26

0240030

$ echo $(( 16#015fa326 ))

23044902

By checking the stack trace of the issue and examining further the block we have parsed we can see that this is innodb_dynamic_metadata index:

(gdb) f 13
+f 13
#13 0x00000000015af3dd in recv_recover_page_func (just_read_in=just_read_in@entry=true, block=block@entry=0x7f59efd7da00) at /home/marcelo.altmann/percona-xtrabackup/storage/innobase/log/log0recv.cc:2624
2624       recv_parse_or_apply_log_rec_body(recv->type, buf, buf + recv->len,
(gdb) p/x block->frame[66]@8
+p/x block->frame[66]@8
$4 =   {[0x0] = 0x0,
  [0x1] = 0x0,
  [0x2] = 0x0,
  [0x3] = 0x0,
  [0x4] = 0x0,
  [0x5] = 0x0,
  [0x6] = 0x0,
  [0x7] = 0x2}

(gdb) f 13

+f 13

#13 0x00000000015af3dd in recv_recover_page_func (just_read_in=just_read_in@entry=true, block=block@entry=0x7f59efd7da00) at /home/marcelo.altmann/percona-xtrabackup/storage/innobase/log/log0recv.cc:2624

2624 recv_parse_or_apply_log_rec_body(recv->type, buf, buf + recv->len,

(gdb) p/x block->frame[66]@8

+p/x block->frame[66]@8

$4 = {[0x0] = 0x0,

[0x1] = 0x0,

[0x2] = 0x0,

[0x3] = 0x0,

[0x4] = 0x0,

[0x5] = 0x0,

[0x6] = 0x0,

[0x7] = 0x2}

You might be wondering where 66 came from; this is from examining position FIL_PAGE_DATA + PAGE_INDEX_ID. That gave us index ID 2. This is below 1024, which is reserved for Data Dictionary tables. By checking what is the second table on that list, we can see that it’s innodb_dynamic_metadata. With all this information summed up we can look at what the server does at shutdown, and it becomes clear what the issue is:
srv0start.cc:3965

/** Shut down the InnoDB database. */
void srv_shutdown() {
  . . .

  /* Write dynamic metadata to DD buffer table. */
  dict_persist_to_dd_table_buffer();
. . .
}

/** Shut down the InnoDB database. */

void srv_shutdown() {

. . .

/* Write dynamic metadata to DD buffer table. */

dict_persist_to_dd_table_buffer();

. . .

}

As part of the shutdown process, we are persisting dirty metadata back to the DD Buffer table (innodb_dynamic_metadata), which is wrong. Those changes will likely be persisted by the server and redo logged once the server performs a checkpoint. Also, more data can be merged together by the point of when the backup was taken and when the server itself persists this data to DD Tables. This is a result of the implementation of WL#7816 and WL#6204 which required Percona XtraBackup to change how it handles these types of redo records.

Summary

In this blog, we walked through the process of analyzing a real Percona XtraBackup bug. This bug exposes a challenge we face in various types of bugs, where the crash/malfunction is a consequence of something that happened way before, and by the time we have a stack trace/coredump, it is too late to perform a proper analysis. Record and Replay enabled us to consistently replay the execution of the source server, making it possible to narrow down the issue to where the root cause was.

Percona XtraBackup is a free, open source, complete database backup solution for all versions of Percona Server for MySQL and MySQL

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Replay the Execution of MySQL With RR (Record and Replay)

A Backup Problem

Replaying the Execution

Replaying the Execution Backward

Root Cause

Summary

Related

Related Blog Articles

RECOMMENDED ARTICLES

High Availability: Choosing the Right Option for Your Percona Monitoring and Management

New Valkey Packages by Percona

Can We Set up a Replicate Filter Within the Percona XtraDB Cluster?

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Replay the Execution of MySQL With RR (Record and Replay)

A Backup Problem

Replaying the Execution

Replaying the Execution Backward

Root Cause

Summary

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

High Availability: Choosing the Right Option for Your Percona Monitoring and Management

New Valkey Packages by Percona

Can We Set up a Replicate Filter Within the Percona XtraDB Cluster?

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation