In this blog post, we will discuss how to validate at the operating system level the effects of changing the innodb_flush_method
to variations other than the default (particularly for O_DIRECT
which is most commonly used) and the use of innodb_use_fdatasync
.
Introduction
First, let’s define what the innodb_flush_method
parameter does. It dictates how InnoDB manages the flushing of data to disk. I won’t detail what each valid value does, but you can check the documentation link here. The list of possible values is detailed below (Unix only):
- fsync
- O_DSYNC
- littlesync
- nosync
- O_DIRECT
- O_DIRECT_NO_FSYNC
As said, we will focus on the O_DIRECT
. As part of the best practices, we recommend using O_DIRECT
to avoid double-buffering, bypassing the OS cache, and thus improving performance when writing data. Below is the InnoDB architecture extracted from the official documentation:
On platforms that support fdatasync()
system calls, the innodb_use_fdatasync variable, introduced in MySQL 8.0.26, permits innodb_flush_method options that use fsync()
to use fdatasync()
instead. An fdatasync()
system call does not flush changes to file metadata unless required for subsequent data retrieval, providing a potential performance benefit.
Because I mentioned the term system call (or syscalls), let’s define it since it is an essential point of this blog post.
To manipulate a file, MySQL and any other software must invoke syscalls. Whenever a process requires a system resource, it sends a request for that resource to the kernel by making a system call. At a high level, system calls are “services” offered by the kernel to user applications. They resemble library APIs, described as function calls with a name, parameters, and return value. The diagram below is a high-level illustration of this process:
Question: Why not directly access the resource we want (memory, disk, etc..)?
This is because Linux divides the execution of the process into two spaces. User-run processes (generally referred to as user space processes) rely on services provided by the kernel. The kernel is a particular part of the operating system that handles various low-level operations in a privileged running mode. The concept of User and Kernel space is described in detail here. System security and stability would be compromised if applications could directly read and write to the kernel’s address space. In the given scenario, one process is capable of accessing the memory area of another process. This suggests a potential issue with memory isolation and could lead to security vulnerabilities.
Question: How do I check if my Operating System supports a specific syscall?
You can use the command:
1 $ man syscallsIt will list the syscalls available and in which Linux Kernel appeared.
Test case
We will use the strace utility and the information presented in /proc/<pid>/fdinfo/<fdinfo>
to prove the theory described before. First, I will start a MySQL 8.0.33 instance with default settings.
We can list the files opened by the mysqld
process by checking the /proc/<pid>/fd/
:
1 2 3 4 5 6 | $ ls -l /proc/12006/fd/ total 0 lr-x------. 1 vinicius.grippa percona 64 Jan 15 16:59 0 -> /dev/null l-wx------. 1 vinicius.grippa percona 64 Jan 15 16:59 1 -> /home/vinicius.grippa/sandboxes/msb_8_0_33/data/msandbox.err ... lrwx------. 1 vinicius.grippa percona 64 Jan 15 16:59 9 -> /home/vinicius.grippa/sandboxes/msb_8_0_33/data/#ib_16384_1.dblwr |
We can check each file descriptor by running cat /proc/<pid>/fdinfo/<file descriptor number>
:
1 2 3 4 5 | $ cat /proc/12006/fdinfo/9 pos: 0 flags: 0100002 mnt_id: 69 lock: 1: POSIX ADVISORY WRITE 12006 fd:06:32640985 0 EOF |
We are interested in the flags description, represented by the octal number 0100002
. To interpret the flags, we can use the fdflags repository from GitHub or the command below in the shell:
1 | $ for flag in APPEND ASYNC CLOEXEC CREAT DIRECT DIRECTORY DSYNC EXCL LARGEFILE NOATIME NOCTTY NOFOLLOW NONBLOCK PATH RDWR SYNC TMPFILE TRUNC; do printf '%s: ' O_${flag}; echo O_${flag} | gcc -D_GNU_SOURCE -include fcntl.h -E - | tail -n 1; done |
And using the fdflags project to avoid manual work:
1 2 3 4 5 6 7 8 9 | fdinfo_directory="/proc/46211/fdinfo/" fd_directory="/proc/46211/fd/" for fd in $(ls ${fdinfo_directory}); do echo "Processing file descriptor ${fd}" && # Read the symbolic link to find out the file name file_name=$(readlink "${fd_directory}${fd}") echo "File Name: $file_name" ./fdflags "${fdinfo_directory}${fd}" done |
1 2 3 4 5 6 7 8 9 10 11 | # Output Processing file descriptor 5 File Name: /home/vinicius.grippa/sandboxes/msb_8_0_33/data/test/sbtest79.ibd O_LARGEFILE O_RDWR ... Processing file descriptor 99 File Name: /home/vinicius.grippa/sandboxes/msb_8_0_33/data/test/sbtest89.ibd O_LARGEFILE O_RDWR |
The output shows the file descriptor number, the file name, and the flags applied to it when it opened.
Next, we can confirm MySQL is using fsync()
to write data with strace:
1 2 | # Attaching strace to the mysqld process $ strace -f -c -o ./strace.out -p <pid> |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | # Once you exit strace (CTRL+C), a summary of syscalls is written in the strace.out file $ cat strace.out % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 45.38 222.980851 1158 192526 50579 futex 27.50 135.114996 134309 1006 io_getevents 5.90 28.997398 1933159 15 12 restart_syscall 5.69 27.970912 537902 52 nanosleep 5.32 26.117827 122 213323 2088 read 3.86 18.968682 172 109744 write 3.00 14.731474 1133190 13 epoll_wait 1.13 5.565184 574 9688 fsync 0.82 4.050227 134 30132 clock_gettime 0.61 2.974088 148 20089 pwrite64 0.47 2.311729 708 3264 fdatasync ... 0.00 0.000270 135 2 rename 0.00 0.000256 32 8 close ------ ----------- ----------- --------- --------- ---------------- 100.00 491.324073 590620 52679 total |
Even without enabling the innodb_use_fdatasync, you will notice fdatasync()
syscall in the strace output. The fdatasync()
syscall is used by default by the binary logs when sync_binlog > 0. We can confirm in strace:
1 2 3 | $ strace -f -s2048 -yy -o ./strace.out -p <pid> ... 47252 fdatasync(70</home/vinicius.grippa/sandboxes/msb_8_0_33/data/binlog.000026> <unfinished ...> |
Suggestion: Try setting sync_binlog=0
and check if the fdatasync()
syscall is still requested by MySQL for the binary logs.
Now, we are going to add the following settings to MySQL and restart the instance:
1 2 3 | [mysqld] innodb_flush_method=O_DIRECT innodb_use_fdatasync = ON |
Checking again, we can see that a new flag, O_DIRECT
, was added to the files:
1 2 3 4 5 6 7 8 9 10 11 | Processing file descriptor 96 File Name: /home/vinicius.grippa/sandboxes/msb_8_0_33/data/test/sbtest112.ibd O_LARGEFILE O_RDWR O_DIRECT … Processing file descriptor 99 File Name: /home/vinicius.grippa/sandboxes/msb_8_0_33/data/test/sbtest94.ibd O_LARGEFILE O_RDWR O_DIRECT |
And checking with strace, we will see our table files(*.ibd) using fdatasync()
:
1 2 3 4 5 6 | $ strace -f -s2048 -yy -o ./strace.out -p 20498 ... 20551 fdatasync(48</home/vinicius.grippa/sandboxes/msb_8_0_33/data/test/sbtest23.ibd> <unfinished ...> 20551 fdatasync(26</home/vinicius.grippa/sandboxes/msb_8_0_33/data/test/sbtest5.ibd> <unfinished ...> 20550 fdatasync(34</home/vinicius.grippa/sandboxes/msb_8_0_33/data/test/sbtest15.ibd> <unfinished ...> 20550 fdatasync(24</home/vinicius.grippa/sandboxes/msb_8_0_33/data/test/sbtest1.ibd> <unfinished ...> |
Conclusion
We investigated the technical nuances of InnoDB’s data-flushing mechanisms and how they interact with the operating system. We can understand the details of optimizing MySQL performance when adjusting the innodb_flush_method
parameter and the innodb_use_fdatasync
.
Our experiments with the strace utility and examining the file descriptors in /proc/<pid>/fdinfo/
have provided concrete evidence of the behavior changes when these settings are tweaked. The use of O_DIRECT
can lead to more efficient data writing operations. Additionally, the introduction of innodb_use_fdatasync
in MySQL 8.0.26 and its preference over fsync()
in specific scenarios illustrate the ongoing evolution of MySQL to exploit specific system call advantages for performance gains.
Percona Distribution for MySQL is the most complete, stable, scalable, and secure open source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!
Try Percona Distribution for MySQL today!