IMPORTANT: DON’T TRY THIS IN PRODUCTION. As demonstrated by Marko (see comments), it may corrupt your data.
In a post, written a few months ago, I found that using EXT4 transactions with the “data=journal” mount option, improves the write performance significantly, by 55%, without putting data at risk. Many people commented on the post mentioning they were not able to reproduce the results and thus, I decided to further investigate in order to find out why my results were different.
So, I ran sysbench benchmarks on a few servers and found when the InnoDB double-write buffer limitations occur and when they don’t. I also made sure some of my colleagues were able to reproduce the results. Basically, in order to reproduce the results you need the following conditions:
- Spinning disk (no SSD)
- Enough CPU power
- A dataset that fits in the InnoDB buffer pool
- A continuous high write load with many ops waiting for disk
Using the InnoDB double write buffer on an SSD disk somewhat prevents us from seeing the issue, something good performance wise. That comes from the fact that the latency of each write operation is much lower. That makes sense, the double-writer buffer is an area of 128 pages on disk that is used by the write threads. When a write thread needs to write a bunch of dirty pages to disk, it first writes them sequentially to free slots in the double write buffer in a single iop and then, it spends time writing the pages to their actual locations on disk using typically one iop per page. Once done writing, it releases the double-write buffer slots it was holding and another thread can do its work. The presence of a raid controller with a write cache certainly helps, at least until the write cache is full. Thus, since I didn’t tested with a raid controller, I suspect a raid controller write cache will delay the apparition of the symptoms but if the write load is sustained over a long period of time, the issue with the InnoDB double write buffer will appear.
So, to recapitulate, on a spinning disk, a write thread needs to hold a lock on some of the double-write buffer slots for at least a few milliseconds per page it needs to write while on a SSD disk, the slots are released very quickly because of the low latency of the SSD storage. To actually stress the InnoDB double-write buffer on a SSD disk, one must push much more writes.
That leads us to the second point, the amount of CPU resources available. At first, one of my colleague tried to reproduce the results on a small EC2 instance and failed. It appeared that by default, the sysbench oltp.lua script is doing quite a lot of reads and those reads saturate the CPU, throttling the writes. By lowering the amount of reads in the script, he was then able to reproduce the results.
For my benchmarks, I used the following command:
sysbench --num-threads=16 --mysql-socket=/var/lib/mysql/mysql.sock
--mysql-database=sbtest --mysql-user=root
--test=/usr/share/doc/sysbench/tests/db/oltp.lua --oltp-table-size=50000000
--oltp-test-mode=complex --mysql-engine=innodb --db-driver=mysql
--report-interval=60 --max-requests=0 --max-time=3600 run
Both servers used were metal boxes with 12 physical cores (24 HT). With less CPU resources, I suggest adding the following parameters:
--oltp-point-selects=1
--oltp-range-size=1
--oltp-index-updates=10
So that the CPU is not wasted on reads and enough writes are generated. Remember we are not doing a generic benchmarks, we are just stressing the InnoDB double-write buffer.
In order to make sure something else isn’t involved, I verified the following:
- Server independence, tried on 2 physical servers and one EC2 instance, Centos 6 and Ubuntu 14.04
- MySQL provided, tried on MySQL community and Percona Server
- MySQL version, tried on 5.5.37 and 5.6.23 (Percona Server)
- Varied the InnoDB log file size from 32MB to 512MB
- The impacts of the number of InnoDB write threads (1,2,4,8,16,32)
- The use of Linux native asynchronous iop
- Spinning and SSD storage
So, with all those verifications done, I can maintain that if you are using a server with spinning disks and a high write load, using EXT4 transactions instead of the InnoDB double write buffer yields to an increase in throughput of more than 50%. In an upcoming post, I’ll show how the performance stability is affected by the InnoDB double-write buffer under a high write load.
Appendix: the relevant part of the my.cnf
innodb_buffer_pool_size = 12G
innodb_write_io_threads = 8 # or else in {1,2,4,8,16,32}
innodb_read_io_threads = 8
innodb_flush_log_at_trx_commit = 0 # must be 0 or 2 to really stress the double write buffer
innodb_log_file_size = 512M # or 32M, 64M
innodb_log_files_in_group = 2
innodb_file_per_table
innodb_flush_method=O_DIRECT # or O_DSYNC
innodb_buffer_pool_restore_at_startup=300 # On 5.5.x, important to warm up the buffer pool
#innodb_buffer_pool_load_at_startup=ON # on 5.6, important to warm up the buffer pool
#innodb_buffer_pool_dump_at_shutdown=ON # on 5.6, important to warm up the buffer pool,
skip-innodb_doublewrite # or commented out
innodb_flush_neighbor_pages=none # or area for spinning
Be aware that setting innodb_doublewrite=OFF will make it possible to permanently corrupt the database by killing the mysqld server.
I once analyzed a crash recovery failure from our internal testing where random configuration parameters are used. The server was forcibly killed, which is normal in our internal testing. But, it failed to recover, because one page was partially overwritten. If I remember correctly, the page size was 16k and the first 4k or 8k had been replaced with a newer version.
I wrote a test program to confirm this. The program issued a single write() to write some megabytes from a memory buffer to a file. I started the test program in the background, issued a random short sleep, and killed the program. Each time, the file size was a multiple of 4 kilobytes. This confirms my understanding of how Linux paging works on x86 with the ext family of file systems.
So, if you disable the doublewrite buffer, be sure to use a file system and block device where write() is truly atomic, or use innodb_page_size=4k. I was surprised to find that a simple kill of the mysqld process is enough to corrupt the data when the doublewrite buffer is disabled.
I am also curious about what Marko is mentioning: what are the data-integrity implications of disabling the doublewrite buffer under trx_commit=0 or 2?
My perhaps-outdated understanding is that disabling the double-write buffer is always dangerous. Does EXT4 change this somehow?
Cheers,
Tim
Disregard my question, I just read: https://www.percona.com/blog/2014/05/23/improve-innodb-performance-write-bound-loads/ :D.
Thanks!
Tim
The innodb_flush_log_at_trx_commit setting affects the redo log flushing during transaction commit. If it is set to 0 or 2, a server kill may lose some of the latest transaction commits, even though the COMMIT statement returned to the client.
The partial page writes in data files are independent of the redo log flushing. No matter which setting you use for the redo log flushing, disabling the doublewrite buffer is dangerous unless the innodb_page_size matches the granularity of the page writes. The doublewrite buffer only covers writes to the data files.
Partial writes are not that much of a problem in the redo log, because the redo log block size is smaller, and crash recovery would ignore the corrupted partially written last redo log blocks, replaying the log up to the last seen LSN.
@Marko: With ext4, the data=journal mount option and innodb_flush_method=O_DSYNC, the equivalent of the double write buffer happens in the ext4 filesystem transactional journal, fsync behaves like commit in a database. In that regard, even though it is not a COW filesystem, ext4 behaves like ZFS.
The granularity of the write operation is enforced to 16KB by the InnoDB fsyncs that happens at multiple of the page size provided that the file is opened use O_DSYNC (or O_SYNC actually). If you watch the server metrics, you’ll see that the data is still written twice, once to ext4 journal and once to the ibd file. I invite you to test. I never corrupted my dataset even when pulling the power plug in the middle of a heavy write benchmark and I did it often… After a crash, the filesystem will use its transactional journal to fully write an InnoDB page that would have been partially written.
The original post and follow-up highlights a good point about disks vs. SSDs which may not be immediately apparent (at least to me it wasn’t having only read about it). I was under the impression that spinning disks would not see nearly the same benefit with the double-write buffer turned off, as the buffer writes were always sequential, compared to the actual writes not so much. Yet with SSDs you would not see the same discrepancy between sequential vs. random writes and so you would be truly halving the work by the disk. However, disks should still see good benefit (which was demonstrated in the original post) as of course innodb trys to group and write sequentially as much as possible, as well as what was pointed out here, there is significant overhead on a single threaded double-write buffer write for spinning disks.
just to clarify, when I said ‘halving the work by the disk’ I meant the overhead, not the actual writes (as the filesystem is doing the job of the double write buffer)
It should also be noted that this is a code path in ext4 that is rarely used. In my tests I ran into some rather severe stalls after a few hours of load which are very hard to debug (see original post).
@Yves Trudeau: As far as I understand, the fsync(2) only plays a role when the whole operating system crashes or the power to the computer is cut abruptly. On my system, ext4(5) documents data=journal as follows:
All data is committed into the journal prior to being written into the main filesystem.
There is no mention of what happens when a write is interrupted by a signal. Are you trying to say that there is some ext4 file system mode where writes do not take effect until an explicit fsync() is submitted? And where non-fsync()ed writes would be rolled back if the process was killed?
The case that I am talking about is that the process is killed during a write operation. Because I do not remember if data=journal was explicitly used on the system where the problem occurred during our internal QA, I repeated my test with the follownig program, on an ext4 file system mounted in data=journal mode:
cat > t.c << EOF
#include
char buf[10 < /some/where/test/file& sleep 0.0001; kill %1
ls -l /some/where/test/file
Every time I run this, the file size will be a multiple of 4096 bytes, not the full 10MiB that was submitted.
OK, InnoDB does not use the write(2) system call but pwrite(2) or aio, but the semantics should be similar.
My first experiment was on /dev/shm (tmpfs). I repeated it also with an ext4 file system like this:
dd if=/dev/zero of=/dev/shm/t bs=1M count=50
/sbin/mke2fs /dev/shm/t
sudo mount /dev/shm/t /mnt -t ext4 -o data=journal
sudo chmod 777 /mnt
./a.out > /mnt/t& sleep 0.0001; kill %1
wc -c /mnt/t
98304 /mnt/t
./a.out > /mnt/t& sleep 0.0001; kill %1
wc -c /mnt/t
118784 /mnt/t
Again, not all 10MiB were written. Instead 24 or 29 blocks of 4KiB were written. The latter count is not a multiple of 16KiB (the default innodb_page_size).
Did you successfully test disabling the InnoDB doublewrite buffer and then repeatedly killing and restarting the server during a write-heavy workload, while using some other size than innodb_page_size=4k? As far as I can tell, this can only be guaranteed work on a file system where writes are supposed to be atomic, such as FusionIO NVMFS.
Our QA engineer only disables the doublewrite buffer when using the 4K page size on Linux, to avoid unnecessary noise.
Note: InnoDB crash recovery always requires the file pages to be readable. We could be more robust and skip the prior reading of the page if the redo log contains MLOG_INIT_FILE_PAGE or similar.
My test program was eaten by the comment system. Retrying without the less-than sign:
char buf[10485760];
int main (int argc, char**argv)
{
write(1, buf, sizeof buf);
return 0;
}
Invoke this as
./a.out > testfile & sleep 0.0001; kill %1
[hit enter to see that bash reports the job as terminated]
wc -c testfile
@Marko
Perhaps not so directl related, try your code on a ZFS on Linux (0.6.3)
always got
10485760
So it seems ZFS is OK.
@Markus, Yes, exactly, ext4 will perform a rollback, extra like a database, provided the file is opened with O_SYNC or O_DSYNC. On recent kernels, O_DIRECT is mapped to O_DSYNC when ext4 is mounted with data=journal. You’re C code is incomplete, you didn’t show how the file is opened and you have no sync opt.
Try the following code:
#define _GNU_SOURCE
#include
#include
#include
int main ()
{
FILE *fp;
int fd;
char *str;
str = (char *) malloc(16384);
memset(str,0,16384);
fd = open( “/var/lib/mysql/append16kb.txt” , O_DSYNC | O_WRONLY | O_TRUNC);
while (1) {
write(fd, str, 16384);
fdatasync(fd);
}
close(fd);
return(0);
}
And then, pull the plug on the server or kill -9 the process. I never got only part of a 16KB chunk. If you remove the O_DSYNC, you got 75% of chance of getting a partial write.
That comment editor is bad…. let’s try again
Yves, thanks for looking at this.
Actually, I ran a similar test on Monday when my colleague pointed out that I am missing O_DSYNC.
More specifically, I used
int fd = open(“testfile”, O_WRONLY | O_CREAT | O_TRUNC | O_DSYNC, 0666);
in my test program.
Some commands that I used:
/sbin/mkfs.ext4 /dev/shm/u
sudo mount /dev/shm/u /mnt
sudo chmod 777 /mnt
cd /mnt
/tmp/a.out & sleep 0.00001; kill %1
wc -c testfile
cd
umount /mnt
sudo mount /dev/shm/u /mnt -t ext4 -o data=journal
cd /mnt
/tmp/a.out & sleep 0.00001; kill %1
wc -c testfile
After the remounting, the file was smaller (I got 16384, 36864, 98304, and 45056 bytes on 4 attempts). With the defaults (no data=journal), the size with my only attempt was 352256 bytes. So, the data=journal is definitely slowing it down, but not making the write() atomic. The mount options according to /proc/mounts are as follows:
/dev/loop0 /mnt ext4 rw,relatime,nodelalloc,data=journal 0 0
The file /proc/version identifies the kernel as follows:
Linux version 3.16.0-4-amd64 ([email protected]) (gcc version
4.8.4 (Debian 4.8.4-1) ) #1 SMP Debian 3.16.7-ckt9-3 (2015-04-23)
Also, “man 2 open” does not promise to me that O_DSYNC or the stronger O_SYNC would make writes atomic. It only seems to promise that if the write completes, it will be completely recovered from the journal, should the file system crash. In this case, the write would not complete on my system, because it was interrupted by a signal.
If this killed write happened to an InnoDB data file with the doublewrite buffer disabled, the data file would not be recoverable when mysqld is restarted. If the kill was soon enough followed by a full system crash (or pulling the plug), then you might be safe, because the partial write might not have been written to the file system journal yet.
Can you please retry with a bigger write or a smaller sleep delay? I think that it is easier to interrupt a write with a signal when the request is bigger than the 16384 bytes that you were using. Note that this is not only an academic request: I seem to remember that InnoDB can combine writes of adjacent pages into a single request.
@Marko, here’s a similar test done with the code I posted:
As you see, it is always a multiple of 16KB. Could you post your whole code with the “
” tags
@Marko: You are right that with large size, I see a distribution now. This is interesting, I’ll investigate.
Yves, thank you. I guess I do not need to post my program, which is very similar to yours.
My colleague Sunny Bains pointed out that true copy-on-write file systems could probably guarantee atomic write operations. I do not see how an update-in-place file system like ext4 could guarantee it.
Hi all,
I think to have an explanation for the results above.
It is true that, when using data=journal, EXT4 will guarantee that application data are double-written to both the journal and the main filesystem. However _if the application crashes during a write_ the filesystem will receive incomplete data, yet it does not know anything about that.
Example: a C program is writing a large enough (eg: 128KB) buffer using a write call. Suddenly, it is killed just when the write call is transferring the application buffer’s content into OS pagecache. After some time (or by issuing a sync) the filesystem receive the partial data and double-write them. The key point here is that, while data are consistent from filesystem view, they are _not_ consistent from the application view.
Using only small buffer (eg: 16KB) will rarely trigger the problem, but using bigger buffer surely will (the example with 128 KB buffer is quite telling).
So: do NOT disable InnoDB doublewrite, unless you _really_ know not only what you are doing, but the OS-level implication in doing that.
Hi Yves,
I’ve seen similarly impressive results mapping ZFS Journaling Device to our Flashtec NVRAM Drive (check it out here – pmcs.com/products/storage/flashtec_nvram_drives). Pretty much doubles your throughput in write intensive workloads.
Contact me if you would like to try this with the fastest NVRAM Cache Card in the market!
Cheers,
Fabian
Would it be safe to assume that if the following are true:
-database was initialized with a 4kb page size
-the physical block sector of the underlying disk is 4kb
Then the double write can be disabled without risking data corruption?
Also, has is it now safe to assume that disabling double write on a ZFS system would be ok?
ZFS is entirely different than ext4, the ZIL acts as the doublewrite buffer so it is safe on ZFS to disable the double write buffer.