Comments on: Update on the InnoDB double-write buffer and EXT4 transactions

By: Yves Trudeau

Yves Trudeau — Thu, 14 Dec 2017 13:52:15 +0000

In reply to Duncan. ZFS is entirely different than ext4, the ZIL acts as the doublewrite buffer so it is safe on ZFS to disable the double write buffer.

By: Duncan

Duncan — Thu, 14 Dec 2017 02:23:25 +0000

Also, has is it now safe to assume that disabling double write on a ZFS system would be ok?

By: Eduardo

Eduardo — Tue, 14 Nov 2017 21:17:44 +0000

Would it be safe to assume that if the following are true:
-database was initialized with a 4kb page size
-the physical block sector of the underlying disk is 4kb

Then the double write can be disabled without risking data corruption?

By: Fabian Trumper

Fabian Trumper — Fri, 24 Jul 2015 17:43:52 +0000

Hi Yves,

I’ve seen similarly impressive results mapping ZFS Journaling Device to our Flashtec NVRAM Drive (check it out here – pmcs.com/products/storage/flashtec_nvram_drives). Pretty much doubles your throughput in write intensive workloads.

Contact me if you would like to try this with the fastest NVRAM Cache Card in the market!

Cheers,

Fabian

By: Gionatan Danti

Gionatan Danti — Thu, 16 Jul 2015 13:47:09 +0000

Hi all,
I think to have an explanation for the results above.

It is true that, when using data=journal, EXT4 will guarantee that application data are double-written to both the journal and the main filesystem. However _if the application crashes during a write_ the filesystem will receive incomplete data, yet it does not know anything about that.

Example: a C program is writing a large enough (eg: 128KB) buffer using a write call. Suddenly, it is killed just when the write call is transferring the application buffer’s content into OS pagecache. After some time (or by issuing a sync) the filesystem receive the partial data and double-write them. The key point here is that, while data are consistent from filesystem view, they are _not_ consistent from the application view.

Using only small buffer (eg: 16KB) will rarely trigger the problem, but using bigger buffer surely will (the example with 128 KB buffer is quite telling).

So: do NOT disable InnoDB doublewrite, unless you _really_ know not only what you are doing, but the OS-level implication in doing that.

By: Marko Mäkelä

Marko Mäkelä — Thu, 25 Jun 2015 13:14:38 +0000

Yves, thank you. I guess I do not need to post my program, which is very similar to yours.

My colleague Sunny Bains pointed out that true copy-on-write file systems could probably guarantee atomic write operations. I do not see how an update-in-place file system like ext4 could guarantee it.

By: Yves Trudeau

Yves Trudeau — Thu, 25 Jun 2015 13:10:22 +0000

@Marko: You are right that with large size, I see a distribution now. This is interesting, I’ll investigate.

root@coupe-feu:/var/lib/mysql# ( for i in `seq 1 300`; do ( /root/append128KB & PID=$!;sleep 0.1; kill $PID; sync; du -sh append128kb.txt ) 2> /dev/null ; done ) | sort | uniq
256K    append128kb.txt
384K    append128kb.txt
388K    append128kb.txt
392K    append128kb.txt
400K    append128kb.txt
428K    append128kb.txt
444K    append128kb.txt
476K    append128kb.txt
512K    append128kb.txt

root@coupe-feu:/var/lib/mysql# ( for i in `seq 1 300`; do ( /root/append128KB & PID=$!;sleep 0.1; kill $PID; sync; du -sh append128kb.txt ) 2> /dev/null ; done ) | sort | uniq

256K append128kb.txt

384K append128kb.txt

388K append128kb.txt

392K append128kb.txt

400K append128kb.txt

428K append128kb.txt

444K append128kb.txt

476K append128kb.txt

512K append128kb.txt

By: Yves Trudeau

Yves Trudeau — Thu, 25 Jun 2015 12:30:07 +0000

@Marko, here’s a similar test done with the code I posted:

root@coupe-feu:/var/lib/mysql# for i in `seq 1 30`; do ( /root/append16KB & PID=$!;sleep 0.05; kill $PID; sync; du -sh append16kb.txt ) 2> /dev/null ; done
32K     append16kb.txt
48K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
48K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
48K     append16kb.txt
48K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
48K     append16kb.txt
48K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt
32K     append16kb.txt

root@coupe-feu:/var/lib/mysql# for i in `seq 1 30`; do ( /root/append16KB & PID=$!;sleep 0.05; kill $PID; sync; du -sh append16kb.txt ) 2> /dev/null ; done

32K append16kb.txt

48K append16kb.txt

32K append16kb.txt

48K append16kb.txt

32K append16kb.txt

48K append16kb.txt

32K append16kb.txt

48K append16kb.txt

32K append16kb.txt

As you see, it is always a multiple of 16KB. Could you post your whole code with the “

" and "

" and "

” tags

By: Marko Mäkelä

Marko Mäkelä — Wed, 24 Jun 2015 17:03:55 +0000

Yves, thanks for looking at this.

Actually, I ran a similar test on Monday when my colleague pointed out that I am missing O_DSYNC.

More specifically, I used
int fd = open(“testfile”, O_WRONLY | O_CREAT | O_TRUNC | O_DSYNC, 0666);
in my test program.

Some commands that I used:

/sbin/mkfs.ext4 /dev/shm/u
sudo mount /dev/shm/u /mnt
sudo chmod 777 /mnt
cd /mnt
/tmp/a.out & sleep 0.00001; kill %1
wc -c testfile
cd
umount /mnt
sudo mount /dev/shm/u /mnt -t ext4 -o data=journal
cd /mnt
/tmp/a.out & sleep 0.00001; kill %1
wc -c testfile

After the remounting, the file was smaller (I got 16384, 36864, 98304, and 45056 bytes on 4 attempts). With the defaults (no data=journal), the size with my only attempt was 352256 bytes. So, the data=journal is definitely slowing it down, but not making the write() atomic. The mount options according to /proc/mounts are as follows:

/dev/loop0 /mnt ext4 rw,relatime,nodelalloc,data=journal 0 0

The file /proc/version identifies the kernel as follows:

Linux version 3.16.0-4-amd64 (debian-kernel@lists.debian.org) (gcc version
4.8.4 (Debian 4.8.4-1) ) #1 SMP Debian 3.16.7-ckt9-3 (2015-04-23)

Also, “man 2 open” does not promise to me that O_DSYNC or the stronger O_SYNC would make writes atomic. It only seems to promise that if the write completes, it will be completely recovered from the journal, should the file system crash. In this case, the write would not complete on my system, because it was interrupted by a signal.

If this killed write happened to an InnoDB data file with the doublewrite buffer disabled, the data file would not be recoverable when mysqld is restarted. If the kill was soon enough followed by a full system crash (or pulling the plug), then you might be safe, because the partial write might not have been written to the file system journal yet.

Can you please retry with a bigger write or a smaller sleep delay? I think that it is easier to interrupt a write with a signal when the request is bigger than the 16384 bytes that you were using. Note that this is not only an academic request: I seem to remember that InnoDB can combine writes of adjacent pages into a single request.

By: Yves Trudeau

Yves Trudeau — Tue, 23 Jun 2015 13:29:09 +0000

That comment editor is bad…. let’s try again

#define _GNU_SOURCE

#include<stdio.h>
#include <unistd.h>
#include <fcntl.h>

int main ()
{
   FILE *fp;
   int fd;
   char *str;

   str = (char *) malloc(16384);
   memset(str,0,16384);

   fd = open( "/var/lib/mysql/append16kb.txt" , O_DSYNC | O_WRONLY | O_TRUNC);
   
   while (1)  {
      write(fd, str, 16384);
      fdatasync(fd);
   }

   close(fd);
  
   return(0);
}

#define _GNU_SOURCE

#include

int main ()

{

FILE *fp;

int fd;

char *str;

str = (char *) malloc(16384);

memset(str,0,16384);

fd = open( "/var/lib/mysql/append16kb.txt" , O_DSYNC | O_WRONLY | O_TRUNC);

while (1) {

write(fd, str, 16384);

fdatasync(fd);

}

close(fd);

return(0);

}

By: Yves Trudeau

Yves Trudeau — Tue, 23 Jun 2015 13:27:27 +0000

@Markus, Yes, exactly, ext4 will perform a rollback, extra like a database, provided the file is opened with O_SYNC or O_DSYNC. On recent kernels, O_DIRECT is mapped to O_DSYNC when ext4 is mounted with data=journal. You’re C code is incomplete, you didn’t show how the file is opened and you have no sync opt.

Try the following code:

#define _GNU_SOURCE

#include
#include
#include

int main ()
{
FILE *fp;
int fd;
char *str;

str = (char *) malloc(16384);
memset(str,0,16384);

fd = open( “/var/lib/mysql/append16kb.txt” , O_DSYNC | O_WRONLY | O_TRUNC);

while (1) {
write(fd, str, 16384);
fdatasync(fd);
}

close(fd);

return(0);
}

And then, pull the plug on the server or kill -9 the process. I never got only part of a 16KB chunk. If you remove the O_DSYNC, you got 75% of chance of getting a partial write.

By: Ben Li

Ben Li — Tue, 23 Jun 2015 10:15:20 +0000

@Marko
Perhaps not so directl related, try your code on a ZFS on Linux (0.6.3)
always got
10485760

So it seems ZFS is OK.

By: Marko Mäkelä

Marko Mäkelä — Sat, 20 Jun 2015 08:16:28 +0000

My test program was eaten by the comment system. Retrying without the less-than sign:

char buf[10485760];

int main (int argc, char**argv)
{
write(1, buf, sizeof buf);
return 0;
}

Invoke this as
./a.out > testfile & sleep 0.0001; kill %1
[hit enter to see that bash reports the job as terminated]
wc -c testfile

By: Marko Mäkelä

Marko Mäkelä — Sat, 20 Jun 2015 08:10:03 +0000

@Yves: As far as I understand, the fsync(2) only plays a role when the whole operating system crashes or the power to the computer is cut abruptly. On my system, ext4(5) documents data=journal as follows:

All data is committed into the journal prior to being written into the main filesystem.

There is no mention of what happens when a write is interrupted by a signal. Are you trying to say that there is some ext4 file system mode where writes do not take effect until an explicit fsync() is submitted? And where non-fsync()ed writes would be rolled back if the process was killed?

The case that I am talking about is that the process is killed during a write operation. Because I do not remember if data=journal was explicitly used on the system where the problem occurred during our internal QA, I repeated my test with the follownig program, on an ext4 file system mounted in data=journal mode:

cat > t.c << EOF
#include

char buf[10 < /some/where/test/file& sleep 0.0001; kill %1
ls -l /some/where/test/file

Every time I run this, the file size will be a multiple of 4096 bytes, not the full 10MiB that was submitted.
OK, InnoDB does not use the write(2) system call but pwrite(2) or aio, but the semantics should be similar.

My first experiment was on /dev/shm (tmpfs). I repeated it also with an ext4 file system like this:

dd if=/dev/zero of=/dev/shm/t bs=1M count=50
/sbin/mke2fs /dev/shm/t
sudo mount /dev/shm/t /mnt -t ext4 -o data=journal
sudo chmod 777 /mnt
./a.out > /mnt/t& sleep 0.0001; kill %1
wc -c /mnt/t
98304 /mnt/t
./a.out > /mnt/t& sleep 0.0001; kill %1
wc -c /mnt/t
118784 /mnt/t

Again, not all 10MiB were written. Instead 24 or 29 blocks of 4KiB were written. The latter count is not a multiple of 16KiB (the default innodb_page_size).

Did you successfully test disabling the InnoDB doublewrite buffer and then repeatedly killing and restarting the server during a write-heavy workload, while using some other size than innodb_page_size=4k? As far as I can tell, this can only be guaranteed work on a file system where writes are supposed to be atomic, such as FusionIO NVMFS.

Our QA engineer only disables the doublewrite buffer when using the 4K page size on Linux, to avoid unnecessary noise.

Note: InnoDB crash recovery always requires the file pages to be readable. We could be more robust and skip the prior reading of the page if the redo log contains MLOG_INIT_FILE_PAGE or similar.

By: Nils

Nils — Fri, 19 Jun 2015 13:33:52 +0000

It should also be noted that this is a code path in ext4 that is rarely used. In my tests I ran into some rather severe stalls after a few hours of load which are very hard to debug (see original post).

By: michael morse

michael morse — Thu, 18 Jun 2015 22:40:57 +0000

just to clarify, when I said ‘halving the work by the disk’ I meant the overhead, not the actual writes (as the filesystem is doing the job of the double write buffer)

By: michael morse

michael morse — Thu, 18 Jun 2015 22:25:53 +0000

The original post and follow-up highlights a good point about disks vs. SSDs which may not be immediately apparent (at least to me it wasn’t having only read about it). I was under the impression that spinning disks would not see nearly the same benefit with the double-write buffer turned off, as the buffer writes were always sequential, compared to the actual writes not so much. Yet with SSDs you would not see the same discrepancy between sequential vs. random writes and so you would be truly halving the work by the disk. However, disks should still see good benefit (which was demonstrated in the original post) as of course innodb trys to group and write sequentially as much as possible, as well as what was pointed out here, there is significant overhead on a single threaded double-write buffer write for spinning disks.

By: Yves Trudeau

Yves Trudeau — Thu, 18 Jun 2015 21:52:18 +0000

@Marko: With ext4, the data=journal mount option and innodb_flush_method=O_DSYNC, the equivalent of the double write buffer happens in the ext4 filesystem transactional journal, fsync behaves like commit in a database. In that regard, even though it is not a COW filesystem, ext4 behaves like ZFS.

The granularity of the write operation is enforced to 16KB by the InnoDB fsyncs that happens at multiple of the page size provided that the file is opened use O_DSYNC (or O_SYNC actually). If you watch the server metrics, you’ll see that the data is still written twice, once to ext4 journal and once to the ibd file. I invite you to test. I never corrupted my dataset even when pulling the power plug in the middle of a heavy write benchmark and I did it often… After a crash, the filesystem will use its transactional journal to fully write an InnoDB page that would have been partially written.

By: Marko Mäkelä

Marko Mäkelä — Thu, 18 Jun 2015 19:37:44 +0000

The innodb_flush_log_at_trx_commit setting affects the redo log flushing during transaction commit. If it is set to 0 or 2, a server kill may lose some of the latest transaction commits, even though the COMMIT statement returned to the client.

The partial page writes in data files are independent of the redo log flushing. No matter which setting you use for the redo log flushing, disabling the doublewrite buffer is dangerous unless the innodb_page_size matches the granularity of the page writes. The doublewrite buffer only covers writes to the data files.

Partial writes are not that much of a problem in the redo log, because the redo log block size is smaller, and crash recovery would ignore the corrupted partially written last redo log blocks, replaying the log up to the last seen LSN.

By: Tim Vaillancourt

Tim Vaillancourt — Thu, 18 Jun 2015 12:18:11 +0000

Disregard my question, I just read: https://www.percona.com/blog/2014/05/23/improve-innodb-performance-write-bound-loads/ :D.

Thanks!

Tim