Comments on: Update on the InnoDB double-write buffer and EXT4 transactions https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/ Thu, 14 Dec 2017 13:52:15 +0000 hourly 1 https://wordpress.org/?v=6.5.2 By: Yves Trudeau https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10968791 Thu, 14 Dec 2017 13:52:15 +0000 https://www.percona.com/blog/?p=29968#comment-10968791 ZFS is entirely different than ext4, the ZIL acts as the doublewrite buffer so it is safe on ZFS to disable the double write buffer.

]]>
By: Duncan https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10968790 Thu, 14 Dec 2017 02:23:25 +0000 https://www.percona.com/blog/?p=29968#comment-10968790 Also, has is it now safe to assume that disabling double write on a ZFS system would be ok?

]]>
By: Eduardo https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10968665 Tue, 14 Nov 2017 21:17:44 +0000 https://www.percona.com/blog/?p=29968#comment-10968665 Would it be safe to assume that if the following are true:
-database was initialized with a 4kb page size
-the physical block sector of the underlying disk is 4kb

Then the double write can be disabled without risking data corruption?

]]>
By: Fabian Trumper https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10892188 Fri, 24 Jul 2015 17:43:52 +0000 https://www.percona.com/blog/?p=29968#comment-10892188 Hi Yves,

I’ve seen similarly impressive results mapping ZFS Journaling Device to our Flashtec NVRAM Drive (check it out here – pmcs.com/products/storage/flashtec_nvram_drives). Pretty much doubles your throughput in write intensive workloads.

Contact me if you would like to try this with the fastest NVRAM Cache Card in the market!

Cheers,

Fabian

]]>
By: Gionatan Danti https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10883230 Thu, 16 Jul 2015 13:47:09 +0000 https://www.percona.com/blog/?p=29968#comment-10883230 Hi all,
I think to have an explanation for the results above.

It is true that, when using data=journal, EXT4 will guarantee that application data are double-written to both the journal and the main filesystem. However _if the application crashes during a write_ the filesystem will receive incomplete data, yet it does not know anything about that.

Example: a C program is writing a large enough (eg: 128KB) buffer using a write call. Suddenly, it is killed just when the write call is transferring the application buffer’s content into OS pagecache. After some time (or by issuing a sync) the filesystem receive the partial data and double-write them. The key point here is that, while data are consistent from filesystem view, they are _not_ consistent from the application view.

Using only small buffer (eg: 16KB) will rarely trigger the problem, but using bigger buffer surely will (the example with 128 KB buffer is quite telling).

So: do NOT disable InnoDB doublewrite, unless you _really_ know not only what you are doing, but the OS-level implication in doing that.

]]>
By: Marko Mäkelä https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10863170 Thu, 25 Jun 2015 13:14:38 +0000 https://www.percona.com/blog/?p=29968#comment-10863170 Yves, thank you. I guess I do not need to post my program, which is very similar to yours.

My colleague Sunny Bains pointed out that true copy-on-write file systems could probably guarantee atomic write operations. I do not see how an update-in-place file system like ext4 could guarantee it.

]]>
By: Yves Trudeau https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10863168 Thu, 25 Jun 2015 13:10:22 +0000 https://www.percona.com/blog/?p=29968#comment-10863168 @Marko: You are right that with large size, I see a distribution now. This is interesting, I’ll investigate.

]]>
By: Yves Trudeau https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10863143 Thu, 25 Jun 2015 12:30:07 +0000 https://www.percona.com/blog/?p=29968#comment-10863143 @Marko, here’s a similar test done with the code I posted:

As you see, it is always a multiple of 16KB. Could you post your whole code with the “

” tags

]]>
By: Marko Mäkelä https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10862365 Wed, 24 Jun 2015 17:03:55 +0000 https://www.percona.com/blog/?p=29968#comment-10862365 Yves, thanks for looking at this.

Actually, I ran a similar test on Monday when my colleague pointed out that I am missing O_DSYNC.

More specifically, I used
int fd = open(“testfile”, O_WRONLY | O_CREAT | O_TRUNC | O_DSYNC, 0666);
in my test program.

Some commands that I used:

/sbin/mkfs.ext4 /dev/shm/u
sudo mount /dev/shm/u /mnt
sudo chmod 777 /mnt
cd /mnt
/tmp/a.out & sleep 0.00001; kill %1
wc -c testfile
cd
umount /mnt
sudo mount /dev/shm/u /mnt -t ext4 -o data=journal
cd /mnt
/tmp/a.out & sleep 0.00001; kill %1
wc -c testfile

After the remounting, the file was smaller (I got 16384, 36864, 98304, and 45056 bytes on 4 attempts). With the defaults (no data=journal), the size with my only attempt was 352256 bytes. So, the data=journal is definitely slowing it down, but not making the write() atomic. The mount options according to /proc/mounts are as follows:

/dev/loop0 /mnt ext4 rw,relatime,nodelalloc,data=journal 0 0

The file /proc/version identifies the kernel as follows:

Linux version 3.16.0-4-amd64 (debian-kernel@lists.debian.org) (gcc version
4.8.4 (Debian 4.8.4-1) ) #1 SMP Debian 3.16.7-ckt9-3 (2015-04-23)

Also, “man 2 open” does not promise to me that O_DSYNC or the stronger O_SYNC would make writes atomic. It only seems to promise that if the write completes, it will be completely recovered from the journal, should the file system crash. In this case, the write would not complete on my system, because it was interrupted by a signal.

If this killed write happened to an InnoDB data file with the doublewrite buffer disabled, the data file would not be recoverable when mysqld is restarted. If the kill was soon enough followed by a full system crash (or pulling the plug), then you might be safe, because the partial write might not have been written to the file system journal yet.

Can you please retry with a bigger write or a smaller sleep delay? I think that it is easier to interrupt a write with a signal when the request is bigger than the 16384 bytes that you were using. Note that this is not only an academic request: I seem to remember that InnoDB can combine writes of adjacent pages into a single request.

]]>
By: Yves Trudeau https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10860945 Tue, 23 Jun 2015 13:29:09 +0000 https://www.percona.com/blog/?p=29968#comment-10860945 That comment editor is bad…. let’s try again

]]>
By: Yves Trudeau https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10860943 Tue, 23 Jun 2015 13:27:27 +0000 https://www.percona.com/blog/?p=29968#comment-10860943 @Markus, Yes, exactly, ext4 will perform a rollback, extra like a database, provided the file is opened with O_SYNC or O_DSYNC. On recent kernels, O_DIRECT is mapped to O_DSYNC when ext4 is mounted with data=journal. You’re C code is incomplete, you didn’t show how the file is opened and you have no sync opt.

Try the following code:

#define _GNU_SOURCE

#include
#include
#include

int main ()
{
FILE *fp;
int fd;
char *str;

str = (char *) malloc(16384);
memset(str,0,16384);

fd = open( “/var/lib/mysql/append16kb.txt” , O_DSYNC | O_WRONLY | O_TRUNC);

while (1) {
write(fd, str, 16384);
fdatasync(fd);
}

close(fd);

return(0);
}

And then, pull the plug on the server or kill -9 the process. I never got only part of a 16KB chunk. If you remove the O_DSYNC, you got 75% of chance of getting a partial write.

]]>
By: Ben Li https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10860825 Tue, 23 Jun 2015 10:15:20 +0000 https://www.percona.com/blog/?p=29968#comment-10860825 @Marko
Perhaps not so directl related, try your code on a ZFS on Linux (0.6.3)
always got
10485760

So it seems ZFS is OK.

]]>
By: Marko Mäkelä https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10857804 Sat, 20 Jun 2015 08:16:28 +0000 https://www.percona.com/blog/?p=29968#comment-10857804 My test program was eaten by the comment system. Retrying without the less-than sign:

char buf[10485760];

int main (int argc, char**argv)
{
write(1, buf, sizeof buf);
return 0;
}

Invoke this as
./a.out > testfile & sleep 0.0001; kill %1
[hit enter to see that bash reports the job as terminated]
wc -c testfile

]]>
By: Marko Mäkelä https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10857802 Sat, 20 Jun 2015 08:10:03 +0000 https://www.percona.com/blog/?p=29968#comment-10857802 @Yves: As far as I understand, the fsync(2) only plays a role when the whole operating system crashes or the power to the computer is cut abruptly. On my system, ext4(5) documents data=journal as follows:

All data is committed into the journal prior to being written into the main filesystem.

There is no mention of what happens when a write is interrupted by a signal. Are you trying to say that there is some ext4 file system mode where writes do not take effect until an explicit fsync() is submitted? And where non-fsync()ed writes would be rolled back if the process was killed?

The case that I am talking about is that the process is killed during a write operation. Because I do not remember if data=journal was explicitly used on the system where the problem occurred during our internal QA, I repeated my test with the follownig program, on an ext4 file system mounted in data=journal mode:

cat > t.c << EOF
#include

char buf[10 < /some/where/test/file& sleep 0.0001; kill %1
ls -l /some/where/test/file

Every time I run this, the file size will be a multiple of 4096 bytes, not the full 10MiB that was submitted.
OK, InnoDB does not use the write(2) system call but pwrite(2) or aio, but the semantics should be similar.

My first experiment was on /dev/shm (tmpfs). I repeated it also with an ext4 file system like this:

dd if=/dev/zero of=/dev/shm/t bs=1M count=50
/sbin/mke2fs /dev/shm/t
sudo mount /dev/shm/t /mnt -t ext4 -o data=journal
sudo chmod 777 /mnt
./a.out > /mnt/t& sleep 0.0001; kill %1
wc -c /mnt/t
98304 /mnt/t
./a.out > /mnt/t& sleep 0.0001; kill %1
wc -c /mnt/t
118784 /mnt/t

Again, not all 10MiB were written. Instead 24 or 29 blocks of 4KiB were written. The latter count is not a multiple of 16KiB (the default innodb_page_size).

Did you successfully test disabling the InnoDB doublewrite buffer and then repeatedly killing and restarting the server during a write-heavy workload, while using some other size than innodb_page_size=4k? As far as I can tell, this can only be guaranteed work on a file system where writes are supposed to be atomic, such as FusionIO NVMFS.

Our QA engineer only disables the doublewrite buffer when using the 4K page size on Linux, to avoid unnecessary noise.

Note: InnoDB crash recovery always requires the file pages to be readable. We could be more robust and skip the prior reading of the page if the redo log contains MLOG_INIT_FILE_PAGE or similar.

]]>
By: Nils https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10857096 Fri, 19 Jun 2015 13:33:52 +0000 https://www.percona.com/blog/?p=29968#comment-10857096 It should also be noted that this is a code path in ext4 that is rarely used. In my tests I ran into some rather severe stalls after a few hours of load which are very hard to debug (see original post).

]]>
By: michael morse https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10856350 Thu, 18 Jun 2015 22:40:57 +0000 https://www.percona.com/blog/?p=29968#comment-10856350 just to clarify, when I said ‘halving the work by the disk’ I meant the overhead, not the actual writes (as the filesystem is doing the job of the double write buffer)

]]>
By: michael morse https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10856336 Thu, 18 Jun 2015 22:25:53 +0000 https://www.percona.com/blog/?p=29968#comment-10856336 The original post and follow-up highlights a good point about disks vs. SSDs which may not be immediately apparent (at least to me it wasn’t having only read about it). I was under the impression that spinning disks would not see nearly the same benefit with the double-write buffer turned off, as the buffer writes were always sequential, compared to the actual writes not so much. Yet with SSDs you would not see the same discrepancy between sequential vs. random writes and so you would be truly halving the work by the disk. However, disks should still see good benefit (which was demonstrated in the original post) as of course innodb trys to group and write sequentially as much as possible, as well as what was pointed out here, there is significant overhead on a single threaded double-write buffer write for spinning disks.

]]>
By: Yves Trudeau https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10856302 Thu, 18 Jun 2015 21:52:18 +0000 https://www.percona.com/blog/?p=29968#comment-10856302 @Marko: With ext4, the data=journal mount option and innodb_flush_method=O_DSYNC, the equivalent of the double write buffer happens in the ext4 filesystem transactional journal, fsync behaves like commit in a database. In that regard, even though it is not a COW filesystem, ext4 behaves like ZFS.

The granularity of the write operation is enforced to 16KB by the InnoDB fsyncs that happens at multiple of the page size provided that the file is opened use O_DSYNC (or O_SYNC actually). If you watch the server metrics, you’ll see that the data is still written twice, once to ext4 journal and once to the ibd file. I invite you to test. I never corrupted my dataset even when pulling the power plug in the middle of a heavy write benchmark and I did it often… After a crash, the filesystem will use its transactional journal to fully write an InnoDB page that would have been partially written.

]]>
By: Marko Mäkelä https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10856174 Thu, 18 Jun 2015 19:37:44 +0000 https://www.percona.com/blog/?p=29968#comment-10856174 The innodb_flush_log_at_trx_commit setting affects the redo log flushing during transaction commit. If it is set to 0 or 2, a server kill may lose some of the latest transaction commits, even though the COMMIT statement returned to the client.

The partial page writes in data files are independent of the redo log flushing. No matter which setting you use for the redo log flushing, disabling the doublewrite buffer is dangerous unless the innodb_page_size matches the granularity of the page writes. The doublewrite buffer only covers writes to the data files.

Partial writes are not that much of a problem in the redo log, because the redo log block size is smaller, and crash recovery would ignore the corrupted partially written last redo log blocks, replaying the log up to the last seen LSN.

]]>
By: Tim Vaillancourt https://www.percona.com/blog/update-on-the-innodb-double-write-buffer-and-ext4-transactions/#comment-10855831 Thu, 18 Jun 2015 12:18:11 +0000 https://www.percona.com/blog/?p=29968#comment-10855831 Disregard my question, I just read: https://www.percona.com/blog/2014/05/23/improve-innodb-performance-write-bound-loads/ :D.

Thanks!

Tim

]]>