Comments on: How to improve InnoDB performance by 55% for write-bound loads

By: Yves Trudeau

Yves Trudeau — Thu, 16 Nov 2017 13:59:54 +0000

@Eduardo, the latest Percona Server 5.7 has the parallel doublewriter feature, it solves the contention issue with the doublewrite buffer

By: Eduardo

Eduardo — Tue, 14 Nov 2017 19:03:33 +0000

@Yves, I am experiencing some poor write performance and came about this post. As per your last post, did you ever do any additional testing with SSD? Is it worth trying to disable it in that case?

I was thinking about turning the double write buffer in a MySQL instance that was set up with a 4 KB page size on top of an iSCSI storage with physical block size of 4 KB, but my colleagues are not so convinced this is a good idea due to other elements in the storage stack (Linux LVM, xfs). Any thoughts on that?

By: Yves Trudeau

Yves Trudeau — Wed, 08 Apr 2015 14:21:20 +0000

@Alex: I have reproduced my results on 2 different servers (ubuntu 14.04 and Centos6 with 3.18 kernel) very easily but I struggled to reproduce yours. I finally succeeded partially using SSD storage. Apparently, SSD storage doesn’t stress the doublewrite buffer which kind of make sense. I’ll write a follow up post on the matter.

By: ALex

ALex — Thu, 02 Apr 2015 19:40:11 +0000

> I’ll try to reproduce you result on a Centos6 box, could you retry on your side on a more recent kernel, 3.2+ kernel.
Already. As I told before, I have tried kernel 3.10.73-1 and 3.19.3-1 from ELRepo. The max performance I got on 3.10.73-1 with the following settings

thread_cache_size = 16
innodb_buffer_pool_size = 12G
innodb_log_file_size = 2048M
innodb_log_buffer_size = 64M
innodb_fast_shutdown = 0
innodb_thread_concurrency = 0
innodb_read_io_threads = 64
innodb_write_io_threads = 64
innodb_flush_log_at_trx_commit=0
innodb_doublewrite=1
innodb_flush_method=O_DIRECT

innodb_additional_mem_pool_size = 64M
innodb_use_sys_malloc = 1

OLTP test statistics:
queries performed:
read: 172605426
write: 61644795
other: 24657918
total: 258908139
transactions: 12328959 (3424.71 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 234250221 (65069.44 per sec.)
other operations: 24657918 (6849.41 per sec.)

Test execution summary:
total time: 3600.0035s
total number of events: 12328959
total time taken by event execution: 57499.1132
per-request statistics:
min: 2.54ms
avg: 4.66ms
max: 21032.40ms
approx. 95 percentile: 5.58ms

Threads fairness:
events (avg/stddev): 770559.9375/1317.18
execution time (avg/stddev): 3593.6946/0.01

By: Yves Trudeau

Yves Trudeau — Thu, 02 Apr 2015 17:48:32 +0000

@Alex, unless it has been backported, O_DIRECT is not supported by the kernel before 3.2, that’s why the last test performs well. Newer kernels maps to O_DSYNC.

Now, regarding the bad results for the first test, there have been a lot of changes to ext4 in the 3.0+ kernel era, could this be just a bad mutex that got fixed? I’ll try to reproduce you result on a Centos6 box, could you retry on your side on a more recent kernel, 3.2+ kernel.

Regards,

Yves

By: Alex

Alex — Tue, 31 Mar 2015 13:33:40 +0000

https://www.percona.com/blog/2014/05/23/improve-innodb-performance-write-bound-loads/

@Yves Trudeau
I have the same result as Nils.

CentOS 6.6, 30 Gb of RAM, Intel Xeon E5-2680 v2 @ 2.80GHz. Percona-Server-55-5.5.42-rel37.1.el6.x86_64

It’s EC2 instance on AWS. Mysql datadir has mounted on raid0 with 2 provisioned ssd volumes with 1800 IOPS total. According to the sysbench results – with data=journal performance are dramatically decreased, on the stock kernel – 2.6.32-504.12.2.el6.x86_64

I have used the following my.cnf config

thread_cache_size = 16
innodb_thread_concurrency = 0

innodb_buffer_pool_size = 12G
innodb_log_file_size = 512M
innodb_log_buffer_size = 64M

innodb_fast_shutdown = 0

innodb_read_io_threads = 32
innodb_write_io_threads = 32

innodb_flush_log_at_trx_commit=2
innodb_doublewrite=0/1
innodb_flush_method=O_DIRECT/O_DSYNC

And run the test with the following command
# sysbench –num-threads=16 –test=oltp –oltp-table-size=50000000 –oltp-test-mode=complex –mysql-engine=innodb –mysql-db=sysbench –mysql-user=sysbench –mysql-password=7654321 –db-driver=mysql –max-requests=0 –max-time=3600 run

******* kernel-2.6.32-504.12.2.el6.x86_64 *******

noatime,nodiratime,data=journal + innodb_doublewrite=0 + innodb_flush_method=O_DSYNC

OLTP test statistics:
queries performed:
read: 16338434
write: 5835151
other: 2334061
total: 24507646
transactions: 1167030 (324.17 per sec.)
deadlocks: 1 (0.00 per sec.)
read/write requests: 22173585 (6159.26 per sec.)
other operations: 2334061 (648.34 per sec.)

http://imgur.com/2ZSpN1N

On the kernel-3.19 from ELRepo there si no difference at all

kernel-ml-3.19.3-1.el6.elrepo.x86_64

noatime,nodiratime,data=ordered + innodb_doublewrite=0 + innodb_flush_method=O_DIRECT

OLTP test statistics:
queries performed:
read: 150480764
write: 53743130
other: 21497252
total: 225721146
transactions: 10748626 (2985.73 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 204223894 (56728.78 per sec.)
other operations: 21497252 (5971.45 per sec.)

http://imgur.com/bTBfYjL

noatime,nodiratime,data=ordered + innodb_doublewrite=1 + innodb_flush_method=O_DIRECT

OLTP test statistics:
queries performed:
read: 149974454
write: 53562305
other: 21424922
total: 224961681
transactions: 10712461 (2975.68 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 203536759 (56537.91 per sec.)
other operations: 21424922 (5951.36 per sec.)

http://imgur.com/1wmBwG2

noatime,nodiratime,data=journal + innodb_doublewrite=0 + innodb_flush_method=O_DIRECT

OLTP test statistics:
queries performed:
read: 147382788
write: 52636704
other: 21054682
total: 221074174
transactions: 10527340 (2924.26 per sec.)
deadlocks: 2 (0.00 per sec.)
read/write requests: 200019492 (55560.89 per sec.)
other operations: 21054682 (5848.51 per sec.)

http://imgur.com/FxISmjn

Did I miss something?

By: curt mayer

curt mayer — Mon, 05 Jan 2015 21:39:11 +0000

Gopi Aravind: disabling journaling destroys durability, methinks.

By: Gopi Aravind

Gopi Aravind — Sat, 20 Dec 2014 18:00:17 +0000

We found something *amazing* with EXT4.

If you disable journaling on EXT4 with tune2fs -O "^has_journal" /dev/ , our TPS according to sysbench increased from a terrible 10/sec to 600/sec. This is on a soft raid of 2 7.2k HDD drives with 10GB innodb buffer.

Disabling double-write only increased the old figure to 20/sec, which is nothing.

Before disabling journaling, iotop showed:
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
28511 be/3 root 0.00 B/s 0.00 B/s 0.00 % 96.30 % [jbd2/md2-8]
31172 be/4 mysql 3.89 K/s 50.58 K/s 0.00 % 22.05 % mysqld
31215 be/4 mysql 7.78 K/s 54.48 K/s 0.00 % 16.12 % mysqld

Where jbd2/md2-8 (the journalling engine) caused most of the waiting on io.

Here’s our OLTP sysbench test with journaling on EXT4:
*Enabled journalling, double=buffer=0, O_DIRECT*

OLTP test statistics:
queries performed:
read: 0
write: 5805
other: 2322
total: 8127
transactions: 1161 (19.33 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 5805 (96.63 per sec.)
other operations: 2322 (38.65 per sec.)

And after disabling it:
*Disabled journalling on drive, double-buffer=0*

OLTP test statistics:
queries performed:
read: 0
write: 274820
other: 109928
total: 384748
transactions: 54964 (916.05 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 274820 (4580.24 per sec.)
other operations: 109928 (1832.10 per sec.)

By: Yves Trudeau

Yves Trudeau — Thu, 21 Aug 2014 13:57:44 +0000

@Nils: I pushed quite hard on the server I tested and I never saw that behavior. I may also get access to a server with a raid ctrl, bbu and fusion io card, can you tell me more about your setup so that I can try to reproduce your results.

By: Nils

Nils — Thu, 21 Aug 2014 05:27:01 +0000

Just a follow-up to this post, I tried it on a busy system and there was a rather peculiar effect where the system would just stall at a determinable (is that a word?) moment. The result was a single MySQLd process hogging 100% of one CPU and every other thread stalling. This didn’t happen with data=writeback and I think I have isolated the problem to this mount option. Use caution.

By: Nils

Nils — Tue, 03 Jun 2014 22:20:19 +0000

I may have access soon to a system with a RAID Controller, BBU and a combination of SSD (only SATA though) and HDD. I will run a few tests with different combinations of log, data, doublewrite etc. location.

Interestingly, there is already a controller for a PCIe based RAM drive that backs up into Flash, the IDT 89HF08P08AG3 [1]. Seems like nobody so far built a device based on it.

[1] http://www.techpowerup.com/171760/idt-announces-worlds-first-pci-express-gen-3-nvme-nv-dram-controller.html

By: Dimitri

Dimitri — Mon, 02 Jun 2014 20:30:46 +0000

Hi Yves,

great then if it’s so!..
(and means I have to replay all my test with the following options as well ;-))

Rgds,
-Dimitri

By: Yves Trudeau

Yves Trudeau — Mon, 02 Jun 2014 20:19:09 +0000

Hi Dimitri,
When you use O_SYNC or O_DSYNC, if you do a 16KB write operation with data=journal, it is transactional, you can’t have a fraction of 16KB written. It is the exact same principle has transaction in a database. Best way to convince yourself is to try 🙂 I did many many tests recently. I am trying to get hold on server with raid_ctrl/BBU/spinning and a PCIe SSD card. I’ll redo the benchmarks splitting the journals (ext4 and innodb) on spinning drives and datafile on SSD.

Regards,

Yves

By: Dimitri

Dimitri — Sun, 01 Jun 2014 09:38:40 +0000

Re-send :

Hi Yves,

your results are looking very promising!! thanks for sharing!
I’m also curious if using O_DSYNC instead of the “classic” O_DIRECT
for RW workloads will not bring any other impact within other tests
cases…

then, indeed, this EXT4 data journaling is not solving the issue with
“double amount of data written”.. – however, keeping in mind arriving
to the market various HW solutions for “battery protected memory”
(even small amount, but if 400MB is already enough then it’ll look
more than attractive, and “double write amount” will be completely
avoided and keep a longer life for flash storage ;-))

the only point with this EXT4 feature remains unclear for me: will
EXT4 really guarantee the absense of partially written pages?..
(keeping in mind that EXT4 is operating with 4K, while InnoDB default
is 16K, and EXT4 knows just nothing about InnoDB data, so for EXT4
“commited” 2 pages of 4K will remain safe, while there will be still 2
other pages “not comitted” and breaking a 16K page for InnoDB).. —
any idea?..

and, yes, regarding generally the “double-write buffer” feature – its
main bottleneck on a fast storage will be its own kind of
serialization due implementation design.. One of the solutions could
be here to have several such buffers (ex. one per BP instance) to
increase the throughput (while the issue with twice written data will
still remain); or bypass the “buffer” and let each page double write
go independently to others, etc..

while the best solution could be probably to have just a single write
but on a different place, but I’m not sure at the end we’ll really
have less IO writes 😉

or the best of the best is simply to avoid double write buffer
completely by involving atomic writes, but unfortunately currently
it’s available only on Fusion-io and not yet a common standard..

Rgds,
-Dimitri

By: Dimitri

Dimitri — Fri, 30 May 2014 16:42:54 +0000

Hi Yves,

your results are looking very promising!! thanks for sharing!
I’m also curious if using O_DSYNC instead of the “classic” O_DIRECT for RW workloads will not bring any other impact within other tests cases…

then, indeed, this EXT4 data journaling is not solving the issue with “double amount of data written”.. – however, keeping in mind arriving to the market various HW solutions for “battery protected memory” (even small amount, but if 400MB is already enough then it’ll look more than attractive, and “double write amount” will be completely avoided and keep a longer life for flash storage ;-))

the only point with this EXT4 feature remains unclear for me: will EXT4 really guarantee the absense of partially written pages?.. (keeping in mind that EXT4 is operating with 4K, while InnoDB default is 16K, and EXT4 knows just nothing about InnoDB data, so for EXT4 “commited” 2 pages of 4K will remain safe, while there will be still 2 other pages “not comitted” and breaking a 16K page for InnoDB).. — any idea?..

and, yes, regarding generally the “double-write buffer” feature – its main bottleneck on a fast storage will be its own kind of serialization due implementation design.. One of the solutions could be here to have several such buffers (ex. one per BP instance) to increase the throughput (while the issue with twice written data will still remain); or bypass the “buffer” and let each page double write go independently to others, etc..

while the best solution could be probably to have just a single write but on a different place, but I’m not sure at the end we’ll really have less IO writes 😉

or the best of the best is simply to avoid double write buffer completely by involving atomic writes, but unfortunately currently it’s available only on Fusion-io and not yet a common standard..

Rgds,
-Dimitri

By: Yves Trudeau

Yves Trudeau — Mon, 26 May 2014 13:14:31 +0000

@MarkC: If you send the ext4 journal to a file on an array behind a raid controller with a write cache and a BBU, it may actually never be written to disk and stay in the write cache memory since the max journal size with 4KB pages is 400MB. In such a case, spinning disks will even be faster than SSDs.

@Alan: The issue with the double write buffer is really internal to MySQL, those 64 pages are running hot and force some kind of serialization.

@Peter: My explanation to this is the serialization effect of the double-write buffer, in order to use some of the 64 pages in it, a thread must wait for the write operation of another thread to complete its write to the data file. I did measure the amount of writes in both cases and yes, they are the same but the ext4 journal seems to better handle concurrency. As of testing with SSDs and a raid controller with BBU, I’d love to do that, any hardware I could use for a few hours?

By: Peter Zaitsev

Peter Zaitsev — Sun, 25 May 2014 14:52:48 +0000

Yves,

The question I have in this case – do we know why does it happen ? Did you have a chance to check the amount of writes going to device in this case ?

Here is what puzzles me. As I understand ZFS simply stores the new data in the new location, hence it only needs to update meta-data to show where the new data is located.

EXT4 with data=journal is I understand correctly just stores data in the journal as well as storing it in location where it should be, so it still should double amount of writes as Innodb does.

If I’m correct when do not we have some problem in how doublewrite buffer is implemented in Innodb – I do not see big reason why filesystem should be able to do it a lot more efficiently. Or has design of data=journal changed in EXT4 ?

I also wonder whenever results for SSDs and RAID with BBU are the same as this is where you can get dramatic differences as any “sync” is very expensive on conventional hard drives and it is much cheaper on system with BBU or SSD.

By: Alan Kent

Alan Kent — Fri, 23 May 2014 19:14:16 +0000

Did you try sending the double-buffer to a separate disk drive or SSD? Just wondering if the double-buffer sync-ing on the same drive as the main database is causing lots of head thrashing. (I don’t have deep MySQL specific experience, but this was a problem on another system. Giving the write-ahead buffer a separate drive helped.)

By: Mark Callaghan

Mark Callaghan — Fri, 23 May 2014 15:24:31 +0000

What is the perf benefit for a system with HW RAID and battery backed write cache?

By: Nils

Nils — Fri, 23 May 2014 10:54:10 +0000

So if it is using the OS file cache, doesn’t that mean some unnecessary double buffering (once in the buffer pool, once in the OS cache)?

I was toying with an idea similar to yours with the RAID Controller and BBU, I was thinking of just putting the doublewrite buffer (and maybe innodb logs) onto the RAID controller, perhaps just attaching two small SSD. I may post again once I have some hardware available.

If someone were to come up with a sort of PCIe Ramdisk with a BBU (or better, capacitors) at a competitive price, that would be a great alternative to RAID controllers, especially since the CoW filesystems like ZFS and btrfs usually also implement RAID in software.