While preparing a post on the design of ZFS based servers for use with MySQL, I stumbled on the topic of fsync call performance. The fsync call is very expensive, but it is essential to databases as it allows for durability (the “D” of the ACID acronym).
Fsync Performance
Let’s first review the type of disk IO operations executed by InnoDB in MySQL. I’ll assume the default InnoDB variable values.
The first and most obvious type of IO are pages reads and writes from the tablespaces. The pages are most often read one at a time, as 16KB random read operations. Writes to the tablespaces are also typically 16KB random operations, but they are done in batches. After every batch, fsync is called on the tablespace file handle.
To avoid partially written pages in the tablespaces (a source of data corruption), InnoDB performs a doublewrite. During a doublewrite operation, a batch of dirty pages, from 1 to about 100 pages, is first written sequentially to the doublewrite buffer and fsynced. The doublewrite buffer is a fixed area of the ibdata1 file or a specific file with the latest Percona Server for MySQL 5.7. Only then do the writes to the tablespaces of the previous paragraph occur.
That leaves us with the writes to the InnoDB log files. During those writes, the transaction information — a kind of binary diff of the affected pages — is written to the log files and then the log file is fsynced. The duration of the fsync call can be a major contributor to the COMMIT latency.
Because the fsync call takes time, it greatly affects the performance of MySQL. Because of this, you probably noticed there are many status variables that relate to fsyncs. To overcome the inherent limitations of the storage devices, group commit allows multiple simultaneous transactions to fsync the log file once for all the transactions waiting for the fsync. There is no need for a transaction to call fsync for a write operation that another transaction already forced to disk. A series of write transactions sent over a single database connection cannot benefit from group commit.
Fsync Results
In order to evaluate the fsync performance, I used the following Python script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | #!/usr/bin/python import os, sys, mmap # Open a file fd = os.open( "testfile", os.O_RDWR|os.O_CREAT|os.O_DIRECT ) m = mmap.mmap(-1, 512) for i in range (1,1000): os.lseek(fd,os.SEEK_SET,0) m[1] = "1" os.write(fd, m) os.fsync(fd) # Close opened file os.close( fd ) |
The script opens a file with the O_DIRECT flag, writes and fsyncs it 1000 times and close the file. I added O_DIRECT after an internal discussion with my colleagues, but it doesn’t change the results and it doesn’t remove the need for calling fsync. We’ll discuss in more detail the impacts of O_DIRECT after we reviewed the results. The script is called with the time command like below:
1 2 3 4 5 | root@lab:/tmp/testfsync# time python /root/fsync.py real 0m18.320s user 0m0.060s sys 0m0.096s |
In the above example using a 7.2k rpm drive, the fsync rate is about 56/s for a latency of 18ms. A 7.2k RPM drive performs 120 rotations per second. On average, the fsyncs require a bit more than two rotations to complete. The filesystem appears to make very little differences: ext4 and XFS show similar results. That means if MySQL uses such storage devices for the InnoDB log files, the latency of each transaction is at least 18ms. If the application workload requires 200 write transactions per second, they’ll need to be executed using at least four database connections.
So, let’s begin with rotational devices. These are becoming a bit less common now with databases, especially without a raid controller. I could only find a few.
Drive | RPM | Rate | Latency | Notes |
---|---|---|---|---|
WDC WD2500BJKT | 5400 | 22/s | 45 ms | Laptop SATA from 2009 |
ST2000LM003 | 5400 | 15/s | 66 ms | USB-3 portable drive |
ST3750528AS | 7200 | 40/s | 25 ms | Desktop grade SATA |
WD2502ABYS-18B7A0 | 7200 | 56/s | 18 ms | Desktop grade SATA |
HUA723020ALA641 | 7200 | 50/s | 20 ms | Enterprise grade SATA, md mirror |
Dell SAS unknown | 7200 | 58/s | 17 ms | Behind Perc ctrl but no write cache |
HDWE150 | 7200 | 43/s | 23 ms | Recent Desktop grade SATA, 5TB |
I, unfortunately, didn’t have access to any 10k or 15k RPM drives that were not behind a raid controller with a write cache. If you have access to such drives, run the above script a few times and send me your results, that would help create a more complete picture! So, we can see a correlation between the rotational speed and the fsync rate, which makes sense. The faster a disk turns, the faster it can fsync. The fsync call saves the data and then updates the metadata. Hence, the heads need to move. That’s probably the main explanation for the remaining disparity. A good point, all drives appears to be fully complying with the SATA flush command even though they all have an enabled write cache. Disabling the drives write caches made no difference.
With the above number, the possible transaction rates in fully ACID mode is pretty depressing. But those drives were rotating ones, what about SSD drives? SSD are memory devices and are much faster for random IO operations. There are extremely fast for reads and good for writes. But as you will see below, not that great for fsyncs.
Drive | rate | latency | notes |
---|---|---|---|
SAMSUNG MZ7LN512 | 160/s | 6.3ms | Consumer grade SATA |
Crucial_CT480M500SSD1 | 108/s | 9.3ms | Consumer grade SATA |
Intel 520 | 2031/s | 0.49ms | Consumer grade SATA |
SAMSUNG MZVPV512HDGL | 104/s | 9.6ms | Consumer grade NVMe |
Samsung SSD 960 PRO | 267/s | 3.8ms | High-end consumer grade NVMe |
Intel PC-3100 | 1274/s | 0.79ms | Low-end consumer grade NVMe (cheat?) |
Intel 750 | 2038/s | 0.49ms | High-end consumer grade NVMe |
Intel PC-3700 | 7380/s | 0.14ms | High-end enterprise-grade NVMe |
Again, this is a small sample of the devices I have access to. All SSD/Flash have write caches, but only the high-end devices have capacitors to flush their write cache to the flash with a loss of power. The PC-3100 device is actually in my home server, and it is obviously cheating. If you look at the card specs on the Intel website, it doesn’t have the “Enhanced Power Loss Data Protection” and “End-to-End Data Protection” features. The much more expansive PC-3700 does. I use the PC-3100 as a ZFS L2ARC device, so I am good. In general, the performance of a flash device varies a bit more than rotational devices, since factors like the number of recent writes and the filling factor come into play.
Even when using a high-end NVMe device like the PC-3700, you can’t reach 10k fully ACID transactions per second at low thread concurrency. How do you reach the higher levels? The answer here is the good old raid controller with a protected write cache. The write cache is basically using DRAM memory protected from power loss by a battery. SAN controllers have similar caches. The writes to the InnoDB log files are sequential writes interleaved with fsyncs. The raid controller concatenates the sequential writes, eventually writing one big chunk on disk and… ignoring the fsyncs. Here’s the result from the only device I had access to:
Drive | rate | latency | notes |
---|---|---|---|
Dell Perc with BBU | 23000/s | 0.04ms | Array of 7.2k rpm drives |
That’s extremely fast but, of course, it is memory. I modified the script to loop 10k times instead of 1k. In theory, something a single slave thread doing simple transactions could reach a rate of 20k/s or more while being fully ACID.
Discussion
We must always consider the results we got in the previous section in the context of a given application. For example, a server using an Intel PC-3700 NVMe card can do more than 7000 fully ACID transactions per second even if it is fully durable provided those transactions are issued by a sufficient number of threads. Adding threads will not allow scaling infinitely. At some point, other bottlenecks like mutex contention or page flushing will dominate.
We often say that Galera-based cluster solutions like Percona XtraDB Cluster (PXC) add latency to the transactions, since it involves communication over the network. With the Galera protocol, a commit operation returns only when all the nodes have received the data. Thus, tt is a good practice to relax the local durability and use innodb_flush_log_at_trx_commit set to 0 or 2. On a local network, the ping time is always below 1ms and often below 0.1ms. As a result, the transaction latency is often smaller.
About fdatasync
The fsync system is not the only system call that persists data to disk. There is also the fdatasync call. fdatasync persists the data to disk but does not update the metadata information like the file size and last update time. Said otherwise, it performs one write operation instead of two. In the Python script, if I replace os.fsync with os.fdatasync, here are the results for a subset of devices:
Drive | rpm | rate | latency | notes |
---|---|---|---|---|
ST2000LM003 | 5400 | 72/s | 13 ms | USB-3 portable drive |
WD2502ABYS-18B7A0 | 7200 | 118/s | 8.5 ms | Desktop grade SATA |
SAMSUNG MZ7LN512 | N/A | 333/s | 3.0ms | Consumer grade SATA |
Crucial_CT480M500SSD1 | N/A | 213/s | 4.7ms | Consumer grade SATA |
Samsung SSD 960 PRO | N/A | 714/s | 1.4ms | High-end consumer grade NVMe |
In all cases, the resulting rates have more than doubled. The fdatasync call has a troubled history, as there were issues with it many years ago. Because of those issues, InnoDB never uses fdatasync, only fsyncs. You can find the following comments in the InnoDB os/os0file.cc:
1 2 3 | /* We let O_SYNC only affect log files; note that we map O_DSYNC to O_SYNC because the datasync options seemed to corrupt files in 2001 in both Linux and Solaris */ |
2001 is a long time ago. Given the above results, maybe we should reconsider the use of fdatasync. From the Linux main page on fdatasync, you find:
1 2 3 4 5 6 7 8 9 | fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respectively, time of last access and time of last modification; see stat(2)) do not require flushing because they are not necessary for a subsequent data read to be handled correctly. On the other hand, a change to the file size (st_size, as made by say ftruncate(2)), would require a metadata flush. |
So, even with fdatasync, operations like extending an InnoDB tablespace will update the metadata correctly. This appears to be an interesting low-hanging fruit in term of MySQL performance. In fact, webscalesql already have fdatasync available
O_DIRECT
Why do we need a fsync or fdatasync with O_DIRECT? With O_DIRECT, the OS is not buffering anything along the way. So the data should be persisted right? Actually, the OS is not buffering but the device very likely is. Here are a few results to highlight the point using a 7.2k rpm SATA drive:
Test | rate | latency |
---|---|---|
O_DIRECT, drive Write cache enabled | 4651/s | 0.22ms |
O_DIRECT, drive Write cache disabled | 101/s | 9.9ms |
ASYNC + fdatasync, Write cache enabled | 119/s | 8.4ms |
ASYNC + fdatasync, Write cache disabled | 117/s | 8.5ms |
The drive write cache was enabled/disabled using the hdparm command. Clearly, there’s no way the drive can persist 4651 writes per second. O_DIRECT doesn’t send the SATA flush command to the disk, so we are only writing to the drive write cache. If the drive write cache is disabled, the rate falls to a more reasonable value of 101/s. What is interesting — and I don’t really understand why — is that opening the file in async mode and performing fdatasync is significantly faster. As expected, the presence of the drive write cache has no impacts on ASYNC + fdatasync. When the fdatasync call occurs, the data is still in the OS file cache.
If you want to use only O_DIRECT, you should make sure all the storage write caches are crash safe. That’s why MySQL adds a fsync call after a write to a file opened with O_DIRECT.
ZFS
These days, I find it difficult to write a blog post without talking about ZFS. How does ZFS handles fsyncs and fdatasyncs? ZFS, like a database, performs write ahead logging in the ZIL. That means calls like fsync and fdatasync return when the data has been persisted to the ZIL, and not to the actual filesystem. The real write operation is done a few seconds later by a background thread. That means the added write for the metadata does not impact performance right away. My home server uses ZFS over a pair of 7.2k RPM drive and doesn’t have a SLOG device. The ZIL is thus stored on the 7.2k RPM drives. The results are the following:
Drive | rpm | rate | latency |
---|---|---|---|
ZFS fsync | 7200 | 104/s | 9.6 ms |
ZFS fdatasync | 7200 | 107/s | 9.3 ms |
Remember that with ZFS, you need to disable the O_DIRECT mode. The fdatasync rate appears to be slightly faster, but it is not really significant. With ZFS, the fsync/fdatasync performance relates to where the ZIL is stored. If there is no SLOG device, the ZIL is stored with the data and thus, the persitence performance of the devices used for the data matter. If there is a SLOG device, the persistence performance is governed by the specs of the device(s) on which the SLOG is located. That’s a very important aspect we have to consider when designing a MySQL server that will use ZFS. The design of such server will be discussed in more details in a future post.
You May Also Like
Not properly managing your databases can kill your business quicker than any competitor or market shift. Download our free eBook, “The Hidden Costs of Not Properly Managing Your Databases” to see what costs you need to keep in check when managing your databases.
I tested your fsync.py using an Intel Optane SSD (900P 280GB, ext4). It took 0.054s to complete but it appears that some of that time was outside the for-loop. 10 000 iterations took 0.43s or 0.043 ms per fsync (~ 23 250 fsyncs / seconds).
And some ZFS benchmarks… https://www.servethehome.com/exploring-best-zfs-zil-slog-ssd-intel-optane-nand/
Be careful, although the Optane 900P has impressive specs, you’ll find this in the specs: “Enhanced Power Loss Data Protection No”. I wonder if it really persists the data at each fsync or fdatasync. I would be interesting to find out because it would be a very nice SLOG device.
Well, it’s without a volatile cache, so it shouldn’t need power loss protection.
https://www.anandtech.com/show/11953/the-intel-optane-ssd-900p-review/3
I did some research on my side and you are right, the optane appears to be safe in term of power loss. Quite impressive performance and a a cheap SLOG device.
Excellent post, Yves.
I tested your fsync.py using our SAN HPE 3PAR StoreServ 8400 storage.
It is relatively high level flash-based storage device.
10 000 iterations took 19.303s or 1.903 ms per fsync (~ 518 fsyncs / seconds).
Thanks, very good article.
BTW, postgresql has pg_test_fsync utility for years
You did a great job in here Yves! Thank you for this article.
Used fsync.py and it was fantastic– thanks for the post on this code and storage device
Hi Yves, does “drive Write cache enabled” mean “Write Back” in physical RAID please?
When I mention “drive write cache”, it normally means the on disk write cache. Some SATA drives can cheat when the drive cache is enable. As of RAID controller write cache, this is only for the Dell PERC results and there, yes, the cache was set to “write back”.
Salut Yves,
Very great article.
But I would emit doubts your benchmark methodology.
Your Python script makes the job, but data collection isn’t good.
Here’s the time command’s output:
$ time sleep 1
real 0m1.005s
user 0m0.000s
sys 0m0.003s
Things are given in seconds and you seem to use it as milliseconds.
Furthermore, you time the entire script and Python as an interpreted language has a long startup.
This time depends of Python version, CPU, etc and just add a false overhead to your results.
Otherwise, I really love Python, really really, but its for-loops are slow,
So you may not expect to reach the best perfs from a Python loop.
You could time each I/O, but the whole thing may be affected by Python perfs.
Thank you again for this article
I used a loop of 1000 so the reported time can be interpreted as ms per event.