Last year, I wrote a post focused on the performance of the fsync call on various storage devices. The fsync call is extremely important for a database when durability, the “D” of the ACID acronym is a hard requirement. The call ensures the data is permanently stored on disk. The durability requirement forces every transaction to return only when the InnoDB log file and they binary log file have been flushed to disk.
In this post, instead of focusing on the performance of various devices, we’ll see what can be done to improve fsync performance using an Intel Optane card.
Intel Optane
A few years ago, Intel introduced a new type of storage devices based on the 3D_XPoint technology and sold under the Optane brand. Those devices are outperforming regular flash devices and have higher endurance. In the context of this post, I found they are also very good at handling the fsync call, something many flash devices are not great at doing.
I recently had access to an Intel Optane NVMe card, a DC P4800X card with a storage capacity of 375GB. Let’s see how it can be used to improve performance.
Optane used directly as a storage device
This is by far the simplest option if your dataset fits on the card. Just install the device, create a filesystem, mount it and go. Using the same python script as in the first post, the results are:
Options | Fsync rate | Latency |
ext4, O_DIRECT | 21200/s | 0.047 ms |
ext4 | 20000/s | 0.050 ms |
ext4, data=journal | 9600/s | 0.100 ms |
The above results are pretty amazing. The fsync performance is on par with a RAID controller with a write cache, for which I got a rate of 23000/s and is much better than a regular NAND based NVMe card like the Intel PC-3700, able to deliver a fsync rate of 7300/s. Even enabling the full ext4 journal, the rate is still excellent although, as expected, cut by about half.
Optane used as the cache block device in a hybrid volume
If you have a large dataset, you can still use the Optane card as a read/write cache and improve fsync performance significantly. I did some tests with two easily available solutions, dm-cache and bcache. In both cases, the Optane card was put in front of an external USB Sata disk and the cache layer set to writeback.
Options | Fsync rate | Latency |
No cache | 13/s | 75 ms |
dm-cache | 3100/s | 0.32 ms |
bcache | 2500/s | 0.40 ms |
Both solutions improve the fsync rate by two orders of magnitude. That’s still much slower than the straight device but a very decent trade-off.
Optane used as an ZFS SLOG
ZFS can also use a fast device for its write journal, the ZIL. Such a device in ZFS terminology is called a SLOG. With the ZFS logbias set to “latency”, here is the impact of using an Optane device as SLOG in front of the same slow USB SATA disk:
Options | Fsync rate | Latency |
ZFS, SLOG | 7400/s | 0.135 ms |
ZFS, no SLOG | 28/s | 36 ms |
The addition of SLOG device boosted fsync rate by a factor of nearly 260. The rates are also twice as important as the ones reported using dm-cache and bcache and about a third of the result using the Optane device for storage. Considering all the added benefits of ZFS like compression and snapshots, that’s a really interesting result.
Conclusion
If you are struggling with the commit latency of a large transactional database, 3D_XPoint devices like the Intel Optane may offer you new options.
What about Optane used directly as a storage device but with ZFS instead of ext4 as the filesystem? Would ZFS be faster than ext4 in that case?
ZFS use a ssd ZIL will fast at first, but it’ll slow down when apply log to slow disks, unless all your fsync can convert to append style write.
But in real case, it’ll be mix of random-write and sequence write.
I read your article impressively.
I’d like to take an Optane for SLOG such as yours.
How do I configure the system?
My Optane card has 3 partitions, 2 of 10GB and one with the remaining space. Keep in mind, this is not prod, just my home server. For ZFS, I use the 2nd for SLOG and the 3rd for cache (l2arc):
zpool add data log /dev/disk/by-id/nvme-INTEL_SSDPED1K375GA_PHKS750500FR375AGN-part2
zpool add data cache /dev/disk/by-id/nvme-INTEL_SSDPED1K375GA_PHKS750500FR375AGN-part3
My ZFS pool is named data. In any production environment, log should be a mirror from ideally 2 Optane cards. Losing the SLOG is pretty bad. 10GB for the SLOG is quite a lot.