In my recent benchmarks for MongoDB, we can see that the two engines WiredTiger and TokuMX struggle from periodical drops in throughput, which is clearly related to a checkpoint interval – and therefore I correspond it to a checkpoint activity.

The funny thing is that I thought we solved checkpointing issues in InnoDB once and for good. There are bunch of posts on this issue in InnoDB, dated some 4 years ago.  We did a lot of research back then working on a fix for Percona Server

But, like Baron said, “History Repeats“… and it seems it also repeats in technical issues.

So, let’s take a look what is checkpointing is, and why it is a problem for storage engines.

As you may know, a transactional engine writes transactional logs (name you may see: “redo logs”, write-ahead logs or WAL etc),
to be able to perform a crash recovery in the case of database crash or server power outage. To maintain somehow the time for recovery (we all expect that database will start quick), the engine has to limit how many changes are in logs. For this a transactional engine performs a “checkpoint,” which basically synchronizes changes in memory with corresponding changes in logs – so old log records can be deleted. Often it results in writing changed database pages from memory to a permanent storage.

InnoDB takes an approach to limit size of log files (in total it equals to innodb-log-file-size * innodb-log-files-in-group), which I name “size limited checkpointing”.
Both TokuDB and WiredTiger limit changes by time periods (by default 60 sec), which results in that log files may grow unlimited within a given time interval (I name it “time limited checkpointing”).

Also the difference is InnoDB takes a “fuzzy checkpointing” approach, which was not really “fuzzy”, until we fixed it, but basically it is not to wait until we reach size limited, but perform “checkpointing” all the time, and the more intensive the closer we get to log size limit. This allows to achieve more or less a smooth throughput, without significant drops in throughput.

Unlike InnoDB, both TokuDB (that I am sure) and WiredTiger (I speculate here, I did not look into WiredTiger internal) wait till the last moment, and perform checkpoint strictly by prescribed interval. If it happens that a database contains many changes in memory, it will result in performance stalls. This effect is close by effect to “hitting a wall on full speed”: user queries get locked, until an engine writes all changes in memory it has to write.

Interestingly enough, RocksDB, because it has a different architecture (I may write on it in future, but for now I will point to RocksDB Wiki), does not have this problem with checkpoints (it however has its own background activity, like level compactions and tombstone maintenance, but it is a different topic).

I do not know how WiredTiger is going to approach this issue with checkpoint, but we are looking to improve TokuDB to make this less painful for user queries – and eventually to move to “fuzzy checkpointing” model.