MongoDB RollbackIn this blog post, we’ll look at how MongoDB rollback works during replicaset failovers.

In recent versions, MongoDB has provided lots of features related to replicaset and automatic failover. When it comes to failover, the next question that arises is “How does MongoDB ROLLBACK work during replicaset failover?”

If a PRIMARY member (say node A) stepped down with some data writes that were executed but not replicated to the SECONDARY members yet, then a ROLLBACK occurs on the former PRIMARY A when it rejoins the replicaset. I’ll explain below how the ROLLBACK works!

ROLLBACK Scenario:

ROLLBACK is rare in a replicaset as MongoDB tries to avoid it by replicating the operations from PRIMARY to SECONDARY without delay, under normal conditions. Most of the time ROLLBACK occurs in the event of network partitioning, or if SECONDARY members can’t keep up with the throughput of operations on the former PRIMARY.

ROLLBACK Process:

We will see the process with a test. I have used Docker for this test with the MongoDB 3.2 Jessie version to setup a replicaset with members mongo1 – A, mongo2 – B, mongo3 – C and set Priority 10 to A. Now A is PRIMARY as expected in the replicaset. We need to write some data into A and create a network partition scenario with B and C at the same time. For that, I inserted 25000 documents into A and made it out of network at the same time.

Terminal 1 (A’s mongo prompt):

Terminal2:

The member A has now become as SECONDARY, because it couldn’t reach other members in the replicaset. On the other side, B and C members see that A is not reachable and then B is elected as PRIMARY. We could see that some inserts from former A replicated to B before the network split happens.

(B node)

Now do some write operations in current PRIMARY – B and then let node A join the network back by joining the container back to the bridge network. You can observe below that the node A’s member states are changing in the mongo prompt. (I just connected to A and pressed ENTER/RETURN button many times to see the member states, or you can see them in the log file):

(A node)

ROLLBACK Internal

From MongoDB point of view, we will see the replicaset process to understand what happened above. Normally the SECONDARY member syncs the oplog entries from its syncSource (the member from where the data is replicated) by using oplogFetcher. The OplogFetcher first sends a find() command to the syncSource’s oplog, and then follows with a series of getMores on the cursor. When node A rejoins the replicaset, node A’s oplogFetcher first sends find() command to syncSource node B and check it has a greater than or equal predicate on the timestamp of the last oplog entry it has fetched. Usually the find() command should at least return one doc due to the greater than or equal predicate. If not, it means that the syncSource is behind and so it will not replicate from it and look for other syncSource.

In this case, A’s oplogFetcher sees that the first document returned from node B does not match the last entry in its oplog. That means node A’s oplog has diverged from node B’s and it should go into ROLLBACK.

Node A first finds the common point between its oplog and its syncSource B’s oplog. It then goes through all of the operations in its oplog back to the common point and figures out how to undo them. Here, 9997 inserts are missed from B and C nodes, and so these documents will be recovered from A’s oplog.

ROLLBACK data

Where would these 9997 recovered documents go? MongoDB writes these ROLLBACK documents under the rollback directory in the dbpath. These recovered collections are named with namespace as the prefix and the date time as the suffix in their names. These are in BSON format, and we need to convert into JSON to analyze them so the plan for the next course of action can be done. In our case, the testData collection’s rollback data are as follows:

That’s it? Now check the counts of the testData collection in node A:

So the records 9997 which were rollbacked into the rollback directory would also be dropped from the collection. This ensures the data consistency throughout the replicaset.

How to avoid ROLLBACK – writeConcern

The default writeConcern in the replicaSet is w:1., i.e., When a client writes into a replicaSet, then it receives an acknowledgment from the PRIMARY alone and won’t wait for SECONDARY members’ acknowledgment. If you want to avoid the ROLLBACK scenario in your environment, then you have to use the {w:majority} or {w:n}, where 1 > n <=  (no. of members in your replica set). This ensures that the writes are propagated to so many members of the replica set before sending the acknowledgment to the client. This solves the problem of ROLLBACK.

But please be careful that you are not giving higher value to writeConcern, because it also affects the write performance. The acknowledgment needs to be received from the number of members mentioned in the value. The value {w:majority} provides the acknowledgement that write operations have propagated to the majority of voting nodes, including the primary and is suitable for most of the environments.

ROLLBACK – Limitation

The main thing to note here is that mongod will not rollback more than 300MB data. In such cases, we need to manually check the instance to recover the data. You can see the below message in mongod.log in such cases:

Understanding this simple ROLLBACK background helps us to decide what needs to be done with the rollbacked data. It also helps us avoid such scenarios, because data is data and is very important!