In this blog post, we’ll look at how MongoDB rollback works during replicaset failovers.
In recent versions, MongoDB has provided lots of features related to replicaset and automatic failover. When it comes to failover, the next question that arises is “How does MongoDB ROLLBACK work during replicaset failover?”
If a PRIMARY member (say node A) stepped down with some data writes that were executed but not replicated to the SECONDARY members yet, then a ROLLBACK occurs on the former PRIMARY A when it rejoins the replicaset. I’ll explain below how the ROLLBACK works!
ROLLBACK Scenario:
ROLLBACK is rare in a replicaset as MongoDB tries to avoid it by replicating the operations from PRIMARY to SECONDARY without delay, under normal conditions. Most of the time ROLLBACK occurs in the event of network partitioning, or if SECONDARY members can’t keep up with the throughput of operations on the former PRIMARY.
ROLLBACK Process:
We will see the process with a test. I have used Docker for this test with the MongoDB 3.2 Jessie version to setup a replicaset with members mongo1 – A, mongo2 – B, mongo3 – C and set Priority 10 to A. Now A is PRIMARY as expected in the replicaset. We need to write some data into A and create a network partition scenario with B and C at the same time. For that, I inserted 25000 documents into A and made it out of network at the same time.
Terminal 1 (A’s mongo prompt):
1 2 3 4 5 6 7 8 9 10 11 | my-mongo-set:PRIMARY> for (var i = 1; i <= 25000; i++) { ... db.testData.insert( { x : i } ) ... } WriteResult({ "nInserted" : 1 }) my-mongo-set:PRIMARY> db.testD2018-03-30T17:34:51.455+0530 I NETWORK [thread1] trying reconnect to 127.0.0.1:30001 (127.0.0.1) failed 2018-03-30T17:34:51.464+0530 I NETWORK [thread1] reconnect 127.0.0.1:30001 (127.0.0.1) ok db.testD admin.testD my-mongo-set:SECONDARY> rs.slaveOk() my-mongo-set:SECONDARY> db.testData.count() 25000 |
Terminal2:
1 2 3 4 5 6 | Vinodhs-MBP:~ vinodhkrish$ docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES b27d82ac2439 mongo:3.2.19-jessie "docker-entrypoint.s…" 2 days ago Up 1 days 0.0.0.0:30003->27017/tcp mongo3 2b39f9e41973 mongo:3.2.19-jessie "docker-entrypoint.s…" 2 days ago Up 1 days 0.0.0.0:30002->27017/tcp mongo2 105b6df757d7 mongo:3.2.19-jessie "docker-entrypoint.s…" 2 days ago Up 1 days 0.0.0.0:30001->27017/tcp mongo1 Vinodhs-MBP:~ vinodhkrish$ docker network disconnect my-mongo-cluster mongo1 |
The member A has now become as SECONDARY, because it couldn’t reach other members in the replicaset. On the other side, B and C members see that A is not reachable and then B is elected as PRIMARY. We could see that some inserts from former A replicated to B before the network split happens.
(B node)
1 2 | my-mongo-set:PRIMARY> db.testData.count() 15003 |
Now do some write operations in current PRIMARY – B and then let node A join the network back by joining the container back to the bridge network. You can observe below that the node A’s member states are changing in the mongo prompt. (I just connected to A and pressed ENTER/RETURN button many times to see the member states, or you can see them in the log file):
(A node)
1 2 3 4 5 6 7 8 | Vinodhs-MacBook-Pro:mongodb-osx-x86_64-3.2.19 vinodhkrish$ ./bin/mongo 127.0.0.1:30001/admin MongoDB shell version: 3.2.19 connecting to: 127.0.0.1:30001/admin my-mongo-set:ROLLBACK> my-mongo-set:RECOVERING> my-mongo-set:SECONDARY> my-mongo-set:SECONDARY> my-mongo-set:PRIMARY> |
ROLLBACK Internal
From MongoDB point of view, we will see the replicaset process to understand what happened above. Normally the SECONDARY member syncs the oplog entries from its syncSource (the member from where the data is replicated) by using oplogFetcher. The OplogFetcher first sends a find() command to the syncSource’s oplog, and then follows with a series of getMores on the cursor. When node A rejoins the replicaset, node A’s oplogFetcher first sends find() command to syncSource node B and check it has a greater than or equal predicate on the timestamp of the last oplog entry it has fetched. Usually the find() command should at least return one doc due to the greater than or equal predicate. If not, it means that the syncSource is behind and so it will not replicate from it and look for other syncSource.
In this case, A’s oplogFetcher sees that the first document returned from node B does not match the last entry in its oplog. That means node A’s oplog has diverged from node B’s and it should go into ROLLBACK.
Node A first finds the common point between its oplog and its syncSource B’s oplog. It then goes through all of the operations in its oplog back to the common point and figures out how to undo them. Here, 9997 inserts are missed from B and C nodes, and so these documents will be recovered from A’s oplog.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | 2018-03-30T12:08:37.160+0000 I REPL [rsBackgroundSync] <strong>Starting rollback due to OplogStartMissing</strong>: our last op time fetched: (term: 4, timestamp: Mar 30 12:03:52:139). source's GTE: (term: 5, timestamp: Mar 30 12:05:37:1) hashes: (3789163619674410187/3226093795606474294) 2018-03-30T12:08:37.160+0000 I REPL [rsBackgroundSync] rollback 0 2018-03-30T12:08:37.160+0000 I REPL [ReplicationExecutor] transition to ROLLBACK 2018-03-30T12:08:37.163+0000 I REPL [rsBackgroundSync] beginning rollback 2018-03-30T12:08:37.163+0000 I REPL [rsBackgroundSync] rollback 1 2018-03-30T12:08:37.164+0000 I REPL [rsBackgroundSync] <strong>rollback 2 FindCommonPoint</strong> 2018-03-30T12:08:37.166+0000 I REPL [rsBackgroundSync] rollback our last optime: Mar 30 12:03:52:139 2018-03-30T12:08:37.166+0000 I REPL [rsBackgroundSync] rollback their last optime: Mar 30 12:08:17:1c5 2018-03-30T12:08:37.166+0000 I REPL [rsBackgroundSync] rollback diff in end of log times: -265 seconds 2018-03-30T12:08:37.269+0000 I REPL [rsBackgroundSync] <strong>rollback common point is (term: 4, timestamp: Mar 30 12:03:46:d2)</strong> 2018-03-30T12:08:37.269+0000 I REPL [rsBackgroundSync] rollback 3 fixup 2018-03-30T12:08:38.240+0000 I REPL [rsBackgroundSync] rollback 3.5 2018-03-30T12:08:38.240+0000 I REPL [rsBackgroundSync] Setting minvalid to (term: 5, timestamp: Mar 30 12:08:17:1c5) 2018-03-30T12:08:38.241+0000 I REPL [rsBackgroundSync] rollback 4 n:1 2018-03-30T12:08:38.241+0000 I REPL [rsBackgroundSync] rollback 4.6 2018-03-30T12:08:38.241+0000 I REPL [rsBackgroundSync] rollback 4.7 2018-03-30T12:08:38.391+0000 I REPL [rsBackgroundSync] rollback 5 <strong>d:9997</strong> u:0 2018-03-30T12:08:38.391+0000 I REPL [rsBackgroundSync] rollback 6 2018-03-30T12:08:38.394+0000 I REPL [rsBackgroundSync] rollback done 2018-03-30T12:08:38.396+0000 I REPL [rsBackgroundSync] rollback finished |
ROLLBACK data
Where would these 9997 recovered documents go? MongoDB writes these ROLLBACK documents under the rollback directory in the dbpath. These recovered collections are named with namespace as the prefix and the date time as the suffix in their names. These are in BSON format, and we need to convert into JSON to analyze them so the plan for the next course of action can be done. In our case, the testData collection’s rollback data are as follows:
1 2 3 4 | root@105b6df757d7:/# cd /data/db root@105b6df757d7:/data/db# <strong>ls -l rollback/</strong> total 324K -rw-r--r-- 1 mongodb mongodb 323K Mar 30 12:08 admin.testData.2018-03-30T12-08-38.0.bson |
1 2 3 4 5 6 7 8 9 10 | root@105b6df757d7:/data/db/rollback# <strong>bsondump</strong> admin.testData.2018-03-30T12-08-38.0.bson > rollback.json 2018-03-30T12:13:00.033+0000 9997 objects found root@105b6df757d7:/data/db/rollback# head rollback.json {"_id":{"$oid":"5abe279f97044083811b5975"},"x":15004.0} {"_id":{"$oid":"5abe279f97044083811b5976"},"x":15005.0} {"_id":{"$oid":"5abe279f97044083811b5977"},"x":15006.0} {"_id":{"$oid":"5abe279f97044083811b5978"},"x":15007.0} {"_id":{"$oid":"5abe279f97044083811b5979"},"x":15008.0} {"_id":{"$oid":"5abe279f97044083811b5980"},"x":15009.0} {"_id":{"$oid":"5abe279f97044083811b5981"},"x":15010.0} |
That’s it? Now check the counts of the testData collection in node A:
1 2 | my-mongo-set:PRIMARY> db.testData.count() <strong>15003</strong> |
So the records 9997 which were rollbacked into the rollback directory would also be dropped from the collection. This ensures the data consistency throughout the replicaset.
How to avoid ROLLBACK – writeConcern
The default writeConcern in the replicaSet is w:1., i.e., When a client writes into a replicaSet, then it receives an acknowledgment from the PRIMARY alone and won’t wait for SECONDARY members’ acknowledgment. If you want to avoid the ROLLBACK scenario in your environment, then you have to use the {w:majority} or {w:n}, where 1 > n <= (no. of members in your replica set). This ensures that the writes are propagated to so many members of the replica set before sending the acknowledgment to the client. This solves the problem of ROLLBACK.
But please be careful that you are not giving higher value to writeConcern, because it also affects the write performance. The acknowledgment needs to be received from the number of members mentioned in the value. The value {w:majority} provides the acknowledgement that write operations have propagated to the majority of voting nodes, including the primary and is suitable for most of the environments.
ROLLBACK – Limitation
The main thing to note here is that mongod will not rollback more than 300MB data. In such cases, we need to manually check the instance to recover the data. You can see the below message in mongod.log in such cases:
Understanding this simple ROLLBACK background helps us to decide what needs to be done with the rollbacked data. It also helps us avoid such scenarios, because data is data and is very important!