WiredTiger MongoDBThis article continues on from Part 1: Building “wt” and  “Part 2: wt dump” to show how to extract any of your MongoDB documents directly from WiredTiger’s raw data files. It’ll also show how to take a peek into the index files. Lastly, it’ll show how to also look in the WT transaction log to see updates made since the latest checkpoint.

[porto_content_box align=”left”]⚠ Warning: the wt dump tool usually opens files in read-write mode, even for commands you’d think would be read-only. It will automatically step through its normal recovery process most of the time, so it may change files.
Until you know its effects on data files do not use it on your only copy of precious data – make a copy of the data directory and learn with the copy first.[/porto_content_box]

List up the Collections and Indexes

WiredTiger doesn’t name any of its data files according to the MongoDB object names, so as a first step you’ll have to extract a table of WT idents (=identifiers) vs. collection and index names.

The _mdb_catalog.wt file is not the top table in the WiredTiger storage engine’s own hierarchy of data sources. To the MongoDB layer of code though it is the complete definition of collection and index objects in the database. This includes both MongoDB system and user-made collections and indexes.

Dump WT ident vs Collections

As explained in the previous Part 2: wt dump blog post, you should reuse this command to save a tab-delimited file of WiredTiger ident(ifier)s vs collection names. I’ll call this file wt_ident_vs_collection_ns.tsv in this article.

Eg. if an ident is “collection-4-5841128870485063129” then there will be a file collection-4-5841128870485063129.wt that has the data of a MongoDB collection.

In case you have a vague memory of seeing strings like these somewhere whilst using the mongo shell, you probably have. These idents are the same as the ones shown in the “wiredTiger.uri” field of the db.collection.stats() command output.

Optional: Dump WT ident vs Indexes

The index WT idents can be dumped as well. The example below will save them to wt_vs_index_ns.tsv in three columns [WT ident, collection name, index name].

Looking at the Application (= MongoDB) Table Data

collection-*.wt and index-*.wt files contain the data of the MongoDB collections and indexes that you observe as a client connected to a MongoDB instance. Once you’ve identified the WT ident value for the collection or index you want to inspect you can use that in the URI argument to the wt dump command.

wt dump collections to *.bson files

The wt dump output of collection-* tables has a basic numeric type key and BSON data as the values.

Use wt dump -x in your terminal on the WT file for a collection and you will see, after the header section, records of the WT table in alternating lines of key and value.

Eg. in the example below the first key is numeric/binary value 0x81 and the first value is the one starting with binary bytes 0xce000000025f6964001e. The second key is 0x82, and its value starts with 0xce000000025f69640021. The values are BSON which were much longer in reality, I only trimmed it for readability at the moment.

FYI Using the alternative URI argument syntax “wt dump -x table:collection-X-XXXXXXXXXX” or “wt dump -x file:collection-X-XXXXXXXXXX.wt” will produce the same.

The key value in the collection-* tables isn’t needed to see the document content so print only every second line for the values using the following command:

The command above prints binary BSON in a hex string format with a newline separating each record. We can translate that hexadecimal back to the original binary using the xxd command utility with the “-r” and “-p” flags. (Note: Don’t combine as one “-rp” flag. It doesn’t work like most unix command’s short options.)

“wt read” a single record?

Sorry, you can’t read your MongoDB collection data with the “wt read” shell command. The blocker is trivial – as of WT v10.0.0 the “wt read” command only accepts plain text or its own “r” recordid numeric value as the lookup key value. The keys for the mongodb collections and indexes however are ‘q’ and ‘u’ types respectively. (Documentation: WT Schema Format types.)

In MongoDB, you might know you can use a showRecordId cursor option so it’s tempting to think that this is the same “r” type that wt read can currently accept, but unfortunately it is not. See the key_format=X value in the header of wt dump output samples to confirm.

If wt read (code = utilities/util_read.c) was modified to accept an -x argument so we could pass the hex strings we already see in wt dump output this issue would be solved. But as it isn’t, for now, you have to dump all records even if you just want one.

This is only a limitation in the shell. If you use the API from within a programming language, including the Python SWIG binding available, you should be able to read just a single record.

Looking at MongoDB’s Index Files

wt dump index-*.wt files

index-* WiredTiger tables have two different formats that are easy to see when using wt dump to look inside them – one with just keys, and one with both keys and values. The WT keys are binary values generated from MongoDB’s KeyString structure.

Below is an example of an index on {state: 1, process: 1} on a collection called config.locks. This is a case where there are no values in the WT table for the index. 

Below is an example of an {_id: 1} index on the collection test.rlru. This is a case when there are both WT keys and values.

Given the point of an index record lookup is to have a value that points to a record in the collection-X-XXXX WT table, you should be asking “How can that first index above be useful without values?”

The answer is the recordid is packed on the end of the key. You’ll notice in the first example they all have 0x04 as the third-last byte. This is how MongoDB packs a recordId when the value is between 0 and 2^10 – 1 I believe. See KeyString appendRecordId() if you want to get into it further.

By the way, an index’s field names are constant and thus irrelevant to sort order, so they’re not part of the keystrings.

Writing about the KeyString format even in just the shortest summary would take a whole blog post. So I’ll just punch out some translations of the binary above as a teaser and stop there.

Keystring 293c436f6e66696753657276657200040008 =>

  • 0x29 = type marker for numeric value 0.
  • 0x3c type marker for (UTF-8?) string
  • 36f6e66696753657276657200 hex of string “ConfigServer” plus tailing null
  • 0x04 type marker for a recordId
  • 0x0008 => binary 000 + binary 0000000001 + binary 000 = (recordId) value 1

Keystring 2b043c436f6e66696753657276657200040060 =>

  • 0x2b = type marker, positive integer in small range (< 2^7?)
  • 04 = binary 0000010 + bit 0 = value 2
  • 0x3c type marker for (UTF-8?) string
  • 36f6e66696753657276657200 hex of string “ConfigServer” plus tailing null
  • 0x04 type marker for a recordId
  • 0x0060 => binary 000 + binary 00.0000.1100 + binary 000 = (recordId) value 12

Keystring 2b0404 =>

  • 0x2b = type marker, positive integer in small range (< 2^7?)
  • 04 = binary 000010 + 00 = value 2, not sure what the tailing two bytes are for.
  • 0x04 type marker to indicate the following value is recordId?

Value 001001 =>

  • 0018 = binary 000 + binary value 00 0000 0011 = 3, plus 3 bits 000 on the end. T.b.h. I don’t always know how to translate this one, but commonly 3 bits at the beginning and end of the recordId format are used as a byte size indicator.

If looking for matching records in the collection-X-XXX table look for key (0x80 + the recordid) value from the index file. Eg. 2 -> 0x82; 12 -> 0x9c. For some reason 0x0 – 0x80 seems to be reserved and so key values in the collection-X-XXX.wt files are all incremented by 0x80 higher than the recordId value in the index-X-XXX.wt records.

In the end though, as we don’t have an ability to use wt read, all this poking around in the indexes from a shell command can only be for satisfying curiosity. Not for making single-record access fast.

Looking at the WT Transaction Log

A WiredTiger checkpoint saves a copy of documents in all collections and indexes as they were at one exact point in time (the time the checkpoint is started). MongoDB will call for a checkpoint to be made once per minute by default.

Without something else being saved to disk, a sudden crash would mean that restores/restarts could only revert to the last checkpoint. The classic database concept of write-ahead log is the solution to this of course. In WiredTiger this is provided by the transaction log, often just called “log” in its own documentation. Or as the documentation also says it adds “commit-level” durability to checkpoint durability.

At restart, whether it is after a perfectly normal shutdown or a crash, WiredTiger will read and replay the writes it finds in its transaction log onto the tables, in memory. In time, when the MongoDB layer requests a new checkpoint be created, the in-memory restored data will be saved to disk.

When you use wt dump it’s not easy to say if you’re looking at the collection (or index) restored as of the last checkpoint or with the transaction log “recovered” (read and applied) as well. If the “-R” global option of the wt command is used then yes log is recovered; if the opposite “-r” option is used then no. But which is in effect if you specify neither is unclear. Also, I’ve seen comments or diagnostic messages that suggest the -r option isn’t perfect.

Two Layers of Recovery

WiredTiger would, by default, keep no transaction log for the tables it manages for the application embedding it. It’s only the request of the application that engages it.

  • If you are using a standalone mongod MongoDB code will enable WT log for every collection and index created
    • Unless the “recoverFromOplogAsStandalone” server parameter is used. This is a trick that is part of point-in-time recovery of hot backups.
  • When replication is enabled (the typical case) only the WT tables for “local” db collections and indexes get “log=(enabled=true)” in their WT config string made by WiredTigerRecordStore::generateCreateString() in wiredtiger_record_store.cpp, and something similar in wiredtiger_index.cpp.

When a mongod node restarts with an existing data directory the WiredTiger library will run recovery. It can’t be stopped (at least not by the MongoDB user). This is not the end of the story though. When replication was used it means only the “local” db collections, in particular the oplog, is restored. The replication system code has a ReplicationRecovery class that is used next, and this will apply updates made since the last durable checkpoint time from the oplog to the user and system collections they’re supposed to be in. Only after that occurs will recovery be complete and the db server will make the data available to clients.

ReplicationRecovery allows MongoDB to do rollbacks (at least to the checkpoint time) and also to trim off ops in the last replication batch application. It’s an ugly point but MongoDB applies replication ops in small batches in parallel to improve performance; this only happens on secondaries usually of course, but also after a restart. If all the writes in the batch finish that is fine, but if they don’t some writes may be missing which is a clear consistency violation. So the unfinished batch’s writes should all be cleared, back to the last saved oplogTruncateAfterPoint value. repl.oplogTruncateAfterPoint is a single-document collection in the local db. It exists only for this purpose as far as I know.

Diagram for Typical Restart Recovery

“Typical” = restarting a node that is a replica. Doesn’t matter if it was a primary or secondary, or even the only node in the replica set.

WiredTiger’s recovery

WT table for local.oplog.rs updated

ReplicationRecovery

Trim oplog docs where “ts” > oplogTruncateAfterPoint

oplog application

Pseudo code:
db.oplog.rs.find({ts > durable checkpoint time}).forEach(function(d) {
     //re-apply write to intended collection in “test”, “mydb”, “admin”, etc. db
     applyOps(opdoc);
}

Ready!

wt printlog

You can use the wt printlog command to see what is in the log currently in the WiredTigerLog.<nnnnnn> files in the journal/ subdirectory. If you decode the write ops in there you can understand what document versions will be restored by MongoDB when it restarts.

Preparation to Using wt printlog – Map WT idents to fileid

Until now you’ve learned that every MongoDB collection and index is in its own WT table file, which means needing to learn what the WT ident(ifier) is to find the right raw WT file to look into.

There’s another small integer id value, a.k.a. fileid, in WT metadata / WT table config strings for each file too. I suppose it saves space; at any rate, the WT transaction log references files only by this number. This means you’ll have to build a handy mapping or list of which WT ident is which fileid. Use the following command to create a file I call wt_ident_vs_fileid.tsv.

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments