Merging Empty Chunks in MongoDB

Empty Chunks in MongoDB I recently wrote about one of the problems we can encounter while working with sharded clusters, which is Finding Undetected Jumbo Chunks in MongoDB. Another issue that we might run into is dealing with empty chunk management.

Chunk Maintenance

As we know, there is also an autoSplitter process that partitions chunks when they become too big. There is also a balancer process that takes care of moving chunks to ensure even distribution between all shards. So as data grows, chunks are partitioned and perhaps moved over to other shards and all is well.

But what happens when we delete data? It can be the case that some chunks are now empty. If we delete a lot of data, perhaps a significant number of the chunks will be empty. This can be a significant issue for sharded collections with a TTL index.

Potential Issues

One of the potential problems when dealing with a high percentage of empty chunks is uneven data distribution. The balancer will make sure the number of chunks on each shard is roughly the same, but it does not take into account whether the chunks are empty or not. So you might end up with a cluster that looks balanced, but in reality, a few shards have way more data than the rest.

To deal with this problem, the first step is to identify empty chunks.

Identifying Empty Chunks

To illustrate this, let’s consider a client’s collection that is sharded by the “org_id” field. Let’s assume the collection currently has the following chunks ranges:

minKey –> 1
1 -–> 5
5 —-> 10
10 –> 15
15 —-> 20
….

We can use the dataSize command to determine the size of a chunk. This command receives the chunk range as part of the arguments. For example, to check how many documents we have on the third chunk, we would run:

db.runCommand({ dataSize: "mydatabase.clients", keyPattern: { org_id: 1 }, min: { org_id: 5 }, max: { org_id: 10 } })

1	db.runCommand({ dataSize: "mydatabase.clients", keyPattern: { org_id: 1 }, min: { org_id: 5 }, max: { org_id: 10 } })

This returns a document like the following:

{
    "size" : 0,
    "numObjects" : 0,
    "millis" : 30,
    "ok" : 1,
    "operationTime" : Timestamp(1641829163, 2),
    "$clusterTime" : {
        "clusterTime" : Timestamp(1641829163, 3),
        "signature" : {
            "hash" : BinData(0,"LbBPsTEahzG/v7I6oe7iyvLr/pU="),
            "keyId" : NumberLong("7016744225173049401")
        }
    }
}

{

"size" : 0,

"numObjects" : 0,

"millis" : 30,

"ok" : 1,

"operationTime" : Timestamp(1641829163, 2),

"$clusterTime" : {

"clusterTime" : Timestamp(1641829163, 3),

"signature" : {

"hash" : BinData(0,"LbBPsTEahzG/v7I6oe7iyvLr/pU="),

"keyId" : NumberLong("7016744225173049401")

}

If the size is 0 we know we have an empty chunk, and we can consider merging it with either the chunk that comes right after it (with the range 10 → 15) or the one just before it (with the range 1 → 5).

Merging Chunks

Assuming we take the first option, here is the mergeChunks command that helps us get this done:

db.adminCommand( {
   mergeChunks: "database.collection",
   bounds: [ { "field" : "5" },
             { "field" : "15" } ]
} )

db.adminCommand( {

mergeChunks: "database.collection",

bounds: [ { "field" : "5" },

{ "field" : "15" } ]

} )

The new chunk ranges now would be as follows:

minKey –> 1
1 —-> 5
5 —-> 15
15 —-> 20
….

One caveat is that the chunks we want to merge might not be on the same shard. If that is the case we need to move them together first, using the moveChunk command.

Putting it All Together

Following the above logic, we can iterate through all the chunks in shard key order and check their size. If we find an empty chunk, we merge it with the chunk just before it. If the chunks are not on the same shard, we move them together. The following script can be used to print all the commands required:

var mergeChunkInfo = function(ns){
    var chunks = db.getSiblingDB("config").chunks.find({"ns" : ns}).sort({min:1}).noCursorTimeout(); 
    //some counters for overall stats at the end
    var totalChunks = 0;
    var totalMerges = 0;
    var totalMoves = 0;
    var previousChunk = {};
    var previousChunkInfo = {};
    var ChunkJustChanged = false;
 
    chunks.forEach( 
        function printChunkInfo(currentChunk) { 

        var db1 = db.getSiblingDB(currentChunk.ns.split(".")[0]) 
        var key = db.getSiblingDB("config").collections.findOne({_id:currentChunk.ns}).key; 
        db1.getMongo().setReadPref("secondary");
        var currentChunkInfo = db1.runCommand({datasize:currentChunk.ns, keyPattern:key, min:currentChunk.min, max:currentChunk.max, estimate:true });
        totalChunks++;
    
        // if the current chunk is empty and the chunk before it was not merged in the previous iteration (or was the first chunk) we have candidates for merging
        if(currentChunkInfo.size == 0 && !ChunkJustChanged) {     
          // if the chunks are contiguous
          if(JSON.stringify(previousChunk.max) == JSON.stringify(currentChunk.min) ) {
            // if they belong to the same shard, merge with the previous chunk
            if(previousChunk.shard.toString() == currentChunk.shard.toString() ) {
              print('db.runCommand( { mergeChunks: "' + currentChunk.ns.toString() + '",' + ' bounds: [ ' + JSON.stringify(previousChunk.min) + ',' + JSON.stringify(currentChunk.max) + ' ] })');
              // after a merge or move, we don't consider the current chunk for the next iteration. We skip to the next chunk. 
              ChunkJustChanged=true;
              totalMerges++;
            } 
            // if they contiguous but are on different shards, we need to have both chunks to the same shard before merging, so move the current one and don't merge for now
            else {              
              print('db.runCommand( { moveChunk: "' + currentChunk.ns.toString() + '",' + ' bounds: [ ' + JSON.stringify(currentChunk.min) + ',' + JSON.stringify(currentChunk.max) + ' ], to: "' + previousChunk.shard.toString() + '" });');
              // after a merge or move, we don't consider the current chunk for the next iteration. We skip to the next chunk. 
              ChunkJustChanged=true;
              totalMoves++;            
            }
          }
          else {
            // chunks are not contiguous (this shouldn't happen unless this is the first iteration)
            previousChunk=currentChunk;
            previousChunkInfo=currentChunkInfo;
            ChunkJustChanged=false; 
          }          
        }
        else {
          // if the current chunk is not empty or we already operated with the previous chunk let's continue with the next chunk pair
          previousChunk=currentChunk;
          previousChunkInfo=currentChunkInfo;
          ChunkJustChanged=false; 
        }
      }
    )

    print("***********Summary Chunk Information***********");
    print("Total Chunks: "+totalChunks);
    print("Total Move Commands to Run: "+totalMoves);
    print("Total Merge Commands to Run: "+totalMerges);
}

var mergeChunkInfo = function(ns){

var chunks = db.getSiblingDB("config").chunks.find({"ns" : ns}).sort({min:1}).noCursorTimeout();

//some counters for overall stats at the end

var totalChunks = 0;

var totalMerges = 0;

var totalMoves = 0;

var previousChunk = {};

var previousChunkInfo = {};

var ChunkJustChanged = false;

chunks.forEach(

function printChunkInfo(currentChunk) {

var db1 = db.getSiblingDB(currentChunk.ns.split(".")[0])

var key = db.getSiblingDB("config").collections.findOne({_id:currentChunk.ns}).key;

db1.getMongo().setReadPref("secondary");

var currentChunkInfo = db1.runCommand({datasize:currentChunk.ns, keyPattern:key, min:currentChunk.min, max:currentChunk.max, estimate:true });

totalChunks++;

// if the current chunk is empty and the chunk before it was not merged in the previous iteration (or was the first chunk) we have candidates for merging

if(currentChunkInfo.size == 0 && !ChunkJustChanged) {

// if the chunks are contiguous

if(JSON.stringify(previousChunk.max) == JSON.stringify(currentChunk.min) ) {

// if they belong to the same shard, merge with the previous chunk

if(previousChunk.shard.toString() == currentChunk.shard.toString() ) {

print('db.runCommand( { mergeChunks: "' + currentChunk.ns.toString() + '",' + ' bounds: [ ' + JSON.stringify(previousChunk.min) + ',' + JSON.stringify(currentChunk.max) + ' ] })');

// after a merge or move, we don't consider the current chunk for the next iteration. We skip to the next chunk.

ChunkJustChanged=true;

totalMerges++;

}

// if they contiguous but are on different shards, we need to have both chunks to the same shard before merging, so move the current one and don't merge for now

else {

print('db.runCommand( { moveChunk: "' + currentChunk.ns.toString() + '",' + ' bounds: [ ' + JSON.stringify(currentChunk.min) + ',' + JSON.stringify(currentChunk.max) + ' ], to: "' + previousChunk.shard.toString() + '" });');

// after a merge or move, we don't consider the current chunk for the next iteration. We skip to the next chunk.

ChunkJustChanged=true;

totalMoves++;

}

else {

// chunks are not contiguous (this shouldn't happen unless this is the first iteration)

previousChunk=currentChunk;

previousChunkInfo=currentChunkInfo;

ChunkJustChanged=false;

}

else {

// if the current chunk is not empty or we already operated with the previous chunk let's continue with the next chunk pair

previousChunk=currentChunk;

previousChunkInfo=currentChunkInfo;

ChunkJustChanged=false;

}

)

print("***********Summary Chunk Information***********");

print("Total Chunks: "+totalChunks);

print("Total Move Commands to Run: "+totalMoves);

print("Total Merge Commands to Run: "+totalMerges);

}

We can invoke it from the Mongo shell as follows:

mergeChunkInfo("mydb.mycollection")

1	mergeChunkInfo("mydb.mycollection")

The script will generate all the commands needed to merge pairs of chunks where at least one is empty. After running the generated commands, this should cut the number of empty chunks in half. Running the script multiple times will eventually get rid of all the empty chunks.

Final Notes

Most people are aware of the problems with jumbo chunks; now we have seen how empty chunks can also be problematic in certain scenarios.

It is a good idea to stop the balancer before attempting any operation that modifies chunks (like merging the empty chunks). This ensures that no conflicting operations happen at the same time. Don’t forget to enable back the balancer afterward.

0 Comments

Inline Feedbacks

View all comments

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Merging Empty Chunks in MongoDB

Chunk Maintenance

Potential Issues

Identifying Empty Chunks

Merging Chunks

Putting it All Together

Final Notes

Related

Related Blog Articles

RECOMMENDED ARTICLES

New Valkey Packages by Percona

Valkey/Redis: Not-So-Good Practices

Choosing the Right Database: Comparing MariaDB vs. MySQL, PostgreSQL, and MongoDB

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Merging Empty Chunks in MongoDB

Chunk Maintenance

Potential Issues

Identifying Empty Chunks

Merging Chunks

Putting it All Together

Final Notes

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

New Valkey Packages by Percona

Valkey/Redis: Not-So-Good Practices

Choosing the Right Database: Comparing MariaDB vs. MySQL, PostgreSQL, and MongoDB

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation