Refining Shard Keys in MongoDB 4.4A 1st Stab at Correcting Bad Shard Key Selections

MongoDB is designed to scale and handle very large data sets. You can scale vertically in your replica sets by adding resources such as CPUs, RAM, and disks. However, many applications today require more resources than can be added to a single server. 

The next level of scaling in MongoDB happens with sharding. Sharding allows for very large data sets and/or high throughput and is accomplished by spreading the data and the associated resource needs across multiple hosts.

Those data sets often include a variety of data types – from structured, to semi-structured, to unstructured. The ability to handle this variety and scale, provide redundancy and high availability, AND continue to be developer-friendly is why MongoDB has grown into the most used NoSQL database in the market. Let’s face it, developers love its schemaless nature and ease of use.

But that schemaless nature can sometimes cause problems with the ability to scale in a way that provides the high performance today’s applications demand. When selecting shard keys, one must think about the data (as it exists now and as it may become), the access patterns, and the expected growth of the collections to be sharded. Therefore MongoDB users should consider those schema-like elements when selecting shard keys in order to architect the system optimally from the start.  

The issues that arise when bad shard keys are selected can at best be considered suboptimal, and at worst, downright terrible for performance. There are many blogs and webinars about choosing the best possible shard keys. See the bottom of this blog for links to a few of those. 

Now we’ll talk about how to refine your shard key for collections that already exist and already have data

Most of the time, this is important when you have very large collections that are critical to the overall performance of your application (and thus to your business). After all, if there is no data or not that much data, you could drop and recreate the collection using a different shard key and reload the data with just a little downtime. So let’s go ahead and take a look at how you can refine a shard key. 

In earlier versions of MongoDB, once a shard key was chosen and implemented, you were stuck with it – for better or worse. You could not change that shard key at all. 

Until MongoDB v4.2 you could not even run an update CRUD operation that changed the values of the field or fields that the shard key included. The shard key and its values were immutable. Luckily, with MongoDB 4.2 this changed and shard key field values can now be changed.

Despite this, bad shard keys continued to cause performance problems for applications and developers, and others. Users have long wished for a way to change their shard key and avoid the difficult situations and performance problems that come along with bad shard keys.

MongoDB 4.4 Refinable Shard Keys

So, along comes MongoDB 4.4 and for the first time, there was something to help mitigate the effects of bad shard keys. MongoDB 4.4 included a new feature that allowed you to refine your existing shard keys for a collection. This was done via the new refineCollectionShardKey command. Additional fields could be added to the existing shard key, thus creating a new shard key. By doing this, chunks of data (that previously did not have enough distinctiveness or had too much frequency) could now use the additional data from the added fields to be subdivided.  

One of the biggest problems bad shard keys cause is the creation of jumbo chunks. These are chunks that grow beyond the 64 M default chunk size and are indivisible and. 

Those jumbo chunks cannot be split and so they will not be moved by the balancer.  This can create situations where operations all go to the same shard (hotspotting) and where more data resides on some shards than others, causing disk size imbalances. 

Both of these situations negatively impact performance. By refining existing shard keys, data and existing chunks can become more divisible – broken down into smaller chunks. This can help relieve the pressure caused by hotspotting and disk imbalances.

Requirements for Refining Shard Keys

In order to use the refineCollectionShardKey command, the following prerequisites must be met:

  1. Your cluster must at least be version 4.4 and also have a feature compatibility version of 4.4.
  2. The new shard key must still have the same prefix –  that is, it must start with the existing shard key.
  3. The new fields added to the shard key can only be added as suffixes.
  4. A new index must be created to support the modified shard key.
  5. Before issuing the refineCollectionShardKey command the balancer must be stopped.

Steps and Commands for Refining Shard Keys

Let’s take a look at the steps and commands required to refine a shard key.  Then we will take a look at a simple example with commands and results.

Step 1. Check the current database version. The required version is v4.4 or greater.

Step 2. Check the current feature compatibility version (fCV). The required version is v4.4 or greater.

*Note* To check the fCV, the above command must be run on a mongod node – connect to a member of the shard replica set as a local user.

Step 3. Stop the Balancer  (on mongos)

Step 4. Set the feature compatibility version (fCV) if needed

*Note* Setting the fCV for a sharded cluster must happen on the mongos.

Step 5. Create the index to support the desired REFINED shard key

Confirm that the index was created as desired:

Step 6. Run the refineCollectionShardKey  command to modify the existing shard key

1 indicates range sharding
hashed use for a hashed shard key – *IF* shard key does not already include a hashed field 

Step 7. Check shard status to ensure that the shard key has changed

Step 8. Restart the Balancer  (on mongos)

Now Let’s Look at a Simple Example

Collection Name: test.mailbox 

Current Shard Key: shard key: { "index" : "hashed" } 

Desired Shard Key: { "index" : "hashed", "_id" : 1 } 

Why:  Assume that the current shard key has too much frequency and is not distinct enough – similar to “Smith” as a last name – and is therefore creating jumbo chunks and causing severe disk imbalances. This is causing performance to tank.

Sample Documents in the “mailbox” collection:

Background: Currently only “index” and”_id” appear to be common fields in all docs. Therefore we will make our new shard key a combination of the original shard key {“index”: “hashed”} and then add a suffix of “_id” as a regular field – not a hashed field because you can only have a single hashed field in your compound index.

Step 1. Check the current version 

Step 2. Check  the current feature compatibility version (fCV)

Step 3. Stop the Balancer    (on mongos)

Step 4. Set the feature compatibility version (fCV) if needed

*Remember* Setting the fCV for a sharded cluster must happen on the mongos.

Confirm that fCV changed – 

= = =  == >> And now let’s check the fCV value in the MongoD PRIMARY again

Step 5. Create the index to support the desired REFINED shard key

Confirm that the index was created as desired

Step 6. Run the refineCollectionShardKey  command to modify the existing shard key

Step 7. Check shard status to ensure that the shard key has changed

Step 8. Restart the Balancer   (on mongos)

Sample Expected mongod and mongos Log Entries Related to Resharding

*note* Just putting a couple here from the mongod and the mongod:

=mongod

=mongos

Tips For Refining Shard Keys

  1. Refining shard keys involves adding a field or fields to the existing shard key as a suffix. This requires. This means any index that supports the new shard key is a Compound Shard Key.
  2. Starting in MongoDB 4.4 you can shard using a Compound Hashed Index
  3. You can only have a single hashed field in your compound index.
  4. Know your data and your collection “structure”  Do a findOne().  Look at older data (1st 5 docs) and latest data (last 5 docs) to make sure that the structure of your data has not changed significantly over time. Oops, not always “schemaless” after all, is it? 
  5. Prior to 4.4, all shard key fields must be populated. In 4.4 and above, sharded collections can have documents with missing shard key fields. 
  6. In general, select fields that will be used in the majority of your operations and queries.
  7. When selecting fields to add as suffixes to your existing shard keys, remember to take into account the main tenets that are important to effective and performant sharding. Those are cardinality, frequency, and non-monotonically increasing.
  8. Missing Shard Key Field Values – Sometimes you may need to add fields to the shard key which do not have values for those fields populated in all documents. In MongoDB 4.4 and above, that is ok. However, it is best to populate those fields as soon as possible in order to avoid situations that might compromise the desired data to be returned. An example would be write operations that target documents with missing shard key field values.

Summary

So, this was the 1st step towards making it easier to correct sharding mistakes that may be negatively impacting your performance. In the upcoming MongoDB 5.0 release, an expected new feature is fully changeable shard keys – aka RESHARDING. 

Many are eagerly awaiting the ability to change the shard key completely and reshard. Looking forward to trying that out and digging more into all of the changes that are required at the code level to make this cloning-type operation occur.

What’s Next?

We are offering MongoDB enthusiasts a unique opportunity to get a sneak peek into MongoDB 5.0 at 11 am EDT TODAY! Tune in live to watch, and ask questions, at YouTube, Twitch, and LinkedIn.

Join our live stream event as Percona experts Akira Kurogane (MongoDB Product Owner) and Kimberly Wilkins (MongoDB Tech Lead) provide an advanced look at the production release of MongoDB 5.0. 

This release has been rumored to include several exciting new features, especially one particularly long-awaited improvement to sharding. We hope to see you there!

You can also keep an eye on our blog and social channels for upcoming sharding insight, or if you just can’t wait, check out our recent sharding presentations: 

Sources:

https://docs.mongodb.com/manual/core/sharding-shard-key/#missing-shard-key

https://docs.mongodb.com/manual/reference/command/refineCollectionShardKey/#mongodb-dbcommand-dbcmd.refineCollectionShardKey

Percona Distribution for MongoDB is a freely available MongoDB database alternative, giving you a single solution that combines the best and most important enterprise components from the open source community, designed and tested to work together.

Download Percona Distribution for MongoDB Today!

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments