MongoDB 3.0 was recently released. Instead of focusing on what’s new – that is so easy to find, let’s rather talk about something that has not changed a lot since the early MongoDB days. This topic is sharding and most specifically: how to choose a good sharding key. Note that most of the discussion will also apply to MySQL, so if you are more interested in sharding than in MongoDB, it could still be worth reading.

When do you want to shard?

In general sharding is recommended with MongoDB as soon as any of these conditions is met:

  • #1: A single server can no longer handle the write workload.
  • #2: The working set no longer fits in memory.
  • #3: The dataset is too large to easily fit in a single server.

Note that #1 and #2 are by far the most common reason why people need sharding. Also note that in the MySQL world, #2 does not imply that you need sharding.

What are the properties of a good sharding key?

The starting point is that a cross-shard query is very expensive in a sharded environment. It is easy to understand why: the query has to be executed independently on several shards, and then results have to be merged.

With MongoDB, mongos will transparently route queries to the right shards and will automatically merge the results: this is very handy but the hidden complexity can also make you forget that you have executed a very poorly optimized query.

This is where the choice of a sharding key is so critical: choose the right key and most queries will be simple and efficient, choose a wrong one and you’ll have ugly and slow queries.

Actually a good sharding does not need to have tens of properties, but only two:

  • Insertions should be as much balanced as possible across all shards.
  • Each query should be able to be executed by retrieving data from as little shards as possible (ideally 1 shard).

Sounds quite easy, right? However depending on your use case, it may be quite difficult to find a good sharding key. Let’s look at a few examples.

Social Network

Say we have users who can be connected to other users, who can read or write posts and who have their own wall.

All 3 collections can become very large, so sharding all will be necessary over time.

For the user and wall collection the user_id field is an obvious choice and it actually meets both criteria for a good sharding key.

For the post collection, user_id also looks like an obvious choice, but if you think about how the collection is accessed for reads, you will realize that you will need to fetch it using its post_id, not its user_id (simply because a user can have multiple posts). If you shard by user_id, any read to a single post will be broadcast to all shards: this is clearly not a good option.

So using post_id is a better choice. However it only meets criteria #2: most posts are never updated, so all the writes are insertions that will go to a single shard. However the traffic on the post collection is strongly in favor of reads, so being able to speed up reads while not slowing down writes is probably an acceptable tradeoff.

Access Logs

The workload here is very specific: write-intensive and append-only.

Sharding by ObjectId is definitely a bad idea: while data can be easily spread across all shards, all writes will only go to one shard, so you will have no benefit compared to a non-sharded setup when it comes to scale the writes.

A better solution is to use a hash of the ObjectId: that way data AND writes will be spread across all shards.

Another good option would be to use another field in your documents that you know is evenly distributed across the whole dataset. Such field may not exist though, that’s why hashing the ObjectId is a more generic solution.

Ecommerce

MongoDB can be a good option to store a product catalog: being schemaless, it can easily store products with very different attributes.

To be usable such a catalog must be searchable. This means that many indexes need to be added, and the working set will probably grow very quickly. In this case your main concern is probably not to scale the writes, but to make reads as efficient as possible.

Sharding can be an option because if done properly, each shard will act as a coarse-grained index. Now the issue is to find which field(s) will evenly distribute the dataset. Most likely a single field will not be enough, you will have to use a compound sharding key.

Here I would say that there is no generic solution, but if the products are for instance targeted at either kid, woman or man and if you have several categories of products, a potential sharding key would be (target, category, sku).

Note that in this case, reading from secondaries may be enough to ensure good scalability.

Conclusion

As you can see, choosing a correct sharding key is not always easy: do not assume that because some apps are sharded by some fields, you should do the same. You may need a totally different strategy or sharding may even not be the right solution for your issues.

If you are interested in learning more about MongoDB, I will be presenting a free webinar on March 25 at 10 a.m. Pacific time. It will be an introduction to MongoDB for MySQL DBAs and developers. Register here if you are interested.