Horizontal Scaling in MySQL ShardingIn a previous post, A Horizontal Scalability Mindset for MySQL, I discussed the concerns around growing individual MySQL instances too large and some basic strategies:

  • Optimizing/minimizing size with proper data types
  • Removing unused/duplicate indexes
  • Keeping your Primary Keys small
  • Pruning data

Finally, if those methods have been exhausted, I touched on horizontal sharding as the approach to keep individual instances at a reasonable size. When discussing my thoughts across our internal teams, there was lots of feedback that we needed to dive into the sharding topic in more detail. This post aims to give more theory and considerations around sharding along with a lightweight ProxySQL sample implementation.

What is Sharding?

Sharding is a word that is frequently used but is often misunderstood. Classic vertical scaling (often just called “scaling”) is when you increase the resources on a single server to handle more data and/or load. Horizontal scaling (aka sharding) is when you actually split your data into smaller, independent buckets and keep adding new buckets as needed.

A sharded environment has two or more groups of MySQL servers that are isolated and independent from each other. While specific implementation details are dependent on your environment, here are some general guidelines:

  • A shard can be single server or replication hierarchy
  • Each shard has identical schemas – you can run the same query pattern against any shard and expect to get a result set (of course varying the WHERE clause of the identifier)
  • A given identifier belongs to exactly one shard at a time – you should NOT have a Customer ID spanning shards (unless only temporarily during migration between shards for example – this is an edge case that requires special handling)

With this understanding of sharding, let’s look at some of the key considerations and challenges for actual implementation.

Sharding Considerations and Challenges

While sharding often becomes the only way to manage exploding data, most experts and architects advise that it is delayed as long as possible. If it is such a great way to handle data, then why is that? The primary reason is complexity. While vertical scaling is trivial (just add more hardware and make no other changes), sharding requires significant thought and planning. The main challenges fall into a few primary buckets:

  • How should I split my data?
  • What if my data spans shards?
  • How do I find the data I need?

Each bucket has its own set of challenges, but each can also be an overall blocker when trying to implement sharding.

How Do I Split My Data?

Picking the shard key is the MOST IMPORTANT part of implementing sharding. The main concept is to pick an identifier that is shared across records, such as CustomerID, ClientID, or UserID. Generally, you’ll want to pick the smallest unit possible (generally at the UserID level), but this also depends on what level of aggregation is needed (more on this later). The risk here is that the key isn’t selective enough and you end up not splitting the data evenly. However, if you are too selective, then you risk breaking up units that logically should remain together (think sharding on ForumPostID vs UserID).

What if My Data Spans Shards?

This is often a deciding factor that prevents an organization from sharding. Splitting the data and retrieving individual records, while not trivial, is relatively straightforward. In many workloads, this is enough. Locate my small subset of data and fetch it. When the use case dictates aggregate queries, however, this workflow breaks down. Take this common query to find the newest users across the platform:

When all the users are stored in a single server, this query is trivial. However, when the data is split across multiple shards, we now need a way to aggregate this data. Implementing this is outside the scope of this post, but the general approach looks something like this:

  1. Issue the query on every shard
  2. Collect the results from each shard in a central aggregation service
  3. Combine/filter the individual results into a single result set

If your use case relies heavily on aggregate queries, implementing this aggregation layer is critical and often very complicated. This is also the use case that discourages so many from sharding, even when it is necessary.

In an ideal scenario, the active OLTP workload can be contained within individual shards. Reporting and data warehousing requires the implementation of distributed queries which can be managed through different systems, or via custom aggregation using multi-source MySQL asynchronous replication.

How Do I Find My Data?

Now, assuming that you have chosen a sharding key, you need to be able to retrieve records when needed. This is the next major hurdle that needs to be overcome. There are two distinct components to this challenge:

  • Determining which shard holds the data
  • How to connect to that shard from my application

Finding the ShardID

Determining which shard holds that data can be as simple or complex as you want. However, with each approach, there are tradeoffs. The simple approach using a simple hash/modulus to determine the shard looks something like this:

The most basic example would be sharding by userID across 2 shards. UserIDs that are even would be on shard 0 and odd userIDs would be on shard 1. While incredibly simplistic, what do I do when 2 shards aren’t enough? The goal of this architecture is to just add more shards when individual shards get too big. So now, when I add new shards, the number of shards increases so the shardID may change for many of the records.

A more flexible approach would be to use a directory service. In this approach, there is a very simple datastore that maps an identifier to a shard. To determine which shard holds the data, you simply query the datastore with the identifier and it returns the shardID. This gives incredible flexibility for moving data between shards, but also adds complexity to the system. Naturally, this service itself would need to be highly available, but doesn’t have to be in MySQL. In addition to complexity, you have potentially introduced additional latency to the system with an extra lookup.

Connecting to the Shard

Once we have the proper shardID, we need to connect to the database cluster in question. Again, we have options and the implementation is dependent on the application. We can look at one of two main approaches:

  • Use a different connection string for each shard
  • Leverage a proxy layer that routes intelligently

Using a different connection string is generally self-explanatory: I have a collection of shardIDs and their corresponding connection strings. When I know the shardID, I then fetch the connection string, connect to the cluster, and away we go.

But we know developers don’t want to (and shouldn’t) mess with connection strings and credentials. So a proxy layer that can route the query based on the query itself or some metadata in the query is ideal. Being a layer 7 service, ProxySQL is able to inspect the queries themselves and make decisions based on defined rules. While definitely not the only way, let’s look at a sample ProxySQL implementation that uses query comments to route to the correct shard.

Sample ProxySQL Implementation

For our example, here is the basic architecture:

  • Sharding users by UserID
  • Directory service that does simple routing based on modulus
  • ProxySQL is already used for HA
  • Data is split across 3 shards

A typical user query may look something like this:

With UserID being the shard key and simple modulus sharding strategy, we can identify the shardID easily for this query:

With the shardID defined, we can now inject a comment into the original query that can be used by ProxySQL to properly route the query to the correct shard:

ProxySQL uses the concept of hostgroups to define routing internally. In many HA cases, this is simply a reader and writer hostgroup. For the sharding architecture, however, we can expand that define hostgroups as actual shards:

With the hostgroups in place, we just need to add a ProxySQL rule to choose the hostgroup based on the injected comment:

Now, after we issue a few queries, you can see in the ProxySQL stats that we are matching the shardID and routing accordingly:

Obviously, this is a massive simplification and you’d want rules that are more generic and more complex hostgroups. For example, your deployment would need to account for queries where no shard hint is provided or the case where an invalid shardID is supplied. However, it displays the potential of writing a custom routing layer within the very lightweight tool ProxySQL. This approach, while it requires some additional thought and logic, could be a viable first step towards splitting your data horizontally.

Again, this is just a sample proof-of-concept implementation. Each use case is different. The main takeaway from this post is that manual sharding is possible, but requires significant planning and implementation work.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure, open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

Download Percona Distribution for MySQL Today

Subscribe
Notify of
guest

3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Neo

With all due respect, Vitess is the solution for sharding MySQL, it supports increase and decrease of shards without downtime and also supports changing sharding key. The scatter gather SQL pattern is also supported.

In my opinion ProxySQL is an inferior sharding solution Vitess.

Michael Coburn

Hi @Neo thanks for your reply! You do highlight a very useful sharding tool for MySQL in Vitess, which is a product that Percona loves and often recommends and deploys as a solution for our Customers. We don’t dispute that there is great value in Vitess. What we like to do on the blog is to highlight many options, and in our opinion ProxySQL is easier to deploy given it is less invasive to your database stack (at least compared to Vitess, which is really a lift-and-shift operation).

Again we appreciate your feedback, and this has led Michael and I to consider doing (another) followup post where we explore more complex solutions such as Vitess!

Ivan Baldo

Would be interesting a post on TiDB too.