For large number of online applications once you implemented proper sharding you can consider your scaling problems solved – by getting more and more hardware you can grow. As I recently wrote it however does not mean it is the most optimal way by itself to do things.

The “classical” sharding involves partitioning by user_id,site_id or somethat similar. This allows to spread data more or less evenly across the boxes and use any number of boxes. However this may be not the most optimal approach by itself because not all data belonging to same user is equal.

Consider Blog or Forum as example – most likely few last posts will get majority of hits while things written year ago are accessed with much less frequency. You can often level off this significantly for reads by using caching (if things are accessed frequently they are served from cache) but you still have to deal with writes which can be significant depending on your design.

It does not only have to be active portions of data same way you can have active users and ones which are almost dead.

Another interesting type of data which I find often kept on the same “cluster” without good reason is some sort of logs or history data. Think about Wikipedia page version history for example which accumulates to huge volume which very few users need to read, various change logs etc are other type of this data.

Besides separating “cold” data from “hot” it often makes sense to separate data based on its importance for system operation – for example if page versions data is currently unavailable for Wikipedia it is still possible to serve 99% of reads and even possibly handle writes by queuing new version creation.

It may make sense to separate data on table (or partition) level to get better “clustering” – because data is usually cached by pages rather than by rows and index entries having all hot data in separate table from cold data is much more efficient for caching than having table with the mix, even though the total size remains same.

It also often makes sense to separate data on the server level. Keeping Hot and Production Critical dataset small you can both make system to perform faster as well as well as get plenty of operating benefits – small database takes less time to backup and restore it is easier to do ALTER TABLE and replication would not fall behind as easily.

You can also use different hardware for different parts of data – you can hold “Hot” data on fast RAID volume or even SSD while place archive data on slow but large SATA volumes (unless response time will not become show stopper)

Of course not all applications need to use this technique but there is significant class of application which can benefit from it dramatically.

10 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
mike

It would be cool if MySQL itself had some sort of LRU type query cache method for re-organizing how data is physically stored. That would be cool.

mike

Hmm. Even having the option would be nice. A configuration option (just like InnoDB recovery style) would probably be used by a lot of people.

Trying to determine how to partition my data and all that myself seems a bit daunting. I’d be happy with an LRU-style thing. I mean, I will deploy memcached and use that for caching prior to the database anyway. But any further optimizations to MySQL are welcome.

howa

I think the golden rule for large site should be don’t let your visitor hit your MySQL directly?

Such as using Squid?

Frank

I honestly think some things are out of the scope of mysql, I think you’re application should be smart enough to pull from a cache. In fact frameworks (.net, cakephp, rails, etc) have built in cache (file, memory, mcache, etc) support you just need to use it. MySQL is very powerful and while I think it needs to do more beyond master/slave replicationing, determining how you use it’s data isn’t exactly it’s job. Being able to treat multiple servers, db’s and tables as a cluster and provide easy to use methods for moving data (tables and databases) for optimization of access might be within it’s scope. Knowing when and why to move your data rests firmly with you and your app.

Wes Mahler

Interesting way to partition data!

howa

Let say, partitioning by user_id, the main difficulty I think is not partitioning, but load balacing, e.g. 80% of users are inactive, how can do spread the 20% of active users across all the database servers for max. performance?