I talk with lot of people who are really interested in Percona XtraDB Cluster (PXC) and mostly they are interested in PXC as a high-availability solution. But, what they tend not to think too much about is if moving from async to synchronous replication is right for their application or not.
Facts about Galera replication
There’s a lot of different facts about Galera that come into play here, and it isn’t always obvious how they will affect your database workload. For example:
- Transaction commit takes approximately the worst packet round trip time (RTT) between any two nodes in your cluster.
- Transaction apply on slave nodes is still asynchronous from client commit (except on the original node where the transaction is committed)
- Galera prevents writing conflicts to these pending transactions while they are inflight in the form of deadlock errors. (This is actually a form of Eventual Consistency where the client is forced to correct the problem before it can commit. It is NOT the typical form of Eventual Consistency, known as asynchronous repair, that most people think of).
Callaghan’s Law
But what does that all actually mean? Well, at the Percona Live conference a few weeks ago I heard a great maxim that really helps encapsulate a lot of this information and puts it into context with your application workload:
[In a Galera cluster] a given row can’t be modified more than once per RTT
This was attributed to Mark Callaghan from Facebook by Alexey Yurchenko from Codership at his conference talk. Henceforth this will be known as “Callaghan’s law” in Galera circles forever, though Mark didn’t immediately recall saying it.
Applied to a standalone Innodb instance
Let’s break it down a bit. Our unit of locking in Innodb is a single row (well, the PRIMARY KEY index entry for that row). This means typically on a single Innodb node we can have all sorts modifications floating around as long as they don’t touch the same row. Row locks are held for modifications until the transaction commits and that takes an fsync to the redo log by default, so applying Callaghan’s law to single-server Innodb, we’d get:
[On a single node Innodb server] a given row can’t be modified more than the time to fsync
You can obviously relax that by simply not fsyncing every transaction (innodb_flush_log_at_trx_commit != 1), or work around it with by fsyncing to memory (Battery or capacitor-backed write cache), etc., but the principle is basically the same. If we want this transaction to persist after a crash, it has to get to disk.
This has no effect on standard MySQL replication from this instance, since MySQL replication is asynchronous.
What about semi-sync MySQL replication?
It’s actually much worse than Galera. As I illustrated in a blog post last year, semi-sync must serialize all transactions and wait for them one at a time. So, Callaghan’s law applied to semi-sync is:
[On a semi-sync replication master] you can’t commit (at all) more than once per RTT.
Applied to a Galera cluster
In the cluster we’re protecting the data as well, though not by ensuring it goes to disk (though you can do that). We protect the data by ensuring it gets to every node in the cluster.
But why every node and not just a quorum? Well, it turns out transaction ordering really, really matters (really!). By enforcing replication to all nodes, we can (simultaneously) establish global ordering for the transaction, so by the time the original node gets acknowledgement of the transaction back from all the other nodes, a GTID will also (by design) be established. We’ll never end up with non-deterministic ordering of transactions as a result.
So this brings us back to Callaghan’s law for Galera. We must have group communication to replicate and establish global ordering for every transaction, and the expense of doing that for Galera is approximately one RTT between the two nodes in the cluster that are furthest apart (regardless of where the commit comes from!). The least amount of data we can change in Innodb at a time is a single row, so the most any single row can be modified cluster-wide is once per RTT.
What about WAN clusters?
Callaghan’s law applies to WAN clusters as well. LANs usually have sub-millisecond RTTs. WANs usually have anywhere from a few ms up to several hundred. This really will open a large window where rows won’t be able to be updated more than just a few times a second at best.
Some things the rule does not mean on Galera
- It does NOT mean you can’t modify different rows simultaneously. You can.
- It does NOT mean you can’t modify data on multiple cluster nodes simultaneously. You can.
- It does NOT set an lower bound on performance, only a upper bound. The best performance you can expect is modifying a given row once per RTT, it could get slower if apply times start to lag.
So what about my application?
Think about your workload. How frequently do you update any given row? We call rows that are updated heavily “hotspots“.
Examples of hotspots
Example 1: Your application is an online game and you keep track of global achievement statistics in a single table with a row for each stat; there are just a few hundred rows. When a player makes an achievement, your application updates this table with a statement like this:
1 | UPDATE achievements SET count = count + 1 where achievement = 'killed_troll'; |
How many players might accomplish this achievement at the same time?
Example 2: You have users and groups in your application. These are maintained in separate tables and there also exists a users_groups table to define the relationship between them. When someone joins a group, you run a transaction that adds the relationship row to users_groups, but also updates groups with some metadata:
1 2 3 4 | BEGIN; INSERT INTO users_groups (user_id, group_id) VALUES (100, 1); UPDATE groups SET last_joined=NOW(), last_user_id=100 WHERE id=1; COMMIT; |
How often might multiple users join the same group?
Results
In both of the above examples you can imagine plenty of concurrent clients attempting to modify the same record at once. But what will actually happen to the clients who try to update the same row within the same RTT? This depends on which node in the cluster the writes are coming from:
From the same node: This will behave just like standard Innodb. The first transaction will acquire the necessary row locks while it commits (which will take the 1 RTT). The other transactions will lock wait until the lock(s) they need are available. The application just waits in those cases.
From other nodes: First to commit wins. The others that try to commit AFTER the first and while the first is still in the local apply queue on their nodes will get a deadlock error.
So, the best case (which may not be best for your application database throughput) will be more write latency into the cluster. The worst case is that your transactions won’t even commit and you have to take some action you normally wouldn’t have had to do.
Workarounds
If your hotspots were really bad in standalone Innodb, you might consider relaxing the fsync: set innodb_flush_log_at_trx_commit to something besides 1 and suddenly you can update much faster. I see this tuning very frequently for “performance” reasons when data durability isn’t as crucial. This is fine as long as you weigh both options carefully.
But in Galera you cannot relax synchronous replication. You can’t change the law, you can only adapt around it, but how might you do that ?
Write to one node
If your issue is really the deadlock errors and not so much the waiting, you could simply send all your writes to one node. This should prevent the deadlock errors, but will not change the lock waiting that your application will need to do for hotspots.
wsrep_retry_autocommit
If your hotspots are all updates with autocommits, you can rely on wsrep_retry_autocommit to auto-retry the transactions for you. However, each autocommit is retried only the number of times specified by this variable (default is 1 retry). This means more waiting, and after the limit is exceeded you will still get the deadlock error.
This is not implemented for full BEGIN … COMMIT multi-statement transactions since it cannot be assumed that those are not applying application logic in between the statements that is not safe to retry after the database state changes.
retry deadlocks
Now we start to get into (*gasp*) territory where your application needs to be modified. Generally if you use Innodb, you should be able to handle deadlock errors in your application. Raise your hands if your application has that logic (I usually get less than 5 people who do out of 100).
But, what to do? Retrying automatically, or giving your end user a chance to retry manually are typical answers. However, this means more latency waiting for a write to go through, and possibly some poor user experience.
batch writes
Instead of updating global counters one at a time (from Example 1, above), how about maintaining the counter in memcache or redis and only flushing to the database periodically?
1 2 3 | if( $last_count % 100 == 0 ) { $db->do( "UPDATE achievements SET count = $last_count where achievement = 'killed_troll'"; } |
change your schema
In Example 2, above, how above moving the ‘joined’ column to the users_groups table so we don’t need to update the parent group row so often?
1 | INSERT INTO users_groups (user_id, group_id, joined) VALUES (100, 1, NOW()); |
Conclusion
Choosing a system to replicate your data to a distributed system requires tradeoffs. Most of us are used to the tradeoffs we take when deploying conventional stand-alone MySQL Innodb with asynchronous slaves. We may not think about the tradeoffs, but we’re making them (anyone obsessively testing slave position to ensure it’s caught up with the master?).
Synchronous replication with PXC and Galera is no different in that there are trade-offs, they just aren’t what we commonly expect.
If Callaghan’s law is going to cause you trouble and you are not prepared to adapt to work with it, PXC/Galera Synchronous replication is probably not right for you.
Excellent description. I have been waiting for someone to begin writing about Galera at this level of detail.
I have not reviewed or tested semi-sync in official MySQL so I am surprised by this and don’t understand why it occurs. But my peers are interested in semi-sync now so we will verify this.
“Semi-sync performs dismally under concurrency. Think about it, our single inserts took ~100ms, so we are effectively serializing our 32 clients with semi-sync replication. Apparently each client must wait for it’s turn writing to the binlog and waiting for confirmation from the slave, there can be no parallelization here.”
Jay,
Great article !
Couple of comments. I think it is important to point out in your case you’re speaking about “commits” when it comes to modification, you can perfectly modify same row many times per RTT assuming you commit all these changes with one go. Some other solutions might actually cause RTT penalty for each update.
Second I think as the technology evolves we should see more workarounds. Queuing and batching technologies can be very interesting used together with sync replication. The reducing the hot spots also can often be used by using multiple rows (say 10) where you modify different rows to reduce contention… but your point lookup queries later have to be aggregates ie running SUM() or MAX() for examples you provided.
The single-node example with fsync is probably correct given the current design of InnoDB, but it’s not fundamental — see the 2009 paper on Early Lock Release (http://infoscience.epfl.ch/record/152158) or the followup from VLDB ’10: http://infoscience.epfl.ch/record/149436/files/vldb10aether.pdf
The paper doesn’t reference multi-node replicated systems, but I’m fairly certain the same optimization could be applied.
-Todd
We implemented early lock release for InnoDB but didn’t write a paper about it. Unfortunately it broke assumptions for crash recovery so the change was removed.
Interesting. Was this on a public branch where I might be able to see the discussion? Or FB-internal?
https://kb.askmonty.org/en/xtradb-option-innodb-release-locks-early/history
https://www.google.com/search?q=early+lock+release+mariadb
https://bugs.launchpad.net/maria/+bug/798213
@Mark: I’d be interested to know if semi-sync can be improved as well. I’m not seeing a lot of love on Semi-sync in general, though if you guys come up with a patch I’m sure it could make its way into Percona Server. 🙂
@Peter: Yes, you are right, I am talking about commits. I’m not sure how useful it is to update the same row multiple times in a single transaction, READ-UNCOMMITTED isn’t going to work in a PXC cluster, for sure. Good thoughts about the hotspot workarounds.
@Todd: Early Lock Release is new to me. I can’t imagine it working well with Galera since there is an assumption that Locks are held on the source node before global ordering is established. Certainly if you were single node writing it may be ok, but not if your HA allowed writes (however briefly) to hit multiple nodes during a writer node failover. However, glad to learn something new!
Great article!
Shouldn’t you be able to use one of the percona toolkit tools to analyze where those hotspots are happening in a production environment by looking for deadlocks? They would be resolving gracefully in a single innodb instance scenario but then potentially require the application rewrite or add the wsrep_retry_autocommit setting as a short-term solution. Hopefully then you could do the analysis of whether you can move to sync repl simply by monitoring your live database rather than do analysis or tests.
Jay, I can think of a scenario where you update a row multiple times before committing:
Your apps send frequent updates into a message queue. Then a consumer thread reads from the MQ and applies updates to the database, many MQ events per commit.
If a given system uses this strategy for hotspot activity, but otherwise everything goes through synchronous replication, then it mitigates the overhead of the RTT. The events in the MQ may not be available synchronously, but that might be okay for a given workload. Not all of an application’s traffic requires the same level of synchronicity.
All this talk of queues and they are likely to be a worst-case for Galera because there is shared state that all transactions must update. Wonder if we need skip-locked or even more support to improve innodb-as-queue concurrency with and without Galera?
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:2060739900346201280
@Serge – I’m starting to put together some ways to analyze query workloads with pt-query-digest that can start to answer some of these questions. Stay tuned for a future blog post on the subject.
@Mark: that seems like a hard problem to solve for standalone Innodb, let alone Galera. I had considered a job queue system in Innodb in the past that, in absence of a ‘SKIP LOCKED’ feature, would use READ-UNCOMMITTED to find which rows that had been “taken” by some other job (that had an open trx and had updated some meta field, but had not committed) and so could gracefully find rows that had not been taken yet. This type of system simply won’t work across multiple nodes in Galera, however.
@Bill — is this the kind of thing you were talking about, or do I misunderstand you?
No, I was trying to describe a system that uses a separate service for a message queue, not MySQL at all. E.g. ActiveMQ, RabbitMQ, Resque, etc. All applications would send updates to the MQ in lieu of writing directly to the database. Sometime later (asynchronously), a worker app would read items from the MQ and then apply them to the database. Possibly committing sets of events in a single transaction to reduce the overhead of the RTT.
I was just trying to think of a scenario that might address your comment, “I’m not sure how useful it is to update the same row multiple times in a single transaction.”
@Bill: Gotcha. Such a system in Galera would need to be careful that transactions don’t get too big in terms of rows modified, and too long in terms of duration since any other connection modifying any one row in that large pending transaction can roll the entire thing back.
This is an excellent article. The rule of thumb that updates on individual rows cannot happen more than once per network round-trip is the most succinct summary of the Galera conflict resolution algorithm I have seen to-date.
To get Galera to work well applications really need to avoid conflicts as much as possible. This seems to be a fundamental truth of any multi-master approach. It does not really matter whether your replication approach is fully synchronous, asynchronous, or like Galera somewhere in between. The trade-off between the approaches seems to be what kind of mess you have to deal with when conflicts do occur.
Jay, following up on your reasoning it seems as if one of the places where Galera’s optimistic locking gets into trouble is large transactions that touch a lot of rows. These can happen through administrative actions for example. Say you want to change the value of a column for 1M rows in a single transaction. If there are any other updates on those rows within the RTT limit either those latter updates fail *or* the main update does.
In fact it seems in this case that the time window for conflicts is potentially far greater than the RTT limit, which is a lower bound. Any update that commits and gets into group communications from the time of the BEGIN on the large transaction could cause the large transaction to roll back. So the conflict window is actually time_of_large_transaction + RTT.
Does this reasoning seem correct or am I missing something?
@Robert –
first comment, second paragraph — exactly, I think this really applies to any multi-writer system.
second comment, first paragraph — I’d argue this is a problem in conventional replication (or Tungsten) too — huge transactions just gum up the works, it’s just Galera’s synchronous nature that turns it from inconvenient to unworkable. Background maintenance operations really need to be chunked into smaller transactions via pt-archiver and similar tools. I do believe that’s the “right” solution.
second comment, second paragraph — yes, total transaction execution time from BEGIN to COMMIT return is the window for conflicts to occur. Callaghan’s law is a lower bound and is really focused on the more typical workloads of a SQL database — small transactions.
Jay – Yoshinori found one performance bug for semi-sync — http://bugs.mysql.com/bug.php?id=69341. Wonder if there are others.
@Mark The performance bug reported by Yoshinori was in reference to MySQL 5.6 in particular. Is is correct to assume this is also occurring in 5.5? It seems to be consistent with testing I’ve done against my 5.5.29 WAN distributed Galera cluster.