Percona XtraDB Cluster (PXC) has become a popular option to provide high availability for MySQL servers. However many people are still having a hard time understanding what will happen to the cluster when one or several nodes leave the cluster (gracefully or ungracefully). This is what we will clarify in this post.
Nodes leaving gracefully
Let’s assume we have a 3-node cluster and all nodes have an equal weight, which is the default.
What happens if Node1 is gracefully stopped (service mysql stop
)? When shutting down, Node1 will instruct the other nodes that it is leaving the cluster. We now have a 2-node cluster and the remaining members have 2/2 = 100% of the votes. The cluster keeps running normally.
What happens now if Node2 is gracefully stopped? Same thing, Node3 knows that Node2 is no longer part of the cluster. Node3 then has 1/1 = 100% of the votes and the 1-node cluster can keep on running.
In these scenarios, there is no need for a quorum vote as the remaining node(s) always know what happened to the nodes that are leaving the cluster.
Nodes becoming unreachable
On the same 3-node cluster with all 3 nodes running, what happens now if Node1 crashes?
This time Node2 and Node3 must run a quorum vote to estimate if it is safe continue: they have 2/3 of the votes, 2/3 is > 50%, so the remaining 2 nodes have quorum and they keep on working normally.
Note that the quorum vote does not happen immediately when Node2 and Node3 are not able to join Node1. It only happens after the ‘suspect timeout’ (evs.suspect_timeout) which is 5 seconds by default. Why? It allows the cluster to be resilient to short network failures which can be quite useful when operating the cluster over a WAN. The tradeoff is that if a node crashes, writes are stalled during the suspect timeout.
Now what happens if Node2 also crashes?
Again a quorum vote must be performed. This time Node3 has only 1/2 of the votes: this is not > 50% of the votes. Node3 doesn’t have quorum, so it stops processing reads and writes.
If you look at the wsrep_cluster_status
status variable on the remaining node, it will show NON_PRIMARY
. This indicates that the node is not part of the Primary Component.
Why does the remaining node stop processing queries?
This is a question I often hear: after all, MySQL is up and running on Node3 so why is it prevented from running any query? The point is that Node3 has no way to know what happened to Node2:
- Did it crash? In this case, it is safe for the remaining node to keep on running queries.
- Or is there a network partition between the two nodes? In this case, it is dangerous to process queries because Node2 might also process other queries that will not be replicated because of the broken network link: the result will be two divergent datasets. This is a split-brain situation, and it is a serious issue as it may be impossible to later merge the two datasets. For instance if the same row has been changed in both nodes, which row has the correct value?
Quorum votes are not held because it’s fun, but only because the remaining nodes have to talk together to see if they can safely proceed. And remember that one of the goals of Galera is to provide strong data consistency, so any time the cluster does not know whether it is safe to proceed, it takes a conservative approach and it stops processing queries.
In such a scenario, the status of Node3 will be set to NON_PRIMARY
and a manual intervention is needed to re-bootstrap the cluster from this node by running:
1 | SET GLOBAL wsrep_provider_options='pc.boostrap=YES'; |
An aside question is: now it is clear why writes should be forbidden in this scenario, but what about reads? Couldn’t we allow them?
Actually this is possible from PXC 5.6.24-25.11 with the wsrep_dirty_reads setting.
Conclusion
Split-brain is one of the worst enemies of a Galera cluster. Quorum votes will take place every time one or several nodes suddenly become unreachable and are meant to protect data consistency. The tradeoff is that it can hurt availability, because in some situations a manual intervention is necessary to instruct the remaining nodes that they can accept executing queries.
I always face with a question! Can Overlapping Query in MySQL Cluster Happens? For example Selecting a specific row before the slaves has updated while already a update query had sent to master.
Pejman,
I assume that ‘MySQL Cluster’ means ‘Galera Cluster’, not regular master-slave.
If so, the following scenario can happen:
– you update a row on node1.
– an instant later, you read that same row on node2, but you don’t see the change that comes from node1.
This can happen because after the write is replicated on node2, it is put in an apply queue not possibly not applied immediately.
If this is an issue, you can use wsrep_sync_wait (https://www.percona.com/doc/percona-xtradb-cluster/5.6/wsrep-system-index.html#wsrep_sync_wait).
The drawback is slower reads.
Pejman,
Just to elaborate Stephane comments, you may perform queries as follow:
SET SESSION wsrep_sync_wait = 1;
SELECT * FROM example WHERE field = “value”;
SET SESSION wsrep_sync_wait = 0;
The application runs the first SET command to enable wsrep_sync_wait. In the next command, the application sends the SELECT query. The node initiates a causality check, blocking incoming queries while it catches up with the cluster. When the node finishes applying the new transactions, it executes the SELECT query, returning the results to the application. The application, having finished the critical read, disables wsrep_sync_wait, returning the node to normal operation.
what is the best way to weight quorum for a 6 node 3 datacenter cluster? something like pc.weight=5 …… pc.weight=0?
hi,
I have two questions as below:
1. Does the quorum affect the response count of certification test ?
If the cluster nodes is five and all is alive.
Is the certification test is successful at least 3 nodes responding success if node1 made a certification test to all the other nodes ? or all nodes must respond success.
2. from the above mentioned,
“Again a quorum vote must be performed. This time Node3 has only 1/2 of the votes: this is not > 50% of the votes. Node3 doesn’t have quorum, so it stops processing reads and writes.”
I dnt get it why node3 has only 1/2 of the votes rather then 1/3 votes ?
much apprecited for you explanation.