A question I often hear when customers want to set up a production PXC cluster is: “How many nodes should we use?”

Three nodes is the most common deployment, but when are more nodes needed? They also ask: “Do we always need to use an even number of nodes?”

This is what we’ll clarify in this post.

This is all about quorum

I explained in a previous post that a quorum vote is held each time one node becomes unreachable. With this vote, the remaining nodes will estimate whether it is safe to keep on serving queries. If quorum is not reached, all remaining nodes will set themselves in a state where they cannot process any query (even reads).

To get the right size for you cluster, the only question you should answer is: how many nodes can simultaneously fail while leaving the cluster operational?

  • If the answer is 1 node, then you need 3 nodes: when 1 node fails, the two remaining nodes have quorum.
  • If the answer is 2 nodes, then you need 5 nodes.
  • If the answer is 3 nodes, then you need 7 nodes.
  • And so on and so forth.

Remember that group communication is not free, so the more nodes in the cluster, the more expensive group communication will be. That’s why it would be a bad idea to have a cluster with 15 nodes for instance. In general we recommend that you talk to us if you think you need more than 10 nodes.

What about an even number of nodes?

The recommendation above always specifies odd number of nodes, so is there anything bad with an even number of nodes? Let’s take a 4-node cluster and see what happens if nodes fail:

  • If 1 node fails, 3 nodes are remaining: they have quorum.
  • If 2 nodes fail, 2 nodes are remaining: they no longer have quorum (remember 50% is NOT quorum).

Conclusion: availability of a 4-node cluster is no better than the availability of a 3-node cluster, so why bother with a 4th node?

The next question is: is a 4-node cluster less available than a 3-node cluster? Many people think so, specifically after reading this sentence from the manual:

Clusters that have an even number of nodes risk split-brain conditions.

Many people read this as “as soon as one node fails, this is a split-brain condition and the whole cluster stop working”. This is not correct! In a 4-node cluster, you can lose 1 node without any problem, exactly like in a 3-node cluster. This is not better but not worse.

By the way the manual is not wrong! The sentence makes sense with its context.

There could actually reasons why you might want to have an even number of nodes, but we will discuss that topic in the next section.

Quorum with multiple data centers

To provide more availability, spreading nodes in several datacenters is a common practice: if power fails in one DC, nodes are available elsewhere. The typical implementation is 3 nodes in 2 DCs:

3nodes

Notice that while this setup can handle any single node failure, it can’t handle all single DC failures: if we lose DC1, 2 nodes leave the cluster and the remaining node has not quorum. You can try with 4, 5 or any number of nodes and it will be easy to convince yourself that in all cases, losing one DC can make the whole cluster stop operating.

If you want to be resilient to a single DC failure, you must have 3 DCs, for instance like this:

3nodes_3DC

Other considerations

Sometimes other factors will make you choose a higher number of nodes. For instance, look at these requirements:

  • All traffic is directed to a single node.
  • The application should be able to fail over to another node in the same datacenter if possible.
  • The cluster must keep operating even if one datacenter fails.

The following architecture is an option (and yes, it has an even number of nodes!):

6nodes

Conclusion

Regarding availability, it is easy to estimate the number of nodes you need for your PXC cluster. But node failures are not the only aspect to consider: Resilience to a datacenter failure can, for instance, influence the number of nodes you will be using.

6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Csaba Balázs

When you need more performance, upgrade node machines (more ram or change to SSD) step-by-step (this would be the test of failure). As I read this article, i know: inserting more node increases the stability, but not the performance. Earlier I think so, when queries are slow, I must add more nodes.

Thanks for this article

Ivan Zahariev

I want to have 3 copies of my data, and thus be able to use 2 nodes simultaneously. This means that I need a cluster of 5 nodes. Is it viable to create a cluster with 3 data nodes, and 2 arbitrators? Is it even possible to run 2 arbitrators in the same PXC?

Ivan Zahariev

Answering my own question 🙂 I’ve tested it, and you cannot have 2 arbitrators.

Ivan Zahariev

A quick note about running an arbitrator. This can actually increase the fault tolerance of a three-node cluster. Having an arbitrator lets you lose 2 nodes in a one-by-one fashion: https://blog.famzah.net/2017/02/23/mysql-galera-cluster-how-many-nodes-do-you-really-need/

I have a question, how to handle DC failure with 3 node + arbitrator solution?
1. Put 3 node in DC1 and arbitrator in DC2, DC2 power outage : Quorum meets, Success
2. Put 2 node in DC1 and Arbitrator + One node in DC2 , DC2 outage : Quorum fails , Failure

What are thoughts.,

bl

so what you do in odd number scenario ? like 5 or 7 of data store ?

is there no point to add the arbitrator for the quorum majority ?
it can be useful for other stuff?