IST donor clusterIn a clustering environment, we often see a node that needs to be taken down for maintenance. For a node to rejoin, it should re-sync with the cluster state. In PXC (Percona XtraDB Cluster), there are 2 ways for the rejoining node to re-sync: State Snapshot Transfer (SST) and Incremental State Transfer (IST). SST involves a full data transfer (which could be time consuming). IST is an incremental data transfer whereby only missing write-sets are donated by a DONOR to the rejoining node (aka as JOINER).

In this article I will try to show how a DONOR for the IST process is selected.

Selecting an IST DONOR

First, a word about gcache. Each node retains some write-sets in its cache known as gcache. Once this gcache is full it is purged to make room for new write-sets. Based on gcache configuration, each node may retain a different span of write-sets. The wider the span, the greater the probability of the node acting as prospective DONOR. The lowest seqno in gcache can be queried using (  show status like 'wsrep_local_cached_downto' )

Let’s understand the IST DONOR algorithm with a topology and working example:

  • Say we have 3 node cluster: N1, N2, N3.
  • To start with, all 3 nodes are in sync (wsrep_last_committed is the same for all 3 nodes, let’s say 100).
  • N3 is schedule for maintenance and is taken down.
  • In meantime N1 and N2 processes workload, thereby moving them from 100 -> 1100.
  • N1 and N2 also purges the gcache. Let’s say wsrep_local_cached_downto for N1 and N2 is 110 and 90 respectively.
  • Now N3 is restarted and discovers that the cluster has made progress from 100 -> 1100 and so it needs the write-sets from (101, 1100).
  • It starts looking for a prospective DONOR.
    • N1 can service data from (110, 1100) but the request is for (101, 1100) so N1 can’t act as DONOR
    • N2 can service data from (90, 1100) and the request is for (101, 1100) so N2 can act as DONOR.

Safety gap and how it affects DONOR selection

So far so good. But can N2 reliably act as DONOR? While N3 is evaluating the prospective DONOR, what if N2 purges more data and now wsrep_local_cached_downto on N2 is 105? In order to accommodate this, the N3 algorithm adds a safety gap.

safety gap = (Current State of Cluster – Lowest available seqno from any of the existing node of the cluster) * 0.008

So the N2 range is considered to be (90 + (1100 – 90) * 0.008, 1100) = (98, 1100).

Can now N2 act as DONOR ? Yes: (98, 1100) < (101, 1100)

What if N2 had purged up to 95 and then N3 started looking for prospective DONOR?

In this case the N2 range would be (95 + (1100 – 95) * 0.008, 1100) = (103, 1100), ruling N2 out from the prospective DONOR list.

Twist at the end

Considering the latter case above (N2 purged up to 95), it has been proven that N2 can’t act as the IST DONOR and the only way for N3 to join is through SST.

What if I say that N3 still joins back using IST? CONFUSED?

Once N3 falls back from IST to SST it will select a SST donor. This selection is done sequentially and nominates N1 as the first choice. N1 doesn’t have the required write-sets, so SST is forced.

But what if I configure wsrep_sst_donor=N2  on N3? This will cause N2 to get selected instead of N1. But wait: N2 doesn’t qualify either as with safety gap, the range is (103, 1100).

That’s true. But the request has IST + SST request, so even though N3 ruled out N2 as the IST DONOR, a request is sent for one last try. If N2 can service the request using IST, it is allowed to do so.  Otherwise it falls back to SST.

Interesting! This is a well thought out algorithm from Codership: I applaud them for this and the many other important control functions that go on backstage of the galera cluster.

3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
realshantih

It’s very hard to understand..

Krunal Bauskar

Indeed it is a complex algorithm that works beautifully without end-user caring about the real stuff.
Any specific section you have problem grasping let me know I can try to explain it.

I don’t understand so much about donor node. Can you explain me more detailed about this?