Recently, I had to work on an emergency for a customer who was having a problem restarting a MariaDB Galera Cluster. After a failure in the cluster they decided to restart the cluster entirely following the right path: bootstrapping the first node, and then adding the rest of the members, one by one. Everything went fine until one of the nodes rejected the request to join the cluster.
Given this problem, the customer decided to ask us to help with the problematic node because none of the tests they did worked, and the same symptom repeated over and over: SST started, copied few gigs of data and then it just hanged (apparently) while the node remained out of the cluster.
Identifying the issue…
Once onboard with the issue, initially I just checked that the cluster was trying a SST: given the whole dataset was about 31GB I decided to go directly to a healthy solution: to clean up the whole datadir and start afresh. No luck at all, the symptom was exactly the same no matter what I tried:
After reviewing the logs I noticed few strange things. In the joiner:
1 2 3 4 5 6 7 8 9 10 | 2019-01-29 16:14:41 139996474869504 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (18153472-f958-11e8-ba63-fae8ac6c22f8): 1 (Operation not permitted) at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable. 2019-01-29 16:14:41 139996262553344 [Note] WSREP: Member 3.0 (node1) requested state transfer from '*any*'. Selected 0.0 (node3)(SYNCED) as donor. 2019-01-29 16:14:41 139996262553344 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 4902465) 2019-01-29 16:14:41 139996474869504 [Note] WSREP: Requesting state transfer: success, donor: 0 2019-01-29 16:14:41 139996474869504 [Note] WSREP: GCache history reset: 00000000-0000-0000-0000-000000000000:0 -> 18153472-f958-11e8-ba63-fae8ac6c22f8:4902465 2019-01-29 16:14:42 139996270946048 [Note] WSREP: (9864c6ca, 'tcp://0.0.0.0:4567') connection to peer 9864c6ca with addr tcp://192.168.12.21:4567 timed out, no messages seen in PT3S 2019-01-29 16:14:42 139996270946048 [Note] WSREP: (9864c6ca, 'tcp://0.0.0.0:4567') turning message relay requesting off 2019-01-29 16:16:08 139996254160640 [ERROR] WSREP: Process was aborted. 2019-01-29 16:16:08 139996254160640 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '192.168.12.21' --datadir '/var/lib/mysql/' --parent '8725' --binlog '/var/log/mysql/mariadb-bin' --binlog-index '/var/log/mysql/mariadb-bin.index': 2 (No such file or directory) |
In the donor (output has been a obfuscated to avoid sharing private info and the times are not matching deliberately):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | Jan 29 18:08:22 node3 -innobackupex-backup: 190129 18:08:22 >> log scanned up to (203524317205) Jan 29 18:08:23 node3 -innobackupex-backup: 190129 18:08:23 >> log scanned up to (203524318337) Jan 29 18:08:24 node3 -innobackupex-backup: 190129 18:08:24 >> log scanned up to (203524320436) Jan 29 18:08:25 node3 -innobackupex-backup: 190129 18:08:25 >> log scanned up to (203524322720) Jan 29 18:08:25 node3 nrpe[25546]: Error: Request packet type/version was invalid! Jan 29 18:08:25 node3 nrpe[25546]: Client request was invalid, bailing out... Jan 29 18:08:26 node3 -innobackupex-backup: 190129 18:08:26 >> log scanned up to (203524322720) Jan 29 18:08:27 node3 -innobackupex-backup: 190129 18:08:27 >> log scanned up to (203524323538) Jan 29 18:08:28 node3 -innobackupex-backup: 190129 18:08:28 >> log scanned up to (203524324667) Jan 29 18:08:29 node3 -innobackupex-backup: 190129 18:08:29 >> log scanned up to (203524325358) Jan 29 18:08:30 node3 -wsrep-sst-donor: 2019/01/29 18:08:30 socat[22843] E write(6, 0x1579220, 8192): Broken pipe Jan 29 18:08:30 node3 -innobackupex-backup: innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe) Jan 29 18:08:30 node3 -innobackupex-backup: xb_stream_write_data() failed. Jan 29 18:08:30 node3 -innobackupex-backup: innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe) Jan 29 18:08:30 node3 -innobackupex-backup: [01] xtrabackup: Error: xtrabackup_copy_datafile() failed. Jan 29 18:08:30 node3 -innobackupex-backup: [01] xtrabackup: Error: failed to copy datafile. Jan 29 18:08:30 node3 mysqld[27345]: 2019-01-29 18:08:30 140562136139520 [Warning] Aborted connection 422963 to db: 'unconnected' user: 'sstuser' host: 'localhost' (Got an error reading communication packets) Jan 29 18:08:30 node3 -wsrep-sst-donor: innobackupex finished with error: 1. Check /var/lib/mysql//innobackup.backup.log Jan 29 18:08:30 node3 -wsrep-sst-donor: Cleanup after exit with status:22 |
So SST starts correctly and then failed. I tried forcing different donors, checked firewall rules, etc. Nothing.
Additionally I noticed that the process was starting over and over, while monitoring, .ssh folder was growing up to certain size (something around 7GB) and then would start over. The logs kept showing the same messages, the init script failed with an error but the process kept running either until I executed service mysql stop or kill -9 to all processes. It was getting stranger every minute.
At this point I was totally lost, scratching my head looking for solutions. More strange still was that trying a manual SST using netcat worked perfectly! So it was definitely a problem with the init script. Systemd journal was not providing any insight…
And then…
MariaDB Cluster dies in the SST process after 90 seconds
Suddenly I noticed that the failure was happening roughly 90 seconds after the start. A short googling later—doing more specific search—I found this page:
https://mariadb.com/kb/en/library/systemd/#ssts-and-systemd which explained precisely my problem.
The MariaDB init script has changed its timeout from 900 seconds to 90 while MySQL Community and Percona Server has this value set to 15 mins. Also it seems that this change has caused some major issues with nodes crashing as documented in MDEV-15607 — the bug is reported as fixed but we still can see timeout problems.
I observed that in case of failure, systemd was killing the mysqld process but not stopping the service. This results in an infinite SST loop that only stops when the service is killed or stopped via systemd command.
The fix was super easy, I just needed to create a file to be used by systemd that sets the timeout to a more useful value as follows:
1 2 3 4 5 6 | sudo tee /etc/systemd/system/mariadb.service.d/timeoutstartsec.conf <<EOF [Service] TimeoutStartSec=900 EOF sudo systemctl daemon-reload |
As you may notice I set the timeout to 15 minutes but I could set it to any time. That was it, the next SST will have plenty of time to finish. Reference to this change is very well documented here
On reflection…
One could argue about this change, and I’m still having some internal discussions about it. In my opinion, a 90 seconds timeout is too short for a Galera cluster. It is very likely that almost any cluster will reach that timeout during SST. Even a regular MySQL server that suffers a crash with a high proportion of dirty pages or many operations to rollback, 90 seconds doesn’t seem to be an feasible time for crash recovery. Why the developers changed it to such a short timeout I have no idea. Luckily, it is very easy to fix now I know the reason.
900 secondes is not enough too.
One point you should mention the exact version of MariaDB. (all version since 6~10 months) shouldn’t have this problem.
I was really bored about this lack of test from MariaDB + GaleraCluster with SST, normally now it’s should be done on each new release to prevent stupid things like this.
for me I use :
TimeoutSec=infinity
or you can put 0 if I good remember also
Aurélien, yeap, fair enough and I agree with you regarding timeouts, infinite IMO is not good either but maybe is just me against infinite processes 🙂
Regarding version this is 10.2.8-MariaDB-10.2.8 which is quite old actually. I will have a look on what are these timeouts in current versions, thanks for pointing out!
how apply this settings when use mariadb-docker
https://hub.docker.com/_/mariadb
i found docker run not use systemd , I really hope to get your help
Been having exactly this problem for some time and our database is somewhat bigger than that one.. Thankfully, we discovered it on a test database first, so we’re aware of it.. but it’s incredibly annoying if you forget about changing the system file (if you can find it/edit it – permissions).
In my case, it was going from 10.1.28 to 10.1.34 that highlighted the problem.
(also MDEV-15606)
It’s systemd. Use systemctl edit mysql and create an override. (Then symlink the /etc/systemd/system/mysql.service.d/ file to /etc/systemd/system/mysqld.service.d/ just for compat) There’s a good chance you should be or are already doing this for the number of processes, number of open files, or memlock.
We had the exact same problem somewhere mid-2018. As by then they identified the issue and promised to fix it in the next release I never gave it much thought. We simply added it to our Ansible playbook and that solved it for us…
Honestly, I’ll not blame MariaDB, PXC or anything coming from Galera.. this comes from systemd-crap
They don’t know how to start databases, this was SST, what happen when you start pulling the buffer pool from disk and you have more than 90 seconds to start?
The startup script should be smart enough to understand the “start” process is in progress, giving you an alert of “Long running start process is running” with a WARN maybe? and let you continue..
EXIT and kill the process is simple wrong.
Ok, I can be the first to defend both the systemd and mariadb new timeout, call me devil’s advocate. Why is recovery/joining considered a “startup process”? It should be marked as started, what it’s doing intrnally to become a fully joined node is not systemd’s problem, and a lot can show progress there. 90 seconds is too long for a well written server to start. Galera init phase as advertised to systemd is broken and this issue exposed it.
Slawomir and Jtomaszon, I would agree with you both but (there is always a but) your reasoning should be true if we just got a fail and service stop working, not a continuos failure that just keeps starting service even the command throws failed (notice that timeout just killed the mysqld server but remains trying to start it as it would be mysqld_safe), I’m ok with an error message and failure but not with a constant loop of SSTs until you decide to either kill or stop the service which never started
Sorry, but I don’t understand this at all. How can that timeout period help this at all ?
It helps by avoiding systemd killing mysqld process due longer sst than timeout period.
But the logs seem to indicate some kinda connection or file problem instead of timeouts.
No, in the donor you see this message:
Jan 29 18:08:30 node3 -innobackupex-backup: innobackupex: Error writing file ‘UNOPENED’ (Errcode: 32 – Broken pipe)
Meaning it can’t write a file due broken pipe errors, this means that joiner stopped connection because mysqld was killed and sst process restarted using another connection socket.
All you need to know about this problem, proper solutions depending on systemd version, workarounds etc is here:
https://jira.mariadb.org/browse/MDEV-17571
mariadb docker , Not matter added mysql.service.d/timeoutstartsec.conf or added mysqld.service.d/timeoutstartsec.conf
————-
[Service]
TimeoutStartSec=0
TimeoutStopSec=0
or
TimeoutStartSec=infinity
TimeoutStopSec=infinity
sst also failed, why?
–innobackupex: Error writing file ‘UNKNOWN’ (errno: 32 “Broken pipe”)