In this blog, we’ll look at how the hidepid options for /proc and Percona XtraDB Cluster can fight with one another.
One of the things I like about consulting at Percona is the opportunity to be exposed to unusual problems. I recently worked with a customer having issues getting SST to work with Percona XtraDB Cluster. A simple problem you would think. After four hours of debugging, my general feeling was that nothing made sense.
I added a bash trace to the SST script and it claimed MySQL died prematurely:
1 2 3 4 5 6 | [ -n '' ]] + ps -p 11244 + wsrep_log_error 'Parent mysqld process (PID:11244) terminated unexpectedly.' + wsrep_log '[ERROR] Parent mysqld process (PID:11244) terminated unexpectedly.' ++ date '+%Y-%m-%d %H:%M:%S' + local readonly 'tst=2017-11-28 22:02:46' |
At the same time, from the MySQL error log MySQL was complaining the SST script died:
1 2 3 4 5 6 | 2017-11-28 22:02:46 11244 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '172.31.4.179' --datadir '/var/lib/my sql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '11244' '' : 32 (Broken pipe) 2017-11-28 22:02:46 11244 [ERROR] WSREP: Failed to read uuid:seqno from joiner script. 2017-11-28 22:02:46 11244 [ERROR] WSREP: SST script aborted with error 32 (Broken pipe) 2017-11-28 22:02:46 11244 [ERROR] WSREP: SST failed: 32 (Broken pipe) 2017-11-28 22:02:46 11244 [ERROR] Aborting |
The Solution
Clearly, something odd was at play. But what? At that point, I decided to try a few operations with the mysql user. Finally, I stumbled onto something:
1 2 3 4 5 6 | [root@db-01 mysql]# su mysql - bash-4.2$ ps fax PID TTY STAT TIME COMMAND 11901 pts/0 S 0:00 bash - 11902 pts/0 R+ 0:00 _ ps fax bash-4.2$ |
There are way more than 100 processes on these servers, so, why can’t the mysql user see them? Of course, the SST script monitors the state of its parent process using “ps”. Look at the bash trace above: 11244 is the mysqld pid. After a little Googling exercise, I found this blog post about the /proc hidepid mount option. Of course, the customer was using this option:
1 2 | [root@db-02 lib]# mount | grep '^proc' proc on /proc type proc (rw,nosuid,nodev,noexec,relatime,hidepid=2) |
I removed the hidepid option using remount, and set hidepid=0 on all the nodes:
1 | mount -o remount,rw,nosuid,nodev,noexec,relatime,hidepid=0 /proc |
This simple command solved the issue. The SST scripts started to work normally. A good lesson learned: do not overlook security settings!
But now you compromised the security of the server!
I would have appreciated an alternative approach to fixing the issue.
Be it, amending and editing the SST script to work with hidepid in mind or maybe better or the only answer would be to create an exclusion group and then you add group mysql to that group and (re)mount proc with gid=$EXCLUSTION_GROUP_ID_NO
But I get what you saying. I.e. do not overlook security settings!
P.s. Maybe it might be an idea to have pt-summary to recognize and alert that proc is mounted hidepid