The Case
Securing MySQL is always a challenge. There are general best practices that can be followed for securing your installation, but the more complex setup you have the more likely you are to face some issues which can be difficult to troubleshoot.
We’ve recently been working on a case (thanks Alok Pathak and Janos Ruszo for their major contribution to this case) where MySQL started becoming unavailable when threads activity was high, going beyond a threshold, but not always the same one.
During that time there were many logs like the following, and mysqld was becoming unresponsive for a few seconds.
1 2 | 2019-11-27T10:26:03.476282Z 7736563 [Note] Got an error writing communication packets 2019-11-27T10:26:03.476305Z 7736564 [Note] Got an error writing communication packets |
The “Got an error writing communication packets” is a quite common log message which may be caused for multiple reasons.
B.4.2.10 Communication Errors and Aborted Connections is the link to the official MySQL documentation, but many blog posts have been written as well.
How We Approached This Issue to Find the Root Cause
The first thing to do was to remotely execute a simple loop to figure out if this is just randomly happening, whether this is a network issue or an issue related to mysqld itself.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | [RDBA] percona@monitoring1: ~ $ time for i in {1..100}; do mysql -h 10.0.2.14 -Bsse "show status like '%uptime';"; done Uptime 3540 Uptime 3540 Uptime 3540 Uptime 3541 Uptime 3541 Uptime 3541 Uptime 3541 Uptime 3542 Uptime 3542 Uptime 3542 Uptime 3543 Uptime 3543 Uptime 3543 Uptime 3543 Uptime 3543 Uptime 3544 ^C |
What we initially wanted to do was to confirm the behavior that was reported by the customer. So given that all the app servers were remotely located (thus clients connecting over TCP), we wanted to confirm if there are actually remote connections being dropped (so a network issue? or unresponsive MySQL for any reason? ). We also wanted to verify if there is a pattern, i.e. one connection out of X being dropped or connections being dropped after a certain amount of time. Having a pattern usually helps to identify what the root cause may be. Another reason for executing this remote connectivity loop was to verify if this issue is happening only when remotely connecting or if it also happens with local connections (local connection was later tested).
Doing some troubleshooting on the network layer there was nothing wrong, so we decided to locally start connecting to mysqld over TCP using another loop. This test showed that MySQL was indeed unavailable (or at least we could not randomly access it). Unfortunately, at that point, we didn’t test local connections through a socket. Connecting through a socket totally bypasses the network layer. If we had tried connecting using a socket we would have immediately realized that it was not actually a MySQL issue, as MySQL was always available (so something was blocking connections on the network level). Further details below.
Moving troubleshooting forward, netstat revealed many connections in a TIME_WAIT state. TIME_WAIT indicates that the source side has closed the connection. Below you can find an example, on a testing environment, of how you can use netstat to identify such TCP connections.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 | [RDBA] percona@db4-atsaloux: ~ $ sudo netstat -a -t Active Internet connections (servers and established) Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 0.0.0.0:sunrpc 0.0.0.0:* LISTEN tcp 0 0 db4-atsaloux:42000 0.0.0.0:* LISTEN tcp 0 0 localhost:domain 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:ssh 0.0.0.0:* LISTEN tcp 0 0 0.0.0.0:nrpe 0.0.0.0:* LISTEN tcp 0 0 db4-atsaloux:ssh 10.0.2.10:35230 ESTABLISHED tcp 0 36 db4-atsaloux:ssh 10.0.2.10:39728 ESTABLISHED tcp 0 0 db4-atsaloux:49154 10.0.2.11:mysql ESTABLISHED tcp6 0 0 [::]:mysql [::]:* LISTEN tcp6 0 0 [::]:sunrpc [::]:* LISTEN tcp6 0 0 [::]:ssh [::]:* LISTEN tcp6 0 0 [::]:nrpe [::]:* LISTEN tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50950 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50964 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50938 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50940 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51010 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50994 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50986 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:44110 ESTABLISHED tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50984 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50978 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51030 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50954 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51032 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51042 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50996 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51046 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51000 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50942 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51004 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:44108 ESTABLISHED tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50992 TIME_WAIT tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50988 TIME_WAIT |
This made us believe that we may have run out of TCP connections on the TCP layer due to an increased amount of TCP sessions which were left open until the time_wait timeout occurs. An interesting blog post, “Application Cannot Open Another Connection to MySQL” was written some time ago. This can give you a really good idea of what the “TIME_WAIT” issue is and what actions can be taken to remediate this.
We initially tried to fine-tune the port range ip_local_port_range and adjust some kernel-related options like tcp_tw_reuse, but unfortunately, there was no luck. The behavior was still the same.
Inspecting network traffic revealed that the host was doing a crazy amount of requests to the DNS server defined into /etc/resolv.conf. Talking about network traffic inspection, we were not able to verify a few things on the network layer as network infrastructure was not managed by us. We got a confirmation from the customer’s IT department that nothing wrong was found on the network layer. What we could do was a packet inspection on the traffic coming in and out MySQL, and tcpdump helped identify the high amount of DNS requests and its slow responses back. The command initially used for packet inspection on the db node was tcpdump dst port 3306 or src port 3306 and more specific filters were afterward to exclude and filter out any non-helpful information such as traffic between master and slaves.
At that time, another thing that came to our mind was to verify whether, for any reason, mysqld is trying to do any DNS lookups. That could explain a problem during the initial negotiation. Checking the variable the skip_name_resolve we found that it was already ON so mysqld should not perform any kind of DNS lookups.
1 2 3 4 5 6 7 | db4-atsaloux (none)> select @@skip_name_resolve; +---------------------+ | @@skip_name_resolve | +---------------------+ | 1 | +---------------------+ 1 row in set (0.00 sec) |
Trying to further debug what MySQL was actually doing, we started an strace for the mysqld process.
The Root Cause
What we noticed was that mysql was too frequently accessing the /etc/hosts.allow and /etc/hosts.deny files. Voila!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | root@db4-atsaloux:~# strace -e open,read -p$(pidof mysqld) strace: Process 693 attached # /etc/hosts.deny: list of hosts that are _not_ allowed to access the system. read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0 # /etc/hosts.allow: list of hosts that are allowed to access the system. read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721 read(51, "", 4096) = 0 read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0 read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721 read(51, "", 4096) = 0 read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0 read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721 read(51, "", 4096) = 0 read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0 read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721 read(51, "", 4096) = 0 read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0 read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721 read(51, "", 4096) = 0 read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0 read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721 read(51, "", 4096) = 0 read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0 read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721 read(51, "", 4096) = 0 read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464 read(51, "", 4096) = 0 |
From what we had seen, some new connections were taking a long time to connect MySQL. The strace on mysqld pid showed frequent access to the /etc/hosts.allow and /etc/hosts.deny files. These files are directly related to tcp wrappers! TCP wrappers are considered by many sysadmins as deprecated software (software development has stopped, but quite a few alternatives exist) but these are still widely used. With TCP wrappers, each new connection has to be checked against an ACL and based on this ACL a decision will be taken whether the remote host is allowed to connect to a service or not.
The DNS lookups that we found while troubleshooting had nothing to do with the skip_name_resolve mysql functionality nor mysqld itself and at that point, we knew that it was not actually a mysqld issue. MySQL was up and running being able to serve requests. Personally speaking, I’d describe this as a “by design” software Achilles’ Heel.
Doing some more investigation what we found was that there was a DNS misconfiguration ( /etc/resolv.conf ) so when DNS responses were slow or the DNS server was not responding, TCP wrappers made connections to mysql to stall or just be dropped while waiting for the DNS response.
Conclusion
Having that in mind, if for any reason you need to use TCP wrappers please always be aware that any DNS issues may cause a stall or an outage. With TCP wrappers, a reverse lookup may happen even if you have skip_name_resolve enabled. Even if you have not manually installed TCP wrappers there are some OS, like certain versions of Centos or Ubuntu, that by default have them available – so be careful if you are going to set up security rules for MySQL. Please note that not every single binary can make use of them. Binary should be linked against the TCP wrappers libraries.
If TCP wrappers are enabled for the MySQL service and you really need them, you should always ensure that DNS responses will be fast and for safety, even if this is not easily manageable, add any hosts connecting to mysqld into the /etc/hosts file so that a real DNS lookup won’t be needed for each connection. There are a few best practices while configuring DNS resolution. In my opinion, you should at least set more than one DNS server in /etc/resolv.conf and you should use some local ones or the closest ones to your linux server and maybe enable some kind of caching.
If you are facing a similar issue you can check whether mysqld is built against TCP wrappers doing an ldd over the mysqld binary and checking whether this is linked against the TCP wrappers libraries. If it is, and you face such an issue, then take a look into your system’s DNS configuration and whether DNS is performing ok or not.
1 2 | [RDBA] percona@db4-atsaloux: ~ $ ldd /usr/sbin/mysqld | grep libwrap libwrap.so.0 => /lib/x86_64-linux-gnu/libwrap.so.0 (0x00007fa80532c000) |
hi,
“Inspecting network traffic revealed that the host was doing a crazy amount of requests to the DNS server defined into /etc/resolv.conf”
how can you check this out ?
by using “tcpdump dst port 3306 or src port 3306 “, where dst is the IP address of the destination MySQL? and src is the source host/machine which is connecting to the MySQL in concern ?
” we noticed was that mysql was too frequently accessing the /etc/hosts.allow and /etc/hosts.deny files. ”
how can you know the mysQL is keep accessing both files ? by this command: strace -e open,read -p$(pidof mysqld) ?
“With TCP wrappers, each new connection has to be checked against an ACL and based on this ACL a decision will be taken whether the remote host is allowed to connect to a service or not.”
if not using TCP wrappers, what should we use ? why people keep using it ? and what LInux and MySQL security feature can replace it ?
Hi.
– For the network packet inspection, we did a tcpdump as a first step to check connections coming in and going out.
– We found that MySQL was accessing these files using the strace and capturing the open and read system calls only. You can use strace in multiple ways but this is how we reached to that conclusion.
– I don’t state that you shouldn’t use TCP wrappers. You just need to pay attention on specific things, as you should always do with every software that you may be using. Firewalls based on iptables could be an alternative. There are plenty. You could also build your architecture accordingly i.e. closed network, VPCs, etc. You should also limit the MySQL accounts i.e. not allow % . And of course use a combination of all of these methods 🙂
hi,
“we did a tcpdump as a first step to check connections coming in and going out.”
from DNS to the MySQL and MySQL to DNS ?
“You just need to pay attention on specific things, as you should always do with every software that you may be using. ”
but problem is , very hard to dig it out . takes a long time! who knows it is creating problem as people keep using it.
“Firewalls based on iptables could be an alternative.”
someone suggest remove tcpwrapper and only use firewall,
the iptables, the one IP address allow list in MySQL?
“from DNS to the MySQL and MySQL to DNS ?
That was done on the MySQL host to verify if it is receiving connections and if connections are going out
“but problem is , very hard to dig it out . takes a long time! who knows it is creating problem as people keep using it.”
Problems may be caused with every software and it’s not always easy to diagnose. For example with iptables or any other firewall you may accidentally block something legitimate. I’m not sure what you really asking here. Going through the tcp wrappers documentation you can easily find that there are some notes about issues that can be caused by DNS server not properly functioning.
“someone suggest remove tcpwrapper and only use firewall, the iptables, the one IP address allow list in MySQL?”
Purpose of this blogpost is not to suggest best practises for Securing your installation. Decision is always up to you based on your needs. You have to evaluate what is better.