How Securing MySQL with TCP Wrappers Can Cause an Outage

The Case
Securing MySQL is always a challenge. There are general best practices that can be followed for securing your installation, but the more complex setup you have the more likely you are to face some issues which can be difficult to troubleshoot.
We’ve recently been working on a case (thanks Alok Pathak and Janos Ruszo for their major contribution to this case) where MySQL started becoming unavailable when threads activity was high, going beyond a threshold, but not always the same one.
During that time there were many logs like the following, and mysqld was becoming unresponsive for a few seconds.
Shell
2019-11-27T10:26:03.476282Z 7736563 [Note] Got an error writing communication packets 2019-11-27T10:26:03.476305Z 7736564 [Note] Got an error writing communication packets
1
2
2019-11-27T10:26:03.476282Z 7736563 [Note] Got an error writing communication packets
2019-11-27T10:26:03.476305Z 7736564 [Note] Got an error writing communication packets
The “Got an error writing communication packets” is a quite common log message which may be caused for multiple reasons.
B.4.2.10 Communication Errors and Aborted Connections is the link to the official MySQL documentation, but many blog posts have been written as well.

How We Approached This Issue to Find the Root Cause

The first thing to do was to remotely execute a simple loop to figure out if this is just randomly happening, whether this is a network issue or an issue related to mysqld itself.

[RDBA] percona@monitoring1: ~ $ time for i in {1..100}; do mysql -h 10.0.2.14 -Bsse "show status like '%uptime';"; done
Uptime	3540
Uptime	3540
Uptime	3540
Uptime	3541
Uptime	3541
Uptime	3541
Uptime	3541
Uptime	3542
Uptime	3542
Uptime	3542
Uptime	3543
Uptime	3543
Uptime	3543
Uptime	3543
Uptime	3543
Uptime	3544
^C

[RDBA] percona@monitoring1: ~ $ time for i in {1..100}; do mysql -h 10.0.2.14 -Bsse "show status like '%uptime';"; done

Uptime 3540

Uptime 3541

Uptime 3542

Uptime 3543

Uptime 3544

What we initially wanted to do was to confirm the behavior that was reported by the customer. So given that all the app servers were remotely located (thus clients connecting over TCP), we wanted to confirm if there are actually remote connections being dropped (so a network issue? or unresponsive MySQL for any reason? ). We also wanted to verify if there is a pattern, i.e. one connection out of X being dropped or connections being dropped after a certain amount of time. Having a pattern usually helps to identify what the root cause may be. Another reason for executing this remote connectivity loop was to verify if this issue is happening only when remotely connecting or if it also happens with local connections (local connection was later tested).

Doing some troubleshooting on the network layer there was nothing wrong, so we decided to locally start connecting to mysqld over TCP using another loop. This test showed that MySQL was indeed unavailable (or at least we could not randomly access it). Unfortunately, at that point, we didn’t test local connections through a socket. Connecting through a socket totally bypasses the network layer. If we had tried connecting using a socket we would have immediately realized that it was not actually a MySQL issue, as MySQL was always available (so something was blocking connections on the network level). Further details below.

Moving troubleshooting forward, netstat revealed many connections in a TIME_WAIT state. TIME_WAIT indicates that the source side has closed the connection. Below you can find an example, on a testing environment, of how you can use netstat to identify such TCP connections.

[RDBA] percona@db4-atsaloux: ~ $ sudo netstat -a -t
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 0.0.0.0:sunrpc          0.0.0.0:*               LISTEN
tcp        0      0 db4-atsaloux:42000      0.0.0.0:*               LISTEN
tcp        0      0 localhost:domain        0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:ssh             0.0.0.0:*               LISTEN
tcp        0      0 0.0.0.0:nrpe            0.0.0.0:*               LISTEN
tcp        0      0 db4-atsaloux:ssh        10.0.2.10:35230         ESTABLISHED
tcp        0     36 db4-atsaloux:ssh        10.0.2.10:39728         ESTABLISHED
tcp        0      0 db4-atsaloux:49154      10.0.2.11:mysql         ESTABLISHED
tcp6       0      0 [::]:mysql              [::]:*                  LISTEN
tcp6       0      0 [::]:sunrpc             [::]:*                  LISTEN
tcp6       0      0 [::]:ssh                [::]:*                  LISTEN
tcp6       0      0 [::]:nrpe               [::]:*                  LISTEN
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:50950         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:50964         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:50938         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:50940         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:51010         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:50994         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:50986         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:44110         ESTABLISHED
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:50984         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:50978         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:51030         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:50954         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:51032         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:51042         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:50996         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:51046         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:51000         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:50942         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:51004         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:44108         ESTABLISHED
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:50992         TIME_WAIT
tcp6       0      0 db4-atsaloux:mysql      10.0.2.10:50988         TIME_WAIT

[RDBA] percona@db4-atsaloux: ~ $ sudo netstat -a -t

Active Internet connections (servers and established)

Proto Recv-Q Send-Q Local Address Foreign Address State

tcp 0 0 0.0.0.0:sunrpc 0.0.0.0:* LISTEN

tcp 0 0 db4-atsaloux:42000 0.0.0.0:* LISTEN

tcp 0 0 localhost:domain 0.0.0.0:* LISTEN

tcp 0 0 0.0.0.0:ssh 0.0.0.0:* LISTEN

tcp 0 0 0.0.0.0:nrpe 0.0.0.0:* LISTEN

tcp 0 0 db4-atsaloux:ssh 10.0.2.10:35230 ESTABLISHED

tcp 0 36 db4-atsaloux:ssh 10.0.2.10:39728 ESTABLISHED

tcp 0 0 db4-atsaloux:49154 10.0.2.11:mysql ESTABLISHED

tcp6 0 0 [::]:mysql [::]:* LISTEN

tcp6 0 0 [::]:sunrpc [::]:* LISTEN

tcp6 0 0 [::]:ssh [::]:* LISTEN

tcp6 0 0 [::]:nrpe [::]:* LISTEN

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50950 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50964 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50938 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50940 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51010 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50994 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50986 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:44110 ESTABLISHED

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50984 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50978 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51030 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50954 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51032 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51042 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50996 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51046 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51000 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50942 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:51004 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:44108 ESTABLISHED

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50992 TIME_WAIT

tcp6 0 0 db4-atsaloux:mysql 10.0.2.10:50988 TIME_WAIT

This made us believe that we may have run out of TCP connections on the TCP layer due to an increased amount of TCP sessions which were left open until the time_wait timeout occurs. An interesting blog post, “Application Cannot Open Another Connection to MySQL” was written some time ago. This can give you a really good idea of what the “TIME_WAIT” issue is and what actions can be taken to remediate this.

We initially tried to fine-tune the port range ip_local_port_range and adjust some kernel-related options like tcp_tw_reuse, but unfortunately, there was no luck. The behavior was still the same.

Inspecting network traffic revealed that the host was doing a crazy amount of requests to the DNS server defined into /etc/resolv.conf. Talking about network traffic inspection, we were not able to verify a few things on the network layer as network infrastructure was not managed by us. We got a confirmation from the customer’s IT department that nothing wrong was found on the network layer. What we could do was a packet inspection on the traffic coming in and out MySQL, and tcpdump helped identify the high amount of DNS requests and its slow responses back. The command initially used for packet inspection on the db node was tcpdump dst port 3306 or src port 3306 and more specific filters were afterward to exclude and filter out any non-helpful information such as traffic between master and slaves.

At that time, another thing that came to our mind was to verify whether, for any reason, mysqld is trying to do any DNS lookups. That could explain a problem during the initial negotiation. Checking the variable the skip_name_resolve we found that it was already ON so mysqld should not perform any kind of DNS lookups.

db4-atsaloux (none)> select @@skip_name_resolve;
+---------------------+
| @@skip_name_resolve |
+---------------------+
|                   1 |
+---------------------+
1 row in set (0.00 sec)

db4-atsaloux (none)> select @@skip_name_resolve;

+---------------------+

| @@skip_name_resolve |

+---------------------+

| 1 |

+---------------------+

1 row in set (0.00 sec)

Trying to further debug what MySQL was actually doing, we started an strace for the mysqld process.

The Root Cause

What we noticed was that mysql was too frequently accessing the /etc/hosts.allow and /etc/hosts.deny files. Voila!

root@db4-atsaloux:~# strace -e open,read -p$(pidof mysqld)
strace: Process 693 attached
# /etc/hosts.deny: list of hosts that are _not_ allowed to access the system.
read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464
read(51, "", 4096)                      = 0
# /etc/hosts.allow: list of hosts that are allowed to access the system.
read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721
read(51, "", 4096)                      = 0
read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464
read(51, "", 4096)                      = 0
read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721
read(51, "", 4096)                      = 0
read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464
read(51, "", 4096)                      = 0
read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721
read(51, "", 4096)                      = 0
read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464
read(51, "", 4096)                      = 0
read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721
read(51, "", 4096)                      = 0
read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464
read(51, "", 4096)                      = 0
read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721
read(51, "", 4096)                      = 0
read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464
read(51, "", 4096)                      = 0
read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721
read(51, "", 4096)                      = 0
read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464
read(51, "", 4096)                      = 0
read(51, "# /etc/hosts.deny: list of hosts"..., 4096) = 721
read(51, "", 4096)                      = 0
read(51, "# /etc/hosts.allow: list of host"..., 4096) = 464
read(51, "", 4096)                      = 0

root@db4-atsaloux:~# strace -e open,read -p$(pidof mysqld)

strace: Process 693 attached

# /etc/hosts.deny: list of hosts that are _not_ allowed to access the system.