The Case

Securing MySQL is always a challenge. There are general best practices that can be followed for securing your installation, but the more complex setup you have the more likely you are to face some issues which can be difficult to troubleshoot.

We’ve recently been working on a case (thanks Alok Pathak and Janos Ruszo for their major contribution to this case) where MySQL started becoming unavailable when threads activity was high, going beyond a threshold, but not always the same one.

During that time there were many logs like the following, and mysqld was becoming unresponsive for a few seconds.

The “Got an error writing communication packets” is a quite common log message which may be caused for multiple reasons.

B.4.2.10 Communication Errors and Aborted Connections is the link to the official MySQL documentation, but many blog posts have been written as well.

How We Approached This Issue to Find the Root Cause

The first thing to do was to remotely execute a simple loop to figure out if this is just randomly happening, whether this is a network issue or an issue related to mysqld itself.

What we initially wanted to do was to confirm the behavior that was reported by the customer. So given that all the app servers were remotely located (thus clients connecting over TCP), we wanted to confirm if there are actually remote connections being dropped (so a network issue? or unresponsive MySQL for any reason? ). We also wanted to verify if there is a pattern, i.e. one connection out of X being dropped or connections being dropped after a certain amount of time. Having a pattern usually helps to identify what the root cause may be. Another reason for executing this remote connectivity loop was to verify if this issue is happening only when remotely connecting or if it also happens with local connections (local connection was later tested).

Doing some troubleshooting on the network layer there was nothing wrong, so we decided to locally start connecting to mysqld over TCP using another loop. This test showed that MySQL was indeed unavailable (or at least we could not randomly access it). Unfortunately, at that point, we didn’t test local connections through a socket. Connecting through a socket totally bypasses the network layer. If we had tried connecting using a socket we would have immediately realized that it was not actually a MySQL issue, as MySQL was always available (so something was blocking connections on the network level). Further details below.

Moving troubleshooting forward, netstat revealed many connections in a TIME_WAIT state. TIME_WAIT indicates that the source side has closed the connection. Below you can find an example, on a testing environment, of how you can use netstat to identify such TCP connections.

This made us believe that we may have run out of TCP connections on the TCP layer due to an increased amount of TCP sessions which were left open until the time_wait  timeout occurs. An interesting blog post, “Application Cannot Open Another Connection to MySQL” was written some time ago. This can give you a really good idea of what the “TIME_WAIT” issue is and what actions can be taken to remediate this.

We initially tried to fine-tune the port range ip_local_port_range  and adjust some kernel-related options like tcp_tw_reuse, but unfortunately, there was no luck. The behavior was still the same.

Inspecting network traffic revealed that the host was doing a crazy amount of requests to the DNS server defined into /etc/resolv.conf. Talking about network traffic inspection, we were not able to verify a few things on the network layer as network infrastructure was not managed by us. We got a confirmation from the customer’s IT department that nothing wrong was found on the network layer. What we could do was a packet inspection on the traffic coming in and out MySQL, and tcpdump helped identify the high amount of DNS requests and its slow responses back. The command initially used for packet inspection on the db node was tcpdump dst port 3306 or src port 3306 and more specific filters were afterward to exclude and filter out any non-helpful information such as traffic between master and slaves.

At that time, another thing that came to our mind was to verify whether, for any reason, mysqld is trying to do any DNS lookups. That could explain a problem during the initial negotiation. Checking the variable the skip_name_resolve we found that it was already ON so mysqld should not perform any kind of DNS lookups.

Trying to further debug what MySQL was actually doing, we started an strace for the mysqld process.

The Root Cause

What we noticed was that mysql was too frequently accessing the /etc/hosts.allow and /etc/hosts.deny files. Voila!

From what we had seen, some new connections were taking a long time to connect MySQL. The strace on mysqld pid showed frequent access to the /etc/hosts.allow  and /etc/hosts.deny  files. These files are directly related to tcp wrappers! TCP wrappers are considered by many sysadmins as deprecated software (software development has stopped, but quite a few alternatives exist) but these are still widely used. With TCP wrappers, each new connection has to be checked against an ACL and based on this ACL a decision will be taken whether the remote host is allowed to connect to a service or not.

The DNS lookups that we found while troubleshooting had nothing to do with the skip_name_resolve mysql functionality nor mysqld itself and at that point, we knew that it was not actually a mysqld issue. MySQL was up and running being able to serve requests. Personally speaking, I’d describe this as a “by design” software Achilles’ Heel.

Doing some more investigation what we found was that there was a DNS misconfiguration ( /etc/resolv.conf ) so when DNS responses were slow or the DNS server was not responding, TCP wrappers made connections to mysql to stall or just be dropped while waiting for the DNS response.

Conclusion

Having that in mind, if for any reason you need to use TCP wrappers please always be aware that any DNS issues may cause a stall or an outage. With TCP wrappers, a reverse lookup may happen even if you have skip_name_resolve enabled. Even if you have not manually installed TCP wrappers there are some OS, like certain versions of Centos or Ubuntu, that by default have them available – so be careful if you are going to set up security rules for MySQL. Please note that not every single binary can make use of them. Binary should be linked against the TCP wrappers libraries.

If TCP wrappers are enabled for the MySQL service and you really need them, you should always ensure that DNS responses will be fast and for safety, even if this is not easily manageable, add any hosts connecting to mysqld into the /etc/hosts file so that a real DNS lookup won’t be needed for each connection. There are a few best practices while configuring DNS resolution. In my opinion, you should at least set more than one DNS server in /etc/resolv.conf and you should use some local ones or the closest ones to your linux server and maybe enable some kind of caching.

If you are facing a similar issue you can check whether mysqld is built against TCP wrappers doing an ldd over the mysqld binary and checking whether this is linked against the TCP wrappers libraries. If it is, and you face such an issue, then take a look into your system’s DNS configuration and whether DNS is performing ok or not.

 

4 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
dba100

hi,

“Inspecting network traffic revealed that the host was doing a crazy amount of requests to the DNS server defined into /etc/resolv.conf”

how can you check this out ?

by using “tcpdump dst port 3306 or src port 3306 “, where dst is the IP address of the destination MySQL? and src is the source host/machine which is connecting to the MySQL in concern ?

” we noticed was that mysql was too frequently accessing the /etc/hosts.allow and /etc/hosts.deny files. ”

how can you know the mysQL is keep accessing both files ? by this command: strace -e open,read -p$(pidof mysqld) ?

“With TCP wrappers, each new connection has to be checked against an ACL and based on this ACL a decision will be taken whether the remote host is allowed to connect to a service or not.”

if not using TCP wrappers, what should we use ? why people keep using it ? and what LInux and MySQL security feature can replace it ?

dba100

hi,

“we did a tcpdump as a first step to check connections coming in and going out.”

from DNS to the MySQL and MySQL to DNS ?

“You just need to pay attention on specific things, as you should always do with every software that you may be using. ”

but problem is , very hard to dig it out . takes a long time! who knows it is creating problem as people keep using it.

“Firewalls based on iptables could be an alternative.”
someone suggest remove tcpwrapper and only use firewall,
the iptables, the one IP address allow list in MySQL?