In this blog, I will demonstrate how to use Percona Monitoring and Management (PMM) to find out the reason why the MySQL server is stalling. I will use only one typical situation for the MySQL server stall in this example, but the same dashboards, graphs, and principles will help you in all other cases.

Nobody wants it but database servers may stop handling connections at some point. As a result, the application will slow down and then will stop responding.

It is always better to know about the stall from a monitoring instrument rather than from your own customers.

PMM is a great help in this case. If you look at its graphs and notice that many of them started showing unusual behavior, you need to react. In the case of stalls, you will see that either some activity went to 0 or, otherwise, it increased to high numbers. In both cases, it does not change.

Let’s review the dashboard “MySQL Instance Summary” and its graph “MySQL Client Thread Activity” during normal operation:

PMM MySQL Instance Summary

As you see, the number of active threads fluctuates and this is normal for any healthy application: even if all connections request data, MySQL puts some threads into idle states while they need to wait while the storage engine prepares the data for them. Or, if the client application processes retrieved data.

The next screenshot was taken when the server was stalling:

Percona monitoring dashboard

In this picture, you see that the number of active threads is near maximum. At the same time the number of “MySQL Temporary Objects” lowered down to zero. This by itself shows that something unusual happened. But to understand the picture better let’s examine storage engine graphs.

I, like most MySQL users, used InnoDB for this example. Therefore the next step for figuring out what is going on would be to examine graphs in the “MySQL InnoDB Details” dashboard.

MySQL InnoDB Details

First, we are seeing that the number of rows that InnoDB reads per second went down to zero, as well as the number of rows written. This means that something prevents InnoDB from performing its operations.

MySQL InnoDB Details PMM

More importantly, we see that all I/O operations were stopped. This is unusual even on a server that does not handle any user connection: InnoDB always performs background operations and is never completely idle.

Percona InnoDB

You may see this in the “InnoDB Logging Performance” graph: InnoDB still uses log files but only for background operations.

InnoDB Logging Performance

InnoDB Buffer Pool activity is also stopped. What is interesting here is that the number of dirty pages went down to zero. This is visible on the “InnoDB Buffer Pool Data” graph: dirty pages are colored in yellow. This actually shows that InnoDB was able to flush all dirty pages from the buffer pool when InnoDB stopped processing user queries.

At this point we can make the first conclusion that our stall was caused by some external lock, preventing MySQL and InnoDB from handling user requests.

MySQL and InnoDB

The “Transaction History” graph confirms this guess: there are no new transactions and InnoDB was able to flush all transactions that were waiting in the queue before the stall happened.

We can conclude that we are NOT experiencing hardware issues.

MySQL and InnoDB problems

This group shows why we experience the stall. As you can see in the  “InnoDB Row Lock Wait Time” graph, the wait time was raised to its maximum value around 14:02, then lowered to zero. There is no row lock waits registered during the stall time.

This means that at some point all InnoDB transactions were waiting for a row lock, then failed with a timeout. Still, they have to wait for something. Since there are no hardware issues and InnoDB functions healthy in the background, this means that all threads are waiting for a global MDL lock, created by the server.

If we have Query Analytics (QAN) enabled we can find such a command easily.

Query Analytics (QAN)

For the selected time frame we can see that many queries were running until a certain time when a query with id 2 was issued, then other queries stopped running and restarted a few minutes later. The query with id 2 is FLUSH TABLES WITH READ LOCK which prevents any write activity once the tables are flushed.

This is the command that caused a full server stall.

Once we know the reason for the stall we can perform actions to prevent similar issues in the future.

Conclusion

PMM is a great tool that helps not only to identify if your database server is stalling but also to figure out what was the reason for the stall. I used only one scenario in this blog. But you may use the same dashboards and graphs to find out other reasons for the stall, such as DoS attack, hardware failure, a high number of IO operations, caused by poorly optimized queries, and many more.

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments