Troubleshooting Percona Monitoring and Management (PMM) Metrics

In this blog post, I’ll look at some helpful tips on troubleshooting Percona Monitoring and Management metrics.

With any luck, Percona Monitoring and Management (PMM) works for you out of the box. Sometimes, however, things go awry and you see empty or broken graphs instead of dashboards full of insights.

Before we go through troubleshooting steps, let’s talk about how data makes it to the Grafana dashboards in the first place. The PMM Architecture documentation page helps explain it:

If we focus just on the “Metrics” path, we see the following requirements:

The appropriate “exporters” (Part of PMM Client) are running on the hosts you’re monitoring
The database is configured to expose all the metrics you’re looking for
The hosts are correctly configured in the repository on PMM Server side (stored in Consul)
Prometheus on the PMM Server side can scrape them successfully – meaning it can reach them successfully, does not encounter any timeouts and has enough resources to ingest all the provided data
The exporters can retrieve metrics that they requested (i.e., there are no permissions problems)
Grafana can retrieve the metrics stored in Prometheus Server and display them

Now that we understand the basic requirements let’s look at troubleshooting the tool.

PMM Client

First, you need to check if the services are actually configured properly and running:

root@rocky:/mnt/data# pmm-admin list
pmm-admin 1.5.2

PMM Server      | 10.11.13.140
Client Name     | rocky
Client Address  | 10.11.13.141
Service Manager | linux-systemd

-------------- ------ ----------- -------- ------------------------------------------- ------------------------------------------
SERVICE TYPE   NAME   LOCAL PORT  RUNNING  DATA SOURCE                                 OPTIONS
-------------- ------ ----------- -------- ------------------------------------------- ------------------------------------------
mysql:queries  rocky  -           YES      root:***@unix(/var/run/mysqld/mysqld.sock)  query_source=slowlog, query_examples=true
linux:metrics  rocky  42000       YES      -
mysql:metrics  rocky  42002       YES      root:***@unix(/var/run/mysqld/mysqld.sock)

root@rocky:/mnt/data# pmm-admin list

pmm-admin 1.5.2

PMM Server | 10.11.13.140

Client Name | rocky

Client Address | 10.11.13.141

Service Manager | linux-systemd

-------------- ------ ----------- -------- ------------------------------------------- ------------------------------------------

SERVICE TYPE NAME LOCAL PORT RUNNING DATA SOURCE OPTIONS

-------------- ------ ----------- -------- ------------------------------------------- ------------------------------------------

mysql:queries rocky - YES root:***@unix(/var/run/mysqld/mysqld.sock) query_source=slowlog, query_examples=true

linux:metrics rocky 42000 YES -

mysql:metrics rocky 42002 YES root:***@unix(/var/run/mysqld/mysqld.sock)

Second, you can also instruct the PMM client to perform basic network checks. These can spot connectivity problems, time drift and other issues:

root@rocky:/mnt/data# pmm-admin check-network
PMM Network Status

Server Address | 10.11.13.140
Client Address | 10.11.13.141

* System Time
NTP Server (0.pool.ntp.org)         | 2018-01-06 09:10:33 -0500 EST
PMM Server                          | 2018-01-06 14:10:33 +0000 GMT
PMM Client                          | 2018-01-06 09:10:33 -0500 EST
PMM Server Time Drift               | OK
PMM Client Time Drift               | OK
PMM Client to PMM Server Time Drift | OK

* Connection: Client --> Server
-------------------- -------
SERVER SERVICE       STATUS
-------------------- -------
Consul API           OK
Prometheus API       OK
Query Analytics API  OK

Connection duration | 355.085µs
Request duration    | 938.121µs
Full round trip     | 1.293206ms

* Connection: Client <-- Server
-------------- ------ ------------------- ------- ---------- ---------
SERVICE TYPE   NAME   REMOTE ENDPOINT     STATUS  HTTPS/TLS  PASSWORD
-------------- ------ ------------------- ------- ---------- ---------
linux:metrics  rocky  10.11.13.141:42000  OK      YES        -
mysql:metrics  rocky  10.11.13.141:42002  OK      YES        -

root@rocky:/mnt/data# pmm-admin check-network

PMM Network Status

Server Address | 10.11.13.140

Client Address | 10.11.13.141

* System Time

NTP Server (0.pool.ntp.org) | 2018-01-06 09:10:33 -0500 EST

PMM Server | 2018-01-06 14:10:33 +0000 GMT

PMM Client | 2018-01-06 09:10:33 -0500 EST

PMM Server Time Drift | OK

PMM Client Time Drift | OK

PMM Client to PMM Server Time Drift | OK

* Connection: Client --> Server

-------------------- -------

SERVER SERVICE STATUS

-------------------- -------

Consul API OK

Prometheus API OK

Query Analytics API OK

Connection duration | 355.085µs

Request duration | 938.121µs

Full round trip | 1.293206ms

* Connection: Client <-- Server

-------------- ------ ------------------- ------- ---------- ---------

SERVICE TYPE NAME REMOTE ENDPOINT STATUS HTTPS/TLS PASSWORD

-------------- ------ ------------------- ------- ---------- ---------

linux:metrics rocky 10.11.13.141:42000 OK YES -

mysql:metrics rocky 10.11.13.141:42002 OK YES -

If everything is working, next we can check if exporters are providing the expected data directly.

Checking Prometheus Exporters

Looking at the output from pmm-admin check-network, we can see the “REMOTE ENDPOINT”. This shows the exporter address, which you can use to access it directly in your browser:

You can see MySQL Exporter has different sets of metrics for high, medium and low resolution, and you can click on them to see the provided metrics:

There are few possible problems you may encounter at this stage

You do not see the metrics you expect to see. This could be a configuration issue on the database side (docs for MySQL and MongoDB), permissions errors or exporter not being correctly configured to expose the needed metrics.
Page takes too long to load. This could mean the data capture is too expensive for your configuration. For example, if you have a million tables, you probably can’t afford to capture per-table data.

mysql_exporter_collector_duration_seconds is a great metric that allows you to see which collectors are enabled for different resolutions, and how much time it takes for a given collector to execute. This way you can find and potentially disable collectors that are too expensive for your environment.

Let’s look at some more advanced ways to troubleshoot exporters.

Looking at ProcessList

root@rocky:/mnt/data# ps aux | grep mysqld_exporter

root      1697  0.0  0.0   4508   848 ?        Ss    2017   0:00 /bin/sh -c 
/usr/local/percona/pmm-client/mysqld_exporter -collect.auto_increment.columns=true 
-collect.binlog_size=true -collect.global_status=true -collect.global_variables=true 
-collect.info_schema.innodb_metrics=true -collect.info_schema.processlist=true 
-collect.info_schema.query_response_time=true -collect.info_schema.tables=true 
-collect.info_schema.tablestats=true -collect.info_schema.userstats=true 
-collect.perf_schema.eventswaits=true -collect.perf_schema.file_events=true 
-collect.perf_schema.indexiowaits=true -collect.perf_schema.tableiowaits=true 
-collect.perf_schema.tablelocks=true -collect.slave_status=true 
-web.listen-address=10.11.13.141:42002 -web.auth-file=/usr/local/percona/pmm-client/pmm.yml 
-web.ssl-cert-file=/usr/local/percona/pmm-client/server.crt 
-web.ssl-key-file=/usr/local/percona/pmm-client/server.key >> 
/var/log/pmm-mysql-metrics-42002.log 2>&1

root@rocky:/mnt/data# ps aux | grep mysqld_exporter

root 1697 0.0 0.0 4508 848 ? Ss 2017 0:00 /bin/sh -c

/usr/local/percona/pmm-client/mysqld_exporter -collect.auto_increment.columns=true

-collect.binlog_size=true -collect.global_status=true -collect.global_variables=true

-collect.info_schema.innodb_metrics=true -collect.info_schema.processlist=true

-collect.info_schema.query_response_time=true -collect.info_schema.tables=true

-collect.info_schema.tablestats=true -collect.info_schema.userstats=true

-collect.perf_schema.eventswaits=true -collect.perf_schema.file_events=true

-collect.perf_schema.indexiowaits=true -collect.perf_schema.tableiowaits=true

-collect.perf_schema.tablelocks=true -collect.slave_status=true

-web.listen-address=10.11.13.141:42002 -web.auth-file=/usr/local/percona/pmm-client/pmm.yml

-web.ssl-cert-file=/usr/local/percona/pmm-client/server.crt

-web.ssl-key-file=/usr/local/percona/pmm-client/server.key >>

/var/log/pmm-mysql-metrics-42002.log 2>&1

This shows us that the exporter is running, as well as specific command line options that were used to start it (which collectors were enabled, for example).

Checking out Log File

root@rocky:/mnt/data# tail /var/log/pmm-mysql-metrics-42002.log

time="2018-01-05T18:19:10-05:00" level=error msg="Error pinging mysqld: dial unix 
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:442"
time="2018-01-05T18:19:11-05:00" level=error msg="Error pinging mysqld: dial unix 
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:442"
time="2018-01-05T18:19:12-05:00" level=error msg="Error pinging mysqld: dial unix 
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:492"
time="2018-01-05T18:19:12-05:00" level=error msg="Error pinging mysqld: dial unix 
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:442"
time="2018-01-05T18:19:12-05:00" level=error msg="Error pinging mysqld: dial unix 
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:616"
time="2018-01-05T18:19:13-05:00" level=error msg="Error pinging mysqld: dial unix 
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:442"
time="2018-01-05T18:19:14-05:00" level=error msg="Error pinging mysqld: dial unix 
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:442"
time="2018-01-05T18:19:15-05:00" level=error msg="Error pinging mysqld: dial unix 
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:442"
time="2018-01-05T18:19:16-05:00" level=error msg="Error pinging mysqld: dial unix 
/var/run/mysqld/mysqld.sock: connect: no such file or directory" source="mysqld_exporter.go:442"
2018/01/06 09:10:33 http: TLS handshake error from 10.11.13.141:56154: tls: first record does not look like a TLS handshake

root@rocky:/mnt/data# tail /var/log/pmm-mysql-metrics-42002.log

time="2018-01-05T18:19:10-05:00" level=error msg="Error pinging mysqld: dial unix