processes linux host percona monitoringA few years ago, I wrote about how to add information about processes to your Percona Monitoring and Management (PMM) instance as well as some helpful ways you can use this data.

Since that time, PMM has released a new major version (PMM v2) and the Process Exporter went through many changes, so it’s time to provide some updated instructions.

Why Bother?

Why do you need per-process data for your database hosts, to begin with? I find this data very helpful, as it allows us to validate how much activity and load is caused by the database process rather than something else. This “something else” may range from a backup process that takes too much CPU, some usually benign system process that went crazy today, or it might even be a crypto miner which was “helpfully” installed on your system. Simply assuming all load you’re observing on the system comes from the database process – which may be correct in most cases, but can also lead you astray –  you need to be able to verify that.

Installation 

Process-monitoring awesomeness installation consists of two parts.  You install an exporter on every node on which you want to monitor process information, and then you install a dashboard onto your PMM server to visualize this data. External  Exporter Support was added in PMM 2.15, so you will need at least this version for those commands to work.

Installing The Exporter

The commands below will download and install the Prometheus Process Exporter and configure PMM to consume the data generated from it.

Note: Different versions of Process Exporter may also work, but this particular version is what I tested with.

Installing the Dashboard

The easiest way to install a dashboard is from the Grafana.com Dashboard Library. In your Percona Monitoring and Management install, click the “+” sign in the toolbar on the left side and select “Import”.

 

percona monitoring and management grafana dashboard

 

Enter Dashboard ID 14239 and you are good to go.

If you’re looking for ways to automate this import process as you are provisioning PMM automatically, you can do that too. Just follow the instructions in the Automate PMM Dashboard Importing blog post.

Understanding Processes Running on your Linux Machine

Let’s now move to the most fun part, looking at the available dashboards and what they can tell us about the running system and how they can help with diagnostics and troubleshooting. In the new dashboard, which I updated from an older PMMv1 version, I decided to add relevant whole-system metrics which can help us to put the process metrics in proper context.

 

node processes percona monitoring and management

 

The CPU-focused row shows us how system CPU is used overall and to what extent the system or some CPU cores are overloaded, as well as top consumers of the “User” and “System” CPU Modes.

Note, because of additional MetricsQL functionality provided by VictoriaMetrics, we can show [other] as the total resource usage by processes that did not make it to the top.

How do you use this data?  Check if the processes using CPU resources are those which you would expect or if there are any processes that you did not expect to see taking as much CPU as they actually do.

 

pmm memory

 

Memory Utilization does the same, but for memory. There are a number of different memory metrics which can be a bit intimidating.

Resident Memory means the memory process (or technically group of processes) takes in physical RAM.  The “Proportional” means the method by how this consumption is counted. A single page in RAM sometimes is shared by multiple processes, and Proportional means it is divided up among all processes sharing it when memory allocation is accounted for rather than counted as part of every process. This ensures there is no double counting and you should not see total size of Resident memory for your processes well in excess of the physical memory you have.

Register for Percona Live ONLINE
A Virtual Event about Open Source Databases

Used Memory means the space process consumes in RAM plus space it consumes in the swap space. Note, this metric is different from Virtual Memory, which also includes virtual space which was assigned to the process but never really allocated.

I find these two metrics as the most practical to understand how physical and virtual memory is actually used by the processes on the system.

 

resident and used memory

 

Virtual Memory is the virtual address space that was allocated to process. In some cases, it will be close to memory used as in the case of the mysqld process, and in other cases, it may be very different; like dockerd process which is running on this system takes 5GB of virtual memory and less than 70MB of actual memory used.

Swapped Memory shows us which processes are swapped out and by how much.  I would pay special attention to this graph because if the Swap Activity panel shows serious IO going on, this means system performance might be significantly impacted. If unused processes are swapped out, or even some unused portions of the processes, it is not the problem. However, if you have half of MySQL’s buffer pool swapped out and heavy Swap IO going… you have work to do.

 

Process Disk IO Usage

 

Process Disk IO Usage allows seeing IO bandwidth and latency for the system overall as well as bandwidth used by reads and writes by different processes.  If you have any unexpected disk IO bandwidth, consumers will easily spot them using this dashboard.

 

processes Context Switches

 

Context Switches provide more details on what kind of context switches are happening in the system and what processes they correspond to.

A high number of Voluntary Context Switches (hundreds of thousands and millions per second) may indicate heavy contention, or it may just correspond to a high number of requests being served by the process, as in many architectures starting/stopping request handling requires a context switch.

A high number of Non-Voluntary Context Switches, on the other hand, can correspond to not having enough CPU available with processes moved off CPU by the scheduler when they have exceeded their allotted time slice, or for other reasons.

 

global file descriptors

 

File Descriptors show us the global limit of the file descriptors allowed in the operating system as well as for individual processes.  Running out of file descriptors for a whole system is really bad, as you will have many things start failing at random. Although on modern, powerful systems, the limit is so high you rarely hit this problem.

The limit of files process can open still applies so it is very helpful to see which processes require a lot of file descriptors (by number) and also how this number compares to the total number of descriptors allowed for the process.  In our case, we can see no process ever allocated more than 7% of file descriptors allowed, which is quite healthy.

 

Major and Minor page faults

 

This graph shows Major and Minor page faults for given processes.

Major page faults are relatively expensive, typically causing disk IO when they happen.

Minor page faults are less expensive, corresponding to accessing pages that are not mapped to the given process address space but otherwise are in memory.  Minor page faults are less expensive, requiring a switch to kernel mode and for the kernel to do some housekeeping.

See more details on Minor/Major page faults and general Linux Memory Management here.

 

Processes in Linux

 

Processes in Linux can cycle through different statuses; for us, the most important ones to consider are “Active” statuses which are either “Running” or “Waiting on Disk IO”.  These roughly can be seen as using CPU and Disk IO resources.

In this section, we can see an overview of the number of running and waiting processes in the system (basically the same stuff “r” and “b” columns in vmstat show), as well as more detailed stats showing which processes, in particular, were running… or waiting on disk IO.

 

process kernel waits

 

While we can see what is going on with Active Processes by looking at their statuses, this shows us what is going on with sleeping processes.  In particular, what kernel functions are they sleeping in.  We can see data grouped by the name of the function in which wait happens or by pair function – process name.

If you want to focus on what types of kernel functions a given process is waiting on, you can select it in the dashboard dropdown to filter data just by this process. For example, selecting “mysqld”, I see:

 

kernel wait details

 

Finally, we have the panel which shows the processes based on their uptime.

 

processes uptime

 

This can be helpful to spot if any processes were started recently. Frankly, I do not find this panel to be most useful but as Process Exporter captures this data, why not?

Summary

Process Exporter provides great insights on running processes, in addition to what basic PMM installation provides.  Please check it out and let us know how helpful it is in your environment.  Should we consider enabling it by default in Percona Monitoring and Management?

3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Cihan

Hello,

Is not it possible to collect these metrics with enabling –no-collector.systemd key? Also, at Pmm agent there is a –collector.processes key while running, are these same as process_exporter? I used node_exporter’s process and systemd keys get process informations. Thanks!

Francisco Miguel Biete Banon

+1 for adding this a default service in PMM. node_exporter is a nice overall, but process_export would give us a very much needed zoom into its metrics.