Integrated Alerting Design Percona Monitoring and ManagementPercona Monitoring and Management 2.13 (PMM) introduced the Integrated Alerting feature as a technical preview. It adds a user-friendly way to set up and manage alerts for your databases. You can read more about this feature usage in our announcement blog post, while in this article we will be focusing on design and implementation details.

Entities

There are four basic entities used for IA: Alert Rule Template, Alert Rule, Alert, and Notification Channel.

Everything starts from the alert rule template. You can see its YAML representation below:

A template serves as the base for alert rules. It defines several fields, let’s look at them:

  • name: uniquely identifies template (required)
  • version: defines template format version (required)
  • summary: a template description (required)
  • expr: a MetricsQL query string with parameter placeholders. MetricsQL is backward compatible with PromQL and provides some additional features. (required)
  • params: contains parameter definitions required for the query. Each parameter has a name, type, and summary. It also may have a unit, available range, and default value.
  • for: specifies the duration of time the expression must be met for;  The  alert query should return true for this period of time at which point the alert will be fired (required)
  • severity: specifies default alert severity level (required)
  • labels: are additional labels to be added to generated alerts (optional)
  • annotations: are additional annotations to be added to generated alerts. (optional)

A template is designed to be re-used as the basis for multiple alert rules so from a single pmm_node_high_cpu_load template you can have alerts for production vs non-production, warning vs critical, etc.

Users can create alert rules from templates. An alert rule is what’s actually executed against metrics and what produces an alert. The rule can override default values specified in the template, add filters to apply the rule to only required services/nodes/etc, and specify target notification channels, such as email, Slack, PagerDuty, or Webhooks. If the rule hasn’t any associated notification channels its alerts will be available only via PMM UI. It’s useful to note that after creation rule keeps its relation with the template and any change in the template will affect all related rules.

Here is an alert rule example:

It has a Prometheus alert rule format.

How it Works

Integrated Alerting feature built on top of Prometheus Alertmanager, VictoriaMetrics TimescaleDB (TSDB), and VMAlert.

VictoriaMetrics TSDB is the main metrics storage in PMM, VMalert responsible for alert rules execution, and Prometheus Alertmanager responsible for alerts delivery. VMAlert runs queries on VM TSDB, checks if they are positive for the specified amount of time (example: MySQL is down for 5 minutes), and triggers alerts. All alerts forwarded to the PMM internal Alertmanager but also can be duplicated to some external Alertmanager (it can be set up on the PMM Settings page).

There are four available templates sources:

  1. Built-in templates, shipped with PMM distribution. They are embedded into the managed binary (core component on PMM).
  2. Percona servers. It’s not available yet, but it will be similar to the STT checks delivery mechanism (HTTPS + files signatures).
  3. Templates created by the user via PMM UI. We persist them in PMM’s database.
  4. Templates created by the user as files in the /srv/ia/templates directory.

During PMM startup, managed loads templates from all sources into the memory.

Alert rules can be created via PMM UI or just by putting rule files in the /srv/prometheus/rules directory. Alert rules created via UI persist in PMM’s internal PostgreSQL database. For each alert rule from DB, managed binary creates a YAML file in /etc/ia/rules/ and asks VMalert to reload the configuration and reread rule files. VMAlert executes query from each loaded alert rule every minute, once the rule condition is met (query is positive for the specified amount of time) VMAlert produces an alert and passes it to the Alertmanager. Please note that /etc/ia/rules/ controlled by managed and any manual changes in that directory will be lost.

Managed generates configuration for Alertmanager and updates it once any related entity changes.

Managed goes through the list of the existing rules and collects unique notification channel combinations. For example, if we have two rules and each of them has assigned channels a,b, and c it will be the one unique channel combination. For each rule managed generates a route and for each unique channel combination, it generates a receiver in the Alertmanager configuration file. Each route has a target receiver and filter by rule id, also it can contain user-defined filters. If a rule hasn’t assigned notification channels, then a special empty receiver will be used. Users can redefine an empty receiver with Alertmanagers base configuration file /srv/alertmanager/alertmanager.base.yml. When some Notification Channel is disabled, managed recollects unique channel combinations excluding disabled channels and regenerates receivers and routing rules. If the rule has only one specified channel and it was disabled then a special disabled receiver will be used for that. Unlike empty receiver, disabled can’t be redefined by the user and always means “do nothing”.  It prevents unexpected behavior after channels disabling. After each Alertmanager configuration update, managed asks Alermanager to reload it.

When Alertmanager receives an alert from VMAlert, it uses routes to find an appropriate receiver and forward alerts to destination channels. The user also can observe alerts via PMM UI. In that case, managed gets all available alerts from Alertmanager API and applies required filters before showing them.

Conclusion

The Integrated Alerting feature has many moving parts, and functionally it’s more about managing configuration for different components and making them work together. It provides a really nice way to be aware of important events in your system. While it’s still in tech preview state, it’s already helpful. With built-in templates, it’s easy to try without diving into documentation about Prometheus queries and other stuff. So please try it and tell us about your experience. What parameters of a system you would like to have covered with templates? What use cases do you have for alerting? We will happy to any feedback.

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Laurent Indermühle

In the current state, PMM is useless for alerting.

The integrated alerts doesn’t work. And the two only tools I know of to look for logs are less than ideal. Either you download a 120MB zip, or you podman exec -it pmm-server bash then navigate to /srv/logs.

What’s missing is:

– A button in “Notification Channels” to test the channel
– A button in “Alerts” to test the message sent
– A way to rename service_id, agents_id and node_id in the messages sent. The title should contain a service_name or node_name, not long id unreadable by a human.
– An history of the alerts (where goes the silenced one?)
– A log of the subsystems accessible from the UI

But what’s making me saying that integrated alerts doesn’t work is that one day I received some alerts, then nothing for days. But when I login into PMM UI I see plenty of alerts fired.

Today I’ve randomly stopped some MariaDB nodes. I never got any alerts. One alert registered in the UI and all others didn’t appeared at all in the UI.

An alert manager is a critical system that should never ever fails silently. If the documentation is incomplete, the process of copying rules automatically from Prometheus to vmalert is obscure and alerts are not always triggered, nobody will use that system.

And it would be a shame, because I love PMM2 so much. You did a very good job for the monitoring part. I really wish the Alerting part will catch up! 🙂