One of the requests we get most often on the Percona Monitoring and Management (PMM) team is “Do you support alerting?” The answer to that question has always been “Yes” but the feedback on how we offered it natively was that it was, well, not robust enough! We’ve been hard at work to change that and are excited to offer, starting with the newly released PMM version 2.3.0, a more dynamic alerting mechanism for your PMM installations: Integration with Prometheus Alertmanager.
Prometheus Alertmanager
If you don’t know what Alertmanager is you can read all about it on the Prometheus website, but the short version is that Alertmanger is a receiver, consolidator, and router of alerting messages that offers LOTS of flexibility when it comes to configurations. From my old days as a SysAdmin, the tools I used weren’t smart enough to deduplicate alerts so I’d have my boss yelling, my coworkers emailing, and my phone (ok…Blackberry) battery depleting itself vibrating to the same alert over and over until I could manage to put the alert in maintenance mode and the queue of alerts drained. Alertmanager is smart enough to deduplicate alerts so you don’t get 50 pages telling you the disk is 90% full before you can grow the volume or purge files. It’s also extremely easy to group alerts so that you don’t get alerts for ‘Application Down’, ‘MySQL Down’, ‘CPU|RAM|Disk: Unavail’, etc. because someone rebooted the DB server without putting monitoring in maintenance mode. Alertmanager also offers many native integrations so you can route alerts to email, SMS, PagerDuty, Slack, and more!
Now, this is our first iteration of Alertmanager support so at this point you will need your own working Alertmanager installation that your PMM server can communicate with. The only other thing you’ll need are the rules you want to trigger alerts from. That’s basically it! You most likely already know how to create yaml style rules but for the curious, it looks something like this:
1 2 3 4 5 6 7 8 9 10 11 | groups: - name: PostgresqlStatus rules: - alert: PostgresqlDown expr: pg_up == 0 for: 5m labels: severity: error annotations: summary: "PostgreSQL down (instance {{ $labels.service_name }})" description: "PostgreSQL instance is down\n VALUE = {{ $value }}\n LABELS: {{ $labels }}" |
The above will trigger an alert to let you know which PMM instances of PostgreSQL are down for more than 5 minutes. Since this first pass targets the experienced users, I’ll leave it to you to craft your own rules but we’re really excited to be adding this sorely needed functionality!
For more information, you can read our AlertManager integration documentation and FAQs. Update your instance today and let us know what you think, we would love to hear your feedback!
I’m trying to add the following in the Alertmanager rule section:
– alert: MongodbReplicationLag
expr: (avg by (cluster,environment,set)(mongodb_mongod_replset_member_optime_date{state=”PRIMARY”}) – min by (cluster,environment,set) (mongodb_mongod_replset_member_optime_date{state=”SECONDARY”})) > 120
for: 5m
labels:
severity: warning
source: pmmprod
Once I click “Apply Alertmanager settings button” button, the error pops-up: Invalid Alert Manager rules.
What am I doing wrong here?
ttext in the comment above did not preserve yaml indentation which was correct originally
Could you check your expr part directly in prometheus UI?/prometheus/
https://
Hello.
If I don’t want to use the new Prometheus Alertmanager, is it still possible to use the Grafana Alerting feature? I cannot find anymore the Alert Tab on the dashboard graph panel for PMM 2.3.0.
Thanks and Regards, Elisa
elisetta1984,
Grafana Alerting still works in PMM 2.3.0, I was just able to set one up on my test instance and it’s alerting me as expected. I’ll assume you have configured a notification channel, once that’s done, you can create a new dashboard, add a panel to it and the option for alerts isn’t a tab anymore but a “bell” icon along the left side of the “edit panel” dialog boxes (there will be an icon for: Queries, Visualization, General, and finally Alert) . If you’re looking to add alerts to the existing graphs that’s possible but a little more involved as many of our dashboards are templated. Here’s a blog post from PMM1 but the bulk of what’s outlined still holds true in PMM2 https://www.percona.com/blog/2017/02/02/pmm-alerting-with-grafana-working-with-templated-dashboards/
We are also looking at incorporating the latest Grafana 6.6 as there’s been some significant work done with alerting but that’s going to be a few releases out for us.
Hope this helps!
Steve
Very nice article.
Couple of questions —
1. Does this resolve template variables issue which we have been facing till graphana 4.x?
2. If we have 250 mongo servers, How can we configure and send alerts only to those servers where there is issue?
Hello Steve.
Thanks for your explanations. Actually I can add an alert for a new Dashboard.
But is there also the way to add directly an alert for an existing Dashboard without creating a new one?
Thanks and Regards,
Elisa
Sai, sorry for the delay…never got an alert that there was a new post until today so let me answer you first! This doesn’t technically resolve template variables…that’s an issue in Grafana but I think I heard that 7.0 lays the foundation to resolve that! What the AlertManager option does is lets you do what you’re after: dynamically alert on systems that meet certain criteria. There’s a ton of great AlertManager recipes online you can find that will allow you to set thresholds and alerting values on any parameters in prometheus you like. CPU over a certain threshold for a certain period of time on a production only system with greater than 12GB of installed RAM…you can alert on that! Want to restrict that alert down to only systems running mongo, no problem. We’re also working on integrating alert manager into PMM as well so you don’t have to have your own setup but not quite ready for production.
Elisa, Unfortunately not, we use template variables in our dashboards to make them quite dynamic and the built-in alerting capabilities of grafana does not work with them. So when you create the copy of the dashboard you’re actually stripping out the offending variables and alerting on specific instances.