Ensuring your monitoring system stays up and running is vital. High availability (HA) minimizes downtime for Percona Monitoring and Management (PMM) during hardware failures, in times of disaster recovery, or increased usage of the tool. It’s not just about extra storage, RAM, or CPU but rather having redundant systems ready to take over seamlessly, like a backup singer.

From my experience, I’ve also seen users ask for “HA” when they really need a “stable and trusted solution.” HA can add complexity, so it’s wise to define your needs clearly. For instance, a highly redundant and complex HA setup might be overkill for monitoring non-critical systems.

The right HA solution depends on your needs. Critical systems requiring immediate response benefit from sub-second failover HA. Less critical applications with some tolerance for downtime (seconds or minutes) have more flexibility. Keep in mind that PMM itself has a one-minute minimum alerting interval. So even with perfect HA, the fastest you’ll know about an issue is one minute after it occurs. Choose your HA solution considering your specific uptime needs, performance needs, and potential data loss you can tolerate. Also, keep in mind the PMM’s limitations.

This post will explore various HA’ish options for PMM to find the best fit for you.

Shining light on PMM’s data caching powerhouse

Before diving into HA options, let’s acknowledge a powerful PMM feature that was implemented in recent releases. When the connection between the PMM Client (the server where the monitored database resides) and the PMM server is interrupted, the PMM client cleverly stores data locally. Once the connection is restored, this cached data is transferred to the PMM server, providing users with full visibility into what was happening during the network downtime. This built-in functionality offers a valuable safety net in itself. Just be aware that the default is 1GB of cache before metrics are dropped using First In, First Out.

PMM's data caching powerhouse

Usual PMM data flow process

 

Connection to the PMM server is lost.
Connection to the PMM server is lost.


Connection is restored

HA options for PMM: A three-tiered approach

Now, let’s explore the current options for achieving high availability with PMM:

1. Simple Docker restart with data caching:

This is a straightforward approach. You can simply launch the PMM server within Docker using the recommended “–restart=always” flag. This ensures that the PMM server will automatically restart if a minor issue occurs. Thanks to the data caching feature mentioned earlier, no data will be lost during this restart process.

This option is suitable for scenarios where the primary concern is ensuring the ability to investigate potential issues later. However, it’s important to remember that this approach is limited by the underlying physical infrastructure. Automatic recovery might be challenging if the failure stems from a hardware issue.

To be fully transparent, this is a “HA level 0” and more, such as “make sure PMM will be running if it can heal itself.”

2. Leveraging Kubernetes (K8s) for enhanced isolation:

Kubernetes (K8s) shines when isolating applications from the underlying infrastructure. PMM offers a Helm chart that facilitates running PMM within a K8s environment. This setup offers a significant advantage: even if the physical infrastructure encounters a problem, K8s automatically handles failover, migrating the PMM instance to a healthy node.

While restarts within K8s can take up to several minutes (depending on your infrastructure configuration), PMM’s data caching ensures that information is preserved during this transition. Even if the issue started during PMM’s restart and continues after PMM is back, it will still trigger alerts to keep you informed.

3. Fully clustered PMM in K8s (For advanced users, coming Q3/2024):

This option caters to a specific user group – those requiring large deployments with numerous instances distributed across various locations. PMM already offers a solution for running this configuration within Docker. Additionally, we’re actively developing comprehensive K8s deployments that encompass clustered database setups (Clickhouse, VictoriaMetrics, and PostgreSQL).
You can read more about this approach in Mastering Database Monitoring: Running PMM in High-Availability Mode.

Additions to blog post on high availability for PMM

While the options mentioned above provide robust HA solutions for PMM, scenarios might require a more comprehensive disaster recovery strategy. For those seeking an in-depth exploration of disaster recovery or blue-green deployments, we recommend checking out this blog post: Percona Monitoring and Management High Availability – A Proof of Concept. This additional approach delves into advanced techniques for ensuring PMM’s resilience in the face of major outages.

Addressing your PMM stability concerns

If you’re a current PMM user with concerns about availability, fret no more! We encourage you to experiment with the HA methods outlined in this blog post. If, after implementing these solutions, you encounter limitations that don’t align with your specific needs, we invite you to share your experience in our user forum. Our community thrives on open discussion, and your valuable feedback helps us continuously improve PMM’s capabilities. Let’s work together to ensure your monitoring system remains the ever-vigilant guardian it’s designed to be!

Percona Monitoring and Management is a best-of-breed open source database monitoring solution tool for use with MySQL, PostgreSQL, MongoDB, and the servers on which they run. Monitor, manage, and improve the performance of your databases no matter where they are located or deployed.

 

Download Percona Monitoring and Management Today

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments