In this blog post, we’ll look at how to set up ProxySQL for high availability.

During the last few months, we’ve had a lot of opportunities to present and discuss a very powerful tool that will become more and more used in the architectures supporting ProxySQL for High AvailabilityMySQL: ProxySQL.

ProxySQL is becoming more flexible, solid, performant and used every day (http://www.proxysql.com/ and recent http://www.proxysql.com/compare).

You can use ProxySQL for high availability.

The tool is a winner when compared to similar ones, and we should all have a clear(er) idea of how to integrate it in our architectures in order to achieve the best results.

The first thing to keep in mind is that ProxySQL doesn’t natively support any high availability solution. We can setup a cluster of MySQL(s) and achieve four or even five nines of HA. But if we include ProxySQL as it is, and as a single block, our HA has a single point of failure (SPOF) that can drag us down in the case of a crash.

ProxySQL for High AvailabilityTo solve this, the most common solution is setting up ProxySQL as part of a tile architecture, where Application/ProxySQL are deployed together.

This is a good solution for some cases, and it for sure reduces the network hops, but it might be less than practical when our architecture has a very high number of tiles (hundreds of application servers, which is not so unusual nowadays).
In that case, managing ProxySQL is challenging. But more problematically, ProxySQL must perform several checks on the destination servers (MySQL). If we have 400 instances of ProxySQL, we end up keeping our databases busy just performing the checks.

In short, this is not a smart move.

ProxySQL for High AvailabilityAnother possible approach is to have two layers of ProxySQL, one close to the application and another in the middle to connect to the database.

I personally don’t like this approach for many reasons. The most relevant reasons are that this approach creates additional complexity in the management of the platform, and it adds network hops.

So what can be done?

I love the KISS principle because I am lazy and don’t want to reinvent the wheel someone else has already invented. I also like it when my customers don’t need to depend on me or any other colleagues after I am done and gone. They must be able to manage, understand and fix their environment by themselves.

To keep things simple, here is my checklist:

  • Exclude the cases where a tile (application/ProxySQL) makes sense
  • Exclude the “in the cloud” cases where tools like ELB (Elastic Load Balancer) exist
  • Exclude architecture that already includes a balancer

What can I use for the remaining cases?

The answer comes with combining existing solutions and existing blocks: KeepAlived + ProxySQl + MySQL.

For an explanation of KeepAlived, visit http://www.keepalived.org/.

Short description
“Keepalived is a routing software written in C. The main goal of this project is to provide simple and robust facilities for load balancing and high-availability to Linux system and Linux-based infrastructures. The load balancing framework relies on well-known and widely used Linux Virtual Server (IPVS) kernel module providing Layer4 load balancing. Keepalived implements a set of checkers to dynamically and adaptively maintain and manage load-balanced server pool according to their health. On the other hand, high-availability is achieved by VRRP protocol. VRRP is a fundamental brick for router failover. Also, Keepalived implements a set of hooks to the VRRP finite state machine providing low-level and high-speed protocol interactions. Keepalived frameworks can be used independently or all together to provide resilient infrastructures.”

Bingo! This is exactly what we need for our ProxySQL setup.

Below, I will explain how to set up:

  • A simple solution based on a single VIP
  • A more complex solution using multiple VIPs
  • An even more complex solution using virtual VIPs and virtual servers

All we want to do is to prevent ProxySQL from becoming a SPOF. And while doing that, we need to reduce network hops as much as possible (keeping the solution SIMPLE).

Another important concept to keep in mind is that ProxySQL (re)start takes place in less than a second. This means that if it crashes, assuming it can be restarted by the angel process, doing that is much more efficient than any kind of failover mechanism. As such, your solution plan should keep in mind the ~1 second time of ProxySQL restart as a baseline.

Ready? Let’s go.

Setup

Choose three machines to host the combination of Keepalived and ProxySQL.

In the following example, I will use three machines for ProxySQL and Keepalived, and three hosting Percona XtraDB Cluster. You can have the Keepalived+ProxySQL whenever you like (even on the same Percona XtraDB Cluster box).

For the following examples, we will have:

To check, I will use this table (please create it in your MySQL server):

And this bash TEST command to use later:

  1. Install ProxySQL (refer to https://github.com/sysown/proxysql/wiki#installation)
  2. Install Keepalived (yum install keepalived; apt-get install keepalived)
  3. Setup ProxySQL users and servers

Once you have your ProxySQL up (run the same configuration on all ProxySQL nodes, it is much simpler), connect to the Admin interface and execute the following:

Create a my.cnf file in your default dir with:

Simple Setup using a single VIP, 3 ProxySQL and 3 Galera nodes

ProxySQL for High Availability

First, setup the Keepalived configuration file (/etc/keepalived/keepalived.conf):

Given the above, and given I want to have TEST1 as the main priority, we will set as:

Modify the config in each node with the above values and (re)start Keepalived.

If all is set correctly, you will see the following in the system log of the TEST1 machine:

In the other two machines:

Which means the node is there as a Backup.  😀

Now it’s time to test our connection to our ProxySQL pool. From an application node, or just from your laptop, open three terminals. In each one:

Unless you are already sending queries to proxies, the proxies are doing nothing. Time to start the test bash as I indicated above. If everything is working correctly, you will see the bash command reporting this:

The other three running bash commands will show that ONLY the ProxySQL in TEST1 is currently getting/serving requests, because it is the one with the VIP:

This is all as expected. Time to see if the failover-failback works along the chain.

Let us kill the ProxySQL on TEST1 while the test bash command is running.

Here is what you will get:

The source changed, but not the Percona XtraDB Cluster node. If you check the system log for TEST1:

While on TEST2:

Simple, and elegant. No need to re-invent the wheel to get a smooth working process.

The total time for the recovery for the ProxySQL crash is about 5.06 seconds, considering the wider window (last application start, last recovery in Percona XtraDB Cluster 2017-01-10 18:19:06.188233|2017-01-10 18:19:11.250927).

As such this is the worse scenario, keeping in mind that we run the check for the ProxySQL every two seconds (real recovery max window 5-2=3 sec).

OK, what about fail-back?

Let us restart the proxysql service:

Here’s the output:

The worst recovery time is 4.04 seconds, of which 2 seconds are a delay due to the check interval.

Of course, the test is running every second and is running a single operation. As such, the impact is minimal (no error in fail-back) and the recovery longer.

Let’s check another thing: is the failover working as expected? Test1 -> 2 -> 3? Let’s kill 1 and 2 and see:

This image is where you should be at the end:

ProxySQL for High Availability

Where the last server is Test3. In this case, I have killed one server immediately after the other. Keepalived had to take a bit longer failing over, but it still did it correctly and following the planned chain.

Fail-back is smooth as usual:

Let’s see another case

The case above is simple. But as a caveat, I can only access one ProxySQL a time. This might be good or not. In any case, it might be nice to have the possibility to choose the ProxySQL. With Keepalived, you can. We can actually set an X number of VIPs and associate them to each test box.

The result will be that each server hosting ProxySQL will also host a VIP, and will eventually be able to fail-over to any of the other two servers.

Failing-over/back is fully managed by Keepalived, checking as we did before if ProxySQL is running. An example of the configuration of one node can be seen below:

The tricky part, in this case, is to play with the PRIORITY for each VIP and each server, such that you will NOT assign the same IP twice.

Performing a check with the test bash, we have:

The final state of IPs on Test3:

And this is the image:

ProxySQL for High Availability

 

Recovery times:

Conclusions

In this example, I used a test that checks the process. But a check can be anything reporting 0|1. The limit is defined only by what you need.

The failover times can be significantly shorter by reducing the check time, and only counting the time taken to move the VIP. I preferred to show the worst-case scenario, using an application with a one-second interval. This is a pessimistic view of what normally happens with real traffic.

I was looking for a simple, simple, simple way to add HA to ProxySQL – something that can be easily integrated with automation, and that is actually also well-established and maintained. In my opinion, using Keepalived is a good solution because it matches all the above expectations.

Implementing a set of ProxySQL nodes, and having Keepalived manage the failover between them, is pretty easy. But you can expand the usage (and the complexity) if you need to, counting on tools that are already part of the Linux stack. There is no need to re-invent the wheel with a crazy mechanism.

If you want to have fun doing crazy things, at least start with something that helps you to go beyond the basics. For instance, I was also playing a bit with Keepalived and a virtual server, creating a set of redundant ProxySQL with load balancers and . . . but that is another story (blog).

😀

Great MySQL and ProxySQL to all!

6 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Okky Hendriansyah

Hi Marco,

Nice write up and experimentation. I understand the idea to simplify yet improve availability of ProxySQL. But I still do not understand how can multiple VIPs on your last case actually work for applications accessing ProxySQL? Isn’t an application will only point to just one of the VIPs?

Thank you.

Best regards,
Okky

Ryan

If the solution dictated the use of ProxySQL as part of a tile architecture.. What would you recommend as an orchestration tool to keep the ProxySQL config+users in sync?

Neo

Hi,

How do we address Split brains with keepalived in “A simple solution based on a single VIP” setup?

Ivan

Great article Marco. May I suggest removing the weight from check_proxy script? e.g.

vrrp_script check_proxy {
script “killall -0 proxysql”
interval 2
}

Not sure if there is another reason for having that, but I found keepalived behaviour a bit more straight forward this way. It will simply fail over if the script is not successful, otherwise it will substract the script weight you specify from the priority of the vrrp_instance and decide wether to failover based on that.

Lina, DBA

Hello

How you will replicate the creation of new users on the 3 proxysql

@Lina you have to implement proxysql Cluster more details here https://www.percona.com/blog/2018/06/11/proxysql-experimental-feature-native-clustering/