Introducing Consul

I’m always interested in what Mitchell Hashimoto and Hashicorp are up to, I typically find their projects valuable.  If you’ve heard of Vagrant, you know their work.

I recently became interested in a newer project they have called ‘Consul‘.  Consul is a bit hard to describe.  It is (in part):

  • Highly consistent metadata store (a bit like Zookeeeper)
  • A monitoring system (lightweight Nagios)
  • A service discovery system, both DNS and HTTP-based. (think of something like haproxy, but instead of tcp load balancing, it provides dns lookups with healthy services)

What this has to do with Percona XtraDB Cluster

I’ve had some more complex testing for Percona XtraDB Cluster (PXC) to do on my plate for quite a while, and I started to explore Consul as a tool to help with this.  I already have Vagrant setups for PXC, but ensuring all the nodes are healthy, kicking off tests, gathering results, etc. were still difficult.

So, my loose goals for Consul are:

  • A single dashboard to ensure my testing environment is healthy
  • Ability to adapt to any size environment — 3 node clusters up to 20+
  • Coordinate starting and stopping load tests running on any number of test clients
  • Have the ability to collect distributed test results

I’ve succeeded on some of these fronts with a Vagrant environment I’ve been working on. This spins up:

  • A Consul cluster (default is a single node)
  • Test server(s)
  • A PXC cluster

Additionally, it integrates the Test servers and PXC nodes with Consul such that:

  • The servers setup a Consul agent in client mode to the  Consul cluster
  • Additionally, they setup a local DNS forwarder that sends all DNS requests to the ‘.consul’ domain to the local agent to be serviced by the Consul cluster.
  • The servers register services with Consul that run local health checks
  • The test server(s) setup a ‘watch’ in consul to wait for starting sysbench on a consul ‘event’.

Seeing it in action

Once I run my ‘vagrant up’, I get a consul UI I can connect to on my localhost at port 8501:

Consul's Node Overview

Consul’s Node Overview

 

I can see all 5 of my nodes.  I can check the services and see that test1 is failing one health check because sysbench isn’t running yet:

Consul reporting sysbench is not running.

Consul reporting sysbench is not running.

This is expected, because I haven’t started testing yet.  I can see that my PXC cluster is healthy:

 

Health checks are using clustercheck from the PXC package

Health checks are using clustercheck from the PXC package

 

Involving Percona Cloud Tools in the system

So far, so good.  This Vagrant configuration (if I provide a PERCONA_AGENT_API_KEY in my environment) also registers my test servers with Percona Cloud Tools, so I can see data being reported there for my nodes:

Percona Cloud Tool's Dashboard for a single node

Percona Cloud Tool’s Dashboard for a single node

So now I am ready to begin my test.  To do so, I simply need to issue a consul event from any of the nodes:

My pre-configured watchers on my test node knows what to do with that event and launches sysbench.  Consul shows that sysbench is indeed running:

Screen Shot 2014-11-26 at 9.43.29 AM

 

And I can indeed see traffic start to come in on Percona Cloud Tools:

Screen Shot 2014-11-26 at 9.53.11 AM

I have testing traffic limited for my example, but that’s easily tunable via the Vagrantfile.  To show something a little more impressive, here’s a 5 node cluster running hitting around 2500 tps total throughput:

Screen Shot 2014-11-26 at 1.08.48 PM

So to summarize thus far:

  • I can spin up any size cluster I want and verify it is healthy with Consul’s UI
  • I can spin up any number of test servers and kick off sysbench on all of them simultaneously

Another big trick of Consul’s

That so far so good, but let me point out a few things that may not be obvious.  If you check the Vagrantfile, I use a consul hostname in a few places.  First, on the test servers:

then again on the PXC server configuration:

Notice ‘pxc.service.consul’.  This hostname is provided by Consul and resolves to all the IPs of the current servers both having and passing the ‘pxc’ service health check:

So I am using this to my advantage in two ways:

  1. My PXC cluster bootstraps the first node automatically, but all the other nodes use this hostname for their wsrep_cluster_address.  This means: no specific hostnames or ips in the my.cnf file, and this hostname will always be up to date with what nodes are active in the cluster; which is the precise list that should be in the wsrep_cluster_address at any given moment.
  2. My test servers connect to this hostname, therefore they always know where to connect and they will round-robin (if I have enough sysbench threads and PXC nodes) to different nodes based on the response of the dns lookup, which returns 3 of the active nodes in a different order each time.

(Some of) The Issues

This is still a work in progress and there are many improvements that could be made:

  • I’m relying on PCT to collect my data, but it’d be nice to utilize Consul’s central key/value store to store results of the independent sysbench runs.
  • Consul’s leader election could be used to help the cluster determine which node should bootstrap on first startup. I am assuming node1 should bootstrap.
  • A variety of bugs in various software still makes this a bit clunky sometimes to manage.  Here is a sample:
    • Consul events sometimes don’t fire in the current release (though it looks to be fixed soon)
    • PXC joining nodes sometimes get stuck putting speed bumps into the automated deploy.
    • Automated installs of percona-agent (which sends data to Percona Cloud Tools) is straight-forward, except when different cluster nodes clobber each other’s credentials.

So, in summary, I am happy with how easily Consul integrates and I’m already finding it useful for a product in its 0.4.1 release.

3 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
epcim

Hi, I am about to use percona + consul for production. I am about to avoid consul-template since I already use chef for configuration.

Can you please confirm that usage of 'wsrep_cluster_address' => 'gcomm://pxc.service.consul', will result in using/specifying of “all” the hosts registered in ‘pxc’ service defined on consul? Like one..N IP addresses? It ovbiously depends on what percona does with “gcomm://” handler.

Thanks PMi

epcim

Thanks!