In this blog, we’ll discuss how to use concurrency to help with WAN latency when using synchronous clusters.

WAN Latency Problem

Our customers often ask us for help or advice with WAN clustering problems. Historically, the usual solution for MySQL WAN deployments is having the primary site in one data center, and stand-by backup site in another data center (replicating from the primary asynchronously). These days, however, there is a huge desire to employ available synchronous replication solutions for MySQL. These solutions include things like Galera (i.e., Percona XtraDB Cluster) or the recently released MySQL Group Replication. This trend is attributable to the fact that these solutions are less problematic and provide more automatic fail over and fail back procedures. But it’s also because businesses want to write in both data centers simultaneously.

Unfortunately, WAN link reliability and latency makes the synchronous replication solution a big challenge. In many cases, these challenges force geographically separate data centers to still replicate asynchronously.

From a requirements point of view, the Galera founders official documentation has WAN related recommendations and some dedicated options (like segments) — as described in Jay’s blog post. But WAN deployments are absolutely possible, and even an advertised option, in Galera. The MySQL Group Replication team, however, seem to discourage such use cases, as we can read:

Group Replication is designed to be deployed in a cluster environment where server instances are very close to each other, and is impacted by both network latency as well as network bandwidth.

Remedy

While perhaps obvious to some, I would like to point out a simple dependency that might be a viable solution in some deployments that face significant network latency. That solution is concurrency! When you face the problem of limited write throughput due to a transaction commit latency, you can employ more writer threads. By using separate connections to MySQL, overall you can to commit more transactions at the same time.

Let me demonstrate with example results based on a very simple test case. I tested both Percona XtraDB Cluster (with Galera replication) and MySQL Group Replication. I configured a minimal cluster of three nodes in each case, running as Docker containers on the same host (simulating a WAN network). For this setup, latency is around 0.05ms on average. Then, I introduced an artificial network latency of 50ms and 100ms into one of the node’s network interfaces. I later repeated the same tests using VirtualBox VM instances, running on a completely different server. The results were very similar. The command to simulate additional network latency is:

To delay the ping to other nodes in the cluster:

The test is very simple: execute 500 small insert transactions, each inserting just single row (but that is less relevant now).

For testing, I used a simple mysqlslap command:

and simple single table:

Interestingly, without increased latency, the same test takes much longer against the Group Replication cluster, even though by default Group Replication works with enabled group_replication_single_primary_mode, and disabled group_replication_enforce_update_everywhere_checks. Theoretically, it should be a lighter operation, from a “data consistency checks” point of view. Also with WAN-type latencies, Percona XtraDB Cluster seems to be slightly faster in this particular test. Here are the test results for the three different network latencies:

XtraDB Clusterlatency/seconds
Threads100ms50ms0.05ms
151.67125.5830.268
413.9368.3590.187
87.844.180.146
164.6412.3530.13
322.331.160.122
641.8080.9250.098
GRlatency/seconds
Threads100ms50ms0.05ms
155.51329.3395.059
414.8897.9162.184
87.6734.1951.294
164.522.5070.767
323.4171.4790.473
642.0990.8090.267

WAN latency

I used the same InnoDB settings for both clusters, each node under a separate Docker container or Virtual Box VM. Similar test result could differ a lot in real production systems, where more CPU cores provide better multi-concurrency conditions.

It also wasn’t my idea to benchmark Galera versus Group Replication, but rather to show that the same concurrency to write throughput dependency applies to both technologies. I might be missing some tuning on the Group Replication side, so I don’t claim any verified winner here.

Just to provide some more details, I was using Percona XtraDB Cluster 5.7.16 and MySQL with Group Replication 5.7.17.

One important note: when you expect higher concurrency to provide better throughput, you must make sure the concurrency is not limited by server settings. For example, you must look at innodb_thread_concurrency  (I used 0, so unlimited),  slave_parallel_workers for GR and wsrep_slave_threads for Galera (among others related to IO operations, etc.).

Apart from “concurrency tuning,” which could involve application changes if not architectural re-design, there are of course more possible optimizations for WAN environments. For example:

https://www.percona.com/blog/2016/03/14/percona-xtradb-cluster-in-a-high-latency-network-environment/ (to deal with latency)

and:

https://www.percona.com/blog/2013/12/19/automatic-replication-relaying-galera-3/

for saving/minimizing network utilization using  binlog_row_image=minimal and other variables.

But these are out of the scope of this post. I hope this simple post helps you deal with the speed of light better!  😉

1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Gene Torres

have you done any testing with one node of the cluster in the cloud (AWS) possibly?