Minimizing Downtime from Lengthy AWS Outages

Well, it happened again… Another lengthy EBS outage in the US-East region impacted several sites across the net. While failures like this are rare, they can be quite costly and translate into headaches for the operations team when impact production systems for any length of time. At Percona, we routinely help clients architect and deploy highly available systems designed with disaster recovery in the cloud. Here are a few high level best practices that I’ve seen when helping clients with AWS deployments:

Plan for failure
Plan for failure
Plan for … you get the idea

Plan for Failure

The single most critical piece is to plan for and expect failure. The ease of setting an infrastructure in the cloud combined with promises of HA can lead to a false sense of confidence. Assume that parts of the robust cloud are going to fail and work to eliminate any SPOF within the cloud architecture. Simulate random things going away at random times (cue the Chaos Monkey). While outages are going to happen, preparing for and planning on cloud failures can help you to mitigate the impact to your application.

AWS Infrastructure (High Level)

The Amazon Web Services infrastructure contains several individual components that can be combined to create a highly available architecture. While Amazon claims that issues in a single Availability Zone should have no impact on other zones in the same region, empirical evidence from past outages has shown otherwise. It is crucial to have components geographically isolated as well as isolated at the data center level.

Regions

The top level of isolation in AWS is the region. They are geographically isolated in different physical data centers around the world. Bandwidth across regions is similar to standard traffic across the internet, and is charged as such. In order to have a fully redundant solution, you need to have working instances in multiple regions that are able to operate independently.

Availability Zones

Within a single region, there are multiple Availability Zones (AZs). They are designed to operate independently, but there have been examples where issues in a single AZ impacted resources in different AZs. Data transfer within a single AZ is free while data transfer across AZs (but within the same region) is charged at a discounted Regional transfer rate. Having instances in multiple AZs is a minimal level of availability, but can’t be trusted alone.

Failure Scenarios

Looking at the history of AWS outages, they have been isolated to a single region, but have impacted multiple availability zones. Also, other regions may suffer a bit of a slowdown due to others failing into another region. In general, there were will be a larger load across the system and some API calls may not be fully responsive.

Possible Strategies

I would say that the number one strategy is making sure that you are geographically isolated. While that isn’t the end-all (multiple cloud providers, physical datacenter, etc), it should give your cloud app more resilience when faced with a cascading failure within a single region. This is a very similar principle to running real gear – you can’t rely on simply keeping servers in different racks when thinking of HA and DR. Rather, you need to have geographically isolated data centers in the event of a catastrophic failure.

EBS Failover

I have seen EBS volume failover used as a viable HA option within a single AZ. Essentially, your instance starts having issues so you mount your EBS data directory on another instance and simply fire up MySQL. This works wonderfully unless EBS is the component experiencing the issue. In this case, having a hot slave in another region is really the only way to “spin up another instance”. In general, unless your data resides in another location, you can’t always assume you can simply mount your storage to another instance (thinking SAN failure in the real gear analogy).

Multi Region Replication

The easiest approach would be to use native replication from a master to a slave across regions. This will allow you to keep a relatively in-sync instance in another region ready to take over in the event of a full region outage where your primary server resides. Being asynchronous, there is always a potential for some slave lag, but 1-2 seconds of lost transactions (which may or may not be recoverable once the downed region recovers) compared with hours of downtime is probably a decent tradeoff.

You can also combine this approach with a tool like pt-query-digest or pt-playback to keep your standby server primed in the event of failover. I only mention this because in some cases, a cold start can often times result in degraded service for quite some time as well.

Percona XtraDB Cluster (PXC)

Another option would be to use PXC and keep a node (or more) in a separate region. While this will allow for synchronous writes to the remote node, it will also have some latency impact on write operations (the ping time to the most distant node) but that may be something your application can tolerate.

In conjunction with having a node outside of the region, you can also keep your other nodes in different AZs within the same region to at least give yourself some isolation at the data center level (think keeping your nodes in separate racks).

If the extra write latency is something that is a show-stopper for your application, you may consider replicating from a PXC cluster to another cluster or standalone server in another region via standard asynchronous replication. This would be similar to the above approach, but you gain some level of resiliency within the region as well. As a note, you will definitely want to load test this solution when evaluating a cross region cluster.

Overall, issues like this are inevitable and will likely cause at least some downtime or degraded service (unless you run active/active, but that is another discussion). However, given the issues that have occurred in the past, you can mitigate that by treating the cloud like you would a physical datacenter and planning accordingly. Expect failures and simulate/test your failover procedures so you can be confident the next time #ec2pacolypse hits.

7 Comments

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Karl

11 years ago

And avoid North Virginia at all cost. Most of the issues are coming from this region, since it’s their first one.

Justin Swanhart

11 years ago

I like the idea of using two PXC clusters in different regions. Use at least three nodes in each region and place each node into a different AZ. The extra latency on commit won’t be too bad as the latency between AZs is fairly low.

You can use normal asynch replication between regions. I suggest this for two reasons. First, you don’t want an Internet transit problem (suddenly very slow connections, or lots of packet loss etc) to slow down writing into your primary region.

Second, while PXC performs much better when geographically distributed compared to semi-sync replication, the extra latency on commit may not be acceptable for your application if you extend your cluster across regions, particularly if you want your primary region to be in the US and the secondary region in Europe.

Some other things to keep in mind:
Make sure you encrypt your traffic between zones and regions. Use SSL, stunnel, openvpn, etc between your nodes.

Make sure you run backups in each of the regions.

Make sure you test failures and failure modes in each region.

Jacky

11 years ago

Just curious, have anyone done performance test for PXC in single availability zone, multi availability zone, and multi region? This will be a good insight on how well PXC do in HA setup of EC2

Jeremiah

11 years ago

Not using “The Cloud” seems to be the obvious way to avoid the impact of outages like this.

William

11 years ago

@Jeremiah Avoiding the AWS will help you avoid AWS outages, but then you’ll have your own infrastructure problems. Mixed in with a few AWS specifics, there are some geographic distribution tips as well.

Aldo

11 years ago

Thanks for the post! the benefits of using AWS are enough reasons to have a good contingency plan I’ll suggest this topic to be included in the course on AWS.

Anil

10 years ago

Please post your experience on Percona XtraDB Cluster (PXC) on AWS. I would help us since we want to explore this option.

MySQL 5.7
End of Life

Compare Percona to Leading Database Solutions

Software
Downloads

Product
Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Minimizing Downtime from Lengthy AWS Outages

Plan for Failure

AWS Infrastructure (High Level)

Regions

Availability Zones

Failure Scenarios

Possible Strategies

EBS Failover

Multi Region Replication

Percona XtraDB Cluster (PXC)

Related

Related Blog Articles

RECOMMENDED ARTICLES

Can We Set up a Replicate Filter Within the Percona XtraDB Cluster?

Choosing the Right Database: Comparing MariaDB vs. MySQL, PostgreSQL, and MongoDB

Seamless Table Modifications: Leveraging pt-online-schema-change for Online Alterations

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7 End of Life

Compare Percona to Leading Database Solutions

Software Downloads

Product Documentation

Resource Hub

Financial Services

Driving Database Success

Percona Blog

Percona Community Hub

Percona Events Hub

About Percona

Percona in the News

Our Customers

Our Partners

Careers

Contact Us

Minimizing Downtime from Lengthy AWS Outages

Plan for Failure

AWS Infrastructure (High Level)

Regions

Availability Zones

Failure Scenarios

Possible Strategies

EBS Failover

Multi Region Replication

Percona XtraDB Cluster (PXC)

Related

Share This Post!

Want to get weekly updates listing the latest blog posts?

Related Blog Articles

RECOMMENDED ARTICLES

Can We Set up a Replicate Filter Within the Percona XtraDB Cluster?

Choosing the Right Database: Comparing MariaDB vs. MySQL, PostgreSQL, and MongoDB

Seamless Table Modifications: Leveraging pt-online-schema-change for Online Alterations

MOST POPULAR ARTICLES

Auditing login attempts in MySQL

Deploy Django on Kubernetes With Percona Operator for PostgreSQL

MySQL “Got an error reading communication packet”

MySQL 5.7
End of Life

Software
Downloads

Product
Documentation