Root cause analysis: outage of October 22, 2012

Root cause analysis: outage of October 22, 2012

1 post / 0 new
christopher.oneill's picture
Chris O'Neill
Points: 5
Root cause analysis: outage of October 22, 2012

Summary

Several Acquia Cloud customers experienced an outage on Monday, October 22 due to the loss of a single Amazon Web Services (AWS) Availability Zone (AZ) in their US-EAST region. This outage affected the uptime for some Managed Cloud and Dev Cloud sites, and required some sites to restore from backups. This was a rolling outage across the impacted AZ - over the course of time, customers who were “healthy” at the beginning eventually became impaired and, in some cases, customers who we brought back online early in the event were affected a second time.

The Managed Cloud impact was limited to clients with servers in the affected AZ. However, during portions of the outage, the Acquia Network and portions of Acquia Search were impacted and customers were unable to launch new Acquia Cloud servers.

Due to the tools Acquia has built to mitigate AWS failures such as these, Acquia’s operations team was able to restore affected sites within 90 minutes of the initial outage, well in advance of AWS restoring full service to the affected AZ.

What Was The Root Cause?

What began as failures with a subset of AWS Elastic Block Storage (EBS) in one AZ in US-EAST turned into a larger disruption that reportedly affected multiple AWS products including EC2 in the affected AWS Availability Zone. Furthermore, the incident resulted in AWS throttling API traffic, which is the mechanism by which vendors like Acquia control their own servers and storage. This significantly affected Acquia’s ability to mitigate impact on customers. AWS’s own summary of the October 22, 2012 AWS Service Event in the US-East Region can be found here: https://aws.amazon.com/message/680342/

Acquia Dev Cloud customers with their instances in the affected zone were affected and unable to quickly recover their site due to the unavailability of the AWS API.

To continuously monitor our Acquia Cloud environment, the Acquia Operations team maintains a number of monitoring servers outside of AWS. This offsite monitoring environment is designed to ensure that our monitoring systems will not be negatively impacted in the event of an AWS failure. Unfortunately, during the AWS outage the Acquia Operations team also temporarily lost one of its non-AWS monitoring servers, compounding issues associated with the outage. The temporary loss of the monitoring server negatively impacted the team’s ability to quickly diagnose what specifically was broken and then to make the appropriate repairs to restore the Acquia Cloud environment.

My Site Is On Managed Cloud’s HA architecture. Why Did My Site Go Down?

Acquia Managed Cloud uses the GlusterFS (distributed file system) to allow multiple web servers to access the files needed by customer sites which are stored in EBS in two separate AZs. Managed Cloud customers with their primary or secondary EBS volumes in the affected AWS AZ were impacted because GlusterFS hung when it made calls to impaired EBS volumes. When this problem was detected, Acquia Operations initiated a fencing procedure to disconnect the failed EBS volume, which unblocked GlusterFS and brought the affected site back online.

What is Acquia Doing To Mitigate Future Risk?

AWS is constantly improving the reliability of their EC2 and EBS infrastructure, and has already fixed this issue and ensured that the issue is not present in any other AZs. AWS is also improving the ability of their API layer to continue to be responsive in the event of an outage.

Acquia is in the process of testing a newer version of the GlusterFS file system which does not hang when an underlying EBS volume is impaired in a single AZ. Acquia has also improved its external monitoring systems (located out of AWS) to be more resilient.

Status: 
Resolved