Acquia Operations detected that the secondary database server supporting the Acquia network had experienced a failure and was no longer synchronized with the primary database. As a result it was not able to be a reliable backup. A customer notification was sent to customers on Tuesday, July 24, 2012 to alert customers to the fact that we would be performing maintenance on July 26th to address this issue.
Two problems occurred as part of the maintenance: (1) maintenance was begun on July 25 (EDT) which was not the day specified in our communication to customers 2) there was a failure of the primary database during this operation that resulted in the Acquia Network being significantly impaired until 10:45 AM EDT on July 26 and partially impaired until 3:18 AM EDT on July 27.
Root Cause Analysis
At 2:02 AM EDT on July 26, Acquia Operations began the process of shutting down the Acquia Network database server in order to migrate it to a larger and more powerful set of hardware. This is a routine operation that is performed daily During this shut down, the database server software was not terminate cleanly, nor was an error generated, resulting in a not readily identified corruption of the database that prevented the database server from restarting. It was not until the server was restarted that the corruption was evident.
By 2:15 AM multiple senior Acquia Operations and Client Advisory team members were engaged in diagnosing the problem and scoping the impact of the failure. At this time, Acquia customers were unable to access any Acquia Network functionality (Insight, UI-based code deployments and other administrative features). This failure did not disrupt the operation of any customer web sites.
A repair of the Acquia Network database began and ran through to completion at 8:09 AM EDT. By 10:45 AM EDT all Acquia Network functionality was restored with the exception of Acquia Network Connector features, including Insight. Full Acquia Network functionality was restored in a fully redundant configuration 3:18 AM EDT on July 27th.
Acquia is taking several steps to improve our ability to respond more quickly to these issues and to mitigate and fix them faster.
- We will be maintaining additional levels of backups for the databases, with an extra backup set aside immediately prior to any database maintenance activity.
- The code developed during the remediation process to temporarily mitigate disruption is in the process of being completely integrated into the Acquia Network platform to improve the speed at which Acquia Network service can be restored.
- New systems and processes have been put in place to dramatically improve the speed with which we communicate to our customers during service incidents.