Amazon promised a detailed postmortem of its massive Elastic Cloud Computing (EC2) crash from last week, and it certainly delivered today with a nearly 6,000 word breakdown of the fiasco by the Amazon Web Services (AWS) team.
But despite being more open about the problem, it still took the company around 5,700 words before it finally apologized to its customers, which included major sites and services like Foursquare, Reddit, and Quora. Amazon will be offering affected users 10 days of credit, but that certainly won’t make up for the business lost from sites that depend on it.
The post is incredibly dense and technical, but there are a few compelling points to note: Amazon says that the crash event was caused by a network configuration change, and it will make sure that similar changes in the future go smoothly. The company also says that its experience with this crash will inform how it protects its cloud service in the future.
“The issues affecting EC2 customers last week primarily involved a subset of the Amazon Elastic Block Store (“EBS”) volumes in a single Availability Zone within the US East Region that became unable to service read and write operations,” Amazon wrote. The company refers to those troublesome volumes as “stuck.”
The event has also led Amazon to reassess the structure of its Elastic Compute Cloud (EC2), which currently consists of “Regions” (isolated sections of EC2) and “Availability Zones” (located within Regions). “Our EBS control plane is designed to allow users to access resources in multiple Availability Zones while still being tolerant to failures in individual zones. This event has taught us that we must make further investments to realize this design goal,” Amazon wrote.
The company says it will make it easier for customers to create fault tolerant services by more easily taking advantage of multiple Availability Zones. It will also host a series of webinars, free starting May 2, on how companies can better build their services for the cloud. Additionally, Amazon says it will “invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster.”
Amazon also responded to complaints about its lack of communication during the event. The AWS group says that it felt that focusing its energy on fixing the problem at hand was initially more important than communication, though it recognizes now that it will need to step up with keeping customers in the loop. The company began to make more frequent updates towards the end of the cloud crash, and it will staff up its developer support team in the future to update customers. It’s also working on tools that will let you see if your Amazon service is being disrupted.