Sunlight is finally beginning to shine through Amazon’s disastrous cloud crash, which seemed impossible to avoid yesterday as it took down major sites like Foursquare and Reddit. Certain parts of the service are still struggling to get back online.
Downtime is to be expected for any web service, but the prolonged failure of Amazon’s cloud (it first went down overnight on Wednesday, April 20) is particularly painful, since so many companies rely on it for their sites and services. It serves as a reminder of the vulnerability of cloud services. Amazon had backup systems in place to prevent a scenario like this, but clearly they couldn’t foresee such a massive breakdown.
At the moment, Amazon’s infrastructure in Northern Virginia is showing the most trouble, including the Elastic Computing Cloud (EC2) and Relational Database Service. Here’s the latest update from Amazon’s engineers:
We’re starting to see more meaningful progress in restoring volumes (many have been restored in the last few hours) and expect this progress to continue over the next few hours. We expect that well reach a point where a minority of these stuck volumes will need to be restored with a more time consuming process, using backups made to S3 yesterday (these will have longer recovery times for the affected volumes). When we get to that point, we’ll let folks know. As volumes are restored, they become available to running instances, however they will not be able to be detached until we enable the API commands in the affected Availability Zone.
For some, Amazon’s lack of communication was their biggest problem: “Starting at 1:41 a.m. PST, Amazon’s updates read as if they were written by their attorneys and accountants who were hedging against their stated SLA rather than being written by a tech guy trying to help another tech guy,” wrote BigDoor CEO Keith Smith.
“We aren’t just sitting around waiting for systems to recover,” he continued. “We are actively moving instances to areas within the AWS cloud that are actually functioning. If Amazon had been more forthcoming with what they are experiencing, we would have been able to restore our systems sooner.”