Amazon takes 5,700 words before apologizing for cloud crash

Amazon promised a detailed postmortem of its massive Elastic Cloud Computing (EC2) crash from last week, and it certainly delivered today with a nearly 6,000 word breakdown of the fiasco by the Amazon Web Services (AWS) team.

But despite being more open about the problem, it still took the company around 5,700 words before it finally apologized to its customers, which included major sites and services like Foursquare, Reddit, and Quora. Amazon will be offering affected users 10 days of credit, but that certainly won’t make up for the business lost from sites that depend on it.

The post is incredibly dense and technical, but there are a few compelling points to note: Amazon says that the crash event was caused by a network configuration change, and it will make sure that similar changes in the future go smoothly. The company also says that its experience with this crash will inform how it protects its cloud service in the future.

“The issues affecting EC2 customers last week primarily involved a subset of the Amazon Elastic Block Store (“EBS”) volumes in a single Availability Zone within the US East Region that became unable to service read and write operations,” Amazon wrote. The company refers to those troublesome volumes as “stuck.”

The event has also led Amazon to reassess the structure of its Elastic Compute  Cloud (EC2), which currently consists of “Regions” (isolated sections of EC2) and “Availability Zones” (located within Regions). “Our EBS control plane is designed to allow users to access resources in multiple Availability Zones while still being tolerant to failures in individual zones. This event has taught us that we must make further investments to realize this design goal,” Amazon wrote.

The company says it will make it easier for customers to create fault tolerant services by more easily taking advantage of multiple Availability Zones. It will also host a series of webinars, free starting May 2, on how companies can better build their services for the cloud. Additionally, Amazon says it will “invest in increasing our visibility, control, and automation to recover volumes in an EBS cluster.”

Amazon also responded to complaints about its lack of communication during the event. The AWS group says that it felt that focusing its energy on fixing the problem at hand was initially more important than communication, though it recognizes now that it will need to step up with keeping customers in the loop. The company began to make more frequent updates towards the end of the cloud crash, and it will staff up its developer support team in the future to update customers. It’s also working on tools that will let you see if your Amazon service is being disrupted.

Topics:

,
  • http://twitter.com/juvus juvus

    Wow! Of all the things you could complain about today you are complaining that Amazon apologize at the end of the letter. I am sure a lot of customers were expecting an apology but a whole lot more wanted to know what happened to make their weekend go to crap. Amazon delivered on their promise for an in-depth review. The nitpicking is not warranted.

  • idontalwayscomment

    Well said Juvus, though the author might be trolling. Where is Sony's in depth technical report of why PSN was down? Amazon has kindly left little to blind faith.Releasing a report like that is allot harder than apologising. I commend them.

  • http://www.devindra.org Devindra Hardawar

    Oh, I'm sorry, I suppose it's unusual for me to expect a company to apologize first when they've royally screwed up the lives and businesses of others ;)

  • StupidPeopleShouldntBreed

    It's not unusual, but if you read your own article. You would realize thats exactly what they did. Don't be such a douchbag to your commentor's, they pay your salary. If those companies would have used the cloud for what it's actually supposed to be used for, they wouldn't have had any down time.

  • http://www.devindra.org Devindra Hardawar

    No commenters don't pay my salary. Thank god.And while I realize better preparation would have helped these companies, pretending that they're somehow more responsible for the downtime than Amazon's fumbling is just madness. If you haven't gathered this by now, the real problem with the postmortem is that it was written by an engineer when Amazon desperately needed a human face on this issue. A better response would have been a simpler document that apologizes first and explains the key issues, which then linked to this more technical explanation.

  • StupidPeopleShouldntBreed

    Ok they might not, but if your act like. Watch your reader's go elsewhere to read. Regardless how stupid the orignal comment might have been. Pretty sure venturebeat is able to pay you because of all these god awful ad's they have on this site. And let me remind you.. My ad view's pay your salary. I am not implying that Amazon shouldn't be sorry, but too put the complete blame on them is non-sense. We all know that computers will go down, it's just a matter of time. Like I said earlier, IF those companies knew how to properly use the cloud.. their site would have NEVER went down.While I agree, an engineer shouldn't be answering questions for the end user. They need to hire a few people, of which they already said that. Amazon is hiring like crazy here in Washington.With my rant being done, I must say.. Nice article. Was some good reading.

  • http://twitter.com/TheOnlyThapa Jain Thapa

    I think Amazon needs to learn some tips from Sony on how to apologize.

  • http://venturebeat.com/2011/08/09/amazon-ec2-outage/ Amazon suffers yet another cloud crash | VentureBeat

    [...] unlike the company’s disastrous cloud crash in April, which lasted around 10 days, Amazon’s Web Services was back up and running without issue in [...]

  • http://www.socialnetworkbackgroundcheck.com/amazon-suffers-yet-another-cloud-crash/ Amazon suffers yet another cloud crash | Social Network Background Check

    [...] unlike the company’s disastrous cloud crash in April, which lasted around 10 days, Amazon’s Web Services was back up and running without issue in [...]

  • http://www.hubspan.com/cloud-computing/amazon-web-services-growing-pains/ I love Amazon but AWS is clearly having growing pains – Hubspan

    [...] running EC2 (Elastic Compute Cloud).   This one fortunately only lasted a few hours, unlike the company’s disastrous cloud crash in April, when multiple eCommerce sites were down for days, losing potentially millions of dollars in sales. [...]

blog comments powered by Disqus