Amazon has released a detailed explanation of its massive EC2 outage this past Friday and Saturday. But even with all the new info, services that went down, including Netflix, Instagram, and Pinterest, have yet to fill in some crucial blanks.
This past weekend, Amazon data centers in Northern Virginia failed after a powerful storm ripped through the area. The company’s AWS service health dashboard slowly updated with details concerning “Power Issues,” but we had very little explanation outside of small updates.
In its new detailed remarks, Amazon claims that it encountered many bugs and small problems that ultimately killed power on its backup generators. The generators ran for a few minutes and then failed, leaving the data centers in the dark. Technicians brought the generators back online in just 10 minutes, but by that point, it would require another three hours to reboot all the servers that had been affected.
Amazon also had a bug that screwed up its Elastic Load Balancers, which help distribute and redirect traffic across different Amazon data centers. Another bug in its Relational Database Service kept a small number of data centers from recovering normally.
We contacted Instagram, Netflix, and Pinterest to get more details that could explain their outages better, but all three services declined to give us access to someone who could talk about their infrastructure. Netflix, for example, should not go down in a single Amazon outage because of redundancies, yet it still failed this past Friday. We’re unsure at this time if Instagram and Pinterest have set up their architectures in a similar manner that should normally avoid outages like these.
Yes, the blame mostly falls on Amazon, but it would be good to have an official explanation from a player like Netflix if Amazon screwed up so badly that it killed whatever spared Netflix from crashing during Amazon’s notable April 2011 outage. A tweet by Netflix cloud architect Adrian Cockcroft points to the Elastic Load Balancers being the real problem. He writes that the company “lost instances in one zone, but lost ELB traffic routing to the zones that were working.”
Instagram also should give an explanation because its outage lasted well into Saturday, whereas Netflix and Pinterest were only down for a matter of hours.
Amazon claims it will repair and retest its data center equipment and software to improve its services. The company’s apology at the bottom of its detailed remarks reads:
We apologize for the inconvenience and trouble this caused for affected customers. We know how critical our services are to our customers’ businesses. If you’ve followed the history of AWS, the customer focus we have, and the pace with which we iterate, we think you know that we will do everything we can to learn from this event and use it to drive improvement across our services. We will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make further changes to improve our services and processes.
Let us know in the comments (or send me a note to sean AT venturebeat.com) if you have any further insight into the outage that Amazon didn’t cover in its explanation.
VB's research team is studying web-personalization... Chime in here, and we’ll share the results.