Amazon Web Services (AWS), a public cloud infrastructure provider, today apologized for disruptions to its S3 storage service and dozens of other services in its Northern Virginia data center region earlier this week. The Amazon subsidiary also detailed steps it’s taking to prevent similar outages in the future.
“Finally, we want to apologize for the impact this event caused for our customers. While we are proud of our long track record of availability with Amazon S3, we know how critical this service is to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further,” the company said in at the conclusion of a postmortem on the outage.
The event was triggered by human error — which is not completely unusual for outages of web services. GitLab’s recent outage, for instance, was caused by human error. Human error also led to an outage of AWS’ closest competitor, Microsoft Azure, back in 2014.
“The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected,” AWS said in the postmortem. “At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.”
On top of the outage itself, AWS struggled to clearly communicate everything that was going wrong on February 28. The Service Health Dashboard did display a message showing that S3 was having issues, but for a time all other services looked OK despite the fact that they were not. That’s because the icons for the many AWS services were hosted in the Northern Virginia region.
“We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions,” AWS said.
Also, AWS is beginning to update its index subsystem that handles metadata and location information of S3 objects, instead of starting that work later on in 2017, so that it will take less time for affected systems to come back online.
AWS is running audits on its operational systems, and it has already adjusted one “to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level,” AWS said.
Read the full postmortem here.