AWS outage shines a light on hybrid cloud

As the dust begins to settle on yet another cloud outage, chatter will once again center on the wisdom of companies putting all their digital eggs in a single cloud provider's basket.

Amazon's AWS "US-East-1" cloud region went down in North Virginia yesterday, disrupting some of Amazon's own applications and a slew of third-party services that rely on AWS. The cause? An "impairment of several network devices" led to multiple API errors, which in turn impacted myriad AWS services including Amazon Elastic Compute Cloud (EC2), Connect, DynamoDB, Athena, Chime, and more.

This isn't the first time AWS and its customers have suffered at the hand of technical glitches -- a similar event occurred just last November that impacted the very same AWS region. And while all the major cloud providers including Microsoft and Google have suffered similar fates at various junctures in the past, as the world's largest public cloud provider, AWS outages often have the farthest-reaching impact.

For several hours yesterday, services such as Disney+, Netflix, Instacart, and McDonald's were impacted, often to humorous (and somewhat inconvenient) effect, as one McDonald's visitor demonstrated:

While this might include using third-party data backup services, major cloud outages also support those that argue in favor of hybrid or multi-region cloud strategies -- particularly for mission-critical services. With hybrid, companies can use their own on-premises infrastructure, leaning on the public cloud only to ensure that their in-house systems don't crumble under peak traffic.

Chris Gladwin, founder, and CEO of "exabyte-scale" database technology company Ocient, says that despite all the hype around cloud migration, the risks posed by major outages mean that "hybrid" will likely be the best approach for many bigger companies.

"This is not the first time AWS has experienced these issues," Gladwin said. "For mission-critical applications, we see organizations turning to on-premise and hybrid cloud deployments that ensure they have greater line-of-sight and control over their deployments, uptime, and ultimately, business results."

Service level agreements (SLAs) also play an important part in companies' cloud strategies. While any amount of downtime -- even minutes -- can cost businesses a lot of money, this needs to be balanced against the cost of using public cloud platforms. For example, a company that requires 100% uptime for their application will likely want to host their application across multiple regions, even though this will cost more -- but a company that can live with a few hours of downtime once or twice a year might want to hedge their bets and pay less for a single cloud region or zone with a 99% uptime guarantee.

"A cloud service level agreement of 99% uptime still allows almost eight hours per month of downtime," said John Pescatore, director of emerging security trends at cybersecurity training and certification company Sans Institute. "Businesses need to invest in redundant or backup capabilities, or pay for higher levels of guaranteed availability to preserve critical business services when running in the cloud."

Pescatore also highlighted the potential "concentration risk" that large companies face if too many parties in their supply chain use the same single cloud service provider.

"Larger businesses need to look at their suppliers and see if they are subject to concentration risk -- too high a percentage of suppliers on one cloud service, and even a short outage can be disastrous to business," he said.

More