Join top executives in San Francisco on July 11-12, to hear how leaders are integrating and optimizing AI investments for success. Learn More
Affecting more than 3.5 billion people globally and disrupting what has become one of the world’s primary communications and business platforms, the five-hour-plus disappearance of Facebook and its family of apps on Oct. 4 was a technology outage for the ages.
Then, this past Friday afternoon, Facebook again acknowledged that some users were unable to access its platforms.
These back-to-back incidents, kicked off by a series of human and technology miscues, were not only a reminder of how dependent we’ve become on Facebook, Instagram, Messenger, and WhatsApp but have also raised the question: If such a misfortune can befall the most widely used social media platform, is any site or app safe?
The uncomfortable answer is no. Outages of varying scope and duration were a fact of life before last week, and they will be after. Technology breaks, people make mistakes, stuff happens.
The right question for every company has always been and remains not whether an outage could occur — of course it could — but what can be done to reduce the risk, duration, and impact.
We watched the episodes — which on Oct. 4 specifically, cost Facebook between $60 and $100 million in advertising, according to various estimates — unfold from the unique perspective of industry insiders when it comes to managing outages.
One of us (Anurag) was a vice president at Amazon Web Services for more than seven years and is currently the founder and CEO of a company that specializes in website and app performance. The other (Niall) spent three years as the global head of site reliability engineering (SRE) for Microsoft Azure and 11 before that in the same speciality at Google. Together, we’ve lived through countless outages at tech giants.
In assorted ways, these outages should serve as a wake-up call for organizations to look within and make sure they have created the right technical and cultural atmosphere to prevent or mitigate a Facebook-like disaster. Four key steps they should take:
1. Acknowledge human error as a given and aim to compensate for it
It’s remarkable how often IT debacles begin with a typo.
According to an explanation by Facebook infrastructure vice president Santosh Janardha, engineers were performing routine network maintenance when “a command was issued with the intention to assess the availability of global backbone capacity, which unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally.”
This is reminiscent of an Amazon Web Services (AWS) outage in February 2017 that incapacitated a slew of websites for several hours. The company said one of its employees was debugging an issue with the billing system and accidentally took more servers offline than intended, which led to cascading failure of yet more systems. Human error contributed to a previous large AWS outage in April 2011.
Companies mustn’t pretend that if they just try harder, they can stop humans from making mistakes. The reality is that if you have hundreds of people manually keying in thousands of commands every day, it is only a matter of time before someone makes a disastrous flub. Instead, companies need to investigate why a seemingly small slip-up in a command line can do such widespread damage.
The underlying software should be able to naturally limit the blast radius of any individual command — in effect, circuit breakers that limit the number of elements impacted by a single command. Facebook had such a control, according to Janardha, “but a bug in that audit tool prevented it from properly stopping the command.” The lesson: Companies need to be diligent in checking that such capabilities are working as intended.
In addition, organizations should look to automation technologies to reduce the amount of repetitive, often tedious manual processes where so many gaffes occur. Circuit breakers are also needed for automations to avoid repairs from spiraling out of control and causing yet more problems. Slack’s outage in January 2021 shows how automations can also cause cascading failures.
2. Conduct blameless post-mortems
Facebook’s Mark Zuckerberg wrote on Oct. 5, “We’ve spent the past 24 hours debriefing on how we can strengthen our systems against this kind of failure.” That’s important, but it also raises a critical point: Companies that suffer an outage should never point fingers at individuals but rather consider the bigger picture of what systems and processes could have thwarted it.
As Jeff Bezos once said, “Good intentions don’t work. Mechanisms do.” What he meant is that trying or working harder doesn’t solve problems, you need to fix the underlying system. It’s the same here. No one gets up in the morning intending to make a mistake, they simply happen. Thus, companies should focus on the technical and organizational means to reduce errors. The conversation should go: “We’ve already paid for this outage. What benefit can we get from that expenditure?”
3. Avoid the “deadly embrace”
The deadly embrace describes the deadlock that occurs when too many systems in a network are mutually dependent — in other words, when one breaks, the other also fails.
This was a major factor in Facebook’s outages. That single erroneous command sparked a domino effect that shut down the backbone connecting all of Facebook’s data centers globally.
Furthermore, a problem with Facebook’s DNS servers — DNS, short for Domain Name System, translates human-readable hostnames to numeric IP addresses — “broke many of the internal tools we’d normally use to investigate and resolve outages like this,” Janardha wrote.
There’s a good lesson here: Maintain a deep understanding of dependencies in a network so you’re not caught flat-footed if trouble begins. And have redundancies and fallbacks in place so that efforts to resolve an outage can proceed quickly. The thinking should be similar to how, if a natural disaster takes down first responders’ modern communication systems, they can still turn to older technologies like ham radio channels to do their jobs.
4. Favor decentralized IT architectures
It may have surprised many tech industry insiders to discover how remarkably monolithic Facebook has been in its IT approach. For whatever reason, the company has wanted to manage its network in a highly centralized manner. But this strategy made the outages worse than they should have been.
For example, it was probably a misstep for them to put their DNS servers entirely within their own network, rather than some deployed in the cloud via an external DNS provider that could be accessed when the internal ones couldn’t.
Another issue was Facebook’s use of a “global control plane” — i.e. a single management point for all of the company’s resources worldwide. With a more decentralized, regional control plane, the apps might have gone offline in one part of the world, say America, but continued working in Europe and Asia. By comparison, AWS and Microsoft Azure use this design and Google has somewhat moved toward it.
Facebook may have suffered the mother of all outages — and back to back at that — but both episodes have provided valuable lessons for other companies to avoid the same fate. These four steps are a great start.
Anurag Gupta is founder and CEO at Shoreline.io, an incident automation company. He was previously Vice President at AWS and VP of Engineering at Oracle.
Niall Murphy is a member of Shoreline.io’s advisory board. He was previously Global Head of Azure SRE at Microsoft and head of the Ads Site Reliability Engineering team at Google Ireland.
VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.