No one likes to talk about outages. They’re horrible to experience as an employee and they take a heavy toll in customer confidence and future revenue. But they do happen.
Even publicly traded tech powerhouses, such as eBay and Microsoft, who have more technical resources than you’ll ever have, fall prey to outages. And when they do, they are closed for business, much to the chagrin of their shareholders and executive teams. In fact, Aberdeen Group estimates that the average cost of downtime for businesses is $161,000/hr.
Root causes of outages include:
- The infamous fat finger (human error)
- Gaps in knowledge about complex systems and their interdependencies
- Equipment failures, including out of date machines or those not configured correctly
- Hacking or other security breaches
- Poor or missing processes
- Any combination of the above
Consequences of outages include:
- Irretrievably lost revenue, such as the estimated half a million dollars that Facebook reportedly lost in a half hour outage in June
- Lost productivity, like when Office 365 recently went down and stranded its customers without email
- Irate customers, such as the small businesses dependent on eBay and their reaction to recent “intermittent service issues”
- Outright failure of the business, like when Codespaces suffered a crippling DDOS attack from a hacker who was attempting extortion and gained access to their Amazon EC2 Control Panel and deleted unrecoverable customer data
It’s not so much a question of whether an outage will occur in your company but when. The secret to surviving them is to get better at handling them and learning from the mistakes of others. Nobody is perfect all the time (LogicMonitor, included) but we hope by talking about these mistakes, we can all begin the hard work required to avoid them in the future.
The top 10 mistakes companies make handling outages:
1. Not having a tried-and-true outage response plan
Does this sound familiar?
An outage occurs. A barrage of emails is fired to the Tech Ops team from Customer Support. Executives begin demanding updates every five minutes. Tech team members all run to their separate monitoring tools to see what data they can dredge up, often only seeing a part of the problem. Mass confusion ensues as groups point their fingers at each other and Sys Admins are unsure whether to respond to the text from their boss demanding an update or to continue to troubleshoot and apply a possible fix. Marketing (“We’re getting trashed on social media! We need to send a mass email and do a blog post telling people what is happening!”) and Legal (“Don’t admit liability!”) jump in to help craft a public-facing response. Cats begin mating with dogs and the world explodes.
OK, that last part may not happen. But if the rest sounds familiar, your company might be making Mistake #1.
How to avoid:
A well-formed process for handling outages must define who is accountable for resolving issues, who is in the escalation path and who is responsible for communicating about issues. It includes a post-mortem process for analyzing the root cause behind the outage and addressing any gaps, which can range from building redundancy into systems to changing monitoring settings so that issues can be caught and resolved before an outage might reoccur in the future.
2. Lack of communication about the outage with impacted customers
In the heat of trying to get your company back online, it’s easy to “go dark.” Unfortunately, not communicating with customers often causes a host of negative consequences, including a flood of support calls, longer hold times, and poor customer experience, and it can produce a perception that your company is unresponsive, untrustworthy or not in control.
The fault often lies in poor or missing lines of communication between customer-facing groups and your Tech Ops team. Not having systems (blogs, forums, mass email, RSS feeds, etc) with which to notify customers of issues can be a big problem. Or companies don’t communicate about the outage based on the mistaken belief that customers might not notice the issue (customers will notice) and that damage will somehow be minimized (lack of communication only makes it worse.)
How to Avoid:
Ensure you have a defined communication process in place with clearly assigned responsibilities for both internal and external communication during and after the outage. Make sure everyone involved is familiar with it. Don’t just store it on your company’s web site, because that may not be accessible during the outage.
3. Playing the blame game
Blaming a partner or vendor is a tactic companies sometimes employ in responding to outages. It rarely proves successful, because customers see it as abdicating responsibility for a decision the company ultimately made. (Who chose to depend on that vendor or partner? You did.) By not accepting responsibility, the company is also not taking steps to prevent recurrence of the problem, which is unlikely to be a crowd pleaser.
How to Avoid:
Taking broader responsibility and instituting a review of vendors involved, setting up redundancy or reviewing processes that might have contributed to the issue are all better options than playing the blame game. Ensure post-mortems are blameless and get to the root cause of the failing process by using the 5 Whys.
4. Not knowing they are having an outage in the first place.
The worst way to hear about an outage is to have your customers tell you (or possibly having your boss tell you). Having your monitoring infrastructure in the datacenter being monitored is an excellent way to have outages that you don’t get an alert about – because monitoring is off-line too. Even if your datacenter is Amazon, which is what happened to Loggly during an extended outage a few years ago.
The best way: to get an alert from a unified SaaS-based platform that tells you if your whole datacenter is down. Your monitoring platform should provide a complete view of websites (including performing synthetic transaction checks), applications, databases, network, servers, virtualization and the Cloud (wherever your IT infrastructure is housed), so that you can proactively fix issues before customer experience is impacted.