ChatOps was birthed from one of the more notorious plagues afflicting the modern operations team: information overload. Ongoing digital transformation initiatives, the relentless accumulation of data and alerts, and the proliferation of operational consoles have threatened to overwhelm teams. Alert fatigue is an all-too-real and damaging phenomenon, making it increasingly difficult for operations engineers to work effectively and efficiently.
ChatOps does well in small teams
Chat — via integrable platforms like Slack and Hipchat — can serve as a single, unifying console for operations teams, acting as a clearinghouse for all other applications and collaboration tools. This is the promise of ChatOps, and it holds tremendous potential for solving information overload.
For smaller operations teams, an unstructured ChatOps model is completely viable. You can have single #major-incident channels and loop people in on a need-by-need basis with a simple @mention. Typically, you wouldn’t need to keep conversational records around single issues for compliance, and it’s relatively easy to onboard new engineers into the platform. It’s not a big deal for new users to enter a channel with limited context, because the environment isn’t too large or complex.
But while ChatOps is an ingenious premise for smaller teams, it becomes problematic at scale. Even a slick tool like Slack starts to become clumsy and noisy at the enterprise operations level and ultimately creates more inefficiencies than it solves. It might be manageable for a 10-person startup to filter various messages to separate QA or release channels, but what happens when those channels contain over 200 people apiece? The paradigm stops working; it’s no longer effective to send that same alert. You’d be nudging hundreds of people in an effort to grab the attention of a single employee. So we run into the original problem that ChatOps had intended to solve: too much noise and too much irrelevant information. Alert fatigue rides again.
So how do we add sophistication to ChatOps to make it viable at the enterprise level?
Scaling ChatOps is daunting
A large enterprise is likely to have multiple task and issue management platforms that operations teams update directly or that require indirect updates to keep various communities up-to-date. Operations teams might need to update incident records in platforms such as ServiceNow or Cherwell. Customer-facing teams would then need updates in their own systems, such as Zendesk or Salesforce Service Cloud, in order to communicate accurate, timely information to consumers. And applications teams would need to track items that require their direct attention in platforms like Atlassian’s JIRA.
These types of of challenges are multiplied by every class of system and organization operations teams work with.
The answer to scaling ChatOps for this type of environment lies in the very strength that gave rise to ChatOps in the first place — integrations. We can make ChatOps more intelligent through the use of bots and interactions with other applications that capture the workflow and structure necessary for scalability.
Bots and heuristics can solve this problem
Consider this example: A telecommunications provider uses Hipchat as its ChatOps focal point. Alerts from Sensu and Splunk are directed at specific operations team engineers, first by passing through heuristics — e.g. on-call schedule, priority, availability, skills, location — to refine and focus the targeting. So rather than contacting the entire team of 100 operations engineers, the company is able to drastically reduce the volume of chat messages. For a small team, this intelligence wouldn’t be necessary, but at scale, it becomes absolutely critical.
Next, the engineer targeted by the alert can take action directly from Hipchat (or mobile app, SMS, email, phone). In this example, the specific action is the creation of a JIRA ticket containing the full details of the incident at hand. A simple “Create ticket” response from the engineer automatically triggers a series of activities, including creation, completion, and assignment of the ticket.
Performing this action directly from the chat console is not only incredibly efficient for the operations engineer but also starts a critical record-keeping process. This is necessary for a team with over 100 operations engineers — not to mention thousands of additional tech team members — who might need to be involved or refer back to the incident at some point.
The next hurdle in the scaling challenge arises when the operations engineer needs to engage people from other teams to help with triage, diagnosis, and resolution. For a small team, tribal knowledge makes it easy to know specifically who to bring in, but enterprises with thousands of engineers don’t have that luxury. In this case, the company captures organizational knowledge and makes it accessible directly from the chat platform by using two components. The first is a bot, to whom an operations engineer can ask questions like “Who is the go-to person on the security team right now?” Replace security team with any other team, application, service, or infrastructure element, and you have a scalable mechanism for the operations engineer to determine who they should @mention for assistance.
The second component combines the knowledge of the bot and the system that powers it with a directive. In this case, the operations engineer can issue a single command such as “I need help with ticket 43514 from network, database, and payment processing,” and this request will be automatically logged in the JIRA system of record. Then, a specific engineer from each of the necessary teams is automatically engaged and pulled into a new Hipchat channel created specifically for collaboration on the incident at hand. And closing this channel automatically adds all activities that took place to the JIRA ticket as well, ensuring that compliance and record-keeping requirements are met. This kind of ChatOps model works at scale. The operations engineer is able to act directly from the Hipchat console, automatically targeting and adding teams as necessary into specific channels, eliminating the need to mass-blast thousands of employees with irrelevant alerts.
Tackling ChatOps at scale involves refining the communication model, using advanced analytics and reporting, taming it with flexible helpers, and coupling it with tight integrations. It requires buy-in from the team, but it’s worth it and generates great results when done right!
Abbas Haider Ali is CTO at xMatters.