PagerDuty expands incident response capabilities to build user trust and loyalty

Staffing shortages, distributed teams that have had minimal collaboration, high-stakes “interrupt work” disrupting IT workflows, rising tech costs prompting consolidation.

This set of “colliding macro issues” demands an elevated level of incident response,

As chief product development officer at PagerDuty Sean Scott put it, organizations must move beyond the idea of "incident response" to a more comprehensive understanding of "incident management."

“Incident response used to be all about ‘how quickly can we get back up’ when your digital operations are disrupted, but today it is much deeper than that,” he said.

For this reason, PagerDuty today announced enhancements to PagerDuty Operations Cloud to help expand capabilities around incident workflows.

“Consumer expectations are higher than ever: Seconds of latency can be the difference between building loyalty and losing a customer,” said Scott. “Incident management is about both reducing the risk of that outcome and keeping teams focused on rewarding work like strategic innovation, not firefighting— and especially not at 3 a.m.”

Bigger mistakes, increasing demand

Considering that the average cost of a data breach is now $4.35 million, the global incident and emergency management market continues to grow — by one estimate, it will total nearly $172 billion by 2026.

According to KPMG, the top cyber incident response mistakes include:

Also, data pertinent to incidents isn't readily available, the firm says, and incident response teams lack authority and visibility. And, users are often unclear of their role in the organization’s security posture.

Furthermore, “there is no ‘intelligence’ in the threat intelligence provided to incident responders,” reports the firm.

Thus, it's important to integrate technology including AIops, automation and tools for site reliability engineering (SRE), said Scott. “Incident management goes into service levels that may be difficult to untangle," he said.

Automating response, standardizing runbooks

For instance, a shopping cart is slow, or there is a partial outage because service APIs in a specific region are down, he said. This requires a platform that identifies operations that aren’t functioning as intended and, when the root cause is targeted, an alert is routed to the best person to resolve it.

Businesses should audit telemetry (that is, how they are monitoring/ingesting signals from their digital systems), and determine a threshold for alerting the best on-call expert (who can ideally resolve the problem themselves).

Organizations often have many different processes for different types of interruptions, and each use case may have different remediation "runbooks," said Scott. These should be audited and standardized so that responders aren’t “hunting for a checklist on a wiki when a high-severity incident occurs,” he said.

With automatic telemetry and diagnostics, response plays can become more sophisticated (and further automated). This could potentially enable organizations to remediate an issue before needing to alert on-call experts, he said. Just those few critical moments can mean preserving customers and saving money.

“As businesses are increasing their digital maturity and enhancing incident response, they shouldn’t think of automation of this big, scary, all-or-nothing choice,” said Scott. “Get teams comfortable with it; little automations can move you closer, step-by-step, from human speed to machine speed.

PagerDuty prioritizing action

PagerDuty’s new Incident Workflows feature allows teams to configure response workflows for different types of incidents based on various triggers, such as changes in urgency, status and priority. It also provides a list of incident actions.

For example, an event in digital infrastructure comes in for a critical extract, transform, load (ETL) job failure. An on-call responder is then notified and goes to work to diagnose and remediate that issue rated with “moderate” severity.

But then, a second event comes in: A mobile app is down for the Northwest region. This is “obviously a much bigger issue than the ETL issue, and should be prioritized as such,” said Scott.

Additionally, users can automatically alert customer support and public relations teams so that they can be more proactive and deflect additional customer feedback to the mobile team. Slack channels and Zoom Bridges can also be created automatically, and an automatic diagnostic is run to gather information or telemetry.

A new PagerDuty Status Page allows users to communicate real-time operational updates to specific cohorts of customers. This can be fully automated or keep humans in the loop for approval, said Scott. For instance, a communications team can approve a customer/stakeholder-facing before it is made public, while internal status pages can automatically alert the organization behind a firewall.

Incident Workflows will move to early availability on November 15 and PagerDuty Status Page moves to early availability November 29.

Tailoring alerts

Meanwhile, flexible time windows for intelligent alert grouping lets users tailor alerts and reduce noise. Furthermore, PagerDuty’s machine learning engine calculates and recommends ideal time windows for a specific service, said Scott.

He reported that a sample of PagerDuty’s early access program shows that teams using the feature see a 10 to 45% increase in average compression rate on their noisiest services in weeks.

Flexible time windows are currently in early availability, and will move to general availability in late November.

Finally, a new custom field on incident feature provides more contextual information on the issue and the ability to view and access information from any surface. This service will become initially available in early 2023.

Scott said that the company’s existing PagerDuty Digital Operations Maturity Curve model enables customers to identify where digital operations fall (from manual/reactive to proactive and predictive). And, the company continues to share learnings and best practices from its own incident response learnings.

“Regardless of how we label it, incident response/incident management is about preserving a seamless customer experience, and maintaining the trust and loyalty of customers,” said Scott. “This ultimately translates to protecting and growing revenue.”

Bigger mistakes, increasing demand

Automating response, standardizing runbooks

PagerDuty prioritizing action

Tailoring alerts

More