Presented by Elastic


We live in a digital world -- and as a result, the ability to reliably operate key software systems and services can make or break a business. 

Downtime or performance hiccups can translate into any number of harmful scenarios, ranging from lost revenue of potential customers who are driven to a competitor’s website that’s just a click away to lost productivity of employees who suddenly find themselves unable to carry out their deadline-driven work. 

While it might seem like a never-ending battle for site reliability engineers (SREs) and DevOps professionals to keep critical websites and applications up and running without incident, there’s good news on this front. Generative AI -- which has made a big splash over the past year with its intuitive Q&A interface -- can supercharge traditional observability methods, creating a multiplier effect that can solve reliability, security and speed challenges quicker and more efficiently.

Just ask the AI

While monitoring and observability have traditionally been about trying to spot the “signal in the noise” and “diagnosing the unknowns” so that when something unexpected happens to a system, SREs and DevOps teams can diagnose the issue and take the proper remediation steps to resolve it.

Generative AI can take away a lot of the toil of this process and accelerate the ability of SREs and DevOps to respond to incidents with greater speed, flexibility and confidence.

Take the example of a newly hired on-call engineer who might not have enough accumulated institutional knowledge to understand every single system within the organization and how it operates.

Now imagine they get woken up in the middle of the night by an alert that something seems irregular with system X, which they are unfamiliar. They can have a conversation with an AI assistant to quickly get up to speed, asking “What is the purpose of this system?” or “What other systems within the organization does this one connect to?” 

In seconds, the engineer is provided with useful contextual information that has been summarized by the large language model (LLM) that underpins the generative AI.

What’s particularly noteworthy here is the engineer can “converse” with the LLM using natural language -- a.k.a., plain old English. They don’t need to understand complicated query languages or the different models vendors use to structure their data to get an answer to their question. They just ask away, the same way they would ask a more experienced colleague the next cubicle over, and instantly have the information they need to better interpret their environment and troubleshoot errors.

A wealth of collective knowledge

More than just providing useful contextual information when asked for it, generative AI can proactively summarize that context and provide it to an SRE.

For instance, an on-call engineer can receive a full summary of an issue in their Slack channel -- including all steps that have been taken to date, and who’s been involved in those steps -- even before they get woken up by an alert. Rather than having to spend valuable minutes or hours digesting what’s happened so far, they are ready to respond nearly instantaneously.

When proactively pushing out these summaries, the LLMs can even provide an overview of the playbook that was used to address this type of situation the last time it occurred. From there, the engineer simply needs to run that playbook themselves or, simpler still, instruct the LLM to go ahead and run it.

This type of assistance effectively puts the entire knowledge base of the organization at the engineer’s disposal. That comprehensive access to knowledge and best practices helps them to quickly make the most effective decision and to efficiently nip any website or application glitches or security incidents in the bud -- regardless of proficiency or institutional experience.

Corporate giants like T-Mobile Netherlands are already taking advantage of this type of functionality, using powerful AI technology to aid their network operations team, network planning, and customer operations -- helping to ensure greater network availability and rapid fault resolution for any network issues that arise.

Today and tomorrow

While today generative AI can serve as an assistant that can helpfully explain things and provide context, as well as a co-pilot who works side-by-side with professionals and offers “an extra set of hands,” the evolution will continue. In the not-too-distant future, generative AI will go one step further and serve as an AI agent that can automate many of the responses on behalf of the engineer.

For example, if the agent has seen a specific alert enough times and has confidence that “we always run playbook X when we see a set of alerts/conditions A, B, and C,” it will run that playbook for the engineer and provide a summary and confirmation of the actions it has taken. That’s one more item -- and potentially a sleepless night -- taken off the SRE’s plate.

Alongside this trajectory toward automation, it’s also likely LLMs will increasingly be able to combine observability data with data from other systems within the organization, whether that’s ERP, financials or security. As that data comes together, engineers will be able to ask more sophisticated, business-critical questions of the LLM -- not just “When was the last time this alert happened, and what runbook did we use?” but “What was the revenue impact the last time an incident like this happened?” or “What was the operational impact on our supply chain?”

A real gamechanger

Observability professionals have always had powerful technology at their disposal -- but generative AI provides an innovative new tool to turbocharge their workflows. 

Crucially, generative AI doesn’t replace SREs or DevOps professionals -- it just reduces much of the toil they need to do as part of their day-to-day work so they can devote more of their time to higher-level problem-solving.

By helping them zero in on the most relevant information, gain better insights and make better decisions faster, the combination of generative AI and observability data is more than a breakthrough -- it’s a gamechanger.

Abhishek Singh is GM, Observability at Elastic.


Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com