AI is exposing why the enterprise monitoring playbook no longer works

Presented by Virtana

Enterprise IT teams are losing the battle against modern application failures. The problem isn’t a lack of monitoring tools. It’s that the tools they rely on were built for an infrastructure era that no longer exists.

IT teams have relied entirely on the premise that when something breaks, monitoring tools will explain why. That assumption is no longer holding. Legacy application performance modeling APM treats the application as the boundary of truth. It surfaces slow transactions and errors, but stops short of root cause, leaving teams to speculate.

In the meantime, AI is multiplying the impact, scaling failure rates to a level that legacy monitoring architecture cannot absorb. Mounting observability spend is making the problem worse rather than better.

As AI workloads push failure surfaces below the application layer, the gap between what monitoring tools can see and what operations teams actually need to know has become a systemic business risk that more spending on the current model will not fix. The only credible exit is a fundamental architectural shift, from siloed, application-layer tools to a unified, full-stack system of record that can correlate signals across every layer of the environment autonomously.

Research exposes systemic operational gaps

New research reveals that AI adoption is exposing systemic operational gaps even faster than most leadership teams realize. Virtana's report, "AI Is Breaking Human-Managed Operations,” which surveyed over 350 senior IT and technology leaders, found that three quarters of enterprises report AI job failure rates exceeding 10%, and a third are above 25%. At enterprise scale, where millions of AI decisions execute continuously, these failure rates represent systemic risk, not operational noise. The survey also exposes a leadership disconnect, where 59% of executives believe their organizations are ready for AI-scale operations, but 62% of practitioners report fragmented systems and persistent visibility deficiencies. When decision-makers and operators hold fundamentally different views of operational reality, investments flow to the wrong problems while the real issues persist: visibility gaps, alert storms, and systemic AI failures that no one can explain.

“The data is unambiguous,” says Appleby, president and CEO of Virtana. “While executive confidence is rising, operational fragility is rising faster. When three-quarters of enterprises report double-digit AI job failure rates and one-third exceed 25%, the operating model is clearly outdated. At enterprise scale, these rates translate into thousands of failed executions per day, driving retries, wasted compute capacity, cascading delays, and escalating operational risk. As AI workloads expand and agentic systems begin operating autonomously, modest failure percentages compound into systemic volatility.”

Why legacy APM can’t explain modern failures

APM was designed for discrete, bounded applications running on predictable infrastructure, but that world no longer exists. Today’s modern applications are distributed systems of decomposed services spanning the entire enterprise, with code running on Kubernetes. They interact with databases and shared infrastructure while competing for GPU resources and supporting AI workloads across hybrid environments. The performance of a user-facing transaction is not determined by any single component, but emerges from the interaction of all of them together.

“Mission-critical applications such as airline reservation systems, payment processing systems, health care delivery systems, and emergency dispatch are no longer just code, but complex systems spanning software, services, infrastructure, and AI workloads,” explains Appleby. “At this scale and complexity, legacy APM focused on code and human-only operations is no longer a credible way to understand how applications behave.”

APM is limited to the “wheres,” not the “whys.” It surfaces symptoms under the hood, like elevated latency, rising error rates, or dropping throughput. But it can’t explain what’s triggering those issues, because the root cause increasingly lives below the application layer, in infrastructure the tool was never designed to observe.

It's missing sub-application-layer root causes like storage degrading under I/O pressure from competing workloads, Kubernetes nodes starved of resources by adjacent services, and GPU contention from AI workloads cascading upward into application latency. It sees the application layer, but cannot see below it, so it registers the symptom and simply stops. As a result, operations teams are left to manually correlate cause across fragmented tools, while the incident clock runs and the business escalates.

The research reflects this directly: 56% of practitioners cite storage and networking bottlenecks as their top AI constraint — precisely the layers traditional APM tools were never designed to observe. These are not edge cases; they are the conditions under which most enterprise AI environments now operate.

More spending on observability is producing less clarity

The intuitive response to a monitoring problem is to add more monitoring tools. Many enterprises have done exactly that, layering new observability platforms, dashboards, and alerting systems on top of existing infrastructure. But this tool-centric model was never designed to keep pace with the scale and complexity of modern enterprise environments. The result is an investment paradox: more spending, more dashboards, more alerts — and still no reliable answer when an incident fires.

The observability market will approach $14.2 billion by 2028, Gartner says, but less than half of IT leaders are confident they can handle operating at scale with the tools they already have. Unfortunately, enterprises are layering new tools onto fragmented data foundations without unifying the underlying data model. That means every new tool adds a new silo, new alert logic, and a new dashboard that does not talk to other dashboards, creating less visibility and more noise.

Full-stack application observability is the path forward for AI-scale operations

With roughly one in four AI workloads failing, running AI at enterprise scale becomes operationally and financially unsustainable. It quickly becomes not just an IT problem, but a balance-sheet problem. Failed executions, retries, wasted compute, and cascading delays are costly.

To close the gap, organizations need to shift how they think about architecture and observability, moving away from observability as a collection of domain-specific tools toward a unified system of record. Signal volume in modern AI environments exceeds what teams can reason over in real time, so the solution is to delegate correlation and first-pass diagnosis to autonomous systems operating continuously over a unified context.

The goal is to allow operators to ask questions in plain language while AI agents interact programmatically through structured protocols, both reasoning over the same unified operational context. Capabilities like Virtana’s AI-native, system-aware application observability automatically correlate signals across code, services, infrastructure, networks, storage, and AI workloads, enabling agentic investigation and surfacing evidence-backed root cause without manual correlation across disconnected tools. The platform exposes this operational context through an MCP server compatible with major AI assistants including ChatGPT, Claude, Gemini and Microsoft Copilot.

“Modern applications are distributed systems, and performance constraints often originate in infrastructure, network, or platform layers that traditional APM was never designed to see,” said Doug Syer, chief engineer for AI monitoring and observability at NWN. “To support more than 6,000 CIOs across enterprise and public sector organizations, we need visibility across the full stack. Virtana Application Observability offers true system-level visibility, correlating signals across the full stack, enabling the immediate transition from symptoms to evidence-backed root cause.”

How to scale AI reliably in production

In the past, an operations team managing a reservation system spanning on-premises infrastructure and multiple cloud regions might spend hours navigating war-room escalations while manually correlating alerts across fragmented tools. With full-stack observability, teams can determine within minutes whether a latency spike originates in application code, a saturated storage path, or a Kubernetes node under I/O pressure from a competing AI workload. The platform exposes this operational context through an MCP server, enabling operators to query full-stack data using natural language while autonomous agents interact programmatically.

As enterprises reach an inflection point, graduating from AI pilots to full production, the stakes are dramatically raised, and failure rates are scaling at a rate that makes the gap impossible to absorb. Adding more and more monitoring tools is clearly not the key to scaling AI reliably in production. Instead, it requires an observability platform that already understands the entire system, not just the code layer sitting on top of it.

Sponsored articles are content produced by a company that is either paying for the post or has a business relationship with VentureBeat, and they’re always clearly marked. For more information, contact sales@venturebeat.com.