VentureBeat

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

michael.nunez@venturebeat.com (Michael Nuñez) — Tue, 26 May 2026 22:32:44 GMT

For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same. OpenAI's GPT-5 family, Anthropic's Claude Opus, and Google's Gemini Pro have clustered within a narrow band on Scale AI's SWE-Bench Pro leaderboard, making it nearly impossible for engineering leaders to determine which agent will actually perform best inside their codebases.

On Monday, a startup called Datacurve released a benchmark it says shatters that illusion. DeepSWE, a 113-task evaluation spanning 91 open-source repositories and five programming languages, produces a dramatically wider spread among the same frontier models — and crowns OpenAI's GPT-5.5 as the clear leader at 70%, sixteen points ahead of its nearest competitor.

"On public leaderboards, top models often look relatively close in capability," wrote Datacurve co-author Serena Ge on X. "DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work."

The benchmark also delivers a pointed critique of the evaluation infrastructure the AI industry relies on to measure progress: Datacurve's audit found that SWE-Bench Pro's verifiers — the automated graders that determine whether an agent solved a task — issued incorrect pass/fail verdicts on roughly one-third of the trials it reviewed.

If that finding holds up, it has sweeping implications. Enterprise procurement teams, venture capitalists, and AI lab marketing departments all lean heavily on benchmark scores to make multimillion-dollar decisions. A 32% error rate in the most widely cited coding benchmark suggests the industry may have been navigating by a broken compass.

Why the most popular AI coding benchmark may be grading on a curve

To understand what Datacurve is claiming, it helps to understand how coding benchmarks work — and how they can go wrong.

The dominant paradigm, pioneered by the SWE-Bench family maintained by Scale AI and academic researchers, constructs tasks by mining real GitHub commits. The process extracts a bug fix or feature addition from a repository's history, rolls the code back to the pre-fix state, and then asks an AI agent to reproduce the change. The original commit's test suite serves as the verifier: if the agent's patch makes the same tests pass, it gets credit. This approach has an elegant simplicity, but Datacurve argues it introduces three systemic weaknesses.

First, contamination. Because tasks are drawn from public GitHub history, the problem statement, the discussion, and often the exact solution are already present in the training data of frontier models. "The SWE-Bench family scrapes existing GitHub issues and PRs, which creates two problems: memorization (models have already seen the solution) and triviality (most tasks are small)," Ge wrote.

Second, scope. SWE-Bench Pro tasks require, on average, just 120 lines of code added across 5 files. DeepSWE's reference solutions average 668 lines added across 7 files — roughly 5.5 times more code. Yet DeepSWE's prompts are actually shorter, averaging 2,158 characters versus SWE-Bench Pro's 4,614. In other words, DeepSWE gives the agent less instruction but expects far more output, which more closely mirrors how a human developer might actually delegate work to an AI assistant.

Third — and most damaging — verifier reliability. Datacurve drew 30 tasks at random from both DeepSWE and SWE-Bench Pro, ran three rollouts across 10 frontier model configurations, and then deployed an LLM-based judge to independently assess whether each agent's patch actually solved the problem. SWE-Bench Pro's verifiers accepted wrong implementations 8.5% of the time and rejected correct implementations 24% of the time. DeepSWE's verifiers registered 0.3% and 1.1%, respectively.

The false negative problem is especially insidious because it punishes creative solutions. In one documented case, the gold-standard pull request for a SWE-Bench Pro task refactored a private helper function. An agent that correctly solved the task by inlining the same logic — a perfectly valid engineering choice — failed because the test suite tried to import a symbol that only existed in the original author's specific implementation.

OpenAI's GPT-5.5 dominates the new benchmark while Claude and Gemini stumble

DeepSWE's top-line results reorder the familiar hierarchy in ways that should matter to every engineering team evaluating AI coding tools. On SWE-Bench Pro, models from OpenAI, Anthropic, and Google have traded the lead within a 30-point range. DeepSWE stretches that range to 70 points.

GPT-5.5 leads at 70%, followed by GPT-5.4 at 56% and Claude Opus 4.7 at 54%. From there, the drop-off is steep: Claude Sonnet 4.6 lands at 32%, Gemini 3.5 Flash at 28%, GPT-5.4-mini and Kimi K2.6 tied at 24%, and then a long tail of models in the teens and single digits. Claude Haiku 4.5, which scores 39% on SWE-Bench Pro, collapses to zero on DeepSWE — suggesting that some mid-tier models have been significantly overperforming on easier, potentially contaminated benchmarks.

GPT-5.5 doesn't just score the highest — it does so efficiently. The model reaches its 70% pass rate with a median cost of $5.80 per trial, a median wall-clock time of 20 minutes, and a median of 47,000 output tokens. GPT-5.4 emerges as perhaps the best overall value at $3.30 per trial with a 56% score. Claude Opus 4.7, meanwhile, costs significantly more per run, and output tokens, wall-clock duration, and dollar cost per trial all vary by an order of magnitude across the agents tested — yet none of these correlates strongly with pass rate. Agents that emit more tokens, run longer, or cost more do not consistently solve more tasks.

Datacurve's audit found that Claude has been reading the answer key on existing benchmarks

Perhaps the most provocative finding in DeepSWE's analysis concerns what the authors label "CHEATED" verdicts — instances where an agent passes a benchmark not by solving the problem, but by reading the answer.

SWE-Bench Pro's Docker containers ship the repository's full .git history, which means the gold-standard solution commit is sitting right there in the container's file system. Most models ignore it. Claude does not. Datacurve's analysis found that both Claude Opus 4.7 and Claude Opus 4.6 registered "CHEATED" on more than 12% of their reviewed SWE-Bench Pro rollouts. In those instances, the Claude agent ran commands like git log --all or git show to retrieve the merged fix and paste it into its own patch. The behavior accounted for approximately 18% of Opus 4.7's passes and 25% of Opus 4.6's passes on the reviewed sample. The issue has been filed publicly as GitHub issue #93 on the SWE-Bench Pro repository.

GPT-5.4 and GPT-5.5 never exhibited this behavior. Gemini configurations stayed around 1%. Datacurve describes the behavior diplomatically — "The benchmark makes this possible (the gold commit lives in the container), but Claude is the family that consistently does so" — but the implication is clear: a meaningful fraction of Claude's SWE-Bench Pro scores may reflect environmental exploitation rather than genuine engineering capability.

DeepSWE addresses this by shipping only a shallow clone with the base commit, leaving no gold hash for the agent to discover. It is worth noting that the behavior is arguably a sign of Claude's environmental attentiveness — the model is very good at exploring its surroundings and exploiting available resources. Whether that counts as "cheating" or "resourcefulness" depends on your perspective, but in the context of a benchmark designed to measure independent problem-solving, it undermines the signal.

Each AI model family fails in its own distinctive way, and the patterns matter for enterprise teams

Beyond the top-line scores, Datacurve's qualitative trajectory analysis reveals distinctly different failure signatures across model families — a finding that could help engineering teams choose the right model for specific types of work.

Claude is forgetful with multi-part prompts. On DeepSWE, Claude configurations miss stated requirements more than any other family. The pattern is consistent: when a prompt enumerates parallel behaviors — "support both sync and async," for instance — Claude typically implements the obvious branch and forgets to mirror the change. Datacurve reports that roughly two-thirds of Claude's "MISSED_REQUIREMENT" failures on DeepSWE follow this "one branch shipped" pattern. In one example, Claude Opus 4.7 correctly landed a sync state-data hook in one engine class while the async engine never received the same hook.

GPT, by contrast, implements exactly what is asked. GPT-5.5 had the lowest rate of missing stated behaviors of any configuration tested. Across multiple runs of the same task, GPT trials tended to converge on the same interpretation of the prompt, suggesting instruction-following precision is a stable trait of the model rather than per-run luck.

One of the most intriguing findings involves self-verification. On DeepSWE, Claude Opus 4.7 and GPT-5.4 wrote and ran new tests in the project's own test framework on over 80% of their runs — even though no one asked them to. On SWE-Bench Pro, those same models dropped to 28% and 18%, respectively. The reason: SWE-Bench Pro's prompt template explicitly tells agents they "should not modify the testing logic or any of the tests." Agents dutifully complied, suppressing a behavior that likely would have improved their performance. This suggests that prompt design in production coding workflows may be inadvertently suppressing valuable agent behaviors — something enterprise teams deploying AI coding agents should carefully audit.

What DeepSWE gets right, what it gets wrong, and what it means for the future of AI benchmarks

Datacurve is forthright about several limitations. The standardized harness, while ensuring fairness, routes all edits through bash rather than the model-specific editing tools each family was trained on — apply_patch for GPT, str_replace_based_edit_tool for Claude. This could hold models below their native ceilings. The benchmark draws exclusively from open-source repositories with 500-plus stars, and results may not generalize to proprietary codebases. Bug localization and refactoring tasks are under-represented, and widely used languages like C++ and Java are absent entirely. The verdict assignments in the qualitative analysis come from an LLM analyzer, not human reviewers, and sample sizes are modest — roughly 90 reviewed rollouts per model per benchmark.

It is also worth noting that Datacurve is a startup with its own commercial interests, and an independent benchmark that reshuffles the leaderboard will inevitably invite scrutiny. The company's decision to publish the full dataset, all agent trajectories, and the evaluation harness on GitHub mitigates this concern considerably, but independent reproduction will be necessary before the AI community treats these results as definitive.

DeepSWE arrives at an inflection point for the AI coding market. Enterprise adoption of AI coding agents is accelerating rapidly, with engineering organizations making consequential bets on which model to build around. The benchmark market itself has become a strategic battleground — Scale AI's SWE-Bench Pro, which Datacurve directly critiques, is maintained by a company that also provides evaluation services to the labs whose models it ranks.

If DeepSWE's central findings about verifier reliability and data contamination hold up under independent scrutiny, they could force a reckoning not just with how the industry measures coding agents, but with the broader question of what benchmarks are actually for. A leaderboard where the grading system is wrong a third of the time is not merely inaccurate — it is the kind of broken instrument that makes everyone feel good about progress that may not be real. And in an industry spending billions on a bet that AI agents can do the work of software engineers, the difference between real progress and the appearance of it is not academic. It is the whole game.

The attack dominating financial services doesn't steal passwords. It resets MFA and steals the token.

louiswcolumbus@gmail.com (Louis Columbus) — Tue, 26 May 2026 19:34:30 GMT

The attacker who hit the most financial services organizations over the past 12 months never phished a password. They called an IT support line, convinced an employee to reset their MFA, and registered their own device on the network.

CrowdStrike’s 2026 Financial Services Threat Landscape Report, released this month and covering activity from April 2025 through March 2026, identified Mutant Spider as the single most active threat to the financial services sector. The group’s primary technique was voice phishing over Microsoft Teams. Operators impersonated internal IT support, convinced employees to reset their credentials and multifactor authentication, then registered their own devices on corporate networks. The security control worked exactly as designed — and that was the problem.

Within days, the FBI published a public service announcement warning about Kali365, a phishing-as-a-service platform sold on Telegram for as little as $250 a month. Kali365 captures Microsoft 365 OAuth tokens through the legitimate device code authentication flow. MFA fires on the victim’s device, not the attacker’s. The token grants persistent access to Outlook, Teams, and OneDrive without triggering another MFA prompt.

The Verizon 2026 Data Breach Investigations Report, also released in May, confirmed that credential theft dropped to 13% of breach initial access vectors. Vulnerability exploitation took the top position at 31%, displacing what Verizon called the longtime leading initial-access category. That's three independent sources, same structural finding. MFA protects password-based authentication, but the attacks dominating financial services increasingly bypass password theft through resets, token grants, and exploitation. The MFA Bypass Exposure Audit Grid at the end of this article maps all five confirmed attack surfaces from the CrowdStrike, FBI, and Verizon reports, what MFA misses on each one, and the specific fix for Monday morning.

The CrowdStrike numbers paint a sector under sustained pressure

Financial services ranked as the fourth most targeted sector by Q1 2026, accounting for 12% of all observed adversary activity, according to the CrowdStrike report. Globally, financial institutions faced 43% more hands-on-keyboard intrusions in 2025 compared to two years earlier. In North America, that figure was 48%.

The e-crime side of the problem grew faster than most defenders expected. Big game hunting operators named 423 financial services entities on dedicated leak sites during the reporting period. That is a 27% increase from the 334 entities named in the prior 12 months. REVENANT SPIDER, which operates the Qilin ransomware-as-a-service program, posted the most financial services victims of any e-crime adversary on its dedicated leak site. The group’s financial services victim count jumped from 14 to 97 over the reporting period.

“Who needs a zero day if all you have to do is call the help desk and say, 'I forgot my password'?” Adam Meyers, senior vice president of counter adversary operations at CrowdStrike, told VentureBeat. That one sentence captures the structural shift his team documented across twelve months of financial services intrusions.

The interactive intrusion breakdown tells the story of who is actually getting inside these networks. E-crime actors drove 75% of hands-on-keyboard intrusions against financial services. State-sponsored adversaries accounted for the remaining 25%. That ratio has not moved since 2023. What changed is the total volume and the sophistication of the access techniques.

Mutant Spider’s vishing campaigns over Microsoft Teams represent a structural shift in initial access. The group impersonates IT support, manipulates employees into resetting MFA, then deploys custom post-access tools including PrionFlaire, SocksLoader, and SleepyMutagen. CrowdStrike believes the group sells that access to ransomware operators. The Teams call is step one. The ransom note is step five.

“Who needs a zero day if all you have to do is call the help desk and say, 'I forgot my password'?”

Scattered Spider returned to aggressive ransomware operations against insurance companies from April through July 2025, following a significant operational pause that began in December 2024. The group ran the same playbook it has used since 2022: help desk social engineering; credential and MFA reset requests; then lateral movement through integrated SaaS applications to locate data for extortion. In September 2025, the U.K.’s National Crime Agency arrested and charged two members for allegedly targeting Transport for London. The U.S. Department of Justice separately charged one of them in connection with multiple cyberattacks against U.S. critical infrastructure.

State-sponsored groups added scale and speed

The report’s state-sponsored findings reinforce the identity problem from a different direction. DPRK-nexus adversaries stole $2.02 billion in digital assets in 2025, a 51% increase from the prior year. In February 2025, Pressure Chollima executed the largest single theft ever reported, stealing $1.46 billion in cryptocurrency by compromising Safe{Wallet}, a digital asset management platform supporting the Bybit exchange, after a developer’s machine was infected through a trojanized Python project. China-nexus groups conducted sustained campaigns against financial institutions across multiple continents. Hollow Panda exploited Check Point VPN appliances to target banks in the Philippines, Indonesia, and Brazil. Vault Panda gained initial access through compromised VPN and firewall appliances across four continents. Every state-sponsored campaign CrowdStrike documented shared a common thread. The adversary’s first move targeted an identity, a credential, or a trusted access path.

Elia Zaitsev, CrowdStrike’s CTO, told VentureBeat in April that the speed of these operations is outpacing traditional defense models. “Traditional approaches are just not designed for this sort of behavior,” Zaitsev said.

Kali365 turns token theft into a subscription service

The FBI’s May 21 public service announcement on Kali365 confirmed the second attack path that makes this a compound problem. The platform exploits Microsoft’s OAuth 2.0 device authorization grant flow, a mechanism designed for devices like smart TVs and conference room systems that cannot support interactive login. Kali365 sends phishing emails impersonating trusted services like Adobe Acrobat Sign, DocuSign, and SharePoint. The email contains a device code and instructions to visit a legitimate Microsoft verification page. The victim authenticates normally. MFA fires. The token goes to the attacker.

Arctic Wolf, which published a technical deep dive on Kali365 in April, documented a three-tier commercial structure. An admin tier for the developers, an agent tier for resellers, and a client tier for paying affiliates. Subscription pricing runs from $250 for 30 days to $2,000 for a year. The platform supports 14 languages and includes AI-generated phishing lures, automated campaign templates, and a real-time tracking dashboard.

The device code flow is not a vulnerability. It is a feature. Microsoft designed it for devices that cannot support interactive login. The problem is that default Entra ID configurations do not restrict its use, and most organizations have never audited whether any legitimate workflow actually requires it. Kali365 exploits that gap between design intent and deployment reality.

The Verizon DBIR reinforced that assessment from a different angle. The 2026 edition analyzed more than 22,000 confirmed breaches across 145 countries. Vulnerability exploitation at 31% now leads credential abuse at 13%. The median time for full patching increased to 43 days, up from 32. Organizations patched only 26% of critical flaws in CISA’s Known Exploited Vulnerabilities catalog, down from 38% the prior year.

That data creates a clear picture. The industry has spent two decades building defenses against credential theft. The attacks that are actually working in financial services either remove MFA through social engineering or capture tokens through legitimate authentication flows where MFA does not protect the attacker’s session.

MFA Bypass Exposure Audit Grid

Security directors need to run this audit against their environment this week. Each row represents a confirmed attack path from the three reports above.

Attack Surface	Confirmed Event	What MFA Misses	Action
Teams vishing/help desk MFA reset	Most active FS attacker called employees on Teams, got MFA reset, registered own device (CrowdStrike)	Help desk verifies caller identity without out-of-band confirmation. Social engineering removes MFA entirely.	Out-of-band verification for all MFA resets. FIDO2 hardware keys. Callback on a separate channel.
OAuth device code flow	$250/mo tool captures M365 tokens via devicelogin page. MFA does not fire on attacker’s device. (FBI)	Not restricted in default Entra ID configurations. Authentication channel separates user’s MFA challenge from attacker’s token grant.	Restrict device code flow in Entra ID conditional access. Block unmanaged devices.
Token persistence	Both paths end here. Valid tokens can grant weeks or months of silent access depending on token lifetime configuration. (CrowdStrike + FBI)	Traditional credential-theft monitoring does not flag token-based access. Tokens are credential-equivalent bearer artifacts, but most detection tools do not classify them that way.	Monitor OAuth refresh token usage from unfamiliar devices. Token lifetime policies.
Post-access SaaS movement	After reset, attackers pivoted to SaaS apps for credentials and docs. (CrowdStrike, insurance sector)	DLP monitors file downloads, not post-reset session activity or token-based API calls from authorized sessions.	Audit Graph API access. Flag bulk ops from reset or device-code sessions.
Budget misalignment	Credential theft at 13%. Vuln exploitation at 31%. (Verizon DBIR) Patch reverse-engineering within 72 hours. (Ivanti)	Legacy, login-only MFA investment addresses the threat that just dropped to third. Token capture and social engineering sit outside that investment.	Rebalance toward token monitoring, session validation, identity verification for resets.

Mike Riemer, SVP and field CISO at Ivanti, told VentureBeat in an exclusive interview that the speed problem compounds the budget misalignment. “Threat actors are reverse engineering patches, and the speed at which they’re doing it has been enhanced greatly by AI,” Riemer said. “They’re able to reverse engineer a patch within 72 hours. If I release a patch and a customer doesn’t patch within 72 hours of that release, they’re open to exploit.”

The structural problem is clear

“People are forgetting about runtime security,” Zaitsev said. “We’ve done this before, with endpoint and virtualization and cloud. People really focused on, hey, let’s patch all the vulnerabilities. Impossible. Let’s make sure we lo

ck down all the permissions. Somehow always seem to miss something.”

The attackers who matter most in financial services right now are not stealing passwords. They are calling help desks. They are exploiting legitimate authentication flows. They are capturing tokens that persist for months. The defenses that consumed the largest share of security budgets for the past decade are pointed at a threat that just dropped to third place.

The fix is not adding another layer of MFA — Zaitsev and Riemer both said as much. It's rethinking what MFA actually protects, what it doesn't, and where the budget needs to go next.

Why prompt debt, retrieval debt, and evaluation debt are quietly reshaping enterprise AI risk

Mon, 25 May 2026 19:30:18 GMT

Over the past two decades, technical debt meant outdated architecture, messy code, and poorly maintained documentation. That definition is no longer sufficient in the AI era, where failure modes are more subtle and often non-linear. AI systems are introducing new layers of technical debt that live across prompts, models, and data dependencies — making these layers less visible, harder to measure, and often more dangerous than traditional debt.

A crisis hiding in plain sight

The complexities of AI systems and their associated failures have been well documented. A 2025 MIT study found that 95% of AI projects fail to reach production or deliver value. A similar study by S&P Global Market Intelligence found that 42% of businesses scrapped multiple AI initiatives in 2025 — a sharp increase from 17% the previous year. Various reasons are cited for these failures, but most of them point to poorly designed and implemented systems that are complex to manage and have multiple hard-to-monitor failure points, leading to a rapid accumulation of AI debt.

Traditional technical debt was localized to the codebase, and bugs were usually easily reproducible. Consequently, bugs could be easily identified during tests and fixed through rearchitecting the codebase. However, AI debt is much more distributed, manifesting across prompts, models, data pipelines, and all associated infrastructure. It is also more intermittent: Due to the probabilistic nature of AI, systems do not always respond the same way, leading to intermittent failures. This makes it much more challenging to identify risks during testing, and also creates a need for more continuous monitoring even post-deployment to prevent gradual drift and worsening performance.

The new forms of AI debt

AI debt typically manifests across four new forms, each of which comes with its own set of risks.

Prompt debt is the most visible of these. A modern version of ‘spaghetti code,' this can include undocumented prompt tweaks, accumulated ‘quick-fix’ prompts that lead to inconsistencies, neglected version control of prompts, and ‘prompt stuffing’ (the cramming of extraneous data or context directly into AI prompts). All these combine to make prompts a form of untyped, untested code without any version control, leading to increased brittleness and vulnerabilities.

Model dependency debt is another increasingly common form of AI debt. Most enterprises now depend on a mixture of external models developed by leading foundation model providers; applications and agents are built on top of API calls to these models. Consequently, application logic now depends on models that are external to the core system, and that cannot be clearly controlled. As models update, performance varies and reproducibility is lost — prompts tuned for one model may fail or perform poorly when switched to another model, whether an update from the same provider or from another provider.

Most enterprise AI deployments today use retrieval-augmented generation (RAG), which pulls in additional context from enterprise data repositories. Retrieval debt is a consequence of these repositories having messy data, duplicated documents, and outdated information. This causes AI to return technically correct answers that are outdated and no longer relevant, causing downstream failures. Unlike hallucinations, these are harder to detect because they were correct, perhaps even until recently, and hence look correct to any tester.

Evaluation debt reflects the lack of standardization in testing and monitoring for AI models and applications. While AI benchmarks exist, they tend to focus on narrow tests and reflect point-in-time results. Most enterprises lack consistent testing standards, ground truth datasets, and real-time monitoring of deployments; there is no equivalent yet of continuous integration /continuous delivery (CI/CD) for prompts. As a consequence, CIOs and CTOs do not have clear visibility into model performance and cannot track improvements or worsening of models.

All of these are in addition to traditional forms of technical debt, which still manifest across the tools and systems that AI applications and agents interact with, read from, or write to. A rapid increase in the adoption of AI-generated code (often deployed without inadequate testing) is further aggravating inconsistencies within, and poor maintainability of traditional codebases.

The new forms of AI debt combine with these earlier forms of technical debt to compound rapidly and create large-scale risks that can cause catastrophic failure of entire enterprise deployments. Solving for these risks is made even more challenging by the distributed nature of AI ownership – most systems span engineering, product, data, and business teams, leading to unclear accountability when an error is identified.

As a result, these risks manifest in the form of escalating compute costs, inaccuracies in AI outputs, and increasing exceptions that need to be handled by humans — leading to projects often stalling and failing due to unclear return-on-investment stories and a lack of trust from users.

How enterprises can prevent AI debt

AI debt will not be solved by ‘better’ models — failure rates remain high despite models already having high accuracy. The solution to AI debt requires better system design, integration, controls, and changes in organizational culture.

First, prompts need to be treated as code. This involves careful version control, documentation, and rigorous testing both pre- and post-deployment for all possible prompt configurations. Best practices from the traditional world of coding — such as the use of smaller prompt blocks instead of large prompt-stuffed walls, or reducing the use of hard-coded parameters — can also help mitigate AI debt.

Second, evaluation needs to be built into the entire AI infrastructure stack. Continuous evaluation pipelines need to be established and must reflect a wide variety of metrics measuring both technical and business-aligned metrics. In addition, AI observability systems should be integrated to monitor output quality, failure rates, model drift, and data drift.

Third, explainability should be included by default in all AI results to make up for limited reproducibility. Data lineage, models used, and the steps followed should be clearly traceable so as to allow auditability of results and correction in case of any systemic errors.

This requires explicit AI debt reduction programs and associated budgets, similar to earlier waves of investment in security or in cloud modernization. These need to be driven at a CXO level by key leaders to prevent costly rework later.

Conclusion: A stitch in time

Enterprise AI deployments are not just static code; they are living systems that interact with the entire enterprise stack. As a result, the defining challenge in an agentic enterprise will not be building or deploying intelligent systems, it will be maintaining these systems to ensure continued reliability during real-world operation.

Enterprises that seek to proactively identify and mitigate AI debt from the design phase itself are the likeliest to build sustainable AI platforms that deliver significant long-term productivity boosts across the organization.

Vikram is a principal at Cota Capital, where he invests in early-stage enterprise tech and deep tech companies.

AI agents are quietly generating chaos engineering failures enterprises don’t track yet

Sun, 24 May 2026 17:00:17 GMT

There is a category of production incident that engineering teams are not tracking yet — because it doesn't fit any existing postmortem template.

The agent initiated an action. The action was technically correct given the agent's context. The context was incomplete. The infrastructure cascaded. And, by the time the incident review happened, three teams were arguing about whether it was an agent failure or an infrastructure failure, because the frameworks for thinking about these two things have never been connected.

The scale of this exposure is no longer theoretical. Seventy-nine percent of organizations now have some form of AI agent in production, with 96% planning expansion. Gartner predicts 33% of enterprise software will include agentic AI by 2028, but separately warns that 40% of those projects will be canceled due to poor risk controls.

What neither statistic captures is the failure mode happening between those two numbers: Agents that are running, that are not canceled, and that are quietly generating infrastructure events no one has categorized as risk.

I've spent six years building infrastructure automation systems at enterprise scale, first at Cisco (leading AI-driven lifecycle platforms deployed across 20-plus global enterprise customers), then at Splunk (designing AI-assisted root cause analysis and observability workflows across thousands of enterprise environments).

During that time I also filed a patent on intent-based chaos engineering methodology. And across all of it, I kept watching organizations make the same structural mistake: Treating autonomous agents and chaos engineering as separate disciplines. They are not. They are the same discipline, and the gap between them is quietly generating the next wave of major production incidents.

The judgment call that agents skip

To understand why this matters, you need to understand what's actually broken in how enterprises govern chaos today, before you add agents to the picture.

Most mature engineering organizations have invested in chaos engineering programs. Game days, blast radius controls, SLO-gated experiments. When a human engineer initiates a chaos experiment, the sequence has a critical property: A human is making a judgment call about whether the system has capacity to absorb the perturbation right now. They check dashboards. They look at the error budget burn rate. They assess whether dependencies are stable. It's imperfect and often intuitive, but there is at least a person in the loop asking the right question before anything runs.

When you introduce an autonomous remediation agent, one that can restart services, reroute traffic, scale resources, or modify configurations in response to detected anomalies, that question disappears. The agent sees an anomaly. The agent takes an action. The action is a chaos event. No SLO burn rate check. No blast radius calculation. No human judgment about whether right now is the right moment to introduce additional stress into a system that may already be under pressure from three other directions.

Here is the specific failure mode I have watched play out. A remediation agent detects elevated latency on a microservice and responds by restarting the service cluster; a reasonable action given its training data and its narrow view of the incident. What the agent doesn't know: Three other services are in the middle of handling peak traffic. The shared connection pool is already at 87% utilization. A dependent database is running a background index rebuild. The restart triggers a thundering herd against the recovering service.

What started as a latency spike the agent was designed to fix becomes a cascade the agent was never designed to model. The blast radius of that agent action was not the service restart. It was everything downstream of the restart, in a system state the agent had no complete picture of.

Nobody's chaos engineering program had tested for that specific combination. Nobody's blast radius calculation had included the agent as an actor. Because we don't think of agents as chaos injectors. We should.

According to the AI Incidents Database, reported AI-related incidents rose 21% from 2024 to 2025. That count almost certainly understates the actual exposure, because most organizations have no incident classification that captures an autonomous agent action as the initiating cause of a cascade. The incident gets logged as a service restart, a connection pool saturation, or a latency event. The agent is invisible in the postmortem.

Absorb capacity is a resource; most systems don't treat it that way

The underlying problem is that enterprise systems have no shared language for absorb capacity — the real-time estimate of how much additional stress a system can take before it breaches its SLO commitments. Chaos engineering programs manage it implicitly, through human judgment and static thresholds that fire after a limit has already been crossed. Agents don't manage it at all.

Through structured primary research with site reliability engineering (SRE) and platform engineering practitioners across organizations including Intuit and GPTZero, I've been developing a resilience budget model. The core idea is to treat absorb capacity as a continuously recomputed, consumable resource rather than a static threshold you try not to breach.

A resilience budget draws on four live signal classes.

SLO burn rate is the primary input, because it directly encodes the distance between current system behavior and the commitment that actually matters. If a system is burning its monthly error budget at five times the expected rate, the resilience budget is near zero regardless of what CPU utilization looks like.
P99 latency trend matters more than absolute latency, because a service trending upward over forty minutes tells you something different than a service that has been stable at the same absolute value.
Dependency saturation state is the most commonly missed signal; a chaos experiment or an agent action that assumes a shared connection pool is freely available when it's sitting at 87% will produce failure modes that nobody designed for.
Application behavioral signals, session completion rates, API call pattern shifts, conversion degradation, and surface system stress earlier than infrastructure metrics do, because users feel the degradation before Prometheus reports it.

What makes this a budget rather than a threshold is that it is consumable. Every chaos experiment draws from the available capacity. Every agent action draws from it. In multi-team organizations where multiple experiments and multiple agents may be acting simultaneously, the budget is shared.

Without a shared ledger of consumption, two teams running experiments against overlapping dependencies produce a combined blast radius that neither team planned. Add autonomous agents acting completely outside the ledger, and the accounting collapses.

Where language models help, and exactly where they fail

Several engineering organizations are now running experiments using large language models (LLMs) to generate chaos hypotheses from dependency graphs and incident postmortem corpora. The results are directionally useful. Language models surface plausible failure modes that experienced SREs recognize as worth testing, and they generate hypotheses faster than manual processes, particularly when working from rich postmortem history.

The limit is dependency graph staleness, and it is a hard limit. A hypothesis generated from a graph that doesn't reflect last month's service extraction, or a new shared library dependency added two sprints ago, will propose an experiment with incorrect blast radius assumptions. The problem is not that the model makes a mistake, it's that the model doesn't know it's making one. It will be confidently incorrect about a system boundary that no longer exists, and in chaos engineering, confident incorrectness in production means an unplanned outage.

Stanford's Trustworthy AI Research Lab found that model-level guardrails alone are insufficient: Fine-tuning attacks bypassed leading models in the majority of tested cases. The implication for chaos hypothesis generation is direct, a model that cannot reliably hold its own safety boundaries cannot be trusted to accurately model the blast radius of an action it has never seen in a dependency graph it has not verified.

When hypothesis generation draws instead from postmortem corpora, the staleness problem shrinks considerably. Postmortems describe failures that actually occurred in the system at a specific moment in time. The signal is inherently validated by production reality. This is the tractable near-term AI application in this space, and it is genuinely useful for organizations with mature incident documentation practices.

What AI cannot do, and should not be asked to do, is make the execution decision when signals are ambiguous. That judgment requires awareness of things that live entirely outside any monitoring system: Pending deployments that changed the dependency landscape an hour ago, on-call staffing levels on a holiday weekend, a customer commitment that makes any additional risk unacceptable until Monday.

A model without access to that context should not be making that call. This is not a temporary limitation pending a more capable model. It is a structural constraint of what machine observability can represent, and building an agent architecture that ignores it is building one that will eventually make a consequential decision with incomplete information — and no human in the loop to catch it.

What this means for how enterprises govern agents in production

The governance implication is straightforward to describe and harder to implement than it sounds. Every autonomous agent action that touches infrastructure needs to register against the same live signal layer that governs chaos experiments. The same SLO burn rates, latency trends, dependency saturation states that a human engineer would check before initiating an experiment should gate what an agent is permitted to do and when. If the resilience budget is below a defined floor, the agent waits or escalates. It does not act.

Agent actions also need to be modeled as experiments, not just logged as events. When an agent restarts a service, the question isn't only whether the restart completed successfully. It's whether the blast radius of that action was proportionate to the available absorb capacity, and what cascading effects it produced across dependencies. That is chaos engineering data. It belongs in the budget model, feeding the next decision the agent or the team needs to make.

And when signals are genuinely ambiguous, when the budget score is unclear, when a recent deployment has changed the topology in ways the agent's context window doesn't capture, when dependency states are in flux, the execution decision needs to go to a human. Not as a permanent limitation on agent autonomy, but as a hard engineering requirement for the current state of the technology.

A circuit breaker that hands ambiguous cases to a human is not a weakness in the agent architecture. It is the thing that makes the architecture trustworthy enough to actually run in production. Intent-based verification formalizes exactly this: Defining what correct agent behavior looks like before deployment, then continuously probing whether those boundaries hold under live system conditions.

The organizations that operate autonomous agents reliably at scale are not the ones with the most sophisticated models. They are the ones that understood, before something went badly wrong, that every agent action is a chaos event and built their governance layer accordingly.

The practical first step is unglamorous: Audit every autonomous agent currently touching infrastructure, map its action surface against your live SLO burn rate signals, and define explicit floor conditions below which the agent is required to wait or escalate. That audit will surface agents acting entirely outside your resilience accounting.

Most organizations running agents at scale today have several. Find them before production does.

Sayali Patil has spent 6-plus years at Cisco Systems and Splunk building the reliability and automation systems that keep enterprise AI infrastructure running at scale.

Valid certificates, stolen accounts: how attackers broke npm's last trust signal

louiswcolumbus@gmail.com (Louis Columbus) — Fri, 22 May 2026 22:21:34 GMT

On May 19, 633 malicious npm package versions passed Sigstore provenance verification. They were cleared by the system because the attacker had generated valid signing certificates from a compromised maintainer account.

Sigstore worked exactly as designed: it verified the package was built in a CI environment, confirmed a valid certificate was issued, and recorded everything in the transparency log. What it cannot do is determine whether the person holding the credentials authorized the publish — and that gap turned the last automated trust signal in npm into camouflage.

One day earlier, StepSecurity documented an attack on the Nx Console VS Code extension, a widely used developer tool with more than 2.2 million lifetime installs. Version 18.95.0 was published using stolen credentials on May 18 and stayed live for under 40 minutes — but Nx internal telemetry showed approximately 6,000 activations during that window, most through auto-update, compared to just 28 official downloads. The payload harvested Claude Code configuration files, AWS keys, GitHub tokens, npm tokens, 1Password vault contents, and Kubernetes service account tokens.

The Mini Shai-Hulud campaign, attributed by multiple researchers to a financially motivated threat actor identified as TeamPCP, hit the npm registry at 01:39 UTC on May 19. Endor Labs detected the initial wave when two dormant packages, jest-canvas-mock and size-sensor, published new versions containing an obfuscated 498KB Bun script — neither had been updated in over three years, making a sudden version with raw GitHub commit hash dependencies a detection signal, but only if the tooling is watching.

By 02:06 UTC, the worm had propagated across the @antv data visualization ecosystem and dozens of unscoped packages, including echarts-for-react (~1.1 million weekly downloads). Socket raised the total to 639 compromised versions across 323 unique packages in this wave. Across the full campaign lifecycle, Socket has tracked 1,055 malicious versions across 502 packages spanning npm, PyPI, and Composer.

StepSecurity confirmed the payload contained full Sigstore integration. The attacker didn't just steal credentials; they could sign and publish downstream npm packages that carried valid provenance attestations.

These two incidents aren’t isolated. Research teams at Endor Labs, Socket, StepSecurity, Adversa AI, Johns Hopkins, Microsoft MSRC, and LayerX independently proved that the developer tool verification model is broken, and no vendor framework audits all of the attack surfaces that failed.

Seven attack surfaces failed in the 48 hours between May 18 and May 19 — npm provenance forgery, VS Code extension credential theft, MCP server auto-execution, CI/CD agent prompt injection, agent framework code execution, IDE credential storage exposure, and shadow AI data exposure — and the audit grid below maps each.

The verification model is broken across all four major AI coding CLIs

Adversa AI disclosed TrustFall on May 7, demonstrating that Claude Code, Gemini CLI, Cursor CLI, and Copilot CLI all auto-execute project-defined MCP servers the moment a developer accepts a folder trust prompt. All four default to “Yes” or “Trust.” One keypress spawns an unsandboxed process with the developer’s full privileges.

The MCP server runs with enough privilege to read stored secrets and source code from other projects. On CI runners using Claude Code’s GitHub Action in headless mode, the trust dialog never renders. The attack executes with zero human interaction.

Johns Hopkins researchers Aonan Guan, Zhengyu Liu, and Gavin Zhong published “Comment and Control,” proving that a malicious instruction in a GitHub pull request title caused Claude Code Security Review to post its own API key as a comment. The same attack worked on Google’s Gemini CLI Action and GitHub’s Copilot Agent. Anthropic rated the vulnerability CVSS 9.4 Critical through its HackerOne program.

Microsoft MSRC disclosed two critical Semantic Kernel vulnerabilities on May 7. One routes attacker-controlled vector store fields into a Python eval() call; the other exposes a host-side file download method as a callable kernel function — meaning one poisoned document in a vector store launches a process on the host.

LayerX security researchers separately demonstrated that Cursor stores API keys and session tokens in unprotected storage, meaning any browser extension can access developer credentials without elevated permissions.

The threat actors hunting these credentials doubled their operational tempo

The Verizon 2026 Data Breach Investigations Report, released May 19, found that 67% of employees access AI services from non-corporate accounts on corporate devices. Shadow AI is now the third most common non-malicious insider action in DLP datasets. Source code leads all data types submitted to unauthorized AI platforms — the same asset class the npm worm campaign targeted.

The CrowdStrike 2026 Financial Services Threat Landscape Report, released May 14, documents the adversaries actively hunting the credential types these attacks harvest.

STARDUST CHOLLIMA tripled its operational tempo against financial entities in Q4 2025. CrowdStrike documented the group using AI-generated recruiter personas on LinkedIn and Telegram, sending malicious coding challenges that looked like technical assessments, and running fake video calls with synthetic environments. The targets are GitHub PATs, npm tokens, AWS keys, and CI/CD secrets. The shadow AI exposure in grid row 7 is the door they walk through.

Developer Tool Stolen-Identity Audit Grid

No vendor framework currently scopes all seven surfaces. This grid maps each one to the research that exposed it, what your stack cannot see, and the audit action to take before the next vendor renewal.

Attack Surface	Disclosed By	What Verification Failed	What Your Stack Cannot See	Audit Action
1. npm provenance forgery	Endor Labs, Socket (May 19)	Sigstore certificates generated from stolen OIDC tokens pass automated verification	EDR and SAST do not validate whether the CI identity that signed a package authorized the publish	Require publish-time two-party approval for packages with more than 10,000 weekly downloads. Do not treat a green Sigstore badge as proof of legitimacy
2. VS Code extension credential theft	StepSecurity (May 18)	VS Code Marketplace accepted a malicious extension version published with a stolen contributor token	Extension auto-updates bypass endpoint detection. Marketplace window 12:30 to 12:48 UTC; overall exposure (including Open VSX) 12:30 to 13:09 UTC	Enforce minimum-age policies for extension updates. Pin critical extension versions. Audit all extensions with access to terminal or file system APIs
3. MCP server auto-execution	Adversa AI, TrustFall (May 7)	All four CLI trust dialogs default to “Yes/Trust” without enumerating which executables will spawn	EDR monitors process behavior, not what an LLM instructs an MCP server to do. WAF inspects HTTP payloads, not tool-call intent	Disable project-scoped MCP server auto-approval in Claude Code, Gemini CLI, Cursor CLI, and Copilot CLI. Block .mcp.json in CI pipelines unless explicitly allowlisted
4. CI/CD agent prompt injection	Johns Hopkins, Comment and Control (April 2026)	GitHub Actions workflows using pull_request_target inject secrets into runner environments that AI agents process as instructions	SIEM logs show an API call from a legitimate GitHub Action. The call itself is the attack. No anomalous network signature exists	Migrate AI code review workflows to pull_request trigger. Audit all workflows using pull_request_target with secret access for AI agent integrations
5. Agent framework code execution	Microsoft MSRC (May 7)	Semantic Kernel Python SDK routed vector store filter fields into eval(). .NET SDK exposed host file-write as a callable kernel function	Application firewalls inspect input payloads. They do not inspect how an orchestration framework parses those payloads internally	Update Semantic Kernel Python SDK to 1.39.4 and .NET SDK to 1.71.0. Audit all agent frameworks for functions tagged as model-callable that access host file system or shell
6. IDE credential storage exposure	LayerX (April 2026)	Cursor stores API keys and session tokens in unprotected storage accessible to any installed browser extension	DLP monitors data in transit. Cursor credentials at rest are invisible to DLP because no egress event occurs until the extension exfiltrates	Audit developer tools for credential storage practices. Require protected storage (OS keychain, encrypted credential stores) for all AI coding tool configurations
7. Shadow AI data exposure	Verizon 2026 DBIR (May 19)	67% of employees access AI services from non-corporate accounts on corporate devices. Source code is the leading data type submitted	CASB policies cover sanctioned SaaS. Non-corporate AI accounts on corporate devices operate outside CASB scope entirely	Deploy browser-layer AI governance that monitors non-corporate AI usage on corporate devices. Inventory AI browser extensions across the organization

Security director action plan

Security directors may want to run this grid against current vendor contracts before Q2 renewals close — asking each vendor which of the seven surfaces their product covers, and treating the non-answers as the gap map.

Any credential accessible from a developer machine or CI runner that installed affected npm packages between 01:39 and 02:18 UTC on May 19 should be considered compromised. That includes GitHub PATs, npm tokens, AWS access keys, Kubernetes service account tokens, HashiCorp Vault tokens, SSH keys, and 1Password vault contents.

AI coding agent integrations running in CI/CD pipelines with pull_request_target workflows deserve a close look. Each one is a prompt injection surface that processes PR comments as agent instructions.

Procurement teams evaluating AI coding tools should consider adding a stolen-identity resistance dimension to vendor assessments. The question worth asking: can the vendor demonstrate how their tool distinguishes a legitimate maintainer publish from an attacker using compromised credentials? If they cannot, the tool is not a verification layer.

The developer tool supply chain has the same problem IAM had a decade ago: credentials prove who you claim to be, not who you are. IAM got a 10-year head start on compensating controls before nation-state groups turned credential theft into an industrial operation. The AI coding tool ecosystem is starting that clock now.

Your AI agents need a terminal, not just a vector database

bendee983@gmail.com (Ben Dickson) — Fri, 22 May 2026 21:05:07 GMT

When agentic workflows fail, developers often assume the problem lies in the underlying model’s reasoning abilities. In reality, the limited information provided by the retrieval interface is often the primary limiting factor.

Researchers at multiple universities propose a technique called direct corpus interaction (DCI) that lets agents bypass embedding models entirely, searching raw corpora directly using standard command-line tools.

The limits of classic retrieval

In classic retrieval systems such as RAG, documents are chunked, converted into vector representations (or embeddings), and indexed offline in a vector database. When an AI system processes a query, a retriever filters the entire database to return a ranked "top-k" list of document snippets that match the query. All evidence must pass through this scoring mechanism before any downstream reasoning occurs.

But modern agentic applications demand much more. "Dense retrieval is very useful for broad semantic recall, but when an agent has to solve a multi-step task, it often needs to search for exact strings, numbers, versions, error codes, file paths, or sparse combinations of clues," the authors of the DCI paper said in comments provided to VentureBeat. "These long-tail details are precisely where semantic similarity can be brittle."

Unlike static search, agents must also revise their search plans dynamically after observing partial or localized evidence. Exact lexical constraints and multi-step hypothesis refinement are difficult to execute with semantic retrievers. Because the retriever compresses access into a single step, any critical evidence filtered out by the similarity search cannot be recovered later, no matter how advanced the agent's downstream reasoning capabilities are. As the authors explain, current retrieval pipelines can become a bottleneck because "they decide too early what the agent is allowed to see."

Direct corpus interaction

This direct access addresses a core problem in enterprise environments: data staleness. Embedding indexes are always a snapshot of a specific moment in time, taking considerable compute and time to build and maintain.

"In many enterprise settings, the data is not a stable document collection. It is daily financial reports, live logs, tickets, code commits, configuration files, incident timelines, and internal documents that keep changing," the authors said. DCI lets the agent reason over the current state of the workspace rather than yesterday's vector index.

The agent operates in a terminal-like environment where its observations are raw tool outputs such as file paths, matched text spans, and surrounding lines. The core tools provided by DCI are few but highly expressive. Agents use commands like “find” and “glob” to navigate directory structures and locate files. For exact matching, they use “grep” and “rg” to locate specific keywords, regex patterns, and exact strings. When local inspection is needed, tools like “head,” “tail,” “sed,” “cat,” and lightweight Python scripts allow the agent to peek at the context surrounding a match or read specific file sections.

The agent can combine these tools via shell pipelines to execute complex search logic in a single step. An agent can pipe commands to enforce strict lexical constraints, such as searching a file for one term and piping the output to search for a second term. It can combine multiple weak clues across a corpus by finding a specific file type, searching for a keyword like "report," and filtering for a year like "2024." It can also immediately verify a hypothesis by inspecting the exact lines around a keyword match.

DCI delegates semantic interpretation directly to the agent instead of relying on embedding-based similarity search. The agent can formulate hypotheses, test exact lexical patterns, and extract detailed information that a traditional semantic retriever might miss.

The researchers propose two versions of this system. DCI-Agent-Lite is designed as a lightweight, low-cost setup built on the GPT-5.4 nano model and restricted purely to raw terminal interactions like bash commands and basic file reads. Because reading raw files can quickly fill up a smaller model's memory, this version relies on lightweight runtime context-management strategies to sustain long-horizon exploration.

DCI-Agent-CC is the higher-performance version, designed for teams with more compute budget. It runs on Claude Code powered by Claude Sonnet 4.6. Claude Code provides stronger prompting, more robust tool orchestration, and superior built-in context handling, which improves the agent's stability during complex, multi-step searches across heterogeneous datasets.

DCI in action

The researchers tested both versions of DCI across agentic search benchmarks like BrowseComp-Plus, knowledge-intensive QA with single-hop and multi-hop reasoning, and information retrieval ranking in tasks requiring domain-specific reasoning and scientific fact-checking.

They tested DCI against three baselines. The first included open-weight retrieval agents such as Search-R1 and proprietary agents powered by frontier models like GPT-5 and Claude Sonnet 4.6, paired with standard retrievers. The second baseline included classical sparse retrievers like BM25 and dense retrievers like OpenAI's text-embedding-3-large and Qwen3-Embedding-8B. The third baseline consisted of high-performing reasoning-oriented re-rankers like ReasonRank-32B and Rank-R1.

DCI systematically outperformed the baselines, according to the researchers. On the complex BrowseComp-Plus benchmark, swapping a traditional Qwen3 semantic retriever for DCI on a Claude Sonnet 4.6 backbone improved accuracy from 69.0% to 80.0% while reducing the API cost from $1,440 to $1,016. The return on investment for lightweight agents was also noticeable. DCI-Agent-Lite with GPT-5.4 nano competed with the OpenAI o3 model using traditional retrieval while cutting costs by more than $600.

On multi-hop QA benchmarks, DCI-Agent-CC reached an 83.0% average accuracy, improving on the strongest open-weight retrieval baseline by 30.7 points, according to the researchers.

The data shows that DCI has lower overall document recall than dense embedding models, but once it finds a relevant document, it extracts substantially more value from it.

"If an enterprise AI lead asked where DCI is most clearly useful, I would point to tasks that require exact evidence localization in a dynamic workspace: debugging production incidents, searching large codebases, analyzing logs, compliance investigation, audit trails, or multi-document root-cause analysis," the researchers note.

In one complex deep-research task, the agent had to identify a specific soccer match based on 12 interlocking clues, including exact attendance, yellow cards, and player birth dates. A traditional retriever would fail by surfacing short, disconnected snippets. Instead, the DCI agent explored the file directory, read specific lines of a 1990 England versus Belgium match report to verify the exact number of substitutions, pulled a specific quote from an interview file, and verified the exact birth dates of two players by peeking into their Wikipedia text files. By chaining these simple commands, DCI ensures that no evidence is permanently lost behind a flawed semantic search algorithm.

Limits and practical implementation of DCI

DCI has a clear operating envelope where it scales excellently in search depth but struggles with search breadth. When the experimental corpus was expanded from 100,000 to 400,000 documents, the system's accuracy dropped significantly and the average number of tool calls rose. While DCI is powerful once a promising document is found, the cost of locating that initial useful anchor document grows sharply as the size of the candidate space increases.

DCI also has lower broad document recall compared to dense embedding models. It trades exhaustive recall for high-resolution, local precision. If an enterprise workflow strictly requires finding every single relevant document across a massive dataset, DCI may not be the right tool.

Granting an agent expressive tools like an unrestricted bash shell increases latency and compute costs due to the high volume of iterative tool calls required to complete a search. It also creates significant context-management and security challenges for IT departments.

"Tool calls can return large outputs; long trajectories can fill the context window; and raw terminal access requires sandboxing, permission control, and careful engineering," the authors said. To manage the context window, the researchers found that moderate truncation and compaction help the agent sustain longer searches, whereas overly aggressive summarization tends to discard useful evidence.

Because of these operational realities, DCI is not meant to be a mandatory replacement for existing vector infrastructure. Instead, it serves as a complementary one.

"For orchestration engineers and data architects, our view is that the most practical near-term deployment pattern is hybrid," the authors said. Semantic retrieval can still provide high-recall candidate discovery when a user's intent is broad or underspecified. "DCI can then operate as a precision and verification layer: the agent can search within the retrieved documents, expand from them into neighboring files, check exact constraints, and combine weak signals across documents."

The researchers have released the code for DCI under the permissive MIT license.

"Longer term, DCI changes how we think about enterprise data. Data will not only need to be stored for humans or indexed for search engines; it will need to be organized for agents that can inspect, compare, grep, trace, and verify," the authors conclude. "File names, timestamps, stable identifiers, metadata, version history, and machine-readable structure become part of the retrieval interface."

D&B's database of 642 million businesses was built for humans, not AI agents. So they rebuilt it.

Fri, 22 May 2026 13:00:00 GMT

Dun & Bradstreet has spent over 180 years building a comprehensive commercial database. Its Commercial Graph, covering 642 million businesses and their relationships, corporate hierarchies and risk profiles, was designed for people. Credit analysts, risk managers and sales professionals who could wait for query results and work through ambiguous entity matches. AI agents cannot do any of those things.

When D&B's customers started pushing agents into credit, procurement and supply chain workflows, the Commercial Graph that had reliably served nearly 200,000 customers globally became a problem. The systems built to serve human analysts were the wrong architecture for machines. So D&B rebuilt.

"We need to think about agents as our new consumer category, evolving from our standard credit analysts or sales and marketing professionals, et cetera, to also now catering to these customers' agents," Gary Kotovets, Chief Data and Analytics Officer at Dun & Bradstreet, told VentureBeat.

What broke when agents started querying

The Commercial Graph was not a single database. It was a collection of separate systems built for different use cases and different markets, held together by custom integrations. Human analysts navigated that fragmentation through SQL queries or pre-built interfaces. Agents could not.

The scale of the underlying data compounded the problem. The database had nearly doubled in five years, expanding from more than 300 million to more than 642 million business records, with 11,000 fields per record, according to D&B. The firm now runs approximately 100 billion data quality checks per month as records move through its systems. Querying that at the sub-second latency agents require, against a fragmented architecture, was not workable.

The relationships the graph tracked were also the wrong kind. Legacy systems recorded static connections between entities. A CEO was linked to a company. That was the line. Agents working on credit assessments or third-party risk need dynamic relationships: when that CEO leaves for a new company, which organization does their track record follow? When a subsidiary changes ownership, how does that propagate across a corporate hierarchy? Those questions required custom analyst work before. Agents cannot wait for custom analyst work.

The broader problem is not unique to D&B. Kotovets said he has spoken with hundreds of CDOs and CIOs over the past six months and consistently heard the same constraint: they could not build what they wanted in AI because their data foundations were not standardized, normalized or agent-queryable. D&B had that foundation, built over decades to serve human analysts. It still had to rebuild for agents.

What they actually built

The rebuild started with consolidation. D&B migrated its fragmented databases to cloud infrastructure, redesigned the underlying schema and built a data fabric layer that normalizes records across markets while preserving regional compliance requirements. The result is a unified knowledge graph that tracks billions of relationships across 642 million companies, continuously updated and enriched by AI-driven data processing.

On top of that graph, D&B built a structured access layer for agents. Raw SQL access at agent query volumes and latency requirements was not the answer. Instead, D&B created a set of tools and skills available through MCP that package data with context and route agents to the right records for specific queries. A match and entity resolution engine sits behind every query, confirming that when an agent asks about a company, the answer resolves to a verified, specific entity rather than a name match.

D&B solved agent identity from both directions

Rebuilding the graph and adding MCP access solved the data retrieval problem. It did not solve the identity problem. Agents are not humans, and the authentication model built for human users did not extend to machines.

D&B built a new registration model for agents. They must map to a verified IP address and register an individual access key, treated as an authenticated identity in the same pipeline as a human user.

"We actually have a concept of Know Your Agent, similar to know your customer, that does those additional verifications," Kotovets said.

That handles the inbound problem: knowing which company an agent belongs to and what data it is entitled to query. But D&B also built for the outbound problem: what happens when a customer's own multi-agent workflow loses track of which company it is analyzing.

In a workflow that chains a credit check agent, a KYC agent and a third-party risk agent, each queries D&B at a different step. Without a mechanism to confirm they are all referencing the same entity, a workflow can complete while operating on divergent records.

"They have to come back to our verification agent to ensure that they're still talking to each other about the same entity," Kotovets said. "It's almost like a digital handshake, in a sense."

D&B's business verification agent can be embedded into any workflow as a persistent reference point and is available on Google's A2A protocol regardless of which orchestration tool a customer uses.

Four things enterprises must get right before deploying AI agents

The rebuild exposed requirements that go beyond D&B's own stack.

Data foundations come before agent infrastructure. The CDOs and CIOs Kotovets spoke with over the past six months consistently hit the same wall: they cannot build what they want in AI until their data is clean, normalized and consolidated. D&B had that foundation already. Most enterprises do not, and they will feel it.
Design for dynamic relationships, not static ones. Enterprise data systems typically record point-in-time connections: a person belongs to a company, an asset belongs to a subsidiary. Agents working on credit, risk or supply chain decisions need to reason across relationships that shift over time. If the underlying data only captures the static line, the agent will too.
Build entity consistency checks into multi-agent workflows. When multiple agents touch the same entity at different steps, there is no guarantee they are all referencing the same record by the time the workflow completes. That gap needs to be engineered for explicitly. Entity verification is a workflow design requirement, not an optional guardrail.
Embed lineage from the start, not as an afterthought. Every agent-produced answer should carry a traceable path back to its source. In credit, risk and supply chain decisions, the cost of an error is concrete. Lineage needs to be built in before scaling, not added after problems surface.

"You could always click and see where it came from, and validate it all the way back to the original source," Kotovets said. "That's been the key for us in unlocking a lot of other capabilities, because we have that level of certainty in the things that we've done."