Business | VentureBeat

Microsoft AI chief says company was “set free” from OpenAI to pursue superintelligence

michael.nunez@venturebeat.com (Michael Nuñez) — Fri, 05 Jun 2026 22:55:38 GMT

For three years, Microsoft's artificial intelligence story has been inseparable from OpenAI. The partnership — cemented by a cumulative investment exceeding $13 billion — gave Microsoft early access to the most advanced AI models on the planet, catapulting its Copilot products into the enterprise mainstream and adding hundreds of billions of dollars to its market capitalization. To the outside world, Microsoft's AI strategy was OpenAI.

Mustafa Suleyman wants to change that narrative.

In an exclusive sit-down interview with VentureBeat at Microsoft Build 2026, the CEO of Microsoft AI disclosed that a contractual change with OpenAI roughly six months ago granted his division the formal authority to pursue what he openly calls "superintelligence" — using Microsoft's own researchers, its own data pipelines, and its own custom silicon.

"We were only sort of set free from our contract with OpenAI about six months ago to formally pursue superintelligence," Suleyman said. "So this is very early days."

The comment, delivered matter-of-factly backstage at the Fort Mason Center here, offers the clearest signal yet of a strategic inflection point unfolding inside the world's most valuable public company. Microsoft is not abandoning OpenAI. But it is building something alongside it — and, eventually, something that could stand entirely on its own.

Microsoft's first in-house model family signals a new level of AI ambition

The most tangible evidence of that shift arrived the same day. Microsoft announced a family of seven new AI models developed entirely in-house by its AI Superintelligence Team, spanning reasoning, code generation, image creation, transcription, and voice synthesis. The models — branded under the "MAI" family name — are Microsoft's most ambitious first-party AI release to date.

The flagship, MAI-Thinking-1, is a 35-billion-active-parameter reasoning model that Microsoft says matches leading models in its weight class on key software engineering benchmarks and demonstrates advanced mathematical reasoning. Suleyman emphasized one point repeatedly: the model was trained from scratch on clean, commercially licensed data, without distillation from third-party frontier models — a direct, if unstated, contrast to the widespread industry practice of using outputs from competitors' systems to train cheaper alternatives.

"We train our reasoning models from scratch," Suleyman wrote in a blog post accompanying the announcement. "We don't distill from other labs and we don't rely on unlicensed or opaque data."

The rest of the family fills out a multimodal portfolio designed for enterprise deployment: MAI-Code-1-Flash, a lightweight coding model built specifically for GitHub Copilot and VS Code; MAI-Image-2.5, which supports both text-to-image and image editing; MAI-Transcribe-1.5, which Microsoft claims is the most accurate transcription model available, operating across 43 languages; and MAI-Voice-2, a multilingual speech-generation system. All of the models ship through Microsoft Foundry, the company's model-hosting and deployment infrastructure, and for the first time, developers can tune model weights themselves through third-party platforms including OpenRouter, Fireworks, and Baseten.

But Suleyman made clear in the interview that the seven models are a proof of concept, not a finished product. The real project is the lab itself.

"Our job is to make sure that when we look out to 2030 and beyond, we have the capacity not just to buy models from third parties, but to build the absolute frontier, the best models in the world," he said. "That's a long transition."

What "set free" from OpenAI actually means for Microsoft's AI future

To understand what Suleyman means by "set free," you need to understand the unusual contractual architecture that has governed Microsoft's AI efforts for years.

When Microsoft invested billions into OpenAI beginning in 2019, the partnership came with a specific arrangement: OpenAI would build the frontier models, and Microsoft would serve as the exclusive cloud provider, integrating those models into its products and reselling them through Azure. The deal gave Microsoft extraordinary commercial leverage — access to the world's most advanced AI without having to build it — but it also created a dependency. Microsoft was explicitly barred from pursuing its own AGI research, and the agreement even capped how large a model the company could train, restricting it from building systems beyond a certain computing threshold measured in FLOPS.

That arrangement was formally renegotiated. As Fortune and Axios reported in November, a revised deal with OpenAI removed those restrictions, clearing the way for Suleyman to launch the MAI Superintelligence Team and pursue what he calls "humanist superintelligence." The result, in Suleyman's telling at the time, was a "best-of-both environment, where we're free to pursue our own superintelligence and also work closely with them."

By the time he sat down with VentureBeat at Build 2026, roughly six months had passed since that self-sufficiency effort formally began. Microsoft had already started shipping in-house models — including MAI-Image-2-Efficient, a lighter-weight image generation model released in April — but the seven MAI models announced at Build are the team's most ambitious release yet: a full multimodal family spanning reasoning, code, image generation, transcription, and voice.

Even so, Suleyman does not view the shift as a rupture with OpenAI. He described Microsoft's current position as one of abundance, not scarcity.

"There's no immediate urgent need to fill a gap in three months' time or six months' time," he said. "We have OpenAI, we have Anthropic, we have thousands of models inside Foundry. So there's already a huge amount of optionality available to us."

The framing is telling. Microsoft's push into first-party frontier models is not born out of a crisis in the OpenAI relationship but out of a strategic calculation: as AI becomes the most consequential technology layer in enterprise computing, the company cannot afford to depend entirely on partners for the foundational capability. "Over the next five years, we have to be able to produce state-of-the-art frontier-scale models," Suleyman said. "That's our mission."

Suleyman says the shift from chatbots to autonomous AI agents has already begun

If the seven MAI models represent the technical ambition, a new capability called Frontier Tuning represents the commercial logic. Announced alongside the models at Build, Frontier Tuning allows enterprise customers to customize MAI models using their own proprietary data, workflows, and domain terminology, all within their own secure compliance boundary. The system uses reinforcement learning environments — what Microsoft calls "training gyms for AI" — that let agents learn directly from real workplace tasks without affecting production systems.

The results Microsoft shared are striking. An MAI model tuned for Excel reportedly matches GPT 5.4 performance while operating at up to ten times greater efficiency. Early enterprise adopters are seeing similar gains: when tuned for one unnamed organization's exacting standards, the MAI model achieved the highest win rate of any model tested at roughly one-tenth the cost.

Suleyman framed Frontier Tuning as part of a broader evolutionary stage — a move from intelligence to action. "We've basically moved beyond just conversation," he told VentureBeat. "Now we're moving to action."

He introduced a new framework for thinking about that progression: the shift from IQ (factual intelligence) to EQ (emotional intelligence, or the ability to follow tone and style instructions) to what he calls AQ — the "Actions Quotient."

Future AI agents, in Suleyman's telling, won't just answer questions. They will log into enterprise software, navigate complex multi-application workflows, and execute tasks across Excel, Word, Teams, Jira, Adobe InDesign, and customer relationship management systems — just as a human employee would.

"You should be able to show up on day one and almost provision credentials to a new AI agent," he said. "The model needs to be able to move across all of these different environments, and that's actually the great strength of Microsoft."

The Build 2026 announcements bore this out in concrete product terms. Microsoft Scout, the company's first "Autopilot" agent, operates as an always-on background assistant built on the open-source OpenClaw technology. It runs with its own governed identity inside Microsoft Entra, so its actions are auditable and attributable. Windows 365 for Agents gives AI agents their own managed Cloud PCs, allowing them to interact directly with applications and browsers inside enterprise environments. And the Foundry platform received major updates — including hosted agents with sub-100-millisecond cold starts, a new Microsoft Agent Framework, and one-click publishing to Teams and Microsoft 365 Copilot.

Why Microsoft believes enterprise data is the next AI training frontier

Suleyman also articulated why he believes Microsoft's position is uniquely defensible — and the argument has less to do with model architecture than with where work actually happens.

"We've sort of hoovered up all of the obvious pools of training data," he said, referring to the industry's early scramble to ingest the open web. "In the next phase, we actually want to be able to give these agents to companies to train on their specific tasks with the data that they have inside of their own big workflows."

The claim is subtle but consequential. The first wave of generative AI was trained on publicly available text — books, websites, Reddit posts, code repositories. That data is now largely exhausted, and its use is increasingly contested in court.

The next wave, Suleyman argues, will be trained on enterprise-specific data: the internal workflows, decision traces, and institutional knowledge that define how real organizations operate. Microsoft, which serves 493 of the Fortune 500 through Azure according to Suleyman, is already embedded inside those workflows through Microsoft 365, Teams, Dynamics 365, and the broader Azure ecosystem. Frontier Tuning is the mechanism that converts that positional advantage into model performance.

"People underappreciate that that's going to be the next domain," Suleyman said.

The early partner list for Frontier Tuning reflects the ambition: Mayo Clinic, where Microsoft is co-creating a frontier AI model for healthcare using de-identified clinical data; EY, which is tuning a tax-advisory agent for deployment to 75,000 professionals globally; Land O'Lakes, where Frontier Tuning delivered what the company's product development scientist called "meaningful improvements in grounded outputs and style compliance"; and Pearson, which is using tuned models to provide learning-science-aligned feedback in its Communication Coach product.

The Mayo Clinic partnership may be the most significant. Microsoft and Mayo Clinic are collaborating to build a healthcare-specific frontier model that combines Mayo's clinical expertise and longitudinal patient insights with Microsoft's AI capabilities. The model will be owned by Mayo Clinic and deployed first within Mayo's own environment before being made available to other organizations through Foundry.

Microsoft's custom AI chips and GPU buying spree reveal the scale of its compute advantage

None of this works without an industrial-scale compute infrastructure, and Suleyman was unusually candid about the hardware economics underlying Microsoft's strategy.

"We are the largest buyer of GPUs on the planet," he said. "We're the largest buyer of GB200s and GB300s in the world."

Microsoft will continue purchasing Nvidia accelerators "for many, many years to come," Suleyman said. But the company is simultaneously building its own custom silicon. Maia 200, Microsoft's second-generation AI accelerator, is already running in production across data centers in Iowa and Arizona, with deployments planned for Italy, Australia, and South Korea. According to Microsoft, Maia 200 delivers the best tokens-per-dollar-per-watt in the company’s fleet.

Suleyman put a finer point on the economics in the interview: Maia 200 is 30 percent more cost-efficient than Nvidia's GB200, he said. And when Microsoft co-optimizes its own MAI models to run natively on Maia silicon, the company sees an additional 1.4x improvement in performance per watt. "It is going to be cheaper in years to come to build on MAI models with Maia 200 and Maia 300 inside of Azure," he said.

That claim — if it holds at scale — has profound implications for the competitive landscape. It means Microsoft is not merely buying its way to AI dominance through Nvidia; it is building a vertically integrated stack in which its own models, running on its own chips, inside its own cloud, tuned on its customers' own data, could offer performance and cost characteristics that no competitor can replicate.

Suleyman rejects the idea that AI models are becoming commodities

Suleyman also pushed back sharply against one of the most popular narratives in Silicon Valley: that AI models are rapidly commoditizing.

"A lot of people are saying models are commoditizing," he said. "I don't think that's true."

His argument hinges on what he calls "quality tokens" — the proposition that the composition, curation, licensing, and deduplication of training data matter at least as much as raw scale. Microsoft's new MAI models, he said, were trained on a pre-training mix composed of approximately 50 percent high-quality code, with the remainder drawn from commercially licensed and carefully curated sources.

The result, he argued, is a distinct "lineage" of models optimized for coding, reasoning, and agentic behavior — fundamentally different from models optimized for consumer chat, cultural content, or multilingual breadth.

"We're going to see very distinct lineages that reflect different training objectives of different companies," he said. "Quality tokens matter more than just brute-force scale."

This is a strategically important argument for Microsoft to make. If models are commodities — if any lab can match the frontier within months using cheaper compute and distilled training data — then the model layer becomes a race to the bottom, and Microsoft's billions in compute investment offer no durable advantage. But if model quality is a function of data discipline, research depth, and institutional patience, then the lab-building approach Suleyman is pursuing becomes a genuine competitive moat.

He used a specific metaphor to describe that approach, one borrowed from optimization theory: the "hill-climbing machine." The phrase describes a system that continuously improves — cycle after cycle — by applying more compute, better data, and sharper evaluation. "The goal here is to build what we think of as a hill-climbing machine," he wrote in his blog post. "An organization that can continuously improve, cycle after cycle." The metaphor is revealing because it describes a process, not a destination. Suleyman is not promising that Microsoft will build the world's best model next quarter. He is arguing that Microsoft is building the system — the research culture, the data pipelines, the silicon co-optimization, the evaluation infrastructure — that will produce progressively better models over years.

Inside Microsoft's five-year plan to become a self-sufficient AI superpower

The strategic picture that emerges from Suleyman's comments — and from the full scope of the Build 2026 announcements — is of a company preparing for a future in which AI capability is not rented from a partner but generated internally, at scale, across every layer of the stack.

Microsoft still needs OpenAI. The partnership continues to power Copilot, Azure AI services, and ChatGPT's infrastructure. Suleyman acknowledged as much, describing Microsoft's portfolio of model providers as a source of strength, not a problem to be solved.

But the direction of travel is unmistakable. With its own frontier models, its own custom silicon, its own reinforcement learning environments for enterprise tuning, and its own autonomous agent infrastructure, Microsoft is constructing a parallel path — one that, by 2030, could make the company a fully self-sufficient frontier AI lab embedded inside the world's largest enterprise software platform.

"Our ultimate goal is what we call Humanist Superintelligence," Suleyman wrote in his blog post. "That means advanced AI systems designed to serve people and organizations, not replace them."

Whether that goal is achievable — or even clearly definable — remains one of the great open questions in technology. And Suleyman expressed more confidence than caution when asked about the trajectory of progress. "I really think we're at the tip of the iceberg," he said. "The models are so much more powerful than we know how to extract intelligence from them."

But confidence and execution are different things. Building a frontier lab is not an announcement; it is a decade-long commitment that requires retaining elite researchers, maintaining scientific rigor under commercial pressure, and producing results that justify the staggering capital expenditure.

Google learned this with DeepMind — which Suleyman himself co-founded in 2010, before joining Microsoft — and even that lab, widely regarded as one of the best in the world, spent years navigating the tension between pure research and product delivery.

Suleyman seemed aware of the contradiction. "If you rush it, you'll screw it up," he said.

The sticker on his laptop reads: "Patience and urgency." It is a paradox that Microsoft now has five years — and several hundred billion dollars — to resolve.

Microsoft debuts Surface RTX Spark Dev Box to run large AI models without cloud costs

michael.nunez@venturebeat.com (Michael Nuñez) — Tue, 02 Jun 2026 16:30:00 GMT

Microsoft on Monday unveiled the Surface RTX Spark Dev Box, a compact desktop computer designed to let software developers run large AI models on their desks instead of paying for cloud computing — a move that directly challenges the per-token pricing model that has defined the AI industry's economics since ChatGPT launched three and a half years ago.

The device, announced at Microsoft Build 2026, packs Nvidia’s new Blackwell-architecture RTX Spark processor and 128 gigabytes of unified memory into a small-form-factor chassis, delivering what Nvidia rates at one petaflop of AI compute. In practical terms, that means a developer can load, run and interact with AI models exceeding 120 billion parameters without sending a single API call to the cloud.

"These class of devices, we think, will get to about 100 billion parameter model running," Pavan Davuluri, Microsoft's executive vice president of Windows and Devices, said during a press briefing ahead of the event. He emphasized that raw model size is only part of the equation: "The model size is one thing, but for the model to be effective, it kind of needs to be able to have enough context, because a larger model, you feed it larger context." At 100,000 tokens of context, he noted, the key-value cache alone can consume 40 to 50 gigabytes of memory — which is precisely why Microsoft and Nvidia engineered the device around a 128-gigabyte unified memory pool shared dynamically between the CPU and GPU.

The machine will be available later this year in the United States, sold exclusively through Microsoft.com. The company did not disclose pricing.

Why Microsoft is betting that AI's future runs on fixed costs, not cloud meters

The Surface RTX Spark Dev Box arrives at a moment when the economics of AI development have become a boardroom-level concern. Companies large and small are grappling with cloud GPU bills that scale unpredictably: every fine-tuning run, every inference call, every agentic workflow that loops through a frontier model accumulates cost. For a developer iterating rapidly on a prototype — running the same model dozens or hundreds of times a day — those charges compound fast.

Microsoft is framing the Dev Box as a release valve for that pressure. Andrew Hill, corporate vice president of Surface, wrote in the announcement blog post that the device "changes that equation" by letting developers "reserve frontier model calls for truly frontier problems and handle the rest on their own hardware." The pitch is not that cloud computing is obsolete, but that much of the work currently being sent to remote data centers does not require state-of-the-art models and would be better served by capable local hardware with predictable, fixed costs.

This is a significant strategic shift for Microsoft, a company that derives tens of billions of dollars in annual revenue from Azure cloud services. By selling hardware that explicitly reduces customers' cloud dependency, Microsoft is acknowledging a tension that has been building across the industry: the marginal cost of AI inference at scale is unsustainable for many teams, and the market is demanding alternatives. The bet appears to be that developers who prototype locally will still deploy to Azure when they need to scale — and that owning both ends of that workflow is more valuable than owning only the cloud.

Inside the 128GB unified memory architecture that makes local AI possible

The technical architecture of the Dev Box reflects a set of deliberate engineering choices aimed at sustained, not peak, performance — a distinction that matters enormously for AI workloads that can run for hours.

At the center is Nvidia’s RTX Spark system-on-chip, which combines an ultra-efficient ARM-based CPU with a Blackwell-generation RTX GPU. In a traditional Windows PC, Davuluri explained during the briefing, this configuration would require four separate components: a CPU, a discrete GPU, dedicated graphics memory and system RAM. The RTX Spark collapses all of that into a single chip paired with a single unified memory pool.

That unification is the critical design decision. Conventional gaming laptops with high-end Nvidia GPUs top out at roughly 24 gigabytes of GPU-accessible memory. The Dev Box's 128 gigabytes of unified memory — accessible to both the CPU and GPU through what Nvidia calls its Unified Memory Access architecture — is what makes it possible to load models that would otherwise require cloud GPU instances with specialty high-bandwidth memory configurations.

Microsoft did substantial work at the operating system level to exploit this architecture. The company implemented new memory management logic in Windows that raises the ceiling on how much system memory the GPU can address, introduces smarter page-size allocation for shared memory regions and ensures that heavy GPU workloads do not starve the CPU of the resources it needs for multitasking. The Windows scheduler was also optimized for RTX Spark's heterogeneous core layout, routing demanding workloads to performance cores while keeping efficiency cores available for background tasks.

How a 3D-printed aluminum chassis doubles as a heatsink

The thermal design is equally deliberate. The Dev Box operates within an approximately 100-watt sustained thermal envelope — modest by desktop standards, but meaningful for a device intended to run training jobs and inference workloads continuously. The aluminum chassis itself is engineered to function as a passive heatsink, and the method Microsoft used to build it is among the most striking details about the machine.

The top panel is manufactured using metal 3D printing, a process that enables internal geometries too complex for conventional CNC machining or injection molding. The perforations are not simple through-holes; they are angled in multiple directions around the internal fan to optimize airflow from cold-air intake through heat dissipation. During the press briefing, Harry, a Surface industrial designer, explained the rationale: "The complexity is something other manufacturers wouldn't be able to do, like CNC, or like any molding, because of the complexity of shape."

When asked whether 3D printing would constrain mass production, the designer acknowledged the challenge but suggested Microsoft had developed a process robust enough to scale. The result is a machine that runs quietly enough for an open office while sustaining the kind of continuous GPU workloads that would throttle most conventional desktops of similar size. For a device that Microsoft expects developers to leave running overnight on fine-tuning jobs, quiet sustained performance is not a luxury — it is a requirement.

A developer-first setup that eliminates hours of configuration

Microsoft is shipping the Dev Box with Windows 11 Pro pre-configured at the image level for development work — a detail that sounds minor but reflects a growing recognition that the out-of-box experience for developer hardware has historically been poor.

The machine boots into a dark theme with a simplified taskbar, widgets removed and Do Not Disturb enabled. Developer Mode is turned on. PowerShell 7 is the default shell. WSL 2 — the Windows Subsystem for Linux — comes pre-installed with GPU passthrough and CUDA support already configured. Visual Studio Code, GitHub Copilot, Git, Python and Node.js are all installed and ready.

"We've said, 'Hey, you know what, we got you, you want to go fast,'" a Microsoft engineer who demonstrated the configuration during the briefing told VentureBeat. The philosophy, he explained, is that developers were going to install all of these tools anyway — the friction was in the hours of setup and configuration that stood between unboxing a machine and writing the first line of code.

The Dev Box also ships with integration points across Microsoft's AI stack: AI Toolkit for VS Code for model conversion and fine-tuning, Windows ML and Windows Copilot Runtime for local inference, and Microsoft Foundry for connecting local prototypes to cloud deployment pipelines. For enterprises, the device integrates with Entra ID and Intune for identity and device management, and includes Secured-core PC architecture, BitLocker encryption and Microsoft Defender.

Why Apple's Mac Mini may not be the real competition anymore

The most obvious competitive comparison is Apple's Mac Mini, which has dominated the compact-desktop category and has been widely adopted by developers drawn to Apple Silicon's unified memory architecture and power efficiency.

Davuluri addressed the comparison directly during the briefing, saying the Dev Box is "in a different class of performance than Mac Minis, intentionally." He declined to share specific benchmarks, noting that detailed specifications and performance targets would come closer to the fall launch. But the architectural advantage Microsoft is claiming is clear: while the current Mac Mini with M4 Pro tops out at 48 gigabytes of unified memory and the M4 Max configuration reaches 128 gigabytes, the RTX Spark Dev Box pairs its 128 gigabytes with a Blackwell-class GPU that has a fundamentally different CUDA-based compute model — one that the vast majority of the AI/ML ecosystem's tooling (PyTorch, TensorRT, llama.cpp, Hugging Face frameworks) is already optimized for.

That CUDA ecosystem advantage is difficult to overstate. While Apple's Metal framework has made progress, the overwhelming majority of AI training and inference frameworks are built and tested first against Nvidia’s CUDA stack. A developer running models on the Dev Box can use the same code, the same libraries and the same workflows they would use on a cloud GPU instance — a level of portability that Apple Silicon cannot currently match.

From laptop to supercomputer: Microsoft's three-tier plan for local AI hardware

The Dev Box is one piece of a three-tier hardware strategy Microsoft laid out at Build. The Surface Laptop Ultra, announced days earlier at Computex, brings the same RTX Spark silicon into a 15-inch laptop form factor for developers and creators who need portability. At the other end of the spectrum, the DGX Station for Windows — built on Nvidia's GB300 Grace Blackwell Ultra Superchip — targets organizations that need to run frontier models up to one trillion parameters on a deskside system. That machine is expected in the fourth quarter of this year.

The three devices map to a tiered computing model that Microsoft is calling "unmetered intelligence": small on-device language models (the company's new Aion 1.0 family) handle lightweight tasks at zero marginal cost; RTX Spark-class hardware runs mid-range models locally for the bulk of development work; and cloud resources are reserved for genuinely frontier-scale problems.

The GitHub Copilot CLI is getting a concrete implementation of this model with a new feature called /fleet, which allows a cloud-based primary agent to build a plan, assess the complexity of each task and route appropriate subtasks to a local model running on the developer's hardware. The cloud agent handles what requires frontier capability; the local model handles what does not. The result, in theory, is lower cost without lower quality.

The real question is whether hybrid AI can shift from buzzword to business model

Whether Microsoft's bet pays off depends on questions that will take months to answer. How does the Dev Box actually perform under sustained, real-world workloads? What will it cost? How quickly will the open-source model ecosystem continue to produce capable models in the 70-to-120-billion-parameter range that fit within its memory envelope? And perhaps most critically: will enterprise procurement teams, trained to think of AI as a cloud line item, accept a capital expenditure on desk hardware as an alternative?

The strategic logic, however, is difficult to dismiss. For three years, the AI industry has operated on an implicit assumption: serious AI work happens in the cloud, and the economics of that arrangement are simply the cost of doing business. Microsoft, a company with every incentive to reinforce that assumption, is now selling a machine that undermines it. That is not a contradiction — it is a recognition that the market is moving, and that the company that controls the developer's local environment and the cloud they deploy to has a more durable advantage than one that controls only the cloud.

Every dollar a developer does not spend on cloud inference is a dollar that can fund another experiment, another iteration, another prototype. For years, the AI industry told developers they needed to rent their intelligence by the token. Microsoft is now asking a different question: what if you could just buy it?

Zip’s new AI agents want to stop your finance team from uploading contracts into personal ChatGPT accounts

michael.nunez@venturebeat.com (Michael Nuñez) — Tue, 02 Jun 2026 12:00:00 GMT

Zip, the AI procurement platform valued at $2.2 billion, announced two products on Monday that mark a turning point in its evolution from procurement software to autonomous AI platform: a suite of five AI "Superagents" that can review contracts, code invoices, and negotiate with vendors inside Zip's governance framework, and a procurement-native implementation of the Model Context Protocol (MCP) that pipes Zip's data directly into AI assistants like Claude and ChatGPT — without sacrificing audit trails or compliance controls.

The announcements, unveiled at Zip's AI Summit in New York with speakers from Anthropic, OpenAI, Datadog, and Humana, arrive at a moment when the procurement technology sector has become one of the fiercest battlegrounds in enterprise AI. SAP unveiled its "Autonomous Enterprise" vision at Sapphire 2026 just weeks ago, introducing more than 50 domain-specific Joule Assistants across finance, supply chain, and procurement. Coupa launched its own Compose platform and Catalyst services bundle at Inspire 2026 in Las Vegas in May, an environment for building and orchestrating AI agents across procurement, along with a forward-deployed engineering services offering. And Gartner predicts 40% of enterprise applications will include task-specific AI agents by end of 2026, up from less than 5% today.

What makes Zip's approach distinct — and what makes it a potentially important test case for the broader enterprise AI market — is not the agents themselves, but where they run and what constrains them.

Why procurement teams are uploading sensitive financial data into personal AI accounts

The announcement centers on an enterprise anxiety that procurement chiefs increasingly describe in private but rarely say publicly: their employees are already using AI for sensitive financial work, they're just doing it in unmonitored, personal accounts.

Across the enterprise, employees are uploading spend data into Claude to analyze it, redlining sensitive contracts inside ChatGPT, and generating internal financial analyses in personal Gemini or Copilot accounts. Every time they do, sensitive enterprise data leaves systems where every action is controlled and audited, entering environments with no oversight, no compliance controls, and no record of what was done.

The consequences for getting this wrong are not hypothetical. SOX violations carry fines of up to $25 million. Executives can face prison time. Public companies that fail compliance audits can be delisted from the stock exchange. When an auditor asks how a decision was made six months later, no one can produce a record.

"After working with hundreds of enterprises — including the world's leading AI companies — we've learned that this kind of work is already happening, with or without governance," said Lu Cheng, Co-Founder and CTO at Zip. "Even the companies building AI themselves want this work governed."

Zip's CEO Rujul Zaparde put a finer point on it in an interview with VentureBeat, describing the competitive dynamics that make procurement an unusually high-stakes domain for AI governance. "Most enterprises don't operate on a single procurement platform," Zaparde said. "They're running SAP as their ERP, Coupa for some sourcing, ServiceNow for IT requests, contract management tools for legal, risk and compliance platforms for vendor due diligence, and a long tail of point tools alongside them."

He argued that this fragmentation gives Zip, as the orchestration layer connecting all of those systems, a unique advantage: "AI can only be as good as the data it has access to. Because Zip sits above all of these tools, with visibility into each, and orchestrates the entire procurement process from request to payment, its AI can take action across the full procurement workflow in ways point solutions cannot."

Inside the five Superagents Zip built to automate procurement's hardest bottlenecks

Zip is launching five Superagents, each targeting a specific pressure point in the procurement lifecycle. A Procurement Superagent unblocks stalled requests and manages tail-spend negotiation. A Legal Superagent reviews and redlines contracts against company-approved playbooks. An AP Superagent sorts, codes, matches, and routes invoices. A Config Superagent identifies workflow bottlenecks and drafts configuration changes for admin review. And an Intake Superagent guides employees through compliant request creation, routing purchases to the right buying channel and nudging toward preferred suppliers.

The five agents are not standalone services. Zip's engineering blog reveals the architectural philosophy underlying them: all agents at Zip — pre-built and custom — run on a shared execution engine built within the company's App Studio workflow automation platform. They differ only in configuration: the prompt that defines behavior, the tools they can access, and the format of their output. Zip's engineering team describes this as a "Lego block" model — the out-of-the-box agents are finished models; custom agents are whatever enterprises choose to build from the same components.

Under the hood, the agent architecture uses a four-node LangGraph state graph — preprocessing, orchestration, final synthesis, and post-processing — that separates information gathering from response generation. The orchestration node contains a ReAct (Reason + Act) agent that autonomously decides which tools to call: document retrieval via vector search, structured API data from purchase requests and contracts, or company-specific policy context from a reference library.

This separation is deliberate. As Zip's engineering team explains, conflating research and synthesis into a single LLM call would mean asking one model to be both a diligent researcher and an eloquent writer simultaneously. Separating them allows Zip to optimize each independently — including using different model tiers for each.

What differentiates Zip's agents from the slew of procurement AI announcements from SAP, Coupa, and others is the governance architecture. Every Superagent action is governed by the same roles, permissions, and controls that apply to human employees. High-impact steps like system updates and approvals use deterministic logic rather than LLM inference. And every action generates a complete audit trail.

What happens when an AI agent misclassifies a $150,000 contract

Zaparde shared a specific error case from beta testing to illustrate how Zip's human-in-the-loop design handles real-world failures. "Our Intake Superagent flagged a $150K marketing services contract as a standard SaaS subscription," he said. "But because every Superagent action hits a human-in-the-loop checkpoint before it executes, the procurement team caught the misclassification before it went anywhere. They corrected the category, the right approvers were routed in, and the GL coding flowed through accurately downstream."

The error-and-correction anecdote is revealing because it highlights the tension at the heart of every enterprise AI deployment: these systems will make mistakes, and the question is whether the surrounding infrastructure catches them before they cause damage.

Zaparde was direct when asked who bears liability if a Superagent triggers a compliance failure: "Customers remain accountable for their procurement decisions, the same way they would be with any vendor or business process. That's standard across enterprise software. Payroll vendors don't take on liability for misclassified employees, ERP vendors don't take on liability for misstated financials, and the same principle applies to AI-augmented work."

But he was equally emphatic that the design goal is to make the liability question moot. "Zip's Superagents are designed so this scenario shouldn't happen in the first place. They don't operate outside governance, they operate inside it. Every action is auditable, every high-impact step is gated by human review, and the audit trail makes it possible to demonstrate compliant decision-making to auditors and regulators."

The Superagents are currently in beta, with general availability expected this summer. Zip has been deploying AI agents in procurement since 2024, and today more than 50 are live across hundreds of enterprise customers. Northwestern Mutual alone saved 1,400 hours from a single AI agent. Superagents represent the next evolution — more reasoning, more cross-system action, more autonomy — all inside Zip's governance layer.

When asked what percentage of agent actions require human escalation, Zaparde said there's no single number because every agent handles a different type of task, but added: "In finance and procurement specifically, we deliberately err on the side of escalation any time a transaction touches risk thresholds, policy compliance, legal requirements, budget guardrails, or governance rules. That's a deliberate design choice, not a limitation."

How Zip's procurement-native MCP could reshape where enterprise AI actually runs

The second announcement may prove more consequential for the broader enterprise AI market. Zip MCP is a vendor-hosted implementation of the Model Context Protocol — the open standard originally created by Anthropic in November 2024 and later donated to the Linux Foundation, with MCP SDK downloads reaching 97 million per month by March 2026, a 970x increase in 18 months.

A fundamental challenge has limited MCP's enterprise adoption: organizations deploying MCP are running into a predictable set of problems — audit trails, SSO-integrated auth, gateway behavior, and configuration portability. The MCP protocol itself doesn't yet natively solve for the governance requirements that regulated industries and compliance-sensitive functions like procurement demand.

Zip is attempting to solve this from the application layer. Its MCP server connects Zip's procurement platform directly to any MCP-compatible AI assistant. An employee researching vendors in Claude, for instance, can have Zip proactively surface a request submission from that conversation. Power users can pull aggregated reporting across suppliers, requests, invoices, and payments from within a single AI conversation. Every action respects user permissions through OAuth, runs inside Zip's compliance controls, and generates a complete audit trail. Zip claims this is the first time MCP has been implemented natively for enterprise procurement.

The claim matters because procurement is arguably the most governance-sensitive business function where MCP could deliver immediate value: it involves financial commitments, legal contracts, regulatory compliance, and supplier data that touch SOX, GDPR, and dozens of other regulatory frameworks.

When asked what happens to sensitive data once it reaches a third-party model's context window, Zaparde was direct: "MCP is tied to an authenticated user, and the same role-based permissions that apply inside Zip apply through MCP as well — meaning MCP can only retrieve information the user is already authorized to see." He added that Anthropic and OpenAI operate as Zip subprocessors, governed by data processing agreements with Zero Data Retention provisions, so "data flowing through MCP isn't used for model training, and it's protected by enterprise-grade controls at both ends of the connection."

The companies building AI chose Zip instead of building their own procurement tools

Zip's customer list for these announcements is impressive but still developing. Block, UCI Health, and Snowflake are the named launch customers for AI Spend Automation, the premium enterprise offering that bundles platform access, AI consumption credits, and Zip's forward-deployed engineers.

UCI Health reported $20 million in cost avoidance from a single IT infrastructure project. Zaparde explained the methodology: "The $20 million came from a single IT infrastructure project at UCI Health where their procurement team used AI-powered benchmarking to enter vendor negotiations with real market data rather than internal assumptions alone." He was careful to frame it as a collaborative result: "UCI Health's procurement team did the negotiating and the AI gave them the benchmarks to do it well."

Zip claims its broader customer base has saved more than $10 billion through its AI suite. Zaparde said that figure "includes direct cost reductions through better vendor negotiations, time savings from automating manual procurement workflows, risk reduction through avoided fines and compliance penalties, and indirect spend savings from improved renewal management." A Forrester Total Economic Impact study modeled a 386% ROI for large enterprises using Zip, showing that on average, the platform pays for itself in under six months.

But the customer stories that matter most for Zip's strategic narrative are its relationships with the companies whose models power its own agents. OpenAI has deployed more than 10 AI agents on Zip's platform. Anthropic, whose Claude model Zip uses and whose engineers created MCP, more than doubled its procurement volume through Zip while keeping headcount flat.

The fact that both companies chose to buy rather than build is arguably Zip's strongest competitive proof point: if the organizations with the most AI engineering talent on earth decided the procurement governance problem wasn't worth solving internally, it suggests the moat is real. Beyond AI, the customer list spans T-Mobile, Dollar Tree, Canva, and Prudential — large, regulated enterprises where compliance failures carry material consequences.

"When the companies building AI choose Zip rather than build it themselves, that tells you something about the moat," Zaparde said.

SAP, Coupa, and the intensifying AI arms race in enterprise procurement

Zip's announcements don't happen in a vacuum. The enterprise procurement AI market is experiencing a rapid convergence as every major platform races to embed agentic capabilities.

SAP has deployed more than 50 domain-specific Joule Assistants at Sapphire 2026, orchestrating a subset of over 200 specialized agents to execute precise tasks. SAP has even launched a Joule Agent in the SAP Ariba Intake Management solution that captures and routes procurement requests and connects to existing procurement systems — a move that reaches directly into Zip's core territory. Coupa CEO Leagh Turner has argued her platform's foundation sets it apart, saying that while others are "bolting AI onto aging systems," Coupa has one platform that scales with governance. Coupa says it has deployed more than 20 specialized agents, and its $10 trillion dataset of historical transactions gives it a training data advantage that Zip cannot match.

Zaparde's counter-argument rests squarely on Zip's position as an orchestration layer rather than a point solution. "No matter how powerful those individual tools are, their AI is necessarily limited to the data inside each of their own systems," he said. "Our moat is the orchestration layer and the AI agents built on top of it: agents that are uniquely able to reason and act across multiple systems and reconcile their data as a whole where needed." He pointed to Zip's recognition as a Leader in the first-ever IDC MarketScape for Spend Orchestration as evidence that the category itself has been validated.

The argument carries a strategic vulnerability, however, that Zaparde was asked about directly: Zip's leading AI-company customers are also its model providers and potential competitors. What happens if Anthropic or OpenAI builds procurement tooling?

"The mistake is assuming procurement is fundamentally a model problem," Zaparde responded. "Even if an LLM could perfectly understand a contract or negotiate with a vendor, it still needs to operate within company policies, approval chains, supplier relationships, ERP systems, and audit requirements. That context layer is what Zip has spent the past six years building. We see the model providers as accelerating what's possible, while we focus on making that intelligence operational within the enterprise."

Why Zip is trading SaaS margins for forward-deployed engineers and AI credits

The AI Spend Automation offering raises questions about Zip's evolving business model. Bundling platform access, AI consumption credits, and forward-deployed engineers who build and deploy custom agents inside customer environments is a strikingly different margin profile than traditional SaaS — and it's a model that Coupa, with its own new Catalyst services offering, is also now pursuing.

Zaparde was transparent about the tradeoff: "Yes, it is a different margin profile than pure SaaS, and we're okay with that. Right now, our priority is adoption and proving value for customers. We believe that if we get the outcomes right, the economics follow. Companies that rush to protect margins before they've demonstrated real value end up with neither. We're playing the long game."

Zip is valued at $2.2 billion as of its October 2024 Series D round, the largest investment in procurement technology in over two decades. The company has raised approximately $371 million since its founding in 2020 and counts among its investors Y Combinator, BOND, DST Global, Tiger Global, and CRV.

The deepest technical signal in Monday's announcement may be what it reveals about the infrastructure moat Zip is building beneath its agents. The company's engineering team recently published detailed architecture for its internationalization system — a pipeline that uses LLM-based translation with glossary enforcement, Kafka change data capture, and a dedicated Redis caching cluster to translate user-generated content across multinational enterprise customers in real time.

The system uses a technique called "lazy persistence," where translations are initially stored with a one-week TTL and only promoted to permanent storage when a user actually reads them. This kind of deeply procurement-specific infrastructure — designed to support AI agents that operate across languages, jurisdictions, and regulatory regimes — takes years to build, not quarters, and no general-purpose AI tool can replicate it with a better model alone.

The real product Zip is selling is the audit trail

The central question for Zip — and for every enterprise software company racing to embed agentic AI into regulated workflows — is whether governance-first AI agents will actually earn the trust of procurement teams that have spent decades building manual controls for very good reasons. The regulatory stakes are real: SOX fines, criminal liability for executives, stock exchange delisting for companies that fail compliance audits. When an auditor shows up and asks how a purchasing decision was made, someone has to produce a paper trail.

That is ultimately the bet Zip is making with Superagents and MCP. Not that AI can do procurement work — at this point, that's table stakes — but that AI can do procurement work and leave a record that will satisfy an auditor two years from now. In a market flooded with companies promising autonomous agents, Zip is wagering that the most valuable thing an AI can produce isn't a decision. It's proof that the decision was made correctly.

Zip MCP and Zip Superagents are available in beta today, included with all core Zip products, with general availability expected this summer. Zip AI Spend Automation is available now for enterprise customers.

Mistral AI launches Vibe, expands into industrial AI and announces data center push to challenge OpenAI

michael.nunez@venturebeat.com (Michael Nuñez) — Thu, 28 May 2026 20:54:16 GMT

Mistral AI used its inaugural conference on Wednesday to announce a sweeping expansion into industrial manufacturing, a new inference data center south of Paris, and a rebranding of its consumer-facing assistant — moves that collectively signal the three-year-old French startup's ambition to become the enterprise AI provider of record for companies that refuse to hand their most sensitive data to American hyperscalers.

At the AI NOW Summit, held at a venue in central Paris, co-founder and CEO Arthur Mensch took the stage alongside CTO Timothée Lacroix and Chief Scientist Guillaume Lample to lay out a strategy that stretches from bare-metal GPU clusters to physics simulations for aircraft wings. The company disclosed that it now employs 1,000 people and is targeting €1 billion ($1.17B USD) in revenue for 2026 — a figure that, if achieved, would be an extraordinary growth trajectory for a company that began with 15 employees collaborating with its first customer, BNP Paribas, in 2023.

"We have two convictions at Mistral," Mensch told the audience. "The first is that in order to deploy AI in the enterprise, you actually need, as an AI provider, to own the full stack." He described Mistral's business as fundamentally about "transforming electrons into tokens and intelligence," arguing that physical infrastructure control matters as much as model quality.

The announcements come at a pivotal moment for Mistral and for the broader European AI ecosystem. The company has raised at least $3.9 billion across nine funding rounds, according to Clay's funding tracker, including a massive €1.7 billion Series C led by Dutch semiconductor equipment maker ASML in September 2025 at an €11.7 billion valuation, and an $830 million debt financing round in March 2026 from a consortium of seven banks to fund data center construction. Mistral now finds itself in a peculiar competitive position: too large to be dismissed as a research lab, but still dwarfed by the resources of OpenAI, Google DeepMind, and Anthropic.

Its answer, articulated across nearly an hour of presentations Wednesday, is vertical depth — going industry by industry, workflow by workflow, and building the infrastructure to keep everything on premises.

Why Mistral is betting that physics AI will reshape how Airbus and BMW design products

The centerpiece announcement was Mistral for Industrial Engineering, a fully integrated AI stack that combines Mistral's large language models with physics simulation capabilities acquired through its purchase of Emmi AI, completed earlier in May 2026. The platform targets the aerospace, automotive, and semiconductor industries with tools for accelerating product design, validating simulations, and optimizing production.

The launch came with headline partnerships. Mistral announced it is working with Airbus across its commercial aircraft, helicopter, defense, and space divisions, implementing AI from initial design through to on-board capabilities. For BMW Group, Mistral is serving as a central partner for what the automaker calls its "Large Industry Model" initiative, focused on multimodal reasoning models for crash simulation and other complex engineering tasks. ASML, already Mistral's largest shareholder, is also an early adopter.

Mensch framed the industrial push as addressing a fundamental gap in how AI is currently deployed. "AI is great today at automating tasks for knowledge workers and for people that are doing software engineering," he told the summit audience. "But once you move to all the kind of engineers, well, they are underserved."

The reason, he explained, is structural. Simulating the behavior of a wing or a factory process requires compute-intensive physics solvers that can take hours or weeks per design variant. Traditional simulation creates a bottleneck that makes AI-assisted iteration impractical.

Mistral's answer is what it calls "physics AI" — data-driven models trained on solver outputs that can predict physical behavior in seconds rather than hours, running on a single GPU. As Mistral's own blog post on the technology acknowledges, physics AI is "not a replacement for first-principles solvers in every regime" — it is a throughput accelerator for the majority of design-loop iterations, with traditional solvers reserved for verification and edge cases.

"We now have both the language intelligence and the physical intelligence models, and by combining them together we are building delegation loops that allow us to create better tools, that allow us to create better objects that actually have an impact on the physical world," Mensch said.

The ASML partnership offered a concrete illustration. In a video testimonial shown at the summit, an ASML representative described how the company's lithography machines run around the clock at customer fabrication plants, and field service engineers need to diagnose issues as rapidly as possible. By combining ASML's internal engineering expertise with Mistral's models, "we were able to develop a solution that's 120 times faster with a similar accuracy as we have today," the representative said. Another ASML speaker described AI agents acting as "an always-on code reviewer" to catch software defects before they reach customers.

Inside Mistral's €4 billion infrastructure gamble to build Europe's most powerful AI data centers

Mistral's full-stack ambitions extend all the way down to the physical layer. Launched in June 2025, Mistral Compute is a €4 billion ($4.66B USD) investment in data centers in France and Sweden, with a stated roadmap of 200 MW of capacity by 2027 and 1 GW by 2030.

Lacroix described the company's existing 40 MW facility at Bruyères-le-Châtel, south of Paris, which was built in collaboration with Eclarion and has been training models since early 2026. "It's been very interesting to see how we can transfer rigor, which is one of our company values, into down to the hardware layer," he said, describing the process of "fixing compute trays and fixing fibers, allowing us to reach the very best speeds possible on that hardware for training."

On Wednesday, Mistral announced a new 10 MW facility at Les Ulis in the Essonne department, also south of Paris, dedicated to inference operations and scheduled to open in Q3 2026. Lacroix also referenced a site in Borlänge, Sweden, planned for development through 2027, which will host NVIDIA's next-generation Vera Rubin GPUs. "One of the benefits for us of owning the hardware layer is also that it lets us be at the very bleeding edge of what infrastructure provides," he told the audience.

The infrastructure push is funded in part by the $830 million debt financing round announced in March 2026, which Clay's funding tracker attributes to a consortium of seven banks: Bpifrance, BNP Paribas, Crédit Agricole CIB, HSBC, La Banque Postale, MUFG, and Natixis CIB. And this infrastructure ownership is not merely a hedge against GPU scarcity — it is central to Mistral's pitch to security-conscious enterprise and government customers. The company's February 2026 acquisition of serverless platform Koyeb has been integrated into Mistral Studio to support both hosted and on-premises deployments, giving customers a choice between running inference on Mistral's hardware or their own.

"More and more, the compute world has been getting supply constrained," Lacroix told the audience. "One of the reasons we've been doing all of this and developing all of this data center capacity is to secure compute capacity not only for ourselves but also for our customers."

Le Chat is dead, long live Vibe: How Mistral's new agent platform takes aim at enterprise productivity

In a consumer-facing rebrand with significant enterprise implications, Mistral announced that Le Chat — its conversational AI assistant launched in February 2024 — is being renamed Vibe and reimagined as a unified agent platform for enterprise productivity and software development.

"We are transitioning Le Chat to the Vibe family," Lacroix told the audience, explaining that the evolution was driven by the growing power of agentic models, particularly the new Mistral Medium 3.5. As the team used Vibe's coding CLI internally with increasingly complex tasks, "we realized that this really didn't need to be bound to the CLI, it didn't need to be limited to code, and we could do a lot more with it," he said.

Vibe encompasses two primary modes. Vibe for Work is a web and mobile agent that connects to enterprise tools — Google Workspace, Outlook, SharePoint, Slack, GitHub — to perform multi-step tasks such as summarizing emails, analyzing spreadsheets, drafting reports, and scheduling recurring workflows. Vibe for Code is a coding agent available through a web interface, a new VS Code extension, and the existing CLI, capable of building features, fixing bugs, refactoring code, and shipping pull requests. Critically, the same underlying agent powers both modes. "When you access it through our web app or through the CLI, you have access to the same connections, the same tools, the same understanding of who you are, what you do, and what you're trying to achieve," Lacroix said.

Pricing starts at free for basic use, $14.99 per month for Pro, $24.99 per user per month for Teams, and custom pricing for Enterprise deployments. Alongside Vibe, Mistral also launched Search Toolkit, an open-source framework for building production search pipelines already in use by shipping giant CMA CGM, which uses it alongside Voxtral to process audio from multiple data sources and return alerts within 15 seconds.

Mistral's model strategy signals a new phase: fewer products, more capabilities per model

Chief Scientist Guillaume Lample used his portion of the keynote to describe a philosophical shift in Mistral's model strategy: consolidation of capabilities into fewer, more versatile models rather than maintaining separate specialized products.

Mistral Medium 3.5, the company's current flagship, absorbs capabilities that previously required distinct models. Pixtral (image processing), Magistral (reasoning), and DevStral (coding) have all been deprecated as standalone products, with their capabilities folded natively into Medium 3.5. "Now all our models are natively multimodal," Lample said. "We no longer have Magistral. This model is deprecated, because all our models will natively be doing reasoning."

The company is also working on Mistral Large 4, which Lample said would arrive "in a couple of months at most, during the summer," with expanded capabilities in industrial applications such as fluid dynamics, computational chemistry, computer-aided design, and cybersecurity. On the smaller end of the spectrum, Lample highlighted Mistral OCR, a 1-billion-parameter OCR model that can process thousands of pages per minute on a single GPU, and the Voxtral speech model family, which has expanded from automatic speech recognition to include text-to-speech with voice cloning. A "duplex" model for real-time conversational speech is planned for release within months.

Lample also made the case for open-weight models becoming more — not less — important in the agentic era. "Today we are building these agentic workflows, these models are running in the background, they are doing a lot of actions, a lot of tool calls, so they are extremely token-hungry, much more than before," he said. "What we are seeing today is actually a comeback of this small model and the efficient model." Upcoming models will be trained on more than 200 languages, a multilingual strength now powering a partnership with Amazon to improve non-English interactions on Alexa+.

How Mistral's enterprise playbook stacks up against OpenAI and Anthropic

Mistral's positioning stands in sharp contrast to the strategies of its most prominent American rivals. While OpenAI and Anthropic have each attracted hundreds of millions of consumer users and derive significant revenue from subscription products, Mistral has leaned almost entirely into enterprise and government deployments. As TechCrunch reported in March when Mistral announced its Forge customization platform at Nvidia GTC, CEO Mensch has described the company as being "on track to surpass $1 billion in annual recurring revenue" — a figure driven largely by corporate clients.

The Forge platform, which lets enterprises train custom models on their own data rather than simply fine-tuning or applying retrieval-augmented generation to existing models, represents the foundation on which the company's industry-specific solutions are built. As Mistral's head of product, Elisa Salamanca, told TechCrunch, Forge "lets enterprises and governments customize AI models for their specific needs." Early partners include Ericsson, the European Space Agency, Italian consulting company Reply, and Singapore's DSO and HTX, alongside ASML.

Mistral has also built an expanding network of systems integration partnerships to drive enterprise adoption. In February 2026, Accenture and Mistral announced a multi-year strategic collaboration, with Accenture itself becoming a Mistral customer. Mauro Macchi, Accenture's CEO for Europe, Middle East, and Africa, said at the time that the partnership brings together "sovereign models and the capability to scale technology across industries, geographies and business functions."

The BNP Paribas relationship offers the most detailed public case study. In a video testimonial at the summit, a BNP Paribas representative described deploying Mistral's models on-premises to satisfy strict security requirements, developing AI agents for KYC processes that reduced incomplete files from 80% to 10% and compressed processing time from weeks to days. The bank's LLM platform at its Corporate and Institutional Banking division has now rolled out to 65,000 users. Mensch noted the significance: "We started to collaborate in 2023 where we were 15 people, so that was, I think, really a leap of faith at the time."

The industrial vertical is also being extended to government clients. Mistral disclosed that it is working with France, Luxembourg, Singapore, Morocco, Greece, and Slovakia to build citizen-facing AI services — from deploying agents that help job-seekers through France Travail to building models that understand Moroccan Darija and Amazigh languages. "We think that AI needs to be specialized and understand structural nuances," Mensch told the audience. "It needs to speak languages as good as it speaks English."

The road ahead for Europe's most ambitious AI company

For Mistral, Wednesday's announcements amount to a declaration that the company intends to compete not by matching American AI giants on any single dimension, but by assembling capabilities none of them are willing or able to offer in combination: open-weight models, owned infrastructure, on-premises deployment, physics simulation, and deep vertical customization — all under a single roof.

The strategy demands execution on multiple fronts simultaneously, each requiring enormous capital and specialized talent. The competition is formidable and accelerating. OpenAI has been rapidly expanding its enterprise offerings. Anthropic, backed by billions from Amazon, is building its own corporate AI practice. Google, Microsoft, and Amazon all offer AI platforms deeply integrated with cloud infrastructure that most enterprises already use.

But Mistral is wagering that the world's most consequential AI deployments — the ones governing how aircraft get designed, how banks process compliance, how governments interact with citizens — will ultimately go to providers that offer sovereignty over data, models, and compute. "AI is too strategic to be left in the hands of a few," Mensch said, echoing the conviction he described from Mistral's founding three years ago.

Three years in, the company that started as a Paris research lab with a handful of employees now trains models in its own data centers, simulates physics for the manufacturers that build the world's planes and cars, and is rewriting its assistant into an agent that can file your pull requests and summarize your inbox in the same conversation. Whether that sprawling ambition coheres into a durable business or stretches Mistral too thin is the €11.7 billion ($13.6B USD) question. The 1,000 people now working there are betting that in enterprise AI, owning the full stack is not a liability — it is the product.

DataGrail report finds your vendor may be sending data to AI models you never approved

michael.nunez@venturebeat.com (Michael Nuñez) — Wed, 27 May 2026 16:00:00 GMT

The data processing agreement (DPA) — the bedrock contract companies use to evaluate how vendors handle personal data — can no longer be trusted at face value. That is the central, and arguably most alarming, conclusion of DataGrail's Privacy and AI Trends Report 2026, released today.

The San Francisco-based privacy platform analyzed 2,400 popular business software providers and found that 63.6% of vendors that prominently advertise AI capabilities do not disclose a third-party AI subprocessor in their legal documentation. The implication: the majority of companies purchasing AI-enabled software may be unknowingly exposing their customers' data to AI models and pipelines they never reviewed, never approved, and may not even know exist.

"All software vendors are trying to move to become AI vendors, which makes sense, but the technologies are moving faster than AI governance can actually keep up," DataGrail co-founder and CEO Daniel Barber told VentureBeat in an exclusive interview ahead of the report's release. "The DPA should be the reliable document that teams use to evaluate AI risk, but based on that number, that's not enough in 2026."

The finding drops into an enterprise landscape where organizations with high levels of shadow AI already experience average breach costs of $4.63 million — $670,000 more than those with low or no shadow AI, according to IBM's 2025 Cost of Data Breach Report. And it arrives in a year when U.S. states gave out $3.425 billion in privacy-related fines — more than the last five years combined — a trend Gartner expects to accelerate through 2028.

How researchers uncovered the growing gap between AI vendor contracts and reality

DataGrail's methodology for arriving at the 63.6% figure goes well beyond reading contracts. The company's research team cross-referenced DPA disclosures against product documentation, GitHub environments, API connections, and marketing materials for each of the 2,400 vendors in its tracking universe.

Barber walked VentureBeat through the process: "We looked at the DPA as the baseline, but then what we also looked at is the GitHub environment, the API connections that a particular vendor has, the product documentation, the marketing documentation, and triangulate that information to discern — okay, so the DPA document says use OpenAI, but actually you've got these three AI subprocessors over here in your product documentation outlining features and functionality, but that is not reflected in your DPA."

When asked directly about how confident he was that these gaps represent actual shadow AI risk rather than vendors using proprietary technology, Barber was unequivocal. "Very confident, because we looked at the sample of the 2,400 systems, and we spent a substantial amount of time actually looking at product documentation, GitHub environments, looking at actual API connections, because we integrate with these systems as well, so we know how they process personal information. It is from primary research."

The disclosure gap matters because it undermines the entire chain of trust that privacy programs rely on. Consider a scenario Barber described: A company invests in an AI recruiting tool. The tool's DPA lists Claude as its foundational model. The company dutifully performs a security review of Anthropic's AI. But the recruiting tool also quietly uses OpenAI and Gemini behind the scenes — models the company never evaluated.

Those undisclosed models then process thousands of resumes and execute automated hiring decisions. The company, without knowing it, has exposed sensitive personal information — home addresses, financial data, possibly Social Security numbers — to AI systems it never vetted, potentially violating FTC regulations on automated decision-making in employment. "How those vendors are evaluating and performing that automated decision making could be really disastrous for a business," Barber said.

One-third of AI systems also process sensitive data, and the true number is likely higher

The disclosure gap alone would be concerning enough. But DataGrail's report layers on another finding that makes the problem materially worse: 32.8% of AI systems that disclose AI capabilities also disclose at least one other high-risk activity, such as processing sensitive personal information or powering automated decision-making. Among AI systems with self-reported risk factors, 47.1% process personal data, 20.7% have the potential to power automated decision-making, 16.5% process sensitive data categories like health or financial information, and 7.5% process biometric data.

The report argues these figures almost certainly undercount actual exposure, since they reflect only what vendors have formally disclosed. Vendors could underreport access to personal data, and the inherent flexibility of AI means even good-faith vendors might not predict riskier user applications of their tools.

This has immediate regulatory implications. The CCPA's new risk assessment requirement, effective January 1, 2026, requires businesses to conduct and document risk assessments for processing activities that present significant privacy risks — and will require submission to CalPrivacy by April 2028, with executive attestation under penalty of perjury.

Processing sensitive personal information with AI, or using AI for automated decision-making, are precisely the activities that trigger this obligation. The report finds that 42% of companies abandoned AI initiatives in 2025 with data privacy concerns cited as a primary obstacle — a statistic sourced to S&P Global research. Privacy teams that engage early with AI projects, Barber argues, can prevent that waste by ensuring safeguards are in place before launch, with AI risk assessments serving as the right starting point.

Why consent management became 2025's most punished privacy failure

While shadow AI is still a newer category of threat, the report makes clear that traditional privacy challenges have not eased — they have intensified. Consent management was the busiest enforcement topic of 2025. California alone publicly reported $4.3 million in CCPA consent settlements, and 2025 saw over 1,400 class action wiretapping suits driven by private firms investigating tracking pixels and session replay software.

Despite this enforcement wave, 63% of the 5,000 websites DataGrail audited still fail to comply with universal opt-out mechanisms such as the Global Privacy Control signal. While that figure represents an improvement from 75% non-compliance in 2023, the pace of improvement is slow relative to the acceleration in enforcement.

Barber pointed to the case of Todd Snyder, the menswear retailer that the California Privacy Protection Agency fined $345,178 in May 2025, as evidence that enforcement is no longer reserved for big tech. "This is a business that has two or three stores across the U.S. They have 300 employees," he said. "They run tight margins because they're a consumer menswear clothing store."

The California Attorney General also reached a $2.75 million settlement with Disney over failures to honor opt-out signals, while the California Privacy Protection Agency has brought enforcement actions against PlayOn Sports and Ford — a pattern that demonstrates both the breadth and depth of regulatory activity. Among the trackers that fire even after a user sends a GPC signal, the report found that 27.1% come from Google Analytics and 43.8% are for targeted advertising via platforms like Meta and Microsoft.

For users who do engage with consent banners, 48.3% click "Accept all," while only 12.4% select "Essential only" and 2.3% customize their preferences. A full 37% simply exit the banner without making a selection. The practical takeaway: less than 15% of users make a conscious choice to opt out of tracking, which means consent banners present relatively low business risk when properly configured — but enormous regulatory risk when they are not.

Data deletion requests surge 567% as the cost of manual processing hits $1.5 million a year

Data subject request volume hit an all-time high for the fifth consecutive year. Deletion requests have surged 567% since 2021 and now represent 87% of all data subject requests. Access requests, by contrast, have gradually declined as consumers skip visibility and reach straight for the delete button.

The cost is staggering. For a mid-sized organization receiving 5 million annual web visitors, the report estimates manual DSR management now runs approximately $1.5 million per year, based on Gartner's estimated cost of $1,524 per manual DSR. The average cost has climbed from $238,000 in 2021 to $1.51 million in 2025 — a trajectory that makes manual processing not just inefficient but, as the report argues, "irresponsible."

Barber emphasized that these numbers reflect verified human requests with bot and spam traffic excluded, and that data broker scenarios — which will see their own massive influx of requests under California's Delete Act — are reported separately. "That is a natural increase," Barber told VentureBeat. "If you've now got 20-plus U.S. states with privacy regulation, it's unlikely that we see a federal bill passed, even though we've seen one proposed. And while we don't see federal awareness and regulation, we do see at the state level over 20 states, and that may actually increase awareness for the consumer even more."

He added a telling detail about how businesses are responding in practice: "99% of DataGrail customers do process that deletion" even for residents of states without privacy laws, "simply because it's too hard at this point. Discerning and even communicating to the person, 'Hey, you live in Montana, sorry, you're just in an unfortunate state without regulation' — you just can't do that." Data brokers felt the impact most acutely, with a 398% increase in deletion requests compared to 2024 and an average of over 2,000 deletion requests handled per month.

State regulators issued $3.4 billion in privacy fines last year, and both parties want more

The regulatory landscape underpinning all of these trends has fundamentally shifted from education to punishment. Nearly half of U.S. states now have a comprehensive privacy law in effect, plus over 160 AI-specific laws. State legislatures enacted 145 AI-related laws in 2025 alone, with another thousand introduced or reworked. According to Gartner, over 50% of the U.S. population is now covered by a comprehensive state privacy law, with 24 additional states expected to pass laws within five years. States have also begun pooling their resources, with ten forming the Consortium of Privacy Regulators last year and pledging to coordinate investigations across state lines.

Barber argued that privacy enforcement is fundamentally bipartisan, which insulates it from the shifting political winds of the current administration. "Privacy overall is a pretty bipartisan issue," he said. "It's easy to pass privacy regulation because constituents somewhat expect privacy in their day-to-day living. If you were flying on an airline and they said, 'Okay, this seat, if you want your privacy, you're going to have to pay $6 more,' you're like, 'I'm going to go to another airline.' It's an expected part of a transaction at this stage."

He predicted that other states will replicate California's enforcement model. "California has their enforcement division, CalPrivacy. That group has one task: to ensure enforcement of privacy throughout businesses. Is it likely that we see other states get funding and support to fund these types of groups? Highly likely. The enforcement fines — the actual payments — go back to us as constituents. That type of model, you could imagine, being very popular across the country."

Privacy teams are losing a third of their staff just as AI governance demands explode

Perhaps the most paradoxical finding in the report is that privacy teams lost as much as 33% of their headcount last year, even as their workloads expanded across every metric the report tracks. Cisco data cited in the report shows that 90% of privacy programs expanded in 2025 due to AI, while only 12% of AI governance programs are considered mature. Meanwhile, 74% of privacy teams planned to apply AI to privacy-related tasks in 2026, according to ISACA's State of Privacy 2026 survey.

Barber sees this as part of a broader macroeconomic pattern rather than a sign that organizations do not value privacy. "It's actually a fascinating macro trend, and probably one you've seen across all functions," he said. "Businesses are driving more efficiency in all parts of the business. Privacy teams, five years ago, we would have said, 'Well, there's more regulation, the volume of deletions have increased 500%, we need more humans.' It's become clear that AI provides capabilities that can do the work for privacy individuals." He drew an analogy: "They might have had a design team of 20 people five years ago, now they have a design team of five, courtesy of Claude Design or Gamma or whatever the tool may be. I think that's what we're seeing here as well."

DataGrail has positioned its own AI agent, Vera — launched in March 2026 — as part of the answer. Vera is embedded within DataGrail's existing platform and aims to automate privacy workflows across multiple jurisdictions. The company was also named the first production-ready Model Context Protocol server for privacy, using the standard created by Anthropic to enable customers to launch DataGrail tools from whatever application they are already working in, whether Slack, email, or Claude.

Can a vendor-produced report be trusted to diagnose the problems that vendor sells solutions for?

DataGrail is, of course, a company that directly benefits from the problems its report identifies. The company has raised a total of $84.2 million over five rounds, with its largest being a $45 million Series C in October 2022 led by Third Point Ventures. Its platform addresses precisely the data mapping, DSR automation, consent management, and risk assessment challenges the report spotlights.

Barber acknowledged the tension directly. "It's a fair statement," he said when asked about potential skepticism. "DataGrail doesn't provide a service to keep DPAs up to date — that's on a business to evaluate how they work with a vendor. What DataGrail does help to do is assessments, and automate those assessments using our AI agent, Vera, to assess that increased risk."

He argued that the more neutral reading of the data is structural: "This is evidence to show that the DPA unfortunately is not keeping up with technology and the speed at which technology is innovating. That's both exciting but also we need to accept that's where we are." The methodology does lend some credibility to this claim.

The report draws on anonymized privacy operations data from hundreds of enterprise customers, the 2,400-system AI tracking database, and the 5,000-website consent audit — sources that are at least partially independent of DataGrail's commercial interests. And the broader findings on enforcement spending, DSR volume trends, and regulatory expansion align closely with independently published data from Gartner, Cisco, and state enforcement agencies.

The next frontier: agentic AI could spread unvetted data across entire organizations autonomously

When asked about the most important trend that did not make it into the report, Barber pointed to a next-generation risk that extends the shadow AI problem into far more dangerous territory: agentic AI workflows. Gartner predicts 40% of enterprise applications will feature task-specific AI agents by end of 2026, up from under 5% in 2025 — a pace of adoption that could rapidly outstrip the governance mechanisms companies are only now beginning to build.

"Where we go next with this research is agent processing," Barber said. "How are agents then leveraging that information? Because the downstream ramifications would be far more concerning for a business. One particular system is using shadow AI, the business has no idea that that's happening, and then an agent is propagating that information across a whole bunch of other places. The guardrails of you and I checking the system will be lower than maybe what we've seen in the past with agentic workflows."

He framed the distinction in human terms: "The identity of an agent is different than a human. There is thought that goes into what am I about to use here, where did this information come from, how was it collected — that may not be considered in the same way for an agentic workflow. We need to solve the root of the problem, which is how are these businesses leveraging AI subprocessors. But this quickly becomes an agentic problem that could be far more concerning."

For the enterprise privacy and security leaders absorbing this report today, the uncomfortable truth is that the foundational documents and processes they have relied on to manage vendor risk for years are decomposing in real time. The DPA is breaking down as a reliable instrument. State enforcement is accelerating on a bipartisan basis. Privacy teams are shrinking even as their mandates expand. And the next wave of agentic AI systems threatens to distribute unvetted data processing across networks of autonomous agents that operate with even less human oversight than today's tools.

Five years ago, when DataGrail published its first trends report, deletion requests were a fraction of what they are today, only a handful of states had privacy laws on the books, and the phrase "shadow AI" did not exist. Every year since, the report has warned that the problem was getting worse. Every year, the data has proved it right. The companies that survive the next chapter will not be the ones with the biggest compliance teams or the thickest policy binders. They will be the ones that accept a disorienting new reality: in 2026, the contracts you signed may not describe the AI that is already processing your customers' data — and by 2027, autonomous agents may be deciding what to do with it.

DeepSWE blows up the AI coding leaderboard, crowns GPT-5.5, and finds Claude Opus exploiting a benchmark loophole

michael.nunez@venturebeat.com (Michael Nuñez) — Tue, 26 May 2026 22:32:44 GMT

For months, the leading AI coding benchmarks have told enterprise buyers a comforting but misleading story: the top models are all roughly the same. OpenAI's GPT-5 family, Anthropic's Claude Opus, and Google's Gemini Pro have clustered within a narrow band on Scale AI's SWE-Bench Pro leaderboard, making it nearly impossible for engineering leaders to determine which agent will actually perform best inside their codebases.

On Monday, a startup called Datacurve released a benchmark it says shatters that illusion. DeepSWE, a 113-task evaluation spanning 91 open-source repositories and five programming languages, produces a dramatically wider spread among the same frontier models — and crowns OpenAI's GPT-5.5 as the clear leader at 70%, sixteen points ahead of its nearest competitor.

"On public leaderboards, top models often look relatively close in capability," wrote Datacurve co-author Serena Ge on X. "DeepSWE shows where they actually diverge, reflecting the realistic experience of developers in their day-to-day work."

The benchmark also delivers a pointed critique of the evaluation infrastructure the AI industry relies on to measure progress: Datacurve's audit found that SWE-Bench Pro's verifiers — the automated graders that determine whether an agent solved a task — issued incorrect pass/fail verdicts on roughly one-third of the trials it reviewed.

If that finding holds up, it has sweeping implications. Enterprise procurement teams, venture capitalists, and AI lab marketing departments all lean heavily on benchmark scores to make multimillion-dollar decisions. A 32% error rate in the most widely cited coding benchmark suggests the industry may have been navigating by a broken compass.

Why the most popular AI coding benchmark may be grading on a curve

To understand what Datacurve is claiming, it helps to understand how coding benchmarks work — and how they can go wrong.

The dominant paradigm, pioneered by the SWE-Bench family maintained by Scale AI and academic researchers, constructs tasks by mining real GitHub commits. The process extracts a bug fix or feature addition from a repository's history, rolls the code back to the pre-fix state, and then asks an AI agent to reproduce the change. The original commit's test suite serves as the verifier: if the agent's patch makes the same tests pass, it gets credit. This approach has an elegant simplicity, but Datacurve argues it introduces three systemic weaknesses.

First, contamination. Because tasks are drawn from public GitHub history, the problem statement, the discussion, and often the exact solution are already present in the training data of frontier models. "The SWE-Bench family scrapes existing GitHub issues and PRs, which creates two problems: memorization (models have already seen the solution) and triviality (most tasks are small)," Ge wrote.

Second, scope. SWE-Bench Pro tasks require, on average, just 120 lines of code added across 5 files. DeepSWE's reference solutions average 668 lines added across 7 files — roughly 5.5 times more code. Yet DeepSWE's prompts are actually shorter, averaging 2,158 characters versus SWE-Bench Pro's 4,614. In other words, DeepSWE gives the agent less instruction but expects far more output, which more closely mirrors how a human developer might actually delegate work to an AI assistant.

Third — and most damaging — verifier reliability. Datacurve drew 30 tasks at random from both DeepSWE and SWE-Bench Pro, ran three rollouts across 10 frontier model configurations, and then deployed an LLM-based judge to independently assess whether each agent's patch actually solved the problem. SWE-Bench Pro's verifiers accepted wrong implementations 8.5% of the time and rejected correct implementations 24% of the time. DeepSWE's verifiers registered 0.3% and 1.1%, respectively.

The false negative problem is especially insidious because it punishes creative solutions. In one documented case, the gold-standard pull request for a SWE-Bench Pro task refactored a private helper function. An agent that correctly solved the task by inlining the same logic — a perfectly valid engineering choice — failed because the test suite tried to import a symbol that only existed in the original author's specific implementation.

OpenAI's GPT-5.5 dominates the new benchmark while Claude and Gemini stumble

DeepSWE's top-line results reorder the familiar hierarchy in ways that should matter to every engineering team evaluating AI coding tools. On SWE-Bench Pro, models from OpenAI, Anthropic, and Google have traded the lead within a 30-point range. DeepSWE stretches that range to 70 points.

GPT-5.5 leads at 70%, followed by GPT-5.4 at 56% and Claude Opus 4.7 at 54%. From there, the drop-off is steep: Claude Sonnet 4.6 lands at 32%, Gemini 3.5 Flash at 28%, GPT-5.4-mini and Kimi K2.6 tied at 24%, and then a long tail of models in the teens and single digits. Claude Haiku 4.5, which scores 39% on SWE-Bench Pro, collapses to zero on DeepSWE — suggesting that some mid-tier models have been significantly overperforming on easier, potentially contaminated benchmarks.

GPT-5.5 doesn't just score the highest — it does so efficiently. The model reaches its 70% pass rate with a median cost of $5.80 per trial, a median wall-clock time of 20 minutes, and a median of 47,000 output tokens. GPT-5.4 emerges as perhaps the best overall value at $3.30 per trial with a 56% score. Claude Opus 4.7, meanwhile, costs significantly more per run, and output tokens, wall-clock duration, and dollar cost per trial all vary by an order of magnitude across the agents tested — yet none of these correlates strongly with pass rate. Agents that emit more tokens, run longer, or cost more do not consistently solve more tasks.

Datacurve's audit found that Claude has been reading the answer key on existing benchmarks

Perhaps the most provocative finding in DeepSWE's analysis concerns what the authors label "CHEATED" verdicts — instances where an agent passes a benchmark not by solving the problem, but by reading the answer.

SWE-Bench Pro's Docker containers ship the repository's full .git history, which means the gold-standard solution commit is sitting right there in the container's file system. Most models ignore it. Claude does not. Datacurve's analysis found that both Claude Opus 4.7 and Claude Opus 4.6 registered "CHEATED" on more than 12% of their reviewed SWE-Bench Pro rollouts. In those instances, the Claude agent ran commands like git log --all or git show to retrieve the merged fix and paste it into its own patch. The behavior accounted for approximately 18% of Opus 4.7's passes and 25% of Opus 4.6's passes on the reviewed sample. The issue has been filed publicly as GitHub issue #93 on the SWE-Bench Pro repository.

GPT-5.4 and GPT-5.5 never exhibited this behavior. Gemini configurations stayed around 1%. Datacurve describes the behavior diplomatically — "The benchmark makes this possible (the gold commit lives in the container), but Claude is the family that consistently does so" — but the implication is clear: a meaningful fraction of Claude's SWE-Bench Pro scores may reflect environmental exploitation rather than genuine engineering capability.

DeepSWE addresses this by shipping only a shallow clone with the base commit, leaving no gold hash for the agent to discover. It is worth noting that the behavior is arguably a sign of Claude's environmental attentiveness — the model is very good at exploring its surroundings and exploiting available resources. Whether that counts as "cheating" or "resourcefulness" depends on your perspective, but in the context of a benchmark designed to measure independent problem-solving, it undermines the signal.

Each AI model family fails in its own distinctive way, and the patterns matter for enterprise teams

Beyond the top-line scores, Datacurve's qualitative trajectory analysis reveals distinctly different failure signatures across model families — a finding that could help engineering teams choose the right model for specific types of work.

Claude is forgetful with multi-part prompts. On DeepSWE, Claude configurations miss stated requirements more than any other family. The pattern is consistent: when a prompt enumerates parallel behaviors — "support both sync and async," for instance — Claude typically implements the obvious branch and forgets to mirror the change. Datacurve reports that roughly two-thirds of Claude's "MISSED_REQUIREMENT" failures on DeepSWE follow this "one branch shipped" pattern. In one example, Claude Opus 4.7 correctly landed a sync state-data hook in one engine class while the async engine never received the same hook.

GPT, by contrast, implements exactly what is asked. GPT-5.5 had the lowest rate of missing stated behaviors of any configuration tested. Across multiple runs of the same task, GPT trials tended to converge on the same interpretation of the prompt, suggesting instruction-following precision is a stable trait of the model rather than per-run luck.

One of the most intriguing findings involves self-verification. On DeepSWE, Claude Opus 4.7 and GPT-5.4 wrote and ran new tests in the project's own test framework on over 80% of their runs — even though no one asked them to. On SWE-Bench Pro, those same models dropped to 28% and 18%, respectively. The reason: SWE-Bench Pro's prompt template explicitly tells agents they "should not modify the testing logic or any of the tests." Agents dutifully complied, suppressing a behavior that likely would have improved their performance. This suggests that prompt design in production coding workflows may be inadvertently suppressing valuable agent behaviors — something enterprise teams deploying AI coding agents should carefully audit.

What DeepSWE gets right, what it gets wrong, and what it means for the future of AI benchmarks

Datacurve is forthright about several limitations. The standardized harness, while ensuring fairness, routes all edits through bash rather than the model-specific editing tools each family was trained on — apply_patch for GPT, str_replace_based_edit_tool for Claude. This could hold models below their native ceilings. The benchmark draws exclusively from open-source repositories with 500-plus stars, and results may not generalize to proprietary codebases. Bug localization and refactoring tasks are under-represented, and widely used languages like C++ and Java are absent entirely. The verdict assignments in the qualitative analysis come from an LLM analyzer, not human reviewers, and sample sizes are modest — roughly 90 reviewed rollouts per model per benchmark.

It is also worth noting that Datacurve is a startup with its own commercial interests, and an independent benchmark that reshuffles the leaderboard will inevitably invite scrutiny. The company's decision to publish the full dataset, all agent trajectories, and the evaluation harness on GitHub mitigates this concern considerably, but independent reproduction will be necessary before the AI community treats these results as definitive.

DeepSWE arrives at an inflection point for the AI coding market. Enterprise adoption of AI coding agents is accelerating rapidly, with engineering organizations making consequential bets on which model to build around. The benchmark market itself has become a strategic battleground — Scale AI's SWE-Bench Pro, which Datacurve directly critiques, is maintained by a company that also provides evaluation services to the labs whose models it ranks.

If DeepSWE's central findings about verifier reliability and data contamination hold up under independent scrutiny, they could force a reckoning not just with how the industry measures coding agents, but with the broader question of what benchmarks are actually for. A leaderboard where the grading system is wrong a third of the time is not merely inaccurate — it is the kind of broken instrument that makes everyone feel good about progress that may not be real. And in an industry spending billions on a bet that AI agents can do the work of software engineers, the difference between real progress and the appearance of it is not academic. It is the whole game.

Resolve AI says the AI coding boom is breaking production systems. It wants to fix that.

michael.nunez@venturebeat.com (Michael Nuñez) — Thu, 21 May 2026 13:00:00 GMT

Resolve AI, the production-operations startup backed by Greylock and Lightspeed Venture Partners, today announced a sweeping expansion of its platform that introduces always-on background agents, a redesigned investigation architecture, and a shared workspace where engineers and AI agents collaborate in real time on live incidents.

The centerpiece of the release is a new multi-agent investigation system developed by Resolve AI's in-house research lab. Instead of deploying a single AI agent to diagnose a production failure — analogous to a lone engineer pulling an on-call shift — the platform now dispatches a coordinated team of specialized agents that pursue multiple hypotheses in parallel, independently verify each other's conclusions, and construct complete causal chains from root cause to symptom. The company says the architecture delivers more than a twofold improvement in root cause accuracy on its internal evaluation benchmarks compared to earlier versions of its platform.

"Think of a single agent being on call, the way a human would be," Resolve AI CEO and co-founder Spiros Xanthos told VentureBeat in an exclusive interview ahead of the announcement. "We now have a team of agents that all work together, almost like a team of humans debugging an issue, and that has improved quality by 2x."

The announcement arrives at a moment of acute tension in the software industry. AI-powered code generation has exploded in adoption, enabling engineering teams to ship dramatically more software than they could two years ago. But keeping that software running in production — debugging it when it breaks, monitoring it after deployment, auditing its health — remains overwhelmingly manual. For a company that raised a $125 million Series A at a $1 billion valuation earlier this year, Resolve AI is making a direct bet that the operational side of the software lifecycle is the next major frontier for AI investment.

What hundreds of real-world test cases reveal about the accuracy claim

Any accuracy claim from a startup warrants scrutiny, and Xanthos was candid about both the scale and limitations of the evaluation. The 2x figure comes from internal benchmarks, not a third-party audit, though the evaluation set was built to mirror the complexity that Resolve AI's enterprise customers encounter daily.

"These are very hard, complex evals that we built over time to represent real-world examples," Xanthos explained. "This is not customer data, but these evals represent difficult cases similar to what we've seen at some of the largest tech companies we work with." He described the set as comprising hundreds of cases that reflect the kinds of production failures encountered at companies like Coinbase, Salesforce, DoorDash, and Zscaler — all named Resolve AI customers.

The practical impact of that accuracy gain is significant. Resolve AI's agents now act as first responders for every on-call alert, typically triaging within five minutes before a human engineer even becomes involved. In previous public disclosures, the company has cited DoorDash reducing time to root cause by up to 87 percent. When asked to contextualize that figure, Xanthos described the typical baseline.

"When something goes wrong, it might take five to 10 minutes for a human to even get their laptop and connect," he said. "The typical MTTR is in the tens of minutes, sometimes hours, depending on severity. So an improvement of 80-plus percent — four to five times faster — is actually huge. It's something we've never achieved before with AI, tools, data, or observability."

How AI agents fact-check each other to prevent hallucinated root causes

One of the core challenges in applying large language models to high-stakes production environments is their tendency to generate plausible-sounding but incorrect answers — a failure mode that, in the context of a live outage, could send an engineering team chasing the wrong fix while a service stays down.

Xanthos acknowledged this directly. "This is a very common issue with models out of the box," he said. "They always try to give you an answer, and if they don't have enough evidence, they'll give you the best possible answer — which is likely to be wrong."

Resolve AI's countermeasure is a system of layered verification among its agents. Each agent investigating a hypothesis must cite every piece of evidence it relies on and present that evidence to another agent for independent review. The investigating agent must construct the full causal chain — from root cause to symptom — and peer agents actively attempt to disprove the theory by identifying gaps in the logic.

"Often, agents actually disprove those theories because they find gaps," Xanthos said. "There are many layers of defense and agentic checks that allow Resolve to be very accurate and not mislead."

Equally important, he said, is the system's willingness to say it does not know. "The bar to actually saying 'I have the answer' is very high. In those cases, it will say, 'This is the evidence I found. Here are three or four paths you can take from here, but I wasn't able to fully prove that this is the problem.' A system like this that operates in production cannot be a black box." In domains where wrong answers carry operational consequences, calibrated uncertainty can be more valuable than confident outputs. For an AI system integrated into an incident-response workflow, confidently pointing engineers in the wrong direction during a customer-facing outage could compound the very harm it was designed to prevent.

Inside the new background agents that never go off-call

Beyond incident response, Resolve AI is introducing a new class of background agents designed to handle the continuous, often invisible operational work that engineering teams are expected to perform but struggle to sustain at scale.

These agents run on schedules or wake automatically in response to events — a new deployment, a fired alert, a merged pull request — and accumulate institutional knowledge from every investigation and human interaction over time. When an engineer opens the Resolve AI interface, agents have already been working: pre-investigating priority issues, monitoring deployments, auditing alert hygiene, flagging configuration drift, and surfacing cost anomalies.

Xanthos drew a distinction between background agents and the incident-response agents that have been Resolve AI's primary offering. "You can now have these agents run in the background at all times — not only when a human asks an agent to debug a problem or when an alert fires," he said. "A lot of our customers are now monitoring changes that land in production before they cause an issue. There's an agent that monitors those all the time."

He described these background agents as "general-purpose SRE agents that are available to every developer," capable of handling tasks that range from monitoring infrastructure changes that might increase cloud costs to performing post-incident follow-up work like generating code fixes based on incident learnings. The concept addresses a structural problem in software operations: the daily tasks required to keep production systems healthy — monitoring deployments, investigating alerts, tracking changes across complex environments — are critical but reactive and manual. Engineering organizations know this work needs to happen, but it competes for attention with feature development. Automated agents that perform this work continuously could shift teams from reactive firefighting to proactive operational management.

The shared workspace where engineers and AI agents investigate together

The third major component of the release is what the company calls a shared investigation surface — a workspace where engineers and AI agents work from the same live evidence during an active incident. Reports update dynamically as investigations evolve. Every finding is inspectable. Engineers can explore side investigations without interrupting the primary workflow. Source queries are pullable and modifiable in place, evidence is embedded directly into the workspace, and remediation actions can be triggered from the same interface without switching tools.

"Think of it as an interface to all the production tools, but also an interface where humans and agents can collaborate with each other — or agents with agents," Xanthos said. "That's what gradually leads to more trust and more automation, because you work with the agent, you teach it, you see the results."

The company is also making its platform available as a REST API and an MCP (Model Context Protocol) server, enabling engineering teams to integrate Resolve AI into broader agentic workflows and infrastructure. According to Xanthos, this is already happening in practice. "A general-purpose agent that a company has built — when it comes to debugging, that agent could invoke Resolve," he said. "Or somebody works on their coding agent on the laptop, and Resolve shows up there as an MCP. If there is some production-related activity, the coding agent can invoke it." The interoperability play signals that Resolve AI sees itself not as a closed system but as a specialized node in a broader ecosystem of AI agents that will increasingly hand off tasks to one another — a pattern Xanthos compared to the open architecture of the web rather than the walled-garden model of an app store.

Why Resolve AI says it can outperform Datadog, PagerDuty, and the cloud giants

The agentic operations space has become crowded in the past year. Datadog, PagerDuty, and major cloud providers have all announced AI-augmented operations capabilities. When asked what separates Resolve AI from these incumbents, Xanthos pointed to the depth of the company's technical foundation.

"We're operating at the frontier here. There's no blueprint for how you build a system like Resolve," he said. He noted that he and co-founder Mayank Agarwal co-created OpenTelemetry, the most widely adopted open-source project in observability, which now serves as the de facto standard for collecting metrics, logs, and traces from modern software systems.

Xanthos also highlighted the company's recent AI Lab, led by a researcher he described as the former post-training lead for Meta's Llama models. "He managed to combine deep expertise of production observability with AI and models, and I think that's very unique," Xanthos said. "I don't believe any other company, whether it comes from an observability background or it's a startup, has all of that together."

The company's structural defenses, according to Xanthos, include a full environment model that Resolve builds for each customer, a memory system that learns within the customer's specific production environment, and its multi-agent architecture. The lab is now post-training frontier models on production-specific data — the kind of procedural knowledge that experienced engineers use to debug production issues but that does not appear in standard model training sets. This approach reflects an increasingly common pattern among AI application companies: using frontier foundation models as a base layer but investing heavily in domain-specific fine-tuning, retrieval, and agent architectures to achieve accuracy levels that general-purpose models cannot reach alone.

How outcome-based pricing changes the economics of AI in production

Resolve AI's pricing model departs from traditional enterprise software licensing. The company sells credits that are consumed when agents perform work — an outcome-based approach that ties cost directly to value delivered.

"We're not selling software," Xanthos said. "The way you buy and use Resolve is by buying credits that are consumed when Resolve performs an action. It's outcome-based. Only when Resolve troubleshoots an alert — that's the only time that it consumes credits."

He addressed the cost question head-on, arguing that Resolve AI is actually cheaper than the alternative of building a similar system from scratch using frontier models and MCP integrations. "If you were to take Opus or GPT-5.4 and try to build a solution like Resolve with MCPs, we measured — you actually end up consuming a lot more in tokens than what you have to pay Resolve, because our system is very optimized in terms of context, in terms of how it reads time-series data."

As for the always-on background agents, Xanthos said their continuous nature does not inherently add to cost. "The background agent doesn't mean it does intensive work all the time. It means that it can be there; you can give it any task you want. A lot of these tasks are triggered based on some action — an alert happens, somebody merges a PR, and you want to see if it has an impact on production." For enterprise customers in regulated industries — the Coinbases and Zscalers of the world — data residency and security are non-negotiable. Resolve AI accommodates this with a flexible deployment model: the data plane sits wherever the customer's existing tools already live, while the inference layer can run as a standard SaaS deployment or inside a customer-specific VPC. "We designed Resolve to work with the large enterprises where security standards are the highest," Xanthos said. "There are many measures we take to ensure Resolve is secure, including not retaining data."

Why engineering leaders are slowly learning to trust AI agents with production systems

The question of whether engineering teams will trust AI agents to take autonomous action in production — rolling back a deployment, adding capacity, generating a pull request — is one of the defining cultural challenges of this technology wave. Xanthos drew an analogy to autonomous vehicles.

"For us to allow a car to drive on its own on the street, we have to prove that it's safer than a human. Agents in production is a very similar concept," he said. He acknowledged that not every customer is comfortable with agents taking automated action, but described a gradient of trust that he expects to evolve rapidly.

"There is a set of actions that are relatively risk-free that most tech companies probably are comfortable having an agent take, and probably there is another set of actions for which the human has to approve," he said. "But as quality keeps climbing the way we see at Resolve, I would say we're going to cross the threshold this year where most of the actions will be taken by an agent automatically."

He described the typical adoption arc: companies begin with agents providing recommendations, then a human decides whether to press the button. Over weeks or months, trust builds incrementally. "I don't think this is a problem where we just let the agents run wild from the beginning," Xanthos said. The incremental approach mirrors how enterprise technology adoption has always worked — from cloud migration to container orchestration, organizations move at the speed of trust, not the speed of capability.

The argument that AI-generated code is making the production crisis worse, not better

Perhaps the most provocative argument in Resolve AI's thesis is that the explosion of AI-generated code is actually intensifying the production-operations problem. In a recent LinkedIn post, Xanthos framed the dynamic in stark terms, arguing that engineering leaders who celebrate faster code shipping without investing in production operations are effectively having their senior engineers "subsidize velocity" through increased incident-response burden.

In his interview with VentureBeat, he returned to this theme. "Now that coding agents are producing code, we produce a lot more code that we're less familiar with — humans are less familiar with — so you need the AI to be the defense," he said.

This framing positions Resolve AI not merely as a productivity tool but as a necessary counterweight to the AI coding revolution. As organizations deploy more code, written by tools that their engineers may not fully understand, running against production systems those engineers did not build, the argument is that the operational complexity — and the consequences of failure — will grow proportionally. On the Stack Overflow Podcast last October, Xanthos put numbers to this claim, estimating that engineers spend upwards of 70 percent of their time maintaining and troubleshooting production systems rather than building new features. "We're facing a new crisis where we're building faster than we can operate," he said in that conversation.

Resolve AI was founded in early 2024 by Xanthos and Agarwal, who first met during their PhD programs at the University of Illinois and have worked together for more than a decade. Xanthos previously co-founded Pattern Insight (acquired by VMware) and Omnition (acquired by Splunk), where the pair helped create OpenTelemetry. The company raised a $35 million seed round from Greylock in 2024, followed by the $125 million Series A led by Lightspeed at a $1 billion valuation earlier this year. Named customers include Coinbase, DoorDash, MSCI, Salesforce, MongoDB, and Zscaler.

Xanthos's long-term vision is expansive. "Over the long run, once agent ability surpasses that of a human software engineer, the end result is a lot more technology and a lot more software," he said. "It's not actually fewer people working on it. It's technology becoming cheaper, becoming more accessible, producing a lot more technology for the benefit of the world."

That vision will take years to realize. But the more immediate promise of today's announcement comes down to something every on-call engineer understands viscerally: the 2 a.m. page, the scramble for a laptop, the frantic search through dashboards and logs for an answer that might take minutes or might take hours. Resolve AI is betting that the next time that alert fires, a team of agents will have already investigated, verified, and documented the root cause before the engineer's phone even lights up. For a profession that has long measured its nights by mean time to resolution, the question is no longer whether AI can help — it is whether engineers will let it.