Google says Gemini 3.5 Flash can slash enterprise AI costs by more than $1 billion a year

Google unveiled Gemini 3.5 Flash at its annual I/O developer conference on Tuesday, a new artificial intelligence model that the company says shatters what had become a seemingly iron law of the AI industry: that the smartest models must also be the slowest and most expensive to run.

The model sits at the center of a sweeping set of announcements — from a video-generating "world model" called Gemini Omni to a 24/7 personal AI agent called Gemini Spark — but 3.5 Flash carries perhaps the most immediate consequence for the enterprises pouring billions of dollars into AI infrastructure. Sundar Pichai, Google's chief executive, told reporters during a press briefing Monday that companies running roughly one trillion tokens per day on Google Cloud could save more than $1 billion annually by shifting 80 percent of their workloads to a mix of Flash and other frontier models.

"You've probably heard anecdotes from other CIOs that companies are already blowing through their annual token budgets, and it's only May," Pichai said, framing the model not just as a technical achievement but as a financial lifeline for organizations struggling with the runaway costs of deploying AI at scale.

The claim, if it holds, would be one of the most significant shifts in the economics of enterprise AI since large language models entered corporate computing.

Why enterprises have been forced to choose between AI quality and AI speed

For the past three years, organizations adopting generative AI have faced a painful trade-off. The most capable models — the ones that can reason through complex multistep problems, write reliable code, and parse dense financial documents — tend to be large, slow, and expensive to query. Faster, cheaper models sacrifice accuracy. Chief information officers have been forced into a kind of AI portfolio management: routing simple queries to lightweight models and reserving the heavy-duty reasoning engines for high-stakes tasks. It is a complex, brittle system that adds engineering overhead and often delivers inconsistent user experiences.

Gemini 3.5 Flash attacks that trade-off directly. According to Google's internal benchmarks and a third-party analysis from Artificial Analysis, the model outperforms Google's own Gemini 3.1 Pro — a model the company positioned as its top-tier flagship just four to five months ago — on nearly every major benchmark. It scores 76.2 percent on Terminal-Bench 2.1, reaches 1656 Elo on GDPval-AA, hits 83.6 percent on MCP Atlas, and leads in multimodal understanding with 84.2 percent on CharXiv Reasoning.

Yet it does all of this while generating output tokens at four times the speed of comparable frontier models from competitors. Koray Kavukcuoglu, chief technology officer of Google DeepMind and chief AI architect for Google, told reporters the team has pushed even further: "We have developed an even more optimized version of Flash, not just four times, but actually 12 times faster with the same quality." That turbo variant is available starting Tuesday inside Antigravity, Google's agentic development platform.

Pichai put the performance gap in blunt terms: "3.5 Flash is better than 3.1 Pro, which was just four months ago, and it's at the almost, I would say, 90% of the performance of frontier models, 4x faster, much faster in Antigravity, maybe 12x, and about 1/3 to one half the cost."

Landing in what Artificial Analysis calls the "top-right quadrant" of its intelligence-versus-speed index — the only model to do so — Flash occupies a position no competitor currently holds.

The trillion-token math behind Google's $1 billion savings claim

To understand why Flash matters so much to enterprise buyers, you need to understand the economics of tokens — the fundamental units of data that AI models process. Every query a customer service chatbot answers, every legal document an AI summarizes, every line of code an agent writes, consumes tokens. And at frontier-model pricing, those tokens add up fast.

Google says its model APIs now process around 19 billion tokens per minute. Across all of Google's own surfaces — Search, the Gemini app, Workspace, and more — the company processes over 3.2 quadrillion tokens per month, a figure that has jumped seven-fold in the past year alone. Two years ago, at I/O 2024, the number was 9.7 trillion per month.

The explosion in token consumption is not unique to Google. Enterprises across industries are discovering that the more capable their AI deployments become, the more tokens they burn. Agentic workflows — where AI systems autonomously execute multistep tasks, call tools, write and run code, and iterate on their own output — are particularly token-hungry. A single agentic coding session can consume orders of magnitude more tokens than a simple question-and-answer exchange.

This is where Flash's cost advantage becomes transformative. The model delivers what Google describes as frontier-level capabilities at less than half the price, in some cases almost a third the price, of comparable frontier models. For a hypothetical enterprise processing one trillion tokens per day on Google Cloud — a scale Pichai said top customers are already reaching — the savings from shifting 80 percent of workloads to a Flash-and-frontier blend would exceed $1 billion per year.

That is not a rounding error. It is the kind of number that reshapes procurement decisions, accelerates deployment timelines, and fundamentally alters the return-on-investment calculus for AI initiatives that many boards of directors have been scrutinizing with increasing impatience.

How Google's own engineers created a data flywheel that rivals cannot easily copy

Perhaps the most strategically significant detail Google shared Tuesday was not a benchmark score or a price point. It was a chart showing the company's own internal token consumption on Antigravity 2.0, its reimagined agentic development platform.

In March 2026, Google's developers were processing roughly half a trillion tokens per day inside Antigravity. By the time of the I/O press briefing in mid-May, that figure had surged past three trillion — a six-fold increase in approximately ten weeks, with usage doubling "literally every few weeks," according to Pichai.

This internal usage creates what AI researchers call a data flywheel: the more Google's own engineers use 3.5 Flash to build products, the more real-world signal the model team collects on where the model excels and where it stumbles. That signal feeds back into model improvement, which makes the model more useful, which drives more usage, which generates more signal. It is a virtuous cycle — and it is one that competing AI labs, which rely primarily on external developer usage and synthetic benchmarks, cannot easily replicate at the same speed or fidelity.

"That scale creates a powerful feedback loop, and that is what has allowed us to keep improving the 3.5 series of models," Pichai said.

When pressed during the Q&A about the competitive frontier — particularly in light of recent advances from rival labs — Pichai acknowledged the landscape is "very dynamic" and "moving fast" but expressed confidence in Google's breadth. He added that the company's focus with the 3.5 series has been on "taking the model intelligence, making sure tool use, instruction following, long horizon use cases, agent decoding all work well."

Kavukcuoglu reinforced the agentic emphasis, noting that 3.5 Flash "can now handle multi-hour autonomous sessions" and "can independently execute complex coding pipelines or manage iterative research projects entirely by itself." The team, he said, even tested the model by having agents build a working operating system entirely from scratch.

Antigravity 2.0 transforms Google's code editor into an agent command center

The arrival of 3.5 Flash is tightly coupled with the launch of Antigravity 2.0, a significant expansion of the agentic development platform Google first introduced six months ago. What began as a coding environment has evolved into what Google describes as a full platform for developing and managing teams of autonomous AI agents, and the company says millions of developers are already building with it.

Antigravity 2.0 ships as a new standalone desktop application that serves as a central hub for orchestrating multiple agents simultaneously. Google offered the example of running one agent to code a website, a second to generate brand assets, and a third to plan product architecture — all in parallel, all managed from a single interface. For developers who prefer command-line workflows, there is Antigravity CLI. And for those building programmatic integrations, the new Antigravity SDK provides direct access to the same agent harness powering Google's own first-party products.

The co-development of 3.5 Flash and Antigravity 2.0 is no accident. "We have co-developed 3.5 Flash together with Google Antigravity, our agentic development platform," Kavukcuoglu said. This tight integration means Flash's strengths — speed, tool use, long-context reasoning, and code generation — are specifically tuned for the kinds of workloads developers execute inside the platform.

Google is also launching Managed Agents in the Gemini API, allowing developers to spin up an agent with a single API call that reasons, uses tools, and executes code in an isolated Linux environment. And it introduced CodeMender, an AI security agent that uses Gemini's advanced reasoning to automatically find and fix critical code vulnerabilities — a capability Kavukcuoglu described as essential as agentic systems write an increasing share of the world's code.

Google's $190 billion infrastructure bet and the custom silicon powering cheaper AI

The models and platforms sit atop a staggering infrastructure investment that Pichai revealed during the briefing: Google expects capital expenditures of approximately $180 billion to $190 billion in 2026 — roughly six times the $31 billion the company spent in 2022, just four years ago.

A key component of that spending is custom silicon. The company recently unveiled its eighth generation of Tensor Processing Units, adopting for the first time a dual-chip architecture with specialized designs for training (TPU 8o) and inference (TPU 8i). Google says it can now distribute model training across multiple data center sites using a system called Pathways, scaling beyond one million TPUs globally — a setup the company claims constitutes the largest training cluster in the world.

"This means training larger, more capable models in weeks, rather than months," Pichai said. The infrastructure advantage matters enormously for Flash's economics. Custom silicon optimized for inference means Google can run Flash at lower cost per token than competitors relying on general-purpose GPUs, and the savings get passed along — at least partially — to customers.

The capex figure also signals something strategic about Google's long-term posture. While some investors have grown nervous about the astronomical sums cloud providers are spending on AI infrastructure, Google is framing the spending as a competitive moat. The more infrastructure it builds, the cheaper it can run inference, the more attractive its models become, and the more usage it captures to improve the next generation. It is the flywheel logic again, extended from software all the way down to silicon.

Gemini Omni, Spark, and the consumer products Flash now powers at massive scale

While the enterprise cost story dominates the Flash narrative, Google also made sweeping moves on the consumer side that put the model to work across products reaching billions of people. Flash is now the default model powering the Gemini app — which has surpassed 900 million monthly active users, more than doubling from 400 million a year ago — and AI Mode in Google Search, which has crossed one billion monthly users in its first year.

Google introduced Gemini Spark, a 24/7 personal AI agent that runs on dedicated virtual machines in Google Cloud and operates in the background even when a user's device is off. Powered by 3.5 Flash with the full Antigravity harness, Spark integrates with Gmail, Docs, Sheets, and Slides. Josh Woodward, who leads Google Labs and the Gemini app, described the experience vividly: "When you use it, it almost feels like you're tossing things over your shoulder, Spark's catching them and gets the job done." On the safety front, Spark requires explicit user approval before high-stakes actions. Google also announced the Agent Payments Protocol, which lets users set strict guardrails — approved brands, spending caps, specific merchants — before an agent can spend money on their behalf. Woodward compared the design to "giving a teenager their first debit card — there's sort of limits and sort of constraints around it."

Alongside Flash, Google unveiled Gemini Omni, a model capable of generating any output from any input, starting with video. Kavukcuoglu drew a sharp distinction from Google's existing Veo model: "Veo is a text-to-video model. Omni is a true and true multi-model input, multi-model output model." All Omni-generated content carries Google's SynthID watermark, and the company announced that OpenAI, Kakao, and ElevenLabs are adopting SynthID as well.

The company also reimagined its search box for the first time in over 25 years, introduced information agents that monitor the web around the clock for user-defined conditions, and launched the Universal Cart — an AI-powered cross-merchant shopping cart built on Google Wallet. Liz Reid, who leads Google Search, called the new search box "the biggest upgrade to our iconic search box since its debut."

What Google's six-month model cadence means for the enterprise AI cost curve

Google signaled that 3.5 Flash is just the opening act of the 3.5 series. Gemini 3.5 Pro is currently in internal testing and will roll out to everyone next month. Kavukcuoglu indicated the company has been operating on roughly a six-month cadence for major model updates — Gemini 3 in November, 3.5 in May — and expects that rhythm to continue.

When a reporter from The New York Times asked how Google determines whether a release warrants a full numerical jump or a half-step increment, Kavukcuoglu said the numbering reflects the magnitude of research progress: "What defines the numbering update is really the progress that we see in our research and how it is reflected in the models and the impact that they have."

For enterprise buyers, that cadence carries an important implication: the cost-performance curve is not just improving — it is improving on a predictable schedule. A model that outperforms the previous flagship at a third the cost every six months fundamentally changes the planning horizon for AI investments. It means the token budgets that companies are blowing through today may look quaint by the end of the year.

Google's announcements arrive at a moment of intense competition. OpenAI, Anthropic, Meta, and a constellation of smaller labs are all racing to deliver models that balance capability with cost. Microsoft has been aggressively integrating OpenAI's models into Azure and Copilot. But Google benefits from a structural advantage that is easy to overlook: distribution. With 13 products serving more than a billion users each — five of which exceed three billion — Google can deploy Flash to an audience no pure-play AI lab can match. Every improvement immediately benefits Search, Gmail, Docs, Maps, and YouTube. And the usage data flowing back from those billions of interactions feeds the very flywheel that makes the next model better.

The question now is whether the $1 billion savings figure — an eye-catching projection based on a specific workload mix — will survive contact with the messy reality of corporate AI deployments, where legacy systems, compliance requirements, and organizational inertia have a way of blunting even the most compelling cost curves. But if Google's own internal usage is any guide — three trillion tokens a day and climbing, doubling every few weeks, with no sign of slowing — the company is not just selling the bet. It is making the bet itself, with its own engineers, on its own infrastructure, at a scale no customer has yet attempted. In the AI cost wars, the most persuasive pitch may simply be: we did it first.