As frontier models move into production, they're running up against major barriers like power caps, inference latency, and rising token-level costs, exposing the limits of traditional scale-first architectures. For the latest stop on the VB AI Impact Series in San Francisco, VentureBeat CEO Matt Marshall welcomed Val Bercovici, Chief AI Officer at WEKA, to explore the inference crisis, and talk about the critical architectural decisions that separate profitable AI deployments from costly failures.
The evidence of the crisis lives in the gap between the gross numbers and the net numbers of leading AI providers, Bercovici said. According to gross numbers, the cost of inference has plunged almost a thousand-fold over the last two years, and naturally inference must be cheap these days, right? But that’s not the whole picture.
That comes when you factor in the overwhelming consumption and volume of tokens. As the price of tokens declines in one dimension, the demand for them is spiking. The net reality is that the cost of AI is going up, and the economics of AI apps and the entire stack that provides them is fundamentally upside down.
"When you take a look at the net unit cost, it’s negative right now," Bercovici said. "We’re back at this classic Uber game of investors subsidizing the real cost of the product."
The explosion of reasoning tokens
The issue came to a head at the end of last year, Bercovici explained, when OpenAI was one of the first to publicly introduce the concept of a reasoning model. In Nvidia’s last earnings call, CEO Jensen Huang noted that the number of reasoning tokens generated at the base model layer now has grown by two orders of magnitude more than any tokens generated previously with the prior generation of pre-trained models.
This past summer, the hype around agentic AI has transformed into a surge in adoption, as the actual business value of AI agents became clear. To be successful, agents travel in swarms, performing tasks and subtasks in parallel. The net benefits are legitimately transformative.
Professional developers using agentic swarms today aren't writing code; instead, they're thinking like product managers, writing detailed specifications for agent swarms, which turn those requirements into tested, documented applications that have automatic agentic security and vulnerability scanning as well as performance tuning. The price, however, is that the number of tokens generated has jumped another 10X.
"Even if the price of inference has decreased by maybe a factor of 1,000, optimistically, the actual demand for tokens is at least 10,000X now," he explained. "We’re looking at a 10X order of magnitude difference."
AGI and future-proof architecture
Solutions like Claude Code and Cursor, command line tools for agentic coding, seem to be just a hair's breadth from artificial general intelligence (AGI), able to improve developer productivity by 30% or more. Developers are seeing material benefits, getting projects completed in hours that would take weeks and months otherwise.
"Agents for generating quality code have now turned that corner where I see the scaling going from compute and data in the pre-training world, to reasoning models and test time compute, to now agents – [these] are the new scaling laws," Bercovici said. "I can now see how we’re getting to AGI when you combine and compound all these layers together into a true form of intelligence."
To ensure you can capture the productivity gains of these agents, leaders need to revisit processes right now. Swarms and models operate fundamentally differently than human coders do. Humans need to modularize things into different functions, calls, and files to make the information easy for humans to consume with human-level attention and context. For an agent swarm, all of your source code should be placed into one giant file, in order to give those agents context.
"Context is everything. Better tokens, reasoning, and output tokens, are generated as a result," Bercovici said. "But then your infrastructure really matters. The volume of tokens here is so large that if you can’t afford these tokens, then you’re not going to be able to do anything. Looking at your inference, making sure you understand it, making sure you work with experts that understand it, and optimizing it so you can go from negative unit economics to actually positive is essential."
Another critical way to architect for a high volume of tokens is to challenge some first principles and older best practices. For instance, converting existing Non-Volatile Memory Express (NVMe) or Extension flash drives into storage.
"If you can redeploy low-cost, high-capacity NVMe devices as actual DRAM [Dynamic Random-Access Memory] and get that memory bandwidth utilization that you need at inference time, you can now radically transform the economics of inference by rethinking the first principles of your infrastructure," he explained.
The path to profitable AI
The benefits of efficient AI architecture aren't just about faster token processing. Designing for efficiency also unlocks the power and cost savings that make AI profitable and enable rapid innovation cycles.
"The harsh reality of AI factories today is there are no assembly lines. It’s some of the most inefficient, obtuse systems in inference right now, driving up the cost of tokens," Bercovici said. "As assembly lines and efficiencies get introduced in inference, we’re going to see that radical efficiency in inference. Those leading inference providers that adopt this are going to just be big first movers, and perhaps gain an insurmountable advantage in the market."
