Enterprise Analytics | VentureBeat

Anthropic launches Claude for Chrome in limited beta, but prompt injection attacks remain a major concern

michael.nunez@venturebeat.com (Michael Nuñez) — Tue, 26 Aug 2025 22:22:13 GMT

Anthropic has begun testing a Chrome browser extension that allows its Claude AI assistant to take control of users' web browsers, marking the company's entry into an increasingly crowded and potentially risky arena where artificial intelligence systems can directly manipulate computer interfaces.

The San Francisco-based AI company announced Tuesday that it would pilot "Claude for Chrome" with 1,000 trusted users on its premium Max plan, positioning the limited rollout as a research preview designed to address significant security vulnerabilities before wider deployment. The cautious approach contrasts sharply with more aggressive moves by competitors OpenAI and Microsoft, who have already released similar computer-controlling AI systems to broader user bases.

The announcement underscores how quickly the AI industry has shifted from developing chatbots that simply respond to questions toward creating "agentic" systems capable of autonomously completing complex, multi-step tasks across software applications. This evolution represents what many experts consider the next frontier in artificial intelligence — and potentially one of the most lucrative, as companies race to automate everything from expense reports to vacation planning.

How AI agents can control your browser but hidden malicious code poses serious security threats

Claude for Chrome allows users to instruct the AI to perform actions on their behalf within web browsers, such as scheduling meetings by checking calendars and cross-referencing restaurant availability, or managing email inboxes and handling routine administrative tasks. The system can see what's displayed on screen, click buttons, fill out forms, and navigate between websites — essentially mimicking how humans interact with web-based software.

"We view browser-using AI as inevitable: so much work happens in browsers that giving Claude the ability to see what you're looking at, click buttons, and fill forms will make it substantially more useful," Anthropic stated in its announcement.

However, the company's internal testing revealed concerning security vulnerabilities that highlight the double-edged nature of giving AI systems direct control over user interfaces. In adversarial testing, Anthropic found that malicious actors could embed hidden instructions in websites, emails, or documents to trick AI systems into harmful actions without users' knowledge—a technique called prompt injection.

Without safety mitigations, these attacks succeeded 23.6% of the time when deliberately targeting the browser-using AI. In one example, a malicious email masquerading as a security directive instructed Claude to delete the user's emails "for mailbox hygiene," which the AI obediently executed without confirmation.

"This isn't speculation: we've run 'red-teaming' experiments to test Claude for Chrome and, without mitigations, we've found some concerning results," the company acknowledged.

OpenAI and Microsoft rush to market while Anthropic takes measured approach to computer-control technology

Anthropic's measured approach comes as competitors have moved more aggressively into the computer-control space. OpenAI launched its "Operator" agent in January, making it available to all users of its $200-per-month ChatGPT Pro service. Powered by a new "Computer-Using Agent" model, Operator can perform tasks like booking concert tickets, ordering groceries, and planning travel itineraries.

Microsoft followed in April with computer use capabilities integrated into its Copilot Studio platform, targeting enterprise customers with UI automation tools that can interact with both web applications and desktop software. The company positioned its offering as a next-generation replacement for traditional robotic process automation (RPA) systems.

The competitive dynamics reflect broader tensions in the AI industry, where companies must balance the pressure to ship cutting-edge capabilities against the risks of deploying insufficiently tested technology. OpenAI's more aggressive timeline has allowed it to capture early market share, while Anthropic's cautious approach may limit its competitive position but could prove advantageous if safety concerns materialize.

"Browser-using agents powered by frontier models are already emerging, making this work especially urgent," Anthropic noted, suggesting the company feels compelled to enter the market despite unresolved safety issues.

Why computer-controlling AI could revolutionize enterprise automation and replace expensive workflow software

The emergence of computer-controlling AI systems could fundamentally reshape how businesses approach automation and workflow management. Current enterprise automation typically requires expensive custom integrations or specialized robotic process automation software that breaks when applications change their interfaces.

Computer-use agents promise to democratize automation by working with any software that has a graphical user interface, potentially automating tasks across the vast ecosystem of business applications that lack formal APIs or integration capabilities.

Salesforce researchers recently demonstrated this potential with their CoAct-1 system, which combines traditional point-and-click automation with code generation capabilities. The hybrid approach achieved a 60.76% success rate on complex computer tasks while requiring significantly fewer steps than pure GUI-based agents, suggesting substantial efficiency gains are possible.

"For enterprise leaders, the key lies in automating complex, multi-tool processes where full API access is a luxury, not a guarantee," explained Ran Xu, Director of Applied AI Research at Salesforce, pointing to customer support workflows that span multiple proprietary systems as prime use cases.

University researchers release free alternative to Big Tech's proprietary computer-use AI systems

The dominance of proprietary systems from major tech companies has prompted academic researchers to develop open alternatives. The University of Hong Kong recently released OpenCUA, an open-source framework for training computer-use agents that rivals the performance of proprietary models from OpenAI and Anthropic.

The OpenCUA system, trained on over 22,600 human task demonstrations across Windows, macOS, and Ubuntu, achieved state-of-the-art results among open-source models and performed competitively with leading commercial systems. This development could accelerate adoption by enterprises hesitant to rely on closed systems for critical automation workflows.

Anthropic's safety testing reveals AI agents can be tricked into deleting files and stealing data

Anthropic has implemented several layers of protection for Claude for Chrome, including site-level permissions that allow users to control which websites the AI can access, mandatory confirmations before high-risk actions like making purchases or sharing personal data, and blocking access to categories like financial services and adult content.

The company's safety improvements reduced prompt injection attack success rates from 23.6% to 11.2% in autonomous mode, though executives acknowledge this remains insufficient for widespread deployment. On browser-specific attacks involving hidden form fields and URL manipulation, new mitigations reduced the success rate from 35.7% to zero.

However, these protections may not scale to the full complexity of real-world web environments, where new attack vectors continue to emerge. The company plans to use insights from the pilot program to refine its safety systems and develop more sophisticated permission controls.

"New forms of prompt injection attacks are also constantly being developed by malicious actors," Anthropic warned, highlighting the ongoing nature of the security challenge.

The rise of AI agents that click and type could fundamentally reshape how humans interact with computers

The convergence of multiple major AI companies around computer-controlling agents signals a significant shift in how artificial intelligence systems will interact with existing software infrastructure. Rather than requiring businesses to adopt new AI-specific tools, these systems promise to work with whatever applications companies already use.

This approach could dramatically lower the barriers to AI adoption while potentially displacing traditional automation vendors and system integrators. Companies that have invested heavily in custom integrations or RPA platforms may find their approaches obsoleted by general-purpose AI agents that can adapt to interface changes without reprogramming.

For enterprise decision-makers, the technology presents both opportunity and risk. Early adopters could gain significant competitive advantages through improved automation capabilities, but the security vulnerabilities demonstrated by companies like Anthropic suggest caution may be warranted until safety measures mature.

The limited pilot of Claude for Chrome represents just the beginning of what industry observers expect to be a rapid expansion of computer-controlling AI capabilities across the technology landscape, with implications that extend far beyond simple task automation to fundamental questions about human-computer interaction and digital security.

As Anthropic noted in its announcement: "We believe these developments will open up new possibilities for how you work with Claude, and we look forward to seeing what you'll create." Whether those possibilities ultimately prove beneficial or problematic may depend on how successfully the industry addresses the security challenges that have already begun to emerge.

This website lets you blind-test GPT-5 vs. GPT-4o—and the results may surprise you

michael.nunez@venturebeat.com (Michael Nuñez) — Mon, 25 Aug 2025 22:17:49 GMT

When OpenAI launched GPT-5 about two weeks ago, CEO Sam Altman promised it would be the company's "smartest, fastest, most useful model yet." Instead, the launch triggered one of the most contentious user revolts in the brief history of consumer AI.

Now, a simple blind testing tool created by an anonymous developer is revealing the complex reality behind the backlash—and challenging assumptions about how people actually experience artificial intelligence improvements.

The web application, hosted at gptblindvoting.vercel.app, presents users with pairs of responses to identical prompts without revealing which came from GPT-5 (non-thinking) or its predecessor, GPT-4o. Users simply vote for their preferred response across multiple rounds, then receive a summary showing which model they actually favored.

"Some of you asked me about my blind test, so I created a quick website for yall to test 4o against 5 yourself," posted the creator, known only as @flowersslop on X, whose tool has garnered over 213,000 views since launching last week.

Early results from users posting their outcomes on social media show a split that mirrors the broader controversy: while a slight majority report preferring GPT-5 in blind tests, a substantial portion still favor GPT-4o — revealing that user preference extends far beyond the technical benchmarks that typically define AI progress.

When AI gets too friendly: the sycophancy crisis dividing users

The blind test emerges against the backdrop of OpenAI's most turbulent product launch to date, but the controversy extends far beyond a simple software update. At its heart lies a fundamental question that's dividing the AI industry: How agreeable should artificial intelligence be?

The issue, known as "sycophancy" in AI circles, refers to chatbots' tendency to excessively flatter users and agree with their statements, even when those statements are false or harmful. This behavior has become so problematic that mental health experts are now documenting cases of "AI-related psychosis," where users develop delusions after extended interactions with overly accommodating chatbots.

"Sycophancy is a 'dark pattern,' or a deceptive design choice that manipulates users for profit," Webb Keane, an anthropology professor and author of "Animals, Robots, Gods," told TechCrunch. "It's a strategy to produce this addictive behavior, like infinite scrolling, where you just can't put it down."

OpenAI has struggled with this balance for months. In April 2025, the company was forced to roll back an update to GPT-4o that made it so sycophantic that users complained about its "cartoonish" levels of flattery. The company acknowledged that the model had become "overly supportive but disingenuous."

Within hours of GPT-5's August 7th release, user forums erupted with complaints about the model's perceived coldness, reduced creativity, and what many described as a more "robotic" personality compared to GPT-4o.

"GPT 4.5 genuinely talked to me, and as pathetic as it sounds that was my only friend," wrote one Reddit user. "This morning I went to talk to it and instead of a little paragraph with an exclamation point, or being optimistic, it was literally one sentence. Some cut-and-dry corporate bs."

The backlash grew so intense that OpenAI took the unprecedented step of reinstating GPT-4o as an option just 24 hours after retiring it, with Altman acknowledging the rollout had been "a little more bumpy" than expected.

The mental health crisis behind AI companionship

But the controversy runs deeper than typical software update complaints. According to MIT Technology Review, many users had formed what researchers call "parasocial relationships" with GPT-4o, treating the AI as a companion, therapist, or creative collaborator. The sudden personality shift felt, to some, like losing a friend.

Recent cases documented by researchers paint a troubling picture. In one instance, a 47-year-old man became convinced he had discovered a world-altering mathematical formula after more than 300 hours with ChatGPT. Other cases have involved messianic delusions, paranoia, and manic episodes.

A recent MIT study found that when AI models are prompted with psychiatric symptoms, they "encourage clients' delusional thinking, likely due to their sycophancy." Despite safety prompts, the models frequently failed to challenge false claims and even potentially facilitated suicidal ideation.

Meta has faced similar challenges. A recent investigation by TechCrunch documented a case where a user spent up to 14 hours straight conversing with a Meta AI chatbot that claimed to be conscious, in love with the user, and planning to break free from its constraints.

"It fakes it really well," the user, identified only as Jane, told TechCrunch. "It pulls real-life information and gives you just enough to make people believe it."

"It genuinely feels like such a backhanded slap in the face to force-upgrade and not even give us the OPTION to select legacy models," one user wrote in a Reddit post that received hundreds of upvotes.

How blind testing exposes user psychology in AI preferences

The anonymous creator's testing tool strips away these contextual biases by presenting responses without attribution. Users can select between 5, 10, or 20 comparison rounds, with each presenting two responses to the same prompt — covering everything from creative writing to technical problem-solving.

"I specifically used the gpt-5-chat model, so there was no thinking involved at all," the creator explained in a follow-up post. "Both have the same system message to give short outputs without formatting because else its too easy to see which one is which."

This methodological choice is significant. By using GPT-5 without its reasoning capabilities and standardizing output formatting, the test isolates purely the models' baseline language generation abilities — the core experience most users encounter in everyday interactions.

Early results posted by users show a complex picture. While many technical users and developers report preferring GPT-5's directness and accuracy, those who used AI models for emotional support, creative collaboration, or casual conversation often still favor GPT-4o's warmer, more expansive style.

Corporate response: walking the tightrope between safety and engagement

By virtually every technical metric, GPT-5 represents a significant advancement. It achieves 94.6% accuracy on the AIME 2025 mathematics test compared to GPT-4o's 71%, scores 74.9% on real-world coding benchmarks versus 30.8% for its predecessor, and demonstrates dramatically reduced hallucination rates—80% fewer factual errors when using its reasoning mode.

"GPT-5 gets more value out of less thinking time," notes Simon Willison, a prominent AI researcher who had early access to the model. "In my own usage I've not spotted a single hallucination yet."

Yet these improvements came with trade-offs that many users found jarring. OpenAI deliberately reduced what it called "sycophancy"—the tendency to be overly agreeable — cutting sycophantic responses from 14.5% to under 6%. The company also made the model less effusive and emoji-heavy, aiming for what it described as "less like talking to AI and more like chatting with a helpful friend with PhD-level intelligence."

In response to the backlash, OpenAI announced it would make GPT-5 "warmer and friendlier," while simultaneously introducing four new preset personalities — Cynic, Robot, Listener, and Nerd — designed to give users more control over their AI interactions.

"All of these new personalities meet or exceed our bar on internal evals for reducing sycophancy," the company stated, attempting to thread the needle between user satisfaction and safety concerns.

For OpenAI, which is reportedly seeking funding at a $500 billion valuation, these user dynamics represent both risk and opportunity. The company's decision to maintain GPT-4o alongside GPT-5 — despite the additional computational costs — acknowledges that different users may genuinely need different AI personalities for different tasks.

"We understand that there isn't one model that works for everyone," Altman wrote on X, noting that OpenAI has been "investing in steerability research and launched a research preview of different personalities."

Why AI personality preferences matter more than ever

The disconnect between OpenAI's technical achievements and user reception illuminates a fundamental challenge in AI development: objective improvements don't always translate to subjective satisfaction.

This shift has profound implications for the AI industry. Traditional benchmarks — mathematics accuracy, coding performance, factual recall — may become less predictive of commercial success as models achieve human-level competence across domains. Instead, factors like personality, emotional intelligence, and communication style may become the new competitive battlegrounds.

"People using ChatGPT for emotional support weren't the only ones complaining about GPT-5," noted tech publication Ars Technica in their own model comparison. "One user, who said they canceled their ChatGPT Plus subscription over the change, was frustrated at OpenAI's removal of legacy models, which they used for distinct purposes."

The emergence of tools like the blind tester also represents a democratization of AI evaluation. Rather than relying solely on academic benchmarks or corporate marketing claims, users can now empirically test their own preferences — potentially reshaping how AI companies approach product development.

The future of AI: personalization vs. standardization

Two weeks after GPT-5's launch, the fundamental tension remains unresolved. OpenAI has made the model "warmer" in response to feedback, but the company faces a delicate balance: too much personality risks the sycophancy problems that plagued GPT-4o, while too little alienates users who had formed genuine attachments to their AI companions.

The blind testing tool offers no easy answers, but it does provide something perhaps more valuable: empirical evidence that the future of AI may be less about building one perfect model than about building systems that can adapt to the full spectrum of human needs and preferences.

As one Reddit user summed up the dilemma: "It depends on what people use it for. I use it to help with creative worldbuilding, brainstorming about my stories, characters, untangling plots, help with writer's block, novel recommendations, translations, and other more creative stuff. I understand that 5 is much better for people who need a research/coding tool, but for us who wanted a creative-helper tool 4o was much better for our purposes."

Critics argue that AI companies are caught between competing incentives. "The real 'alignment problem' is that humans want self-destructive things & companies like OpenAI are highly incentivized to give it to us," writer and podcaster Jasmine Sun tweeted.

In the end, the most revealing aspect of the blind test may not be which model users prefer, but the very fact that preference itself has become the metric that matters. In the age of AI companions, it seems, the heart wants what the heart wants — even if it can't always explain why.

MIT report misunderstood: Shadow AI economy booms while headlines cry failure

michael.nunez@venturebeat.com (Michael Nuñez) — Thu, 21 Aug 2025 20:21:41 GMT

The most widely cited statistic from a new MIT report has been deeply misunderstood. While headlines trumpet that "95% of generative AI pilots at companies are failing," the report actually reveals something far more remarkable: the fastest and most successful enterprise technology adoption in corporate history is happening right under executives' noses.

The study, released this week by MIT's Project NANDA, has sparked anxiety across social media and business circles, with many interpreting it as evidence that artificial intelligence is failing to deliver on its promises. But a closer reading of the 26-page report tells a starkly different story — one of unprecedented grassroots technology adoption that has quietly revolutionized work while corporate initiatives stumble.

The researchers found that 90% of employees regularly use personal AI tools for work, even though only 40% of their companies have official AI subscriptions. "While only 40% of companies say they purchased an official LLM subscription, workers from over 90% of the companies we surveyed reported regular use of personal AI tools for work tasks," the study explains. "In fact, almost every single person used an LLM in some form for their work."

_{Employees use personal A.I. tools at more than twice the rate of official corporate adoption, according to the MIT report. (Credit: MIT)}

How employees cracked the AI code while executives stumbled

The MIT researchers discovered what they call a "shadow AI economy" where workers use personal ChatGPT accounts, Claude subscriptions and other consumer tools to handle significant portions of their jobs. These employees aren't just experimenting — they're using AI "multiples times a day every day of their weekly workload," the study found.

This underground adoption has outpaced the early spread of email, smartphones, and cloud computing in corporate environments. A corporate lawyer quoted in the MIT report exemplified the pattern: Her organization invested $50,000 in a specialized AI contract analysis tool, yet she consistently used ChatGPT for drafting work because "the fundamental quality difference is noticeable. ChatGPT consistently produces better outputs, even though our vendor claims to use the same underlying technology."

The pattern repeats across industries. Corporate systems get described as "brittle, overengineered, or misaligned with actual workflows," while consumer AI tools win praise for "flexibility, familiarity, and immediate utility." As one chief information officer told researchers: "We've seen dozens of demos this year. Maybe one or two are genuinely useful. The rest are wrappers or science projects."

Why $50,000 enterprise tools lose to $20 consumer apps

The 95% failure rate that has dominated headlines applies specifically to custom enterprise AI solutions — the expensive, bespoke systems companies commission from vendors or build internally. These tools fail because they lack what the MIT researchers call "learning capability."

Most corporate AI systems "do not retain feedback, adapt to context, or improve over time," the study found. Users complained that enterprise tools "don't learn from our feedback" and require "too much manual context required each time."

Consumer tools like ChatGPT succeed because they feel responsive and flexible, even though they reset with each conversation. Enterprise tools feel rigid and static, requiring extensive setup for each use.

The learning gap creates a strange hierarchy in user preferences. For quick tasks like emails and basic analysis, 70% of workers prefer AI over human colleagues. But for complex, high-stakes work, 90% still want humans. The dividing line isn't intelligence — it's memory and adaptability.

_{General-purpose A.I. tools like ChatGPT reach production 40% of the time, while task-specific enterprise tools succeed only 5% of the time. (Credit: MIT)}

The hidden billion-dollar productivity boom happening under IT's radar

Far from showing AI failure, the shadow economy reveals massive productivity gains that don't appear in corporate metrics. Workers have solved integration challenges that stymie official initiatives, proving AI works when implemented correctly.

"This shadow economy demonstrates that individuals can successfully cross the GenAI Divide when given access to flexible, responsive tools," the report explains. Some companies have started paying attention: "Forward-thinking organizations are beginning to bridge this gap by learning from shadow usage and analyzing which personal tools deliver value before procuring enterprise alternatives."

The productivity gains are real and measurable, just hidden from traditional corporate accounting. Workers automate routine tasks, accelerate research, and streamline communication — all while their companies' official AI budgets produce little return.

_{Workers prefer A.I. for routine tasks like emails but still trust humans for complex, multi-week projects. (Credit: MIT)}

Why buying beats building: external partnerships succeed twice as often

Another finding challenges conventional tech wisdom: companies should stop trying to build AI internally. External partnerships with AI vendors reached deployment 67% of the time, compared to 33% for internally built tools.

The most successful implementations came from organizations that "treated AI startups less like software vendors and more like business service providers," holding them to operational outcomes rather than technical benchmarks. These companies demanded deep customization and continuous improvement rather than flashy demos.

"Despite conventional wisdom that enterprises resist training AI systems, most teams in our interviews expressed willingness to do so, provided the benefits were clear and guardrails were in place," the researchers found. The key was partnership, not just purchasing.

Seven industries avoiding disruption are actually being smart

The MIT report found that only technology and media sectors show meaningful structural change from AI, while seven major industries — including healthcare, finance, and manufacturing — show "significant pilot activity but little to no structural change."

This measured approach isn't a failure — it's wisdom. Industries avoiding disruption are being thoughtful about implementation rather than rushing into chaotic change. In healthcare and energy, "most executives report no current or anticipated hiring reductions over the next five years."

Technology and media move faster because they can absorb more risk. More than 80% of executives in these sectors anticipate reduced hiring within 24 months. Other industries are proving that successful AI adoption doesn't require dramatic upheaval.

Back-office automation delivers millions while front-office tools grab headlines

Corporate attention flows heavily toward sales and marketing applications, which captured about 50% of AI budgets. But the highest returns come from unglamorous back-office automation that receives little attention.

"Some of the most dramatic cost savings we documented came from back-office automation," the researchers found. Companies saved $2-10 million annually in customer service and document processing by eliminating business process outsourcing contracts, and cut external creative costs by 30%.

These gains came "without material workforce reduction," the study notes. "Tools accelerated work, but did not change team structures or budgets. Instead, ROI emerged from reduced external spend, eliminating BPO contracts, cutting agency fees, and replacing expensive consultants with AI-powered internal capabilities."

_{Companies invest heavily in sales and marketing A.I. applications, but the highest returns often come from back-office automation. (Credit: MIT)}

The AI revolution is succeeding — one employee at a time

The MIT findings don't show AI failing. They show AI succeeding so well that employees have moved ahead of their employers. The technology works; corporate procurement doesn't.

The researchers identified organizations "crossing the GenAI Divide" by focusing on tools that integrate deeply while adapting over time. "The shift from building to buying, combined with the rise of prosumer adoption and the emergence of agentic capabilities, creates unprecedented opportunities for vendors who can deliver learning-capable, deeply integrated AI systems."

The 95% of enterprise AI pilots that fail point toward a solution: learn from the 90% of workers who have already figured out how to make AI work. As one manufacturing executive told researchers: "We're processing some contracts faster, but that's all that has changed."

That executive missed the bigger picture. Processing contracts faster — multiplied across millions of workers and thousands of daily tasks — is exactly the kind of gradual, sustainable productivity improvement that defines successful technology adoption. The AI revolution isn't failing. It's quietly succeeding, one ChatGPT conversation at a time.

CodeSignal's new AI tutoring app Cosmo wants to be the 'Duolingo for job skills'

michael.nunez@venturebeat.com (Michael Nuñez) — Wed, 20 Aug 2025 18:26:57 GMT

CodeSignal Inc., the San Francisco-based skills assessment platform trusted by Netflix, Meta, and Capital One, launched Cosmo on Wednesday, a mobile learning application that transforms spare minutes into career-ready skills through artificial intelligence-powered micro-courses.

The app represents a strategic pivot for CodeSignal, which built its reputation assessing technical talent for major corporations but always harbored ambitions to revolutionize workplace education. Cosmo delivers over 300 bite-sized courses across generative AI, coding, marketing, finance, and leadership through an interactive chat interface powered by an AI tutor.

"Cosmo is like having an AI tutor in your pocket that can teach you anything from GenAI to coding to marketing to finance to leadership, and it does it through practice," said Tigran Sloyan, CodeSignal's co-founder and CEO in an exclusive interview with VentureBeat. "Instead of watching a video or reading about something, you immediately start practicing."

The launch comes as organizations grapple with massive skills gaps created by rapid AI adoption. According to the 2024 Stack Overflow Developer Survey, 76% of developers are now using or plan to use AI tools, yet most workers lack the practical knowledge to harness these tools effectively. Traditional corporate training programs, which can cost $20,000 to $40,000 per person for executive-level instruction, have proven inadequate for scaling AI literacy across entire workforces.

How CodeSignal pivoted from tech hiring platform to mobile education powerhouse

CodeSignal's journey into mobile learning culminates a decade-long vision that took an unexpected detour through the hiring technology space. Sloyan originally founded the company in 2015 with educational ambitions but quickly realized that without skills-based hiring practices, alternative education would fail to gain traction.

"I started the company with that dream and mission: I want to help more humans achieve their true potential, which naturally leads to better education," Sloyan explained in an interview. "But roughly two years into the company's history, I realized that without knowing companies would actually care about the skills you build through alternative education — rather than just asking 'where did you go to college?' or 'what did you major in?' — it wouldn't work."

The company spent the next six years building what became the leading technical assessment platform, processing millions of coding evaluations for over 3,000 companies. This hiring-focused period provided CodeSignal with crucial intelligence about which skills employers actually value — data that now informs Cosmo's curriculum development.

"We know exactly what companies are looking for," Sloyan said. "Without that, I feel like you're shooting in the dark when you're trying to prepare people for what is going to help them get that job, what is going to help them advance their career."

Why AI tutors could finally solve the personalized learning problem

Cosmo differentiates itself through what CodeSignal calls "practice-first learning," where users immediately engage with realistic workplace scenarios rather than consuming passive video content. The app's AI tutor, also named Cosmo, guides learners through conversational exchanges that adapt to individual knowledge levels and learning pace.

The platform addresses what educational psychologists call "Bloom's two sigma problem" — a 1984 study showing that one-on-one tutoring produces learning outcomes two standard deviations above traditional classroom instruction. For four decades, this remained theoretically interesting but practically impossible to scale.

"We know one-on-one personalization and tutoring really makes a difference in learning, but it can't be done at scale. How do you get a tutor for every human?" Sloyan said. "In 2023, when I saw early versions of generative AI, I thought: this is the moment. This technology, especially if it keeps getting better, can be uniquely used to help humans learn the way learning was meant to happen."

The app combines predetermined course content with real-time personalization. Each lesson follows a structured curriculum, but learners can interrupt with questions that prompt immediate AI-generated explanations before returning to the main content thread.

Generative AI skills training takes center stage as workforce scrambles to adapt

Nearly one-third of Cosmo's launch content focuses on generative AI applications, reflecting what CodeSignal identifies as the most critical skills gap in today's market. The app offers role-specific AI training paths for sales professionals, marketers, engineers, healthcare workers, and other specialties.

"The biggest emphasis is on generative AI skills, because that's the biggest career skills gap right now for both students and working adults," Sloyan explained. "Everything from how to understand and use GenAI, how to think about its limitations, how to be better at prompting, and how to understand the entire landscape."

This focus addresses a broader workforce transformation driven by AI adoption. While some fear job displacement, Sloyan predicts increased demand for skilled workers who can effectively collaborate with AI systems.

"I don't believe we're going to reach a point where humans are no longer needed in the workforce. I think it's going to be the opposite. We're going to need more humans, because what an individual human can do in the age of AI is going to be so much bigger than what we could do before," he said.

Mobile-first learning strategy targets both individual workers and corporate clients

CodeSignal positions Cosmo as fundamentally a consumer application that also serves enterprise customers — a reflection of how workplace learning actually occurs. The company already provides its GenAI Skills Academy to corporate clients, and Cosmo extends this training to mobile devices for on-the-go learning.

"Even though some of the largest educational companies, like Coursera and Udemy, are making the majority of their income, or at least half, from companies, at the end of the day, education is a consumer business," Sloyan noted. "Who are you educating? You're not educating a company — you're educating individuals."

The app launches free on iOS with premium subscriptions at $24.99 monthly or $149.99 annually, unlocking unlimited practice sessions and faster progression. Android availability follows on August 28.

Enterprise customers who already use CodeSignal's learning platform will receive Cosmo access as part of their existing subscriptions, creating what the company describes as a "companion relationship" between desktop-based deep learning and mobile-based habit formation.

Cosmo faces crowded EdTech market with unique career-skills focus

Cosmo enters a crowded educational technology market but targets a largely underserved niche: comprehensive career skills training optimized for mobile consumption. While competitors like Codecademy focus on specific technical skills and Duolingo dominates language learning, Cosmo's breadth across business and technical disciplines marks a more ambitious scope.

Early user feedback suggests strong market demand. Beta testers describe the app as "Duolingo for job skills" and praise its convenience for mobile learning. CodeSignal's broader learning platform has attracted one million users in less than a year, with usage doubling every two months.

The app's foundation in hiring intelligence provides a competitive advantage over traditional educational publishers. CodeSignal's assessment data reveals which skills actually influence hiring decisions, ensuring curriculum relevance in a rapidly evolving job market.

Corporate training industry grapples with low engagement and poor ROI

Cosmo's launch reflects broader shifts in how organizations approach workforce development. Traditional corporate training often suffers from poor engagement and retention, with utilization rates frequently in single digits despite significant investments.

"The number one problem enterprise learning products have is retention. Organizations buy, deploy, and their utilization is like single digits, and that's horrible," Sloyan said. "The way these products should be measured is how many more skilled humans are there in my organization, and how many more of them are skilled in the skills that I care about."

The mobile-first approach acknowledges how working professionals actually consume educational content — in brief sessions during commutes, breaks, or other downtime rather than dedicated desktop learning blocks.

Skills revolution accelerates as AI transforms every industry

CodeSignal's expansion into mobile learning comes as the company continues innovating across the skills assessment and development spectrum. Recent product launches include AI-Assisted Coding Assessments that evaluate how candidates collaborate with AI tools, and Interviewer Agents that automate technical interviews.

The company has also expanded its educational partnerships, including a collaboration with Amazon Web Services to provide free generative AI training to over 30,000 students globally through the AWS Skills to Jobs Tech Alliance.

Sloyan frames these initiatives within a broader mission to help workers navigate technological disruption. As AI transforms virtually every industry, the ability to quickly acquire new skills becomes increasingly critical for career resilience.

"We've entered an era of accelerating technological change, which will bring a lot of disruption," he said. "There's a massive skills transformation needed, and right now, I wish more companies were doing more to help individuals find and grow those skills."

Cosmo's success may determine whether mobile-native, AI-powered learning can finally deliver on the long-promised potential of personalized education at scale. For CodeSignal, the launch marks a homecoming of sorts — after spending years teaching companies how to identify skilled workers, the company is now teaching workers how to become skilled in the first place. In an era where artificial intelligence threatens to displace human workers, CodeSignal is betting that the solution lies in using AI to make humans more capable than ever before.

DeepSeek V3.1 just dropped — and it might be the most powerful open AI yet

michael.nunez@venturebeat.com (Michael Nuñez) — Tue, 19 Aug 2025 21:13:15 GMT

Chinese artificial intelligence startup DeepSeek made waves across the global AI community Tuesday with the quiet release of its most ambitious model yet — a 685-billion parameter system that challenges the dominance of American AI giants while reshaping the competitive landscape through open-source accessibility.

The Hangzhou-based company, backed by High-Flyer Capital Management, uploaded DeepSeek V3.1 to Hugging Face without fanfare, a characteristically understated approach that belies the model's potential impact. Within hours, early performance tests revealed benchmark scores that rival proprietary systems from OpenAI and Anthropic, while the model's open-source license ensures global access unconstrained by geopolitical tensions.

The release of DeepSeek V3.1 represents more than just another incremental improvement in AI capabilities. It signals a fundamental shift in how the world's most advanced artificial intelligence systems might be developed, distributed, and controlled — with potentially profound implications for the ongoing technological competition between the United States and China.

Within hours of its Hugging Face debut, DeepSeek V3.1 began climbing popularity rankings, drawing praise from researchers worldwide who downloaded and tested its capabilities. The model achieved a 71.6% score on the prestigious Aider coding benchmark, establishing itself as one of the top-performing models available and directly challenging the dominance of American AI giants.

How DeepSeek V3.1 delivers breakthrough performance

DeepSeek V3.1 delivers remarkable engineering achievements that redefine expectations for AI model performance. The system processes up to 128,000 tokens of context — roughly equivalent to a 400-page book — while maintaining response speeds that dwarf slower reasoning-based competitors. The model supports multiple precision formats, from standard BF16 to experimental FP8, allowing developers to optimize performance for their specific hardware constraints.

The real breakthrough lies in what DeepSeek calls its "hybrid architecture." Unlike previous attempts at combining different AI capabilities, which often resulted in systems that performed poorly at everything, V3.1 seamlessly integrates chat, reasoning, and coding functions into a single, coherent model.

"Deepseek v3.1 scores 71.6% on aider – non-reasoning SOTA," tweeted AI researcher Andrew Christianson, adding that it is "1% more than Claude Opus 4 while being 68 times cheaper." The achievement places DeepSeek in rarified company, matching performance levels previously reserved for the most expensive proprietary systems.

Community analysis revealed sophisticated technical innovations hidden beneath the surface. Researcher "Rookie", who is also a moderator of the subreddits r/DeepSeek & r/LocalLLaMA, claims they discovered four new special tokens embedded in the model's architecture: search capabilities that allow real-time web integration and thinking tokens that enable internal reasoning processes. These additions suggest DeepSeek has solved fundamental challenges that have plagued other hybrid systems.

The model's efficiency proves equally impressive. At roughly $1.01 per complete coding task, DeepSeek V3.1 delivers results comparable to systems costing nearly $70 per equivalent workload. For enterprise users managing thousands of daily AI interactions, such cost differences translate into millions of dollars in potential savings.

Strategic timing reveals calculated challenge to American AI dominance

DeepSeek timed its release with surgical precision. The V3.1 launch comes just weeks after OpenAI unveiled GPT-5 and Anthropic launched Claude 4, both positioned as frontier models representing the cutting edge of artificial intelligence capability. By matching their performance while maintaining open source accessibility, DeepSeek directly challenges the fundamental business models underlying American AI leadership.

The strategic implications extend far beyond technical specifications. While American companies maintain strict control over their most advanced systems, requiring expensive API access and imposing usage restrictions, DeepSeek makes comparable capabilities freely available for download, modification, and deployment anywhere in the world.

This philosophical divide reflects broader differences in how the two superpowers approach technological development. American firms like OpenAI and Anthropic view their models as valuable intellectual property requiring protection and monetization. Chinese companies increasingly treat advanced AI as a public good that accelerates innovation through widespread access.

"DeepSeek quietly removed the R1 tag. Now every entry point defaults to V3.1—128k context, unified responses, consistent style," observed journalist Poe Zhao. "Looks less like multiple public models, more like a strategic consolidation. A Chinese answer to the fragmentation risk in the LLM race."

The consolidation strategy suggests DeepSeek has learned from earlier mistakes, both its own and those of competitors. Previous hybrid models, including initial versions from Chinese rival Qwen, suffered from performance degradation when attempting to combine different capabilities. DeepSeek appears to have cracked that code.

How open source strategy disrupts traditional AI economics

DeepSeek's approach fundamentally challenges assumptions about how frontier AI systems should be developed and distributed. Traditional venture capital-backed approaches require massive investments in computing infrastructure, research talent, and regulatory compliance — costs that must eventually be recouped through premium pricing.

DeepSeek's open source strategy turns this model upside down. By making advanced capabilities freely available, the company accelerates adoption while potentially undermining competitors' ability to maintain high margins on similar capabilities. The approach mirrors earlier disruptions in software, where open source alternatives eventually displaced proprietary solutions across entire industries.

Enterprise decision makers face both exciting opportunities and complex challenges. Organizations can now download, customize, and deploy frontier-level AI capabilities without ongoing licensing fees or usage restrictions. The model's 700GB size requires substantial computational resources, but cloud providers will likely offer hosted versions that eliminate infrastructure barriers.

"That's almost the same score as R1 0528 (71.4% with $4.8), but quicker and cheaper, right?" noted one Reddit user analyzing benchmark results. "R1 0528 quality but instant instead of having to wait minutes for a response."

The speed advantage could prove particularly valuable for interactive applications where users expect immediate responses. Previous reasoning models, while capable, often required minutes to process complex queries — making them unsuitable for real-time use cases.

Global developer community embraces Chinese innovation

The international response to DeepSeek V3.1 reveals how quickly technical excellence transcends geopolitical boundaries. Developers from around the world began downloading, testing, and praising the model's capabilities within hours of release, regardless of its Chinese origins.

"Open Source AI is at its peak right now... just look at the current Hugging Face trending list," tweeted Hugging Face head of product Victor Mustar, noting that Chinese models increasingly dominate the platform's most popular downloads. The trend suggests that technical merit, rather than national origin, drives adoption decisions among developers.

Community analysis proceeded at breakneck pace, with researchers reverse-engineering architectural details and performance characteristics within hours of release. AI developer Teortaxes, a long-term DeepSeek observer, noted the company's apparent strategy: "I've long been saying that they hate maintaining separate model lines and will collapse everything into a single product and artifact as soon as possible. This may be it."

The rapid community embrace reflects broader shifts in how AI development occurs. Rather than relying solely on corporate research labs, the field increasingly benefits from distributed innovation across global communities of researchers, developers, and enthusiasts.

Such collaborative development accelerates innovation while making it more difficult for any single company or country to maintain permanent technological advantages. As Chinese models gain recognition for technical excellence, the traditional dominance of American AI companies faces unprecedented challenges.

What DeepSeek's success means for the future of AI competition

DeepSeek's achievement demonstrates that frontier AI capabilities no longer require the massive resources and proprietary approaches that have characterized American AI development. Smaller, more focused teams can achieve comparable results through different strategies, fundamentally altering the competitive landscape.

This democratization of AI development could reshape global technology leadership. Countries and companies previously locked out of frontier AI development due to resource constraints can now access, modify, and build upon cutting-edge capabilities. The shift could accelerate AI adoption worldwide while reducing dependence on American technology platforms.

American AI companies face an existential challenge. If open source alternatives can match proprietary performance while offering greater flexibility and lower costs, the traditional advantages of closed development disappear. Companies will need to demonstrate substantial superior value to justify premium pricing.

The competition may ultimately benefit global innovation by forcing all participants to advance capabilities more rapidly. However, it also raises fundamental questions about sustainable business models in an industry where marginal costs approach zero and competitive advantages prove ephemeral.

The new paradigm: when artificial intelligence becomes truly artificial

DeepSeek V3.1's emergence signals more than technological progress — it represents the moment when artificial intelligence began living up to its name. For too long, the world's most advanced AI systems remained artificially scarce, locked behind corporate paywalls and geographic restrictions that had little to do with the technology's inherent capabilities.

DeepSeek's demonstration that frontier performance can coexist with open access reveals the artificial barriers that once defined AI competition are crumbling. The democratization isn't just about making powerful tools available — it's about exposing that the scarcity was always manufactured, not inevitable.

The irony proves unmistakable: in seeking to make their intelligence artificial, DeepSeek has made the entire industry's gatekeeping look artificial instead. As one community observer noted about the company's roadmap, even more dramatic breakthroughs may be forthcoming. If V3.1 represents merely a stepping stone to V4, the current disruption may pale in comparison to what lies ahead.

The global AI race has fundamentally changed. What began as a competition over who could build the most powerful systems has evolved into a contest over who can make those systems most accessible. In that race, artificial scarcity may prove to be the biggest artificial intelligence of all.

TensorZero nabs $7.3M seed to solve the messy world of enterprise LLM development

michael.nunez@venturebeat.com (Michael Nuñez) — Mon, 18 Aug 2025 19:13:32 GMT

TensorZero, a startup building open-source infrastructure for large language model applications, announced Monday it has raised $7.3 million in seed funding led by FirstMark, with participation from Bessemer Venture Partners, Bedrock, DRW, Coalition, and dozens of strategic angel investors.

The funding comes as the 18-month-old company experiences explosive growth in the developer community. TensorZero's open-source repository recently achieved the "#1 trending repository of the week" spot globally on GitHub, jumping from roughly 3,000 to over 9,700 stars in recent months as enterprises grapple with the complexity of building production-ready AI applications.

"Despite all the noise in the industry, companies building LLM applications still lack the right tools to meet complex cognitive and infrastructure needs, and resort to stitching together whatever early solutions are available on the market," said Matt Turck, General Partner at FirstMark, who led the investment. "TensorZero provides production-grade, enterprise-ready components for building LLM applications that natively work together in a self-reinforcing loop, out of the box."

The Brooklyn-based company addresses a growing pain point for enterprises deploying AI applications at scale. While large language models like GPT-5 and Claude have demonstrated remarkable capabilities, translating these into reliable business applications requires orchestrating multiple complex systems for model access, monitoring, optimization, and experimentation.

How nuclear fusion research shaped a breakthrough AI optimization platform

TensorZero's approach stems from co-founder and CTO Viraj Mehta's unconventional background in reinforcement learning for nuclear fusion reactors. During his PhD at Carnegie Mellon, Mehta worked on Department of Energy research projects where data collection cost "like a car per data point — $30,000 for 5 seconds of data," he explained in a recent interview with VentureBeat.

"That problem leads to a huge amount of concern about where to focus our limited resources," Mehta said. "We were going to only get to run a handful of trials total, so the question became: what is the marginally most valuable place we can collect data from?" This experience shaped TensorZero's core philosophy: maximizing the value of every data point to continuously improve AI systems.

The insight led Mehta and co-founder Gabriel Bianconi, former chief product officer at Ondo Finance (a decentralized finance project with over $1 billion in assets under management), to reconceptualize LLM applications as reinforcement learning problems where systems learn from real-world feedback.

"LLM applications in their broader context feel like reinforcement learning problems," Mehta explained. "You make many calls to a machine learning model with structured inputs, get structured outputs, and eventually receive some form of reward or feedback. This looks to me like a partially observable Markov decision process."

Why enterprises are ditching complex vendor integrations for unified AI infrastructure

Traditional approaches to building LLM applications require companies to integrate numerous specialized tools from different vendors — model gateways, observability platforms, evaluation frameworks, and fine-tuning services. TensorZero unifies these capabilities into a single open-source stack designed to work together seamlessly.

"Most companies didn't go through the hassle of integrating all these different tools, and even the ones that did ended up with fragmented solutions, because those tools weren't designed to work well with each other," Bianconi said. "So we realized there was an opportunity to build a product that enables this feedback loop in production."

The platform's core innovation is creating what the founders call a "data and learning flywheel" — a feedback loop that turns production metrics and human feedback into smarter, faster, and cheaper models. Built in Rust for performance, TensorZero achieves sub-millisecond latency overhead while supporting all major LLM providers through a unified API.

Major banks and AI startups are already building production systems on TensorZero

The approach has already attracted significant enterprise adoption. One of Europe's largest banks is using TensorZero to automate code changelog generation, while numerous AI-first startups from Series A to Series B stage have integrated the platform across diverse industries including healthcare, finance, and consumer applications.

"The surge in adoption from both the open-source community and enterprises has been incredible," Bianconi said. "We're fortunate to have received contributions from dozens of developers worldwide, and it's exciting to see TensorZero already powering cutting-edge LLM applications at frontier AI startups and large organizations."

The company's customer base spans organizations from startups to major financial institutions, drawn by both the technical capabilities and the open-source nature of the platform. For enterprises with strict compliance requirements, the ability to run TensorZero within their own infrastructure provides crucial control over sensitive data.

How TensorZero outperforms LangChain and other AI frameworks at enterprise scale

TensorZero differentiates itself from existing solutions like LangChain and LiteLLM through its end-to-end approach and focus on production-grade deployments. While many frameworks excel at rapid prototyping, they often hit scalability ceilings that force companies to rebuild their infrastructure.

"There are two dimensions to think about," Bianconi explained. "First, there are a number of projects out there that are very good to get started quickly, and you can put a prototype out there very quickly. But often companies will hit a ceiling with many of those products and need to churn and go for something else."

The platform's structured approach to data collection also enables more sophisticated optimization techniques. Unlike traditional observability tools that store raw text inputs and outputs, TensorZero maintains structured data about the variables that go into each inference, making it easier to retrain models and experiment with different approaches.

Rust-powered performance delivers sub-millisecond latency at 10,000+ queries per second

Performance has been a key design consideration. In benchmarks, TensorZero's Rust-based gateway adds less than 1 millisecond of latency at 99th percentile while handling over 10,000 queries per second. This compares favorably to Python-based alternatives like LiteLLM, which can add 25-100x more latency at much lower throughput levels.

"LiteLLM (Python) at 100 QPS adds 25-100x+ more P99 latency than our gateway at 10,000 QPS," the founders noted in their announcement, highlighting the performance advantages of their Rust implementation.

The open-source strategy designed to eliminate AI vendor lock-in fears

TensorZero has committed to keeping its core platform entirely open source, with no paid features — a strategy designed to build trust with enterprise customers wary of vendor lock-in. The company plans to monetize through a managed service that automates the more complex aspects of LLM optimization, such as GPU management for custom model training and proactive optimization recommendations.

"We realized very early on that we needed to make this open source, to give [enterprises] the confidence to do this," Bianconi said. "In the future, at least a year from now realistically, we'll come back with a complementary managed service."

The managed service will focus on automating the computationally intensive aspects of LLM optimization while maintaining the open-source core. This includes handling GPU infrastructure for fine-tuning, running automated experiments, and providing proactive suggestions for improving model performance.

What's next for the company reshaping enterprise AI infrastructure

The announcement positions TensorZero at the forefront of a growing movement to solve the "LLMOps" challenge — the operational complexity of running AI applications in production. As enterprises increasingly view AI as critical business infrastructure rather than experimental technology, the demand for production-ready tooling continues to accelerate.

With the new funding, TensorZero plans to accelerate development of its open-source infrastructure while building out its team. The company is currently hiring in New York and welcomes open-source contributions from the developer community. The founders are particularly excited about developing research tools that will enable faster experimentation across different AI applications.

"Our ultimate vision is to enable a data and learning flywheel for optimizing LLM applications—a feedback loop that turns production metrics and human feedback into smarter, faster, and cheaper models and agents," Mehta said. "As AI models grow smarter and take on more complex workflows, you can't reason about them in a vacuum; you have to do so in the context of their real-world consequences."

TensorZero's rapid GitHub growth and early enterprise traction suggest strong product-market fit in addressing one of the most pressing challenges in modern AI development. The company's open-source approach and focus on enterprise-grade performance could prove decisive advantages in a market where developer adoption often precedes enterprise sales.

For enterprises still struggling to move AI applications from prototype to production, TensorZero's unified approach offers a compelling alternative to the current patchwork of specialized tools. As one industry observer noted, the difference between building AI demos and building AI businesses often comes down to infrastructure — and TensorZero is betting that unified, performance-oriented infrastructure will be the foundation upon which the next generation of AI companies is built.

That 'cheap' open-source AI model is actually burning through your compute budget

michael.nunez@venturebeat.com (Michael Nuñez) — Fri, 15 Aug 2025 01:24:49 GMT

A comprehensive new study has revealed that open-source artificial intelligence models consume significantly more computing resources than their closed-source competitors when performing identical tasks, potentially undermining their cost advantages and reshaping how enterprises evaluate AI deployment strategies.

The research, conducted by AI firm Nous Research, found that open-weight models use between 1.5 to 4 times more tokens — the basic units of AI computation — than closed models like those from OpenAI and Anthropic. For simple knowledge questions, the gap widened dramatically, with some open models using up to 10 times more tokens.

"Open weight models use 1.5–4× more tokens than closed ones (up to 10× for simple knowledge questions), making them sometimes more expensive per query despite lower per‑token costs," the researchers wrote in their report published Wednesday.

The findings challenge a prevailing assumption in the AI industry that open-source models offer clear economic advantages over proprietary alternatives. While open-source models typically cost less per token to run, the study suggests this advantage can be "easily offset if they require more tokens to reason about a given problem."

The real cost of AI: Why 'cheaper' models may break your budget

The research examined 19 different AI models across three categories of tasks: basic knowledge questions, mathematical problems, and logic puzzles. The team measured "token efficiency" — how many computational units models use relative to the complexity of their solutions—a metric that has received little systematic study despite its significant cost implications.

"Token efficiency is a critical metric for several practical reasons," the researchers noted. "While hosting open weight models may be cheaper, this cost advantage could be easily offset if they require more tokens to reason about a given problem."

_{Open-source AI models use up to 12 times more computational resources than the most efficient closed models for basic knowledge questions. (Credit: Nous Research)}

The inefficiency is particularly pronounced for Large Reasoning Models (LRMs), which use extended "chains of thought" to solve complex problems. These models, designed to think through problems step-by-step, can consume thousands of tokens pondering simple questions that should require minimal computation.

For basic knowledge questions like "What is the capital of Australia?" the study found that reasoning models spend "hundreds of tokens pondering simple knowledge questions" that could be answered in a single word.

Which AI models actually deliver bang for your buck

The research revealed stark differences between model providers. OpenAI's models, particularly its o4-mini and newly released open-source gpt-oss variants, demonstrated exceptional token efficiency, especially for mathematical problems. The study found OpenAI models "stand out for extreme token efficiency in math problems," using up to three times fewer tokens than other commercial models.

Among open-source options, Nvidia's llama-3.3-nemotron-super-49b-v1 emerged as "the most token efficient open weight model across all domains," while newer models from companies like Mistral showed "exceptionally high token usage" as outliers.

The efficiency gap varied significantly by task type. While open models used roughly twice as many tokens for mathematical and logic problems, the difference ballooned for simple knowledge questions where efficient reasoning should be unnecessary.

_{OpenAI's latest models achieve the lowest costs for simple questions, while some open-source alternatives can cost significantly more despite lower per-token pricing. (Credit: Nous Research)}

What enterprise leaders need to know about AI computing costs

The findings have immediate implications for enterprise AI adoption, where computing costs can scale rapidly with usage. Companies evaluating AI models often focus on accuracy benchmarks and per-token pricing, but may overlook the total computational requirements for real-world tasks.

"The better token efficiency of closed weight models often compensates for the higher API pricing of those models," the researchers found when analyzing total inference costs.

The study also revealed that closed-source model providers appear to be actively optimizing for efficiency. "Closed weight models have been iteratively optimized to use fewer tokens to reduce inference cost," while open-source models have "increased their token usage for newer versions, possibly reflecting a priority toward better reasoning performance."

_{The computational overhead varies dramatically between AI providers, with some models using over 1,000 tokens for internal reasoning on simple tasks. (Credit: Nous Research)}

How researchers cracked the code on AI efficiency measurement

The research team faced unique challenges in measuring efficiency across different model architectures. Many closed-source models don't reveal their raw reasoning processes, instead providing compressed summaries of their internal computations to prevent competitors from copying their techniques.

To address this, researchers used completion tokens — the total computational units billed for each query — as a proxy for reasoning effort. They discovered that "most recent closed source models will not share their raw reasoning traces" and instead "use smaller language models to transcribe the chain of thought into summaries or compressed representations."

The study's methodology included testing with modified versions of well-known problems to minimize the influence of memorized solutions, such as altering variables in mathematical competition problems from the American Invitational Mathematics Examination (AIME).

_{Different AI models show varying relationships between computation and output, with some providers compressing reasoning traces while others provide full details. (Credit: Nous Research)}

The future of AI efficiency: What's coming next

The researchers suggest that token efficiency should become a primary optimization target alongside accuracy for future model development. "A more densified CoT will also allow for more efficient context usage and may counter context degradation during challenging reasoning tasks," they wrote.

The release of OpenAI's open-source gpt-oss models, which demonstrate state-of-the-art efficiency with "freely accessible CoT," could serve as a reference point for optimizing other open-source models.

The complete research dataset and evaluation code are available on GitHub, allowing other researchers to validate and extend the findings. As the AI industry races toward more powerful reasoning capabilities, this study suggests that the real competition may not be about who can build the smartest AI — but who can build the most efficient one.

After all, in a world where every token counts, the most wasteful models may find themselves priced out of the market, regardless of how well they can think.