Dev | VentureBeat

OpenAI debuts GPT‑5.1-Codex-Max coding model and it already completed a 24-hour task internally

carl.franzen@venturebeat.com (Carl Franzen) — Wed, 19 Nov 2025 19:26:00 GMT

OpenAI has introduced GPT‑5.1-Codex-Max, a new frontier agentic coding model now available in its Codex developer environment. The release marks a significant step forward in AI-assisted software engineering, offering improved long-horizon reasoning, efficiency, and real-time interactive capabilities. GPT‑5.1-Codex-Max will now replace GPT‑5.1-Codex as the default model across Codex-integrated surfaces.

The new model is designed to serve as a persistent, high-context software development agent, capable of managing complex refactors, debugging workflows, and project-scale tasks across multiple context windows.

It comes on the heels of Google releasing its powerful new Gemini 3 Pro model yesterday, yet still outperforms or matches it on key coding benchmarks:

On SWE-Bench Verified, GPT‑5.1-Codex-Max achieved 77.9% accuracy at extra-high reasoning effort, edging past Gemini 3 Pro’s 76.2%.

It also led on Terminal-Bench 2.0, with 58.1% accuracy versus Gemini’s 54.2%, and matched Gemini’s score of 2,439 on LiveCodeBench Pro, a competitive coding Elo benchmark.

When measured against Gemini 3 Pro’s most advanced configuration — its Deep Thinking model — Codex-Max holds a slight edge in agentic coding benchmarks, as well.

Performance Benchmarks: Incremental Gains Across Key Tasks

GPT‑5.1-Codex-Max demonstrates measurable improvements over GPT‑5.1-Codex across a range of standard software engineering benchmarks.

On SWE-Lancer IC SWE, it achieved 79.9% accuracy, a significant increase from GPT‑5.1-Codex’s 66.3%. In SWE-Bench Verified (n=500), it reached 77.9% accuracy at extra-high reasoning effort, outperforming GPT‑5.1-Codex’s 73.7%.

Performance on Terminal Bench 2.0 (n=89) showed more modest improvements, with GPT‑5.1-Codex-Max achieving 58.1% accuracy compared to 52.8% for GPT‑5.1-Codex.

All evaluations were run with compaction and extra-high reasoning effort enabled.

These results indicate that the new model offers a higher ceiling on both benchmarked correctness and real-world usability under extended reasoning loads.

Technical Architecture: Long-Horizon Reasoning via Compaction

A major architectural improvement in GPT‑5.1-Codex-Max is its ability to reason effectively over extended input-output sessions using a mechanism called compaction.

This enables the model to retain key contextual information while discarding irrelevant details as it nears its context window limit — effectively allowing for continuous work across millions of tokens without performance degradation.

The model has been internally observed to complete tasks lasting more than 24 hours, including multi-step refactors, test-driven iteration, and autonomous debugging.

Compaction also improves token efficiency. At medium reasoning effort, GPT‑5.1-Codex-Max used approximately 30% fewer thinking tokens than GPT‑5.1-Codex for comparable or better accuracy, which has implications for both cost and latency.

Platform Integration and Use Cases

GPT‑5.1-Codex-Max is currently available across multiple Codex-based environments, which refer to OpenAI’s own integrated tools and interfaces built specifically for code-focused AI agents. These include:

Codex CLI, OpenAI’s official command-line tool (@openai/codex), where GPT‑5.1-Codex-Max is already live.
IDE extensions, likely developed or maintained by OpenAI, though no specific third-party IDE integrations were named.
Interactive coding environments, such as those used to demonstrate frontend simulation apps like CartPole or Snell’s Law Explorer.
Internal code review tooling, used by OpenAI’s engineering teams.

For now, GPT‑5.1-Codex-Max is not yet available via public API, though OpenAI states this is coming soon. Users who wish to work with the model in terminal environments today can do so by installing and using the Codex CLI.

It is not currently confirmed whether or how the model will integrate into third-party IDEs unless they are built on top of the CLI or future API.

The model is capable of interacting with live tools and simulations. Examples shown in the release include:

An interactive CartPole policy gradient simulator, which visualizes reinforcement learning training and activations.
A Snell’s Law optics explorer, supporting dynamic ray tracing across refractive indices.

These interfaces exemplify the model’s ability to reason in real time while maintaining an interactive development session — effectively bridging computation, visualization, and implementation within a single loop.

Cybersecurity and Safety Constraints

While GPT‑5.1-Codex-Max does not meet OpenAI’s “High” capability threshold for cybersecurity under its Preparedness Framework, it is currently the most capable cybersecurity model OpenAI has deployed. It supports use cases such as automated vulnerability detection and remediation, but with strict sandboxing and disabled network access by default.

OpenAI reports no increase in scaled malicious use but has introduced enhanced monitoring systems, including activity routing and disruption mechanisms for suspicious behavior. Codex remains isolated to a local workspace unless developers opt-in to broader access, mitigating risks like prompt injection from untrusted content.

Deployment Context and Developer Usage

GPT‑5.1-Codex-Max is currently available to users on ChatGPT Plus, Pro, Business, Edu, and Enterprise plans. It will also become the new default in Codex-based environments, replacing GPT‑5.1-Codex, which was a more general-purpose model.

OpenAI states that 95% of its internal engineers use Codex weekly, and since adoption, these engineers have shipped ~70% more pull requests on average — highlighting the tool’s impact on internal development velocity.

Despite its autonomy and persistence, OpenAI stresses that Codex-Max should be treated as a coding assistant, not a replacement for human review. The model produces terminal logs, test citations, and tool call outputs to support transparency in generated code.

Outlook

GPT‑5.1-Codex-Max represents a significant evolution in OpenAI’s strategy toward agentic development tools, offering greater reasoning depth, token efficiency, and interactive capabilities across software engineering tasks. By extending its context management and compaction strategies, the model is positioned to handle tasks at the scale of full repositories, rather than individual files or snippets.

With continued emphasis on agentic workflows, secure sandboxes, and real-world evaluation metrics, Codex-Max sets the stage for the next generation of AI-assisted programming environments — while underscoring the importance of oversight in increasingly autonomous systems.

The most important OpenAI announcement you probably missed at DevDay 2025

michael.nunez@venturebeat.com (Michael Nuñez) — Thu, 09 Oct 2025 07:00:00 GMT

OpenAI’s annual developer conference on Monday was a spectacle of ambitious AI product launches, from an app store for ChatGPT to a stunning video-generation API that brought creative concepts to life. But for the enterprises and technical leaders watching closely, the most consequential announcement was the quiet general availability of Codex, the company's AI software engineer. This release signals a profound shift in how software—and by extension, modern business—is built.

While other announcements captured the public’s imagination, the production-ready release of Codex, supercharged by a new specialized model and a suite of enterprise-grade tools, is the engine behind OpenAI’s entire vision. It is the tool that builds the tools, the proven agent in a world buzzing with agentic potential, and the clearest articulation of the company's strategy to win the enterprise.

The general availability of Codex moves it from a "research preview" to a fully supported product, complete with a new software development kit (SDK), a Slack integration, and administrative controls for security and monitoring.This transition declares that Codex is ready for mission-critical work inside the world’s largest companies.

"We think this is the best time in history to be a builder; it has never been faster to go from idea to product," said OpenAI CEO Sam Altman during the opening keynote presentation. "Software used to take months or years to build. You saw that it can take minutes now to build with AI."

That acceleration is not theoretical. It's a reality born from OpenAI’s own internal use — a massive "dogfooding" effort that serves as the ultimate case study for enterprise customers.

Inside GPT-5-Codex: The AI model that codes autonomously for hours and drives 70% productivity gains

At the heart of the Codex upgrade is GPT-5-Codex, a version of OpenAI's latest flagship model that has been "purposely trained for Codex and agentic coding." The new model is designed to function as an autonomous teammate, moving far beyond simple code autocompletion.

"I personally like to think about it as a little bit like a human teammate," explained Tibo Sottiaux, an OpenAI engineer, during a technical session on Codex. "You can pair a program with it on your computer, you can delegate to it, or as you'll see, you can give it a job without explicit prompting."

This new model enables "adaptive thinking," allowing it to dynamically adjust the time and computational effort spent on a task based on its complexity.For simple requests, it's fast and efficient, but for complex refactoring projects, it can work for hours.

One engineer during the technical session noted, "I've seen the GPT-5-Codex model work for over seven hours productively... on a marathon session." This capability to handle long-running, complex tasks is a significant leap beyond the simple, single-shot interactions that define most AI coding assistants.

The results inside OpenAI have been dramatic. The company reported that 92% of its technical staff now uses Codex daily, and those engineers complete 70% more pull requests (a measure of code contribution) each week. Usage has surged tenfold since August.

"When we as a team see the stats, it feels great," Sottiaux shared. "But even better is being at lunch with someone who then goes 'Hey I use Codex all the time. Here's a cool thing that I do with it. Do you want to hear about it?'"

How OpenAI uses Codex to build its own AI products and catch hundreds of bugs daily

Perhaps the most compelling argument for Codex’s importance is that it is the foundational layer upon which OpenAI’s other flashy announcements were built. During the DevDay event, the company showcased custom-built arcade games and a dynamic, AI-powered website for the conference itself, all developed using Codex.

In one session, engineers demonstrated how they built "Storyboard," a custom creative tool for the film industry, in just 48 hours during an internal hackathon. "We decided to test Codex, our coding agent... we would send tasks to Codex in between meetings. We really easily reviewed and merged PRs into production, which Codex even allowed us to do from our phones," said Allison August, a solutions engineering leader at OpenAI.

This reveals a critical insight: the rapid innovation showcased at DevDay is a direct result of the productivity flywheel created by Codex. The AI is a core part of the manufacturing process for all other AI products.

A key enterprise-focused feature is the new, more robust code review capability. OpenAI said it "purposely trained GPT-5-Codex to be great at ultra thorough code review," enabling it to explore dependencies and validate a programmer's intent against the actual implementation to find high-quality bugs.Internally, nearly every pull request at OpenAI is now reviewed by Codex, catching hundreds of issues daily before they reach a human reviewer.

"It saves you time, you ship with more confidence," Sottiaux said. "There's nothing worse than finding a bug after we actually ship the feature."

Why enterprise software teams are choosing Codex over GitHub Copilot for mission-critical development

The maturation of Codex is central to OpenAI’s broader strategy to conquer the enterprise market, a move essential to justifying its massive valuation and unprecedented compute expenditures. During a press conference, CEO Sam Altman confirmed the strategic shift.

"The models are there now, and you should expect a huge focus from us on really winning enterprises with amazing products, starting here," Altman said during a private press conference.

OpenAI President and Co-founder Greg Brockman immediately added, "And you can see it already with Codex, which I think has been just an incredible success and has really grown super fast."

For technical decision-makers, the message is clear. While consumer-facing agents that book dinner reservations are still finding their footing, Codex is a proven enterprise agent delivering substantial ROI today. Companies like Cisco have already rolled out Codex to their engineering organizations, cutting code review times by 50% and reducing project timelines from weeks to days.

With the new Codex SDK, companies can now embed this agentic power directly into their own custom workflows, such as automating fixes in a CI/CD pipeline or even creating self-evolving applications. During a live demo, an engineer showcased a mobile app that updated its own user interface in real-time based on a natural language prompt, all powered by the embedded Codex SDK.

While the launch of an app ecosystem in ChatGPT and the breathtaking visuals of the Sora 2 API rightfully generated headlines, the general availability of Codex marks a more fundamental and immediate transformation. It is the quiet but powerful engine driving the next era of software development, turning the abstract promise of AI-driven productivity into a tangible, deployable reality for businesses today.

OpenAI Dev Day 2025: ChatGPT becomes the new app store — and hardware is coming

michael.nunez@venturebeat.com (Michael Nuñez) — Tue, 07 Oct 2025 22:30:00 GMT

In a packed hall at Fort Mason Center in San Francisco, against a backdrop of the Golden Gate Bridge, OpenAI CEO Sam Altman laid out a bold vision to remake the digital world. The company that brought generative AI to the mainstream with a simple chatbot is now building the foundations for its next act: a comprehensive computing platform designed to move beyond the screen and browser, with legendary designer Jony Ive enlisted to help shape its physical form.

At its third annual DevDay, OpenAI unveiled a suite of tools that signals a strategic pivot from a model provider to a full-fledged ecosystem. The message was clear: the era of simply asking an AI questions is over. The future is about commanding AI to perform complex tasks, build software autonomously, and live inside every application, a transition Altman framed as moving from "systems that you can ask anything to, to systems that you can ask to do anything for you."

The day’s announcements were a three-pronged assault on the status quo, targeting how users interact with software, how developers build it, and how businesses deploy intelligent agents. But it was the sessions held behind closed doors, away from the public livestream, that revealed the true scope of OpenAI’s ambition — a future that includes new hardware, a relentless pursuit of computational power, and a philosophical quest to redefine our relationship with technology.

From chatbot to operating system: The new 'App Store'

The centerpiece of the public-facing keynote was the transformation of ChatGPT itself. With the new Apps SDK, OpenAI is turning its wildly popular chatbot into a dynamic, interactive platform, effectively an operating system where developers can build and distribute their own applications.

“Today, we're going to open up ChatGPT for developers to build real apps inside of ChatGPT,” Altman announced during the keynote presentation to applause. “This will enable a new generation of apps that are interactive, adaptive and personalized, that you can chat with.”

Live demonstrations showcased apps from partners like Coursera, Canva, and Zillow running seamlessly within a chat conversation. A user could watch a machine learning lecture, ask ChatGPT to explain a concept in real-time, and then use Canva to generate a poster based on the conversation, all without leaving the chat interface. The apps can render rich, interactive UIs, even going full-screen to offer a complete experience, like exploring a Zillow map of homes.

For developers, this represents a powerful new distribution channel. “When you build with the Apps SDK, your apps can reach hundreds of millions of chat users,” Altman said, highlighting a direct path to a massive user base that has grown to over 800 million weekly active users.

In a private press conference later, Nick Turley, head of ChatGPT, elaborated on the grander vision. "We never meant to build a chatbot," he stated. "When we set out to make ChatGPT, we meant to build a super assistant and we got a little sidetracked. And one of the tragedies of getting a little sidetracked is that we built a great chatbot, but we are the first ones to say that not all software needs to be a chatbot, not all interaction with the commercial world needs to be a chatbot."

Turley emphasized that while OpenAI is excited about natural language interfaces, "the interface really needs to evolve, which is why you see so much UI in the demos today. In fact, you can even go full screen and chat is in the background." He described a future where users might "start your day in ChatGPT, just because it kind of has become the de facto entry point into the commercial web and into a lot of software," but clarified that "our incentive is not to keep you in. Our product is to allow other people to build amazing businesses on top and to evolve the form factor of software."

The rise of the agents: Building the 'do anything' AI

If apps are about bringing the world into ChatGPT, the new "Agent Kit" is about sending AI out into the world to get things done. OpenAI is providing a complete "set of building blocks... to help you take agents from prototype to production," Altman explained in his keynote.

Agent Kit is an integrated development environment for creating autonomous AI workers. It features a visual canvas to design complex workflows, an embeddable chat interface ("Chat Kit") for deploying agents in any app, and a sophisticated evaluation suite to measure and improve performance.

A compelling demo from financial operations platform Ramp showed how Agent Kit was used to build a procurement agent. An employee could simply type, "I need five more ChatGPT business seats," and the agent would parse the request, check it against company expense policies, find vendor details, and prepare a virtual credit card for the purchase — a process that once took weeks now completed in minutes.

This push into agents is a direct response to a growing enterprise need to move beyond AI as a simple information retrieval tool and toward AI as a productivity engine that automates complex business processes. Brad Lightcap, OpenAI's COO, noted that for enterprise adoption, "you needed this kind of shift to more agentic AI that could actually do things for you, versus just respond with text outputs."

The future of code and the Jony Ive bBombshell

Perhaps the most profound shift is occurring in software development itself. Codex, OpenAI's AI coding agent, has graduated from a research preview to a full-fledged product, now powered by a specialized version of the new GPT-5 model. It is, as one speaker put it, "a teammate that understands your context."

The capabilities are staggering. Developers can now assign Codex tasks directly from Slack, and the agent can autonomously write code, create pull requests, and even review other engineers' work on GitHub. A live demo showed Codex taking a simple photo of a whiteboard sketch and turning it into a fully functional, beautifully designed mobile app screen. Another demo showed an app that could "self-evolve," reprogramming itself in real-time based on a user's natural language request.

But the day's biggest surprise came in a closing fireside chat, which was not livestreamed, between Altman and Jony Ive, the iconic former chief design officer of Apple. The two revealed they have been collaborating for three years on a new family of AI-centric hardware.

Ive, whose design philosophy shaped the iPhone, iMac, and Apple Watch, said his creative team’s purpose "became clear" with the launch of ChatGPT. He argued that our current relationship with technology is broken and that AI presents an opportunity for a fundamental reset.

“I think it would be absurd to assume that you could have technology that is this breathtaking, delivered to us through legacy products, products that are decades old,” Ive said. “I see it as a chance to use this most remarkable capability to full-on address a lot of the overwhelm and despair that people feel right now.”

While details of the devices remain secret, Ive spoke of his motivation in deeply human terms. “We love our species, and we want to be useful. We think that humanity deserves much better than humanity generally is given,” he said. He emphasized the importance of "care" in the design process, stating, "We sense when people have cared... you sense carelessness. You sense when somebody does not care about you, they care about money and schedule."

This collaboration confirms that OpenAI's ambitions are not confined to the cloud; it is actively exploring the physical interface through which humanity will interact with its powerful new intelligence.

The Unquenchable Thirst for Compute

Underpinning this entire platform strategy is a single, overwhelming constraint: the availability of computing power. In both the private press conference and the un-streamed Developer State of the Union, OpenAI’s leadership returned to this theme again and again.

“The degree to which we are all constrained by compute... Everyone is just so constrained on being able to offer the services at the scale required to get the revenue that at this point, we're quite confident we can push it pretty far,” Altman told reporters. He added that even with massive new hardware partnerships with AMD and others, "we'll be saying the same thing again. We're so convinced... There's so much more demand."

This explains the company’s aggressive, multi-billion-dollar investment in infrastructure. When asked about profitability, Altman was candid that the company is in a phase of "investment and growth." He invoked a famous quote from Walt Disney, paraphrasing, "We make more money so we can make more movies." For OpenAI, the "movies" are ever-more-powerful AI models.

Greg Brockman, OpenAI’s President, put the ultimate goal in stark economic terms during the Developer State of the Union. "AI is going to become, probably in the not too distant future, the fundamental driver of economic growth," he said. "Asking ‘How much compute do you want?’ is a little bit like asking how much workforce do you want? The answer is, you can always get more out of more."

As the day concluded and developers mingled at the reception, the scale of OpenAI's project came into focus. Fueled by new models like the powerful GPT-5 Pro and the stunning Sora 2 video generator, the company is no longer just building AI. It is building the world where AI will live — a world of intelligent apps, autonomous agents, and new physical devices, betting that in the near future, intelligence itself will be the ultimate platform.

OpenAI's DevDay 2025 preview: Will Sam Altman launch the ChatGPT browser?

michael.nunez@venturebeat.com (Michael Nuñez) — Fri, 03 Oct 2025 19:00:00 GMT

OpenAI will host more than 1,500 developers at its largest annual conference on Monday, as the company behind ChatGPT seeks to maintain its edge in an increasingly competitive artificial intelligence landscape.

The third annual DevDay conference at San Francisco's Fort Mason represents a critical moment for OpenAI, which has seen its dominance challenged by rapid advances from Google's Gemini, Anthropic's Claude, and Meta's growing AI efforts. The event comes just days after OpenAI's new Sora video generation app topped Apple's App Store, demonstrating the company's ability to capture mainstream attention even as technical competitors close the gap.

Chief Executive Sam Altman will deliver the opening keynote at 10 a.m. Pacific time, promising "announcements, live demos, and a vision of how developers are reshaping the future with AI." The session will be livestreamed, but subsequent presentations — including a developer-focused "State of the Union" with President Greg Brockman and a closing conversation between Altman and Apple design legend Jony Ive — will only be available to in-person attendees.

Google and Meta challenge ChatGPT's developer dominance

The conference arrives at a pivotal moment for OpenAI. While the company's ChatGPT remains the most recognizable AI brand globally, technical evaluations show Google's latest Gemini models performing competitively on coding tasks, while Anthropic's Claude has gained traction among developers for its safety features and reasoning capabilities.

This intensifying competition has fundamentally altered OpenAI's strategic calculus. The company that once commanded premium pricing for access to its models now finds itself in a price war, releasing more capable systems at lower costs to retain developer loyalty.

The shift reflects a maturing market where technical performance differences between leading AI models have narrowed considerably, forcing companies to compete on price, developer experience, and specialized capabilities rather than raw model superiority.

The timing of DevDay also follows several strategic moves by OpenAI that signal broader ambitions beyond its core chatbot business. The company recently launched Sora 2, its advanced video generation model, alongside a social media application that allows users to create and share AI-generated videos. Industry observers speculate that Monday's event could feature the long-rumored ChatGPT browser, potentially challenging Google Chrome's dominance.

Enterprise AI adoption takes center stage as revenue strategy shifts

This year's agenda reflects OpenAI's growing focus on enterprise customers, who provide more predictable revenue streams than consumer subscriptions. Sessions will cover "orchestrating agents at scale," enterprise AI adoption challenges, and how OpenAI applies its own technology to internal workflows across sales, support, and finance.

The enterprise emphasis marks a shift from earlier DevDay events. The inaugural 2023 conference introduced GPT-4 Turbo and the GPT Store marketplace, while 2024's more subdued gathering focused primarily on developer API improvements. This year's expanded format suggests OpenAI views the developer community as crucial to its competitive positioning.

The State of the Union presentation is expected to focus on how artificial intelligence is transforming software development workflows, with anticipated demonstrations of enhanced capabilities in OpenAI's Codex programming assistant and the introduction of new open model offerings that could expand developer access to the company's technology.

Sora cinema and interactive AI demos showcase next-generation capabilities

Beyond formal presentations, DevDay will feature hands-on demonstrations of emerging technologies. A "Sora Cinema" will showcase AI-generated short films, while custom arcade games built using GPT-5 — OpenAI's latest model — will demonstrate the technology's creative applications.

Perhaps most intriguingly, attendees can interact with a "living portrait" of computer science pioneer Alan Turing that responds to questions, representing the kind of interactive AI experiences that could define the next generation of human-computer interaction.

The presence of Jony Ive at the closing session carries particular significance. The former Apple executive has been collaborating with OpenAI on a consumer AI device, suggesting Monday's conversation could provide insights into the company's hardware ambitions.

Developer ecosystem and market positioning face unprecedented competitive pressure

For enterprise technology decision-makers, DevDay represents more than a product showcase — it's a window into how AI will reshape software development and business processes. The conference agenda includes sessions on context engineering, agent orchestration, and enterprise scaling challenges that reflect real-world implementation hurdles.

The developer ecosystem around OpenAI's APIs has become a critical competitive moat. Companies like Cursor, Clay, and Decagon have built substantial businesses on OpenAI's foundation models, creating network effects that make switching to alternative providers more difficult.

However, this ecosystem faces new challenges as competitors offer compelling alternatives. Google's recent improvements to Gemini for coding tasks and Meta's investments in its Superintelligence Labs represent serious threats to OpenAI's developer mindshare.

As the AI industry matures beyond initial breakthroughs, Monday's DevDay will test whether OpenAI can maintain its leadership position through superior tooling, developer experience, and enterprise-focused innovation. With over $500 billion in market valuation riding on continued growth, the stakes for this year's conference extend far beyond San Francisco's shores.

The keynote begins at 10 a.m. Pacific time and will be available via livestream on OpenAI's YouTube channel.

Anthropic’s new Claude can code for 30 hours. Think of it as your AI coworker

michael.nunez@venturebeat.com (Michael Nuñez) — Mon, 29 Sep 2025 17:00:00 GMT

Anthropic launched Claude Sonnet 4.5 on Monday, positioning the artificial intelligence model as "the best coding model in the world" in a direct challenge to OpenAI's recently released GPT-5, as the two AI giants battle for dominance in the lucrative enterprise software development market.

The San Francisco-based startup claims its newest model achieves state-of-the-art performance on critical coding benchmarks, scoring 77.2% on SWE-bench Verified — a rigorous software engineering evaluation — compared to GPT-5's performance. More remarkably, Anthropic says Claude Sonnet 4.5 can maintain focus on complex, multi-step tasks for more than 30 hours, a dramatic leap in AI's ability to handle sustained work.

"Sonnet 4.5 achieves 77.2% on SWE-bench Verified (82% with parallel test-time compute). It is SOTA," an Anthropic spokesperson told VentureBeat, using industry shorthand for "state of the art." The company also highlighted the model's 50% score on Terminal-bench, another coding benchmark where it claims leadership.

The announcement follows mounting pressure from OpenAI's recent advances and pointed criticism from high-profile figures like Elon Musk, who recently posted on X.com that "winning was never in the set of possible outcomes for Anthropic." When asked about Musk's statement, Anthropic declined to comment.

The release arrives just seven weeks after OpenAI's GPT-5 launch in August, underscoring the breakneck pace of competition in artificial intelligence as companies race to capture enterprise customers increasingly relying on AI for software development. The timing is particularly noteworthy as Anthropic grapples with questions about its heavy dependence on just two major customers.

Anthropic dominates coding market despite customer concentration risks

The competition centers on a market that has emerged as AI's first major profitable use case beyond chatbots. Anthropic commands 42% of the code generation market — more than double OpenAI's 21% share — according to a Menlo Ventures survey of 150 enterprise technical leaders. That dominance has translated into remarkable financial performance, with the company reaching a $5 billion revenue run rate earlier this year.

However, industry analysis reveals that coding applications Cursor and GitHub Copilot drive approximately $1.4 billion of Anthropic's revenue, creating a potentially dangerous customer concentration that could leave the company vulnerable if either relationship falters.

"Our run-rate revenue has grown significantly, even when you exclude these two customers," the Anthropic spokesperson said, pushing back on concerns about customer concentration. The company provided supportive quotes from both Cursor CEO Michael Truell and GitHub Chief Product Officer Mario Rodriguez praising Claude Sonnet 4.5's performance.

The new model achieves significant advances in computer use capabilities, scoring 61.4% on OSWorld, a benchmark that tests AI models on real-world computer tasks. Just four months ago, Claude Sonnet 4 held the lead at 42.2%, demonstrating rapid improvement in AI's ability to interact with software interfaces.

OpenAI's aggressive pricing strategy threatens Anthropic's premium positioning

Anthropic's announcement comes as the company grapples with competitive pressure from GPT-5's aggressive pricing strategy. Early analysis shows Claude Opus 4 costing roughly seven times more per million tokens than GPT-5 for certain tasks, creating immediate pressure on Anthropic's premium positioning.

The pricing disparity signals a fundamental shift in competitive dynamics that could force enterprise procurement teams to reconsider vendor relationships previously built on performance rather than price. Companies managing exponentially growing AI budgets now face comparable capability at a fraction of the cost.

Yet Anthropic is maintaining its pricing strategy with Claude Sonnet 4.5. "Sonnet 4.5's cost remains the same as Sonnet 4," the spokesperson confirmed, keeping prices at $3 per million input tokens and $15 per million output tokens.

Claude Sonnet 4.5 delivers 30-hour autonomous work sessions and enhanced security

Beyond performance improvements, Anthropic positions Claude Sonnet 4.5 as its "most aligned frontier model yet," showing significant reductions in concerning behaviors like sycophancy, deception, and power-seeking tendencies. The company has made "considerable progress on defending against prompt injection attacks," a critical security concern for enterprise deployments.

The model is being released under Anthropic's AI Safety Level 3 (ASL-3) protections, which include classifiers designed to detect potentially dangerous inputs and outputs related to chemical, biological, radiological, and nuclear weapons. While these safeguards sometimes flag normal content, Anthropic says it has reduced false positives by a factor of ten since initially describing them.

Perhaps most significantly for developers, Anthropic is releasing the Claude Agent SDK — the same infrastructure that powers its Claude Code product. "We built Claude Code because the tool we needed didn't exist yet," the company said in its announcement. "The Agent SDK gives you the same foundation to build something just as capable for whatever problem you're solving."

International expansion accelerates as $1.5 billion copyright settlement finalizes

The model launch coincides with Anthropic's aggressive international expansion, as the company seeks to diversify beyond its U.S.-concentrated customer base. The startup recently announced plans to triple its international workforce and expand its applied AI team fivefold in 2025, driven by data showing that nearly 80% of Claude usage now comes from outside the United States.

However, the expansion comes amid significant legal costs. Anthropic recently agreed to pay $1.5 billion in a copyright settlement with authors and publishers over allegations the company illegally used their books to train AI models without permission. The settlement, approved by a federal judge last week, requires payments of $3,000 for each publication listed in the case.

Enterprise AI spending doubles as companies prioritize performance over cost

The rapid-fire model releases from both companies reflect the high stakes in enterprise AI adoption. Model API spending has more than doubled to $8.4 billion in just six months, according to Menlo Ventures, as enterprises shift from experimental projects to production deployments.

Customer behavior patterns suggest enterprises consistently prioritize performance over price, upgrading to the newest models within weeks of release regardless of cost. This behavior could work in Anthropic's favor if Claude Sonnet 4.5's performance advantages prove compelling enough to overcome GPT-5's pricing advantage.

However, the dramatic price differential introduced by GPT-5 could overcome typical switching inertia, especially for cost-conscious enterprises facing budget pressures. Industry observers note that model switching costs remain relatively low, with 66% of enterprises upgrading within existing providers rather than switching vendors.

For enterprises, the intensifying competition delivers better performance and lower costs through continuously improving capabilities. The rapid pace of model improvements — with new versions launching monthly rather than annually — provides organizations with expanding AI capabilities while vendors compete aggressively for their business.

While the corporate rivalry between Anthropic and OpenAI dominates industry headlines, the real economic impact extends far beyond Silicon Valley boardrooms. The development of AI systems capable of sustained coding work for 30 hours represents a fundamental shift in how software gets built, with implications that extend across every industry relying on technology infrastructure.

These advancing capabilities signal broader workplace transformation ahead. As AI systems demonstrate increasing proficiency at complex, sustained intellectual work, the technology industry's competition for coding supremacy foreshadows similar disruptions across fields requiring analytical thinking, problem-solving, and technical expertise.

How enterprises can select the right QA tools for riding the AI vibe coding wave

Tue, 16 Sep 2025 13:00:00 GMT

Enterprise startup CodeRabbit today raised $60 million to solve a problem most enterprises don't realize they have yet. As AI coding agents generate code faster than humans can review it, organizations face a critical infrastructure decision that will determine whether they capture AI's productivity gains or get buried in technical debt.

The funding round, led by Scale Venture Partners, signals investor confidence in a new category of enterprise tooling. The code quality assurance (QA) space is a busy one with GitHub's bundled code review features, Cursor's bug bot, Zencoder, Qodo and emerging players like Graphite in a space that's rapidly attracting attention from both startups and incumbent platforms.

The market timing reflects a measurable shift in development workflows. Organizations using AI coding tools generate significantly more code volume. Traditional peer review processes haven't scaled to match this velocity. The result is a new bottleneck that threatens to negate AI's promised productivity benefits.

"AI-generated code is here to stay, but speed without a centralized knowledge base and an independent governance layer is a recipe for disaster," Harjot Gill, CEO of CodeRabbit told VentureBeat. "Code review is the most critical quality gate in the agentic software lifecycle."

The technical architecture that matters

Unlike traditional static analysis tools that rely on rule-based pattern matching, AI code review platforms use reasoning models to understand code intent across entire repositories. The technical complexity is significant. These systems require multiple specialized models working in sequence over 5-15 minute analysis workflows.

"We're using around six or seven different models under the hood," Gill explained. "This is one of those areas where reasoning models like GPT-5 are a good fit. These are PhD-style problems."

The key differentiator lies in context engineering. Advanced platforms gather intelligence from dozens of sources: code graphs, historical pull requests, architectural documents and organizational coding guidelines. This approach enables AI reviewers to catch issues that traditional tools miss. Examples include security vulnerabilities that emerge from changes across multiple files or architectural inconsistencies that only become apparent with full repository context.

Competitive landscape and vendor positioning

The AI code review space is attracting competition from multiple directions.

Though there are integrated QA capabilities built directly into platform like GitHub and Cursor there is still a need and a market for standalone solutions as well.

"When it comes to the critical trust layer, organizations won't go cheap out on that," Gill said. "They will buy the best tool possible."

He noted that it's similar in some respects to the observability market where specialized tools like DataDog compete successfully against bundled alternatives like Amazon CloudWatch.

Gill's view is validated by multiple industry analysts.

"In an era of AI‑assisted development, code review is more important than ever; AI increases code volume and complexity that correspondingly increases code review times and raises the risk of defects," IDC analyst Arnal Dayaratna told VentureBeat. "That reality elevates the value of an independent, platform‑agnostic reviewer that stands apart from the IDE or model vendor."

Industry analyst Paul Nashawaty told VentureBeat that CodeRabbit embeds context-aware, conversational feedback directly in developer environments making reviews faster and less noisy for developers. Its ability to learn team preferences and provide in-editor guidance reduces friction and accelerates throughput

"That said, CodeRabbit is more of a complement than a replacement," Nashawaty said. "Most enterprises will still pair it with established Static Application Security Testing (SAST)/Source Code Analysis (SCA) tools, which the industry estimates represent a $3B plus market growing at approximately 18% CAGR, for broader rule coverage, compliance reporting and governance."

Real-world implementation results

The Linux Foundation provides a concrete example of successful deployment. The organization supports numerous open-source projects across multiple programming languages: Golang, Python, Angular and TypeScript. Manual reviews were creating high variance quality checks that missed critical bugs while slowing distributed teams across time zones.

The default option for the Linux Foundation before CodeRabbit, was to review code manually. This approach was slow, inefficient and error-prone involving significant time commitment from technical leads, with often two cycles to complete the review. After implementing CodeRabbit, their developers reported a 25% reduction in time spent on code reviews.

CodeRabbit caught issues that human reviewers had missed, including inconsistencies between documentation and test coverage, missing null checks, and refactoring opportunities in Terraform files.

Evaluation framework for AI code review platforms

Industry analysts have identified specific criteria enterprises should prioritize when evaluating AI code review platforms, based on common adoption barriers and technical requirements.

Agentic reasoning capabilities: IDC analyst Arnal Dayaratna recommends prioritizing agentic capabilities that use generative AI to explain why changes were made, trace impact across the repository and propose fixes with clear rationale and test implications. This differs from traditional static analysis tools that simply flag issues without contextual understanding.

Developer experience and accuracy: Analyst Paul Nashawaty emphasizes balancing developer adoption and risk coverage with focus on accuracy, workflow integration and contextual awareness of code changes.

Platform independence: Dayaratna highlights the value of an independent, platform-agnostic reviewer that stands apart from the IDE or model vendor.

Quality validation and governance: Both analysts stress pre-commit validation capabilities. Dayaratna recommends tools that validate suggested edits before commit to avoid new review churn and require automated tests, static analysis and safe application of one-click patches. Enterprises need governance flexibility to configure review standards. "Every company has a different bar when it comes to how pedantic and how nitpicky they want the system to be," Gill noted.

Proof-of-concept approach: Nashawaty recommends a 2-4 week proof-of-concept on real issues that help to measure developer satisfaction, scan accuracy, and remediation speed rather than relying solely on vendor demonstrations or feature checklists.

For enterprises looking to lead in AI-assisted development,it’s increasingly foundational to evaluate code review platforms as critical infrastructure, not point solutions. The organizations that establish robust AI review capabilities now will have competitive advantages in software delivery velocity and quality.

For enterprises adopting AI development tools later, the lesson is clear: plan for the review bottleneck before it constrains your AI productivity gains. The infrastructure decision you make today determines whether AI coding tools become force multipliers or sources of technical debt.

OpenAI adds 'powerful but dangerous' support for MCP in ChatGPT dev mode

carl.franzen@venturebeat.com (Carl Franzen) — Thu, 11 Sep 2025 17:20:00 GMT

It dropped in the midst of a chaotic and tragic news day yesterday, but OpenAI made a significant upgrade to ChatGPT that's worth further consideration among software developers: the company added support for the emerging Model Context Protocol (MCP) standard directly into ChatGPT itself when switched into developer mode, allowing third party developers to be able to connect their own external (MCP compatible) servers and tools directly into the developers' ChatGPT accounts.

This provides a huge advantage to third-party developers who want to access and modify their own websites, products and services directly within ChatGPT's web interface.

Instead of logging into separate apps, clicking through menus, or juggling multiple dashboards, a developer who has ChatGPT dev mode switched on (and is using the Plus or Pro plans at $20 or $200 monthly, respectively) can ask ChatGPT a natural language question and get a direct answer from the developers' own service, or even make changes to it, all from within a single chat.

The company cautions that while the feature is powerful, it is comes with risks.

In OpenAI’s own words: “it's powerful but dangerous, and is intended for developers who understand how to safely configure and test connectors. When using developer mode, watch for prompt injections and other risks, model mistakes on write actions that could destroy data, and malicious MCPs that attempt to steal information."

Why OpenAI's move is so significant and helps further enshrine MCP as an AI industry standard

MCP itself is an open standard, first introduced by Anthropic in November 2024, that provides a consistent way to connect AI assistants to external systems such as content repositories, enterprise software, or developer tools.

Anthropic likens MCP to a kind of USB-C port for AI applications: just as USB-C simplifies hardware connections, MCP standardizes how AI models communicate with external data and services.

Since its release, MCP has gained rapid adoption across the industry, with early adopters including Block, Apollo, Cloudflare, MongoDB, and PayPal, all of whom have made it possible for developers to connect AI chatbots and other gen AI tools to these respective third party services and obtain information from them. There's a whole website showing public MCP servers to which AI developers can now connect large language models.

Thus, the purpose of ChatGPT’s new developer mode with MCP support is to give developers a standardized, relatively simple and easy way to connect their systems, tools, or services directly into ChatGPT and maintain that connection going forward, retrieving information or executing operations on the connected MCP server from within the developer's own ChatGPT interface.

Instead of building custom integrations or relying on OpenAI’s old plugin system, developers can now host an MCP server that exposes specific functions — such as checking inventory, updating records, or processing payments — and then execute them from within their own ChatGPT Plus or Pro accounts.

Industry observers and participants argue that MCP is fast becoming a common language for enterprise AI. A recent VentureBeat article noted that despite only launching months earlier, MCP appears to be the leading candidate for interoperability in the agentic AI ecosystem.

Unlike traditional application programming interfaces (APIs), MCP allows for more granular control and security. Enterprises can define what tools are exposed, require authentication from agents, and enforce rules about what models can or cannot access. This fine-grained control makes MCP particularly appealing in corporate environments where security and compliance are top priorities.

For developers working with ChatGPT’s new developer mode, this means the connectors they create may not just serve one-off integrations — they could be building into a broader ecosystem standard. MCP was created as an open protocol so that one connector can work across different AI ecosystems, not just ChatGPT.

How it works

Enabling developer mode requires users to navigate to Settings → Connectors → Advanced → Developer mode.

Once active, the option to add connectors appears in conversations. Developers can link remote MCP servers over supported protocols such as SSE and streaming HTTP. Authentication options include OAuth or no authentication at all.

Within the connector settings, developers can toggle individual tools on or off, refresh connectors to pull the latest descriptions, and inspect tool details.

During conversations, ChatGPT can then invoke the selected tools. Developers may also steer tool usage by writing explicit prompts, such as telling the model to only use a specific connector for a given task, to avoid ambiguity with built-in features.

From Stripe to Slack: example use cases

OpenAI posted a video demo on X showing how its new dev mode with MCP support can perform linked actions across different services — once a developer has taken the time to expose and connect the MCP server to ChatGPT.

In one example, ChatGPT used an MCP Stripe connector to check a balance, create a customer invoice, and confirm the write action before generating it. The invoice details were then viewable directly in the chat session.

A second demonstration layered multiple connectors. ChatGPT first processed a refund through Stripe, then used a Zapier connector to send a Slack message notifying the customer.

Other possible integrations include updating Jira tickets via Atlassian’s MCP server or using Cloudflare’s connector to convert web pages to Markdown.

These examples highlight the flexibility of chaining connectors together to automate multi-step processes, with ChatGPT managing the sequencing between tools.

Controls and safeguards

To mitigate risks, all write actions require explicit confirmation by default. Before a tool executes a write, the developer can expand the tool call details to inspect the full JSON payload, including both inputs and expected outputs.

Developers may choose to remember an approve or deny choice for the duration of a conversation, but new or refreshed sessions will require confirmation again.

Tools flagged with the readOnlyHint annotation are treated as read-only, while all others are classified as write actions.

Guidance for developers

OpenAI provides recommendations for making connectors easier and safer to use.

Tool names and descriptions should be action-oriented and include clear instructions about when to use them, as well as parameter explanations.

This guidance helps the model distinguish between similar tools and avoid defaulting to inappropriate built-in options.

The documentation also advises developers to disallow alternatives when necessary, specify input shapes for tool calls, and define sequencing when multiple steps are required.

For example, one prompt might instruct ChatGPT to read a file from a repository first and then write modified content back, avoiding any intermediate or unintended actions.

Connecting to the broader MCP ecosystem

The release of developer mode follows other recent updates to OpenAI’s developer-facing tools, particularly the Responses API which received support for MCP in May 2025.

Among its features are support for remote MCP servers, integration of GPT-4o’s image generation model, and access to built-in tools such as Code Interpreter and improved file search.

That API was designed as a unified toolbox for building agentic applications and has already processed trillions of tokens since its launch earlier this year.

Building on earlier commitments

The expansion of MCP support across OpenAI’s ecosystem also ties back to comments from CEO and co-founder Sam Altman earlier this year.

In March, Altman wrote on X that “people love MCP and we are excited to add support across our products,” noting that it was already available in the Agents SDK and would soon extend to the ChatGPT desktop app and Responses API.

With developer mode now live, that roadmap is taking clearer shape as the company moves to integrate MCP more deeply into both its APIs and its flagship ChatGPT product.

Comparing OpenAI’s and Anthropic’s MCP guidance

Although both OpenAI and Anthropic are building around the same open MCP standard, their guidance for developers reflects differences in focus and product integration.

OpenAI’s instructions for developer mode are tightly tied to ChatGPT’s interface. The company emphasizes practical prompting techniques for ensuring the right tool is called, such as telling the model to only use a specific connector and to avoid built-in tools.

It also advises developers to specify input formats and sequencing when chaining multiple tool calls. Much of the guidance centers on guardrails and safety: inspecting JSON payloads, confirming write actions, and understanding that model mistakes could delete or alter important data. In other words, OpenAI frames MCP use within ChatGPT as powerful but risky, and stresses developer responsibility in setting up and testing connectors safely.

Anthropic’s approach, by contrast, is more infrastructure-oriented. Its documentation highlights MCP as an open protocol that enables developers to either expose data through servers or build clients that connect to them. Rather than focusing on prompt design or in-chat usage, Anthropic stresses the architecture: MCP servers as connectors to enterprise systems, MCP clients as AI applications, and a growing ecosystem of prebuilt servers for popular platforms like Google Drive, GitHub, and Slack.

Its guidance encourages developers to quickly spin up servers, connect them to Claude or other tools, and contribute to the open-source ecosystem.

Where they converge is in treating MCP as a way to overcome the problem of fragmented integrations. Both also frame it as critical for building agentic systems—AIs that not only generate text but also act on external systems in structured ways.

The differences reflect their respective product strategies. OpenAI’s guidance is tailored to developers who want to experiment inside ChatGPT and its surrounding APIs, where the risks of write actions are immediate and visible to end users. Anthropic, meanwhile, positions MCP as a foundational building block for enterprise infrastructure and developer platforms, encouraging organizations to standardize their tool connections at the protocol level. For developers, this means OpenAI’s focus is on safe in-chat usage, while Anthropic’s is on building scalable systems that can serve entire organizations.

What it all means

For developers already experimenting with MCP servers, the new mode significantly broadens what can be done inside ChatGPT.

Instead of only fetching data, users can now carry out full workflows — updating records, generating invoices, issuing refunds, or coordinating with third-party services like Slack — directly within a conversation. The ability to chain connectors together also opens the door to more complex automations.

At the same time, the emphasis on careful setup and review reflects the dual nature of the update: powerful but potentially risky if misused. By keeping confirmations in place and requiring developers to inspect tool calls, OpenAI appears to be prioritizing responsible use as the feature rolls out.