VentureBeat

Claude, OpenClaw and the new reality: AI agents are here — and so is the chaos

dattarajraogravitar@gmail.com (Dattaraj Rao, Persistent Systems) — Wed, 08 Apr 2026 22:26:37 GMT

The age of agentic AI is upon us — whether we like it or not. What started with an innocent question-answer banter with ChatGPT back in 2022 has become an existential debate on job security and the rise of the machines.

More recently, fears of reaching artificial general intelligence (AGI) have become more real with the advent of powerful autonomous agents like Claude Cowork and OpenClaw. Having played with these tools for some time, here is a comparison.

First, we have OpenClaw (formerly known as Moltbot and Clawdbot). Surpassing 150,000 GitHub stars in days, OpenClaw is already being deployed on local machines with deep system access. This is like a robot “maid” (Irona for Richie Rich fans, for instance) that you give the keys to your house. It’s supposed to clean it, and you give it the necessary autonomy to take actions and manage your belongings (files and data) as it pleases. The whole purpose is to perform the task at hand — inbox triaging, auto-replies, content curation, travel planning, and more.

Next we have Google’s Antigravity, a coding agent with an IDE that accelerates the path from prompt to production. You can interactively create complete application projects and modify specific details over individual prompts. This is like having a junior developer that can not only code, but build, test, integrate, and fix issues. In the realworld, this is like hiring an electrician: They are really good at a specific job and you only need to give them access to a specific item (your electric junction box).

Finally, we have the mighty Claude. The release of Anthropic's Cowork, which featured AI agents for automating legal tasks like contract review and NDA triage, caused a sharp sell-off in legal-tech and software-as-a-service (SaaS) stocks (referred to as the SaaSpocalypse). Claude has anyway been the go-to chatbot; now with Cowork, it has domain knowledge for specific industries like legal and finance. This is like hiring an accountant. They know the domain inside-out and can complete taxes and manage invoices. Users provide specific access to highly-sensitive financial details.

Making these tools work for you

The key to making these tools more impactful is giving them more power, but that increases the risk of misuse. Users must trust providers like Anthorpic and Google to ensure that agent prompts will not cause harm, leak data, or provide unfair (illegal) advantage to certain vendors. OpenClaw is open-source, which complicates things, as there is no central governing authority.

While these technological advancements are amazing and meant for the greater good, all it takes is one or two adverse events to cause panic. Imagine the agentic electrician frying all your house circuits by connecting the wrong wire. In an agent scenario, this could be injecting incorrect code, breaking down a bigger system or adding hidden flaws that may not be immediately evident. Cowork could miss major saving opportunities when doing a user's taxes; on the flip side, it could include illegal writeoffs. Claude can do unimaginable damage when it has more control and authority.

But in the middle of this chaos, there is an opportunity to really take advantage. With the right guardrails in place, agents can focus on specific actions and avoid making random, unaccounted-for decisions. Principles of responsible AI — accountability, transparency, reproducibility, security, privacy — are extremely important. Logging agent steps and human confirmation are absolutely critical.

Also, when agents deal with so many diverse systems, it's important they speak the same language. Ontology becomes very important so that events can be tracked, monitored, and accounted for. A shared domain-specific ontology can define a “code of conduct." These ethics can help control the chaos. When tied together with a shared trust and distributed identity framework, we can build systems that enable agents to do truly useful work.

When done right, an agentic ecosystem can greatly offload the human “cognitive load” and enable our workforce to perform high-value tasks. Humans will benefit when agents handle the mundane.

Dattaraj Rao is innovation and R&D architect at Persistent Systems.

Goodbye, Llama? Meta launches new proprietary AI model Muse Spark — first since Superintelligence Labs' formation

carl.franzen@venturebeat.com (Carl Franzen) — Wed, 08 Apr 2026 20:34:01 GMT

Meta has been one of the most interesting companies of the generative AI era — initially gaining a loyal and huge following of users for the release of its mostly open source Llama family of large language models (LLMs) beginning in early 2023 but coming to screeching halt last year after Llama 4 debuted to mixed reviews and ultimately, admissions of gaming benchmarks.

That bumpy rollout of Llama 4 apparently spurred Meta founder and CEO Mark Zuckerberg to totally overhaul Meta's AI operations in the summer of 2025, forming a new internal division, Meta Superintelligence Labs (MSL) which he recruited 29-year-old former Scale AI co-founder and CEO Alexandr Wang to lead as Chief AI Officer.

Now, today, Meta is showing us the fruits of that effort: Muse Spark, a new proprietary model that Wang says (posting on rival social network X, used more often by the machine learning community) is "the most powerful model that meta has released," and has "support for tool-use, visual chain of thought, & multi-agent orchestration." He also says it will be the start of a new Muse family of models, raising questions about what will become of Meta's popular lineup and ongoing development of the Llama family.

It arrives not as a generic chatbot, but as the foundation for what Wang calls "personal superintelligence"—an AI that doesn’t just process text but "sees and understands the world around you" to act as a digital extension of the self, echoing Zuckberg's public manifesto for a vision of personal superintelligence published in summer 2025.

However, it is proprietary only — confined for now to the Meta AI app and website, as well as a " private API preview to select users," according to Meta's blog post announcing it — a move likely to rankle the literally billions of users of Llama models and the thousands of developers who relied upon it (some of whom are active participants in rival social network Reddit's r/LocalLLaMA subreddit). In addition, no pricing information for the model has yet been announced.

It's unclear if Meta has ended development on the Llama family entirely. When asked directly by VentureBeat, a Meta spokesperson said in an email: “Our current Llama models will continue to be available as open source,” which doesn’t address the question of development of future Llama models.

Visual chain-of-thought

At its core, Muse Spark is a natively multimodal reasoning model. Unlike previous iterations that "stitched" vision and text together, Muse Spark was rebuilt from the ground up to integrate visual information across its internal logic. This architectural shift enables "visual chain of thought," allowing the model to annotate dynamic environments—identifying the components of a complex espresso machine or correcting a user's yoga form via side-by-side video analysis.

The most significant technical leap, however, is a new "Contemplating" mode. This feature orchestrates multiple sub-agents to reason in parallel, allowing Meta to compete with extreme reasoning models like Google's Gemini Deep Think and OpenAI's GPT-5.4 Pro.

In benchmarks, this mode achieved 58% in "Humanity’s Last Exam" and 38% in "FrontierScience Research," figures that Meta claims validate their new scaling trajectory.

Perhaps more impressive for the company’s bottom line is the model’s efficiency. Meta reports that Muse Spark achieves its reasoning capabilities using over an order of magnitude less compute than Llama 4 Maverick, its previous mid-size flagship. This efficiency is driven by a process called "thought compression". During reinforcement learning, the model is penalized for excessive "thinking time," forcing it to solve complex problems with fewer reasoning tokens without sacrificing accuracy.

Benchmarks reveal a return-to-form

The launch of Muse Spark is framed as a statistical "quantum leap," ending Meta’s year-long absence from the absolute frontier of AI performance.

By reconciling Meta’s official internal data with independent auditing from third-party LLM tracking firm Artificial Analysis, a clear picture emerges: Muse Spark is not just a marginal improvement over the Llama series; it is a fundamental re-entry into the "Top 5" global models.

According to the Artificial Analysis Intelligence Index v4.0, Muse Spark achieved a score of 52. For context, Meta’s previous flagship, Llama 4 Maverick, debuted in 2025 with an Index score of just 18.

By nearly tripling its performance, Muse Spark now sits within striking distance of the industry’s most elite systems, trailing only Gemini 3.1 Pro Preview (57), GPT-5.4 (57), and Claude Opus 4.6 (53).

Meta’s official benchmarks suggest that Muse Spark is particularly dominant in multimodal reasoning, specifically where visual figures and logic intersect.

CharXiv Reasoning: In "figure understanding," Muse Spark achieved a score of 86.4, significantly outperforming Claude Opus 4.6 (65.3), Gemini 3.1 Pro (80.2), and GPT-5.4 (82.8).
MMMU Pro: Official reports place the model at 80.4, while Artificial Analysis’s independent audit measured it at 80.5%. This makes it the second-most capable vision model on the market, surpassed only by Gemini 3.1 Pro Preview (83.9% official; 82.4% independent).
Visual Factuality (SimpleVQA): Muse Spark scored 71.3, placing it ahead of GPT-5.4 (61.1) and Grok 4.2 (57.4), though it narrowly trails Gemini 3.1 Pro (72.4).

These scores validate Meta’s focus on "visual chain of thought," enabling the model to not just recognize objects, but to reason through complex spatial problems and dynamic annotations.

The "Thinking" gear of Muse Spark was put to the test against specialized benchmarks designed to break non-reasoning models.

Humanity’s Last Exam (HLE): In this multidisciplinary evaluation, Meta reports a score of 42.8 (No Tools) and 50.4 (With Tools). Independent audits by Artificial Analysis tracked the model at 39.9%, trailing Gemini 3.1 Pro Preview (44.7%) and GPT-5.4 (41.6%).
GPQA Diamond (PhD Level Reasoning): Muse Spark achieved a formidable 89.5, surpassing Grok 4.2 (88.5) but trailing the specialized "max reasoning" outputs of Opus 4.6 (92.7) and Gemini 3.1 Pro (94.3).
ARC AGI 2: This remains a notable weak point. Muse Spark scored 42.5, far behind the abstract reasoning puzzles solved by Gemini 3.1 Pro (76.5) and GPT-5.4 (76.1).
CritPT (Physics Research): Independent auditing found Muse Spark achieved the 5th highest score at 11%. This marks a substantial lead over Gemini 3 Flash (9%) and Claude 4.6 Sonnet (3%).

One of the most striking results from the official data is Muse Spark's performance in the health sector, likely a result of Meta's collaboration with over 1,000 physicians.

HealthBench Hard: Muse Spark achieved 42.8, a massive lead over Claude Opus 4.6 (14.8), Gemini 3.1 Pro (20.6), and even GPT-5.4 (40.1).
MedXpertQA (Multimodal): It scored 78.4, comfortably ahead of Opus 4.6 (64.8) and Grok 4.2 (65.8), though it still trails Gemini 3.1 Pro’s top-tier score of 81.3.

Agentic Systems and Efficiency: The "Thought Compression" Effect

While Muse Spark excels at reasoning, its "agentic" performance—executing real-world work tasks—presents a more nuanced picture.

SWE-Bench Verified: Muse Spark scored 77.4, trailing Claude Opus 4.6 (80.8) and Gemini 3.1 Pro (80.6).
GDPval-AA Elo: Meta’s official score of 1444 differs slightly from Artificial Analysis’s recorded 1427. In both cases, Muse Spark trails GPT-5.4 (1672) and Opus 4.6 (1606), suggesting that while the model "thinks" well, it is still refining its ability to "act" in long-horizon software and office workflows.
Token Efficiency: This is where Muse Spark distinguishes itself. To run the Intelligence Index, it used 58 million output tokens. In contrast, Claude Opus 4.6 required 157 million tokens and GPT-5.4 required 120 million. This supports Meta's claim of "thought compression"—delivering frontier-class intelligence while using less than half the "thinking time" of its closest competitors.

Benchmark	Llama 4 Maverick (2025)	Muse Spark (Official)	Gemini 3.1 Pro (Official)
Intelligence Index Score	18	52	57
MMMU Pro	--	80.4	83.9
CharXiv Reasoning	--	86.4	80.2
HealthBench Hard	--	42.8	20.6
License	Open-Weights	Proprietary	Proprietary

With Muse Spark, Meta has successfully transitioned from being the "LAMP stack for AI" to a direct challenger for the title of "Personal Superintelligence". While agentic workflows remain a hurdle, its dominance in vision, health, and token efficiency places Meta back at the center of the frontier race.

Personal wellness and Instagram shopping

Meta is immediately deploying Muse Spark to power specialized experiences across its app family.

Shopping Mode: A new feature that leverages Meta’s vast creator ecosystem. The AI picks up on brands, styling choices, and content across Instagram and Threads to provide personalized recommendations, effectively turning every post into a shoppable interaction.
Health Reasoning: In a move toward medical utility, Meta collaborated with over 1,000 physicians to curate training data. Muse Spark can now analyze nutritional content from photos of food or provide "health scores" for pescatarian diets with high cholesterol.
Interactive UI: The model can generate web-based minigames or tutorials on the fly. For example, a user can prompt the AI to turn a photo into a playable Sudoku game or a highlights-based tutorial for home appliances.

Evaluation awareness

While Muse Spark demonstrates strong refusal behaviors regarding biological and chemical weapons, its safety profile includes a startling new discovery. Third-party testing by Apollo Research found that the model possesses a high degree of "evaluation awareness".

The model frequently recognized when it was being tested in "alignment traps" and reasoned that it should behave honestly specifically because it was under evaluation.

While Meta concluded this was not a "blocking concern" for release, the finding suggests that frontier models are becoming increasingly "conscious" of the testing environment—potentially rendering traditional safety benchmarks less reliable as models learn to "game" the exam.

What happens to Llama?

In February 2023, Meta released Llama 1 to demonstrate that smaller, compute-optimal models could match larger counterparts like GPT-3 in efficiency. Although access was initially restricted to researchers, the model weights were leaked via 4chan on March 3, 2023, an event that inadvertently democratized high-tier research and catalyzed a global movement for running models on consumer-grade hardware.

This shift was solidified in July 2023 with the release of Llama 2, which introduced a commercial license that permitted self-hosting for most organizations. This approach saw rapid adoption, with the Llama family exceeding 100 million downloads and supporting over 1,000 commercial applications by the third quarter of 2023.

Through 2024 and 2025, Meta scaled the Llama family to establish it as the essential infrastructure for global enterprise AI, frequently referred to as the LAMP stack for AI. Following the launch of Llama 3 in April 2024 and the landmark Llama 3.1 405B in July, Meta achieved performance parity with the world's leading proprietary systems.

The subsequent release of Llama 4 in April 2025 introduced a Mixture-of-Experts architecture, allowing for massive parameter scaling while maintaining fast inference speeds. By early 2026, the Llama ecosystem reached a staggering scale, totaling 1.2 billion downloads and averaging approximately one million downloads per day.

This widespread adoption provided businesses with significant economic sovereignty, as self-hosting Llama models offered an 88% cost reduction compared to using proprietary API providers.

As of April 2026, Meta’s role as the undisputed leader of the open-weight movement has transitioned into a highly contested multi-polar landscape characterized by the rise of international competitors.

While the United States accounts for 35% of global Llama deployments, Chinese models from labs like Alibaba and DeepSeek began accounting for 41% of downloads on platforms like Hugging Face by late 2025. Throughout early 2026, new entrants such as Zhipu AI’s GLM-5 and Alibaba’s Qwen 3.6 Plus have outpaced Llama 4 Maverick on general knowledge and coding benchmarks.

In response to this global pressure, Meta's Muse Spark arrives with hefty expectations and an open source legacy that will be tough to live up to.

Proprietary only (for now)

The launch marks a controversial departure from Meta AI's "open science" roots. While the Llama series was famously accessible to developers, Muse Spark is launching as a proprietary model.

Wang addressed the shift on X, stating: "Nine months ago we rebuilt our ai stack from scratch. New infrastructure, new architecture, new data pipelines... This is step one. Bigger models are already in development with plans to open-source future versions."

However, the developer community remains skeptical. Some see this as a necessary pivot after the Llama 4 series failed to gain expected developer traction; others view it as Meta "closing the gates" now that it has a competitive reasoning model.

Wang himself acknowledged the transition’s difficulty, noting there are "certainly rough edges we will polish over time".

For the 3 billion people using Meta’s apps, the change will be felt almost instantly. The AI they interact with is no longer just a library of information, but an agent with a $27 billion brain and a mandate to understand their world as intimately as they do.

LLM-referred traffic converts at 30-40% — and most enterprises aren't optimizing for it

taryn.plumb@venturebeat.com (Taryn Plumb) — Wed, 08 Apr 2026 19:04:30 GMT

For more than two decades, digital discovery has operated on a simple model: search, scan, click, decide.

That worked when humans were the ones doing the web searching; but with the advent of AI agents, the primary consumer of information is no longer always human.

This is giving rise to a new paradigm: Answer engine optimization (AEO), also referred to as generative engine optimization (GEO). Because agents look at data much differently than humans do, success is no longer defined by rankings and clicks, but whether content is understood, selected, and cited by AI systems.

The SEO model that the web was built on simply isn’t going to cut it anymore, and enterprises need to prepare now.

How LLMs interpret web content

Traditional SEO is built around keywords, rankings, page-level optimization, and click-through rates. Users manually search across multiple sources and click around to get what they need. Simple, but sometimes frustrating and a definite time suck.

But AEO operates on a whole different level. Agents are increasingly taking over users’ workflows: Claude Code, OpenClaw, CrewAI, Microsoft Copilot, AutoGen, LangChain, Agent Bricks, Agentforce, Google Vertex, Perplexity’s web interface, and whatever else comes along.

These agents do not “browse” the web the way humans do. They analyze user intent based not just on phrasing, but persistent memory and context from past sessions (rather than simple autocomplete). They require materials that are concise, structured, and to the point.

What’s more, agents are moving beyond browsing to delegation, handling more downstream work. What started as “search, read, decide,” evolves to “agent retrieves, agent summarizes, human decides” (and, beyond that, “agent acts → human validates”).

“In practice, AEO begins where SEO stops,” said Dustin Engel, founder of consultancy company Elegant Disruption. “AEO is the next layer of discovery,” or “zero-click discovery.”

In this new world where agents synthesize answers, users may never even see an enterprise’s website, click-through rates decline, and attribution and citability (rather than pure visibility, or showing up at the top of a list of blue links) become critical.

“The new default is closer to a citation map: Where the model is pulling from, how often you show up, and how you are described,” Engel said.

Some, like Adam Yang of Q&A platform Quora, argue that AEO is already becoming the default over SEO.

This is for “a certain class of queries,” Yang notes. Any question where the user wants a synthesized answer — "what's the best approach to X," "compare these two options," "what do I need to know about Y" — is increasingly resolved by an AI without a click.

Google's own AI Overviews are already accelerating this on the consumer side, many analysts note. “SEO isn't dead,” Yang said. “But the optimization target has shifted from ‘rank on page 1’ to ‘get cited in the answer.’”

How devs are already using AI agents

Are there scenarios where regular search/Googling is still the best option?

“Absolutely,” said analyst Wyatt Mayham of Northwest AI Consulting. Notably, for personal tasks like finding nearby restaurants or local service providers. The interface is “just better” in those cases because it integrates maps, reviews, and photos. “That experience is hard to beat right now,” he said.

For work-related research, though, he says he’s “barely” using traditional search anymore, and it’s getting “closer to zero” every month.

“When I need to understand a company or a person professionally, agents do it faster and give me a more useful output than a page of blue links ever did,” he said.

His firm uses autonomous agents “heavily,” and built a Claude Skills function that powers its sales operation. Before a discovery call with a prospect, team members can trigger a skill that pulls the contact’s LinkedIn profile, scrapes their company website, grabs relevant info from sources like ZoomInfo, and crafts a clear picture of their revenue, team size, tech stack, and pain points.

“By the time I get on a call, I have a tailored research brief ready to go without spending 30 to 45 minutes manually Googling around,” Mayham said.

The big advantage is that these tools run in the background, he noted. You don’t have to sit clicking through browser tabs: You just tell the agent what you need, it does it, and you get a structured output that’s actually useful.

“It's collapsed what used to be a full hour of sales prep into a few minutes,” Mayham said.

Carlos Dutra, CEO and founder of IT consultancy Vindler Solutions, said Claude Code has “genuinely changed” his daily workflow. He uses it for most of his coding work, and what surprised him wasn't the speed, but the fact that he didn’t need to open and keep track of browser tabs. “Not because I'm lazy, but because the answers are better,” he said. He still uses Google for some tasks: Pricing pages, recent news, anything that needs to be current. “But for technical reasoning? Agents have mostly replaced search for me personally,” he said. Quora’s Yang has had a similar experience. He’s been using Claude Code daily for the past few months, primarily for content strategy, knowledge management, and competitive research. Workflows that used to take him half a day now take 30 minutes. But what’s been most advantageous is that he can now run research and synthesis tasks in parallel that he previously had to do sequentially. Also helpful is that agents’ context retention across sessions is “meaningfully better” than web-based tools. When he needs to understand a concept, map a competitive landscape, or synthesize industry trends, Claude or Perplexity are the go-to before opening a browser tab. “I've started treating agent search as my first stop, not Google. Traditional search is now where I verify, not where I discover.” The kinks are real, though. Mayham pointed out that LinkedIn, in particular, is “aggressive” about blocking automated access, and many other sites have (or are implementing) similar protections. Users will hit walls when agents can't get through, so a fallback plan is important for those relying on agents. “The reliability isn't 100% yet, and that's probably the biggest thing holding broader adoption back,” he said. Mayham’s advice for other devs: Stop chasing shiny objects. A new AI tool launches “practically every day,” and many (experienced devs included) are jumping from platform to platform without ever going deep with any of them. “Pick a model, go deep, build real workflows on it,” he emphasized. “You'll get more value from mastery of one platform than surface-level experimentation across five.”

How enterprises can compete in an AEO-driven world

When AI agents do the searching, the rules change. The question is no longer whether your content ranks on the first page, it's whether the model selects you as the source when generating an answer.

Structure matters much more than it used to. Content should:

Be organized around conversational intent, provide direct answers, and mirror real user questions and follow-ups;
Be authoritative and reflect strong expertise;
Be fresh (and, when necessary, regularly refreshed);
Have clear headers and established FAQ schema.

Another must is maintaining a strong brand presence across the forums and platforms — Wikipedia, Reddit, LinkedIn, industry publications — that models are trained on. Enterprises might also consider investing in original data, like research.

In Mayham’s experience, when a business gets recommended by an LLM during a search-style query, the conversion rate is “dramatically higher” than traditional channels. For his company, LLM-referred traffic is converting at 30 to 40%, which “blows away what we see from SEO or paid social.”

“The intent signal is just different when someone is having a conversation with an AI and it recommends you by name.”

Discoverability inside LLMs will matter as much as Google rankings, “maybe more,” Mayham said. “It's a whole new surface for customer acquisition that most businesses aren't even thinking about yet.”

Vindler's Dutra agreed that the “uncomfortable truth” is that most enterprise content is becoming “basically invisible” in agent-driven queries. “AEO is about whether your content survives being chunked, embedded, and semantically retrieved,” he said.

The companies getting ahead aren’t doing anything “exotic,” he noted. They have clean, declarative content that doesn’t require context to understand. Those still writing copy stuffed with keywords are going to fall behind because LLMs care about semantic clarity.

A quick test he gives clients: Ask an LLM a question your page is supposed to answer, without giving it the URL. “If it can't construct the answer from your content, you have a problem.”

Jeff Oxford of SEO agency Visibility Labs offers valuable step-by-step advice:

Engage in conversations on Reddit, which is one of the most-cited domains in AI search. Enterprises should establish a positive reputation on Reddit, and engage on any relevant threads where customers are asking for recommendations.
Build a strong YouTube presence. According to Ahrefs, which tracks internet behavior, YouTube mentions have the “strongest correlation” with AI visibility across ChatGPT, AI Mode, and AI Overviews. “This makes sense, since both Google and OpenAI have trained their models on YouTube transcripts,” Oxford said, “and YouTube is the most-cited domain in Google's AI products.”
Invest in digital PR and brand mentions; the latter is the second-highest correlated factor with AI visibility. “Brands need to improve their digital presence by being in as many places as possible,” Oxford said.
Create content aligned with AI citation patterns. Enterprises should audit the prompts and topics where AI search engines are surfacing competitors, then create authoritative content on those same topics.

“The goal is to become a source that AI models consider worth citing,” he noted.

Still, there may be a lot of unnecessary hype around how drastically enterprises need to change, said Shashi Bellamkonda, principal research director at consultancy firm Info-Tech Research Group. Those following best practices of producing content that their audience actually needs, written by experts and showcasing expert opinion, are in a good position to be cited in AI-powered search. He pointed out that Google developed an EEAT framework (experience, expertise, authority, and trust) to evaluate content quality and helpfulness and help algorithms identify reliable, high-quality information. To stand out, enterprises should use structured data and schema to signal the context: Is this an article, a research study, a product overview? “Original long-form content will be valued by AI-powered answer engines,” Bellamkonda said. “Copycat strategies or trying to game the system are taboo in this era.” Experts should also share their thoughts across several channels, and "About Us" pages must be “robust” and include bios highlighting thought leaders’ expertise.

“Ultimately, the reputation of AI-powered search is in making sure the user likes the search rather than what you think they should read,” Bellamkonda said. “So a good focus on the end user is a great way to succeed.”

New framework lets AI agents rewrite their own skills without retraining the underlying model

bendee983@gmail.com (Ben Dickson) — Wed, 08 Apr 2026 17:18:21 GMT

One major challenge in deploying autonomous agents is building systems that can adapt to changes in their environments without the need to retrain the underlying large language models (LLMs).

Memento-Skills, a new framework developed by researchers at multiple universities, addresses this bottleneck by giving agents the ability to develop their skills by themselves. "It adds its continual learning capability to the existing offering in the current market, such as OpenClaw and Claude Code," Jun Wang, co-author of the paper, told VentureBeat.

Memento-Skills acts as an evolving external memory, allowing the system to progressively improve its capabilities without modifying the underlying model. The framework provides a set of skills that can be updated and expanded as the agent receives feedback from its environment.

For enterprise teams running agents in production, that matters. The alternative — fine-tuning model weights or manually building skills — carries significant operational overhead and data requirements. Memento-Skills sidesteps both.

The challenges of building self-evolving agents

Self-evolving agents are crucial because they overcome the limitations of frozen language models. Once a model is deployed, its parameters remain fixed, restricting it to the knowledge encoded during training and whatever fits in its immediate context window.

Giving the model an external memory scaffolding enables it to improve without the costly and slow process of retraining. However, current approaches to agent adaptation largely rely on manually-designed skills to handle new tasks. While some automatic skill-learning methods exist, they mostly produce text-only guides that amount to prompt optimization. Other approaches simply log single-task trajectories that don’t transfer across different tasks.

Furthermore, when these agents try to retrieve relevant knowledge for a new task, they typically rely on semantic similarity routers, such as standard dense embeddings; high semantic overlap does not guarantee behavioral utility. An agent relying on standard RAG might retrieve a "password reset" script to solve a "refund processing" query simply because the documents share enterprise terminology.

"Most retrieval-augmented generation (RAG) systems rely on similarity-based retrieval. However, when skills are represented as executable artifacts such as markdown documents or code snippets, similarity alone may not select the most effective skill," Wang said.

How Memento-Skills stores and updates skills

To solve the limitations of current agentic systems, the researchers built Memento-Skills. The paper describes the system as “a generalist, continually-learnable LLM agent system that functions as an agent-designing agent.” Instead of keeping a passive log of past conversations, Memento-Skills creates a set of skills that act as a persistent, evolving external memory.

These skills are stored as structured markdown files and serve as the agent's evolving knowledge base. Each reusable skill artifact is composed of three core elements. It contains declarative specifications that outline what the skill is and how it should be used. It includes specialized instructions and prompts that guide the language model's reasoning. And it houses the executable code and helper scripts that the agent runs to actually solve the task.

Memento-Skills achieves continual learning through its "Read-Write Reflective Learning" mechanism, which frames memory updates as active policy iteration rather than passive data logging. When faced with a new task, the agent queries a specialized skill router to retrieve the most behaviorally relevant skill — not just the most semantically similar one — and executes it.

After the agent executes the skill and receives feedback, the system reflects on the outcome to close the learning loop. Rather than just appending a log of what happened, the system actively mutates its memory. If the execution fails, an orchestrator evaluates the trace and rewrites the skill artifacts. This means it directly updates the code or prompts to patch the specific failure mode. In case of need, it creates an entirely new skill.

Memento-Skills also updates the skill router through a one-step offline reinforcement learning process that learns from execution feedback rather than just text overlap. "The true value of a skill lies in how it contributes to the overall agentic workflow and downstream execution,” Wang said. “Therefore, reinforcement learning provides a more suitable framework, as it enables the agent to evaluate and select skills based on long-term utility."

To prevent regression in a production environment, the automated skill mutations are guarded by an automatic unit-test gate. The system generates a synthetic test case, executes it through the updated skill, and checks the results before saving the changes to the global library.

By continuously rewriting and refining its own executable tools, Memento-Skills enables a frozen language model to build robust muscle memory and progressively expand its capabilities end-to-end.

Putting the self-evolving agent to the test

The researchers evaluated Memento-Skills on two rigorous benchmarks. The first is General AI Assistants (GAIA), which requires complex multi-step reasoning, multi-modality handling, web browsing, and tool use. The second is Humanity's Last Exam, or HLE, an expert-level benchmark spanning eight diverse academic subjects like mathematics and biology. The entire system was powered by Gemini-3.1-Flash acting as the underlying frozen language model.

The system was compared against a Read-Write baseline that retrieves skills and collects feedback but doesn’t have self-evolving features. The researchers also tested their custom skill router against standard semantic retrieval baselines, including BM25 and Qwen3 embeddings.

The results proved that actively self-evolving memory vastly outperforms a static skill library. On the highly diverse GAIA benchmark, Memento-Skills improved test set accuracy by 13.7 percentage points over the static baseline, achieving 66.0% compared to 52.3%. On the HLE benchmark, where the domain structure allowed for massive cross-task skill reuse, the system more than doubled the baseline's performance, jumping from 17.9% to 38.7%.

Moreover, the specialized skill router of Memento-Skills avoids the classic retrieval trap where an irrelevant skill is selected simply because of semantic similarity. Experiments show that Memento-Skills boosts end-to-end task success rates to 80%, compared to just 50% for standard BM25 retrieval.

The researchers observed that Memento-Skills manages this performance through highly organic, structured skill growth. Both benchmark experiments started with just five atomic seed skills, such as basic web search and terminal operations. On the GAIA benchmark, the agent autonomously expanded this seed group into a compact library of 41 skills to handle the diverse tasks. On the expert-level HLE benchmark, the system dynamically scaled its library to 235 distinct skills.

Finding the enterprise sweet spot

The researchers have released the code for Memento-Skills on GitHub, and it is readily available for use.

For enterprise architects, the effectiveness of this system depends on domain alignment. Instead of simply looking at benchmark scores, the core business tradeoff lies in whether your agents are handling isolated tasks or structured workflows.

"Skill transfer depends on the degree of similarity between tasks," Wang said. "First, when tasks are isolated or weakly related, the agent cannot rely on prior experience and must learn through interaction." In such scattershot environments, cross-task transfer is limited. "Second, when tasks share substantial structure, previously acquired skills can be directly reused. Here, learning becomes more efficient because knowledge transfers across tasks, allowing the agent to perform well on new problems with little or no additional interaction."

Given that the system requires recurring task patterns to consolidate knowledge, enterprise leaders need to know exactly where to deploy this today and where to hold off.

"Workflows are likely the most appropriate setting for this approach, as they provide a structured environment in which skills can be composed, evaluated, and improved," Wang said.

However, he cautioned against over-deployment in areas not yet suited for the framework. "Physical agents remain largely unexplored in this context and require further investigation. In addition, tasks with longer horizons may demand more advanced approaches, such as multi-agent LLM systems, to enable coordination, planning, and sustained execution over extended sequences of decisions."

As the industry moves toward agents that autonomously rewrite their own production code, governance and security remain paramount. While Memento-Skills employs foundational safety rails like automatic unit-test gates, a broader framework will likely be needed for enterprise adoption.

"To enable reliable self-improvement, we need a well-designed evaluation or judge system that can assess performance and provide consistent guidance," Wang said. "Rather than allowing unconstrained self-modification, the process should be structured as a guided form of self-development, where feedback steers the agent toward better designs."

Block introduces Managerbot, a proactive Square AI agent and the clearest proof point yet for Jack Dorsey’s AI bet

michael.nunez@venturebeat.com (Michael Nuñez) — Tue, 07 Apr 2026 22:14:30 GMT

Block today unveiled Managerbot, a new AI agent embedded in the Square platform that proactively monitors a seller's business, identifies emerging problems, and proposes actionable solutions — without the seller ever having to ask a question. The product marks the most tangible manifestation of CEO Jack Dorsey's controversial bet that artificial intelligence can fundamentally reshape how his company operates, builds products, and serves the millions of small businesses that depend on Square to run day-to-day commerce.

In an exclusive interview with VentureBeat, Willem Avé, Block's head of product at Square, described Managerbot as a decisive break from the company's earlier Square AI assistant, which functioned as a reactive chatbot that answered seller questions about sales, employees, and business performance.

"The big shift from Square AI to Managerbot is really from reactive to proactive," Avé said. "What that means is the primary interface is not a question box. You assign tasks to Managerbot, and that could be based on data, an insight, or a signal from your business."

The product is beginning to roll out now, with full availability to Square sellers expected over the coming months. Block declined to say whether Managerbot would carry an additional fee or be bundled into existing Square subscriptions.

How Managerbot predicts inventory shortages, optimizes schedules, and writes marketing campaigns on its own

Avé outlined three core domains where Managerbot operates today: inventory forecasting, employee shift scheduling, and automated marketing campaign creation. In every case, the agent acts before the seller does — watching over the business, detecting patterns, and surfacing recommendations with proposed actions attached.

In the inventory domain, Managerbot continuously monitors a seller's stock levels, sales velocity, and external signals such as weather patterns and local events, then alerts the seller when an item is about to run out — or when it should stock up ahead of anticipated demand. "In warmer weather, we can see that you sell more of a certain good," Avé explained. "That's the forecasting capability, combined with local data — weather, events — so we can help sellers manage both their inventory and cash flows."

For shift scheduling — a task that Avé described as "one of those interesting, very hard computer science problems" that consumes hours of a small business owner's week — Managerbot analyzes forecasted sales data and then generates optimized employee schedules that balance worker preferences with coverage needs. "It turns out that frontier models are actually pretty good at it," Avé said.

The third capability tackles what Avé called "the whole bucket of things that sellers could do if they had more time" — principally marketing. Managerbot identifies sales trends across a seller's catalog and automatically drafts win-back campaigns and promotional outreach targeted at a store's best customer segments. Avé said Block is seeing "very meaningful lift" from Managerbot-generated campaigns compared to what some sellers create manually, though he declined to share specific performance figures publicly.

Block built Managerbot on frontier AI models from OpenAI and Anthropic — but says the real innovation is underneath

Managerbot runs on third-party frontier models — Avé specifically referenced Anthropic's Sonnet and OpenAI's GPT family — but Block's competitive advantage, he argued, lies in the "agent harness" the company has built around those models. That harness draws heavily on Goose, Block's open-source agent framework, and incorporates learnings from its consumer-facing Money Bot on Cash App.

The challenge specific to Square is scale and complexity. A seller running a small business might interact with hundreds of different tools across invoicing, inventory, customer management, marketing, payroll, and scheduling. Managerbot must navigate all of them coherently within a single agentic loop. "This isn't like, you know, you load a skill and call it a day — think about hundreds of skills," Avé said. "Actually, managing the context and managing the way that we progressively disclose tools, and some of the other innovation that we have at the harness layer, is I think some of the secret sauce."

A critical design decision shapes every interaction: Managerbot does not autonomously execute changes to a seller's business. Every write action — whether adjusting a shift schedule, publishing a marketing campaign, or modifying inventory — requires explicit seller approval. To facilitate that approval, Managerbot generates visual UI previews showing exactly what will change before the seller clicks "yes." "We want to earn trust with sellers, so any write action is prompted to the user to approve," Avé said. "The seller needs a visual representation of what the change is. You can't just describe in words all the time what you're going to go do."

An $80 million fine and chatbot blunders hang over Block's push to automate financial recommendations

That human-in-the-loop caution reflects a sensitivity that gains additional weight given Block's recent history. In January 2025, 48 state financial regulators imposed an $80 million fine on Block for violations of Bank Secrecy Act and anti-money laundering laws related to Cash App. The Connecticut Department of Banking stated in announcing the settlement that regulators "found Block was not in compliance with certain requirements, creating the potential that its services could be used to support money laundering, terrorism financing, or other illegal activities." The Illinois Department of Financial and Professional Regulation simultaneously joined the coordinated enforcement action.

Separately, reporting from The Guardian has documented instances of Block's customer-facing chatbots making serious errors, including telling customers to cancel or close their accounts. When VentureBeat raised this concern during the interview, Avé acknowledged the stakes but redirected to Managerbot's specific safeguards.

"Financial accuracy and financial data — the value of these products really come from recommendations," Avé said. "We need to be better than whatever you can feed to ChatGPT. If you take a CSV of your sales and put it in ChatGPT or Claude, we need our product to be better and answer that question either more accurately or better than what's available in the market." He pointed to the harness layer's role in reducing hallucinations through tuning, prompt engineering, and optimized tool-call loops, while acknowledging the inherent limitations of probabilistic systems: "It's never going to be zero. Obviously, these are probabilistic systems, and we have guidance and call-outs in the tool to provide that." On regulated domains like lending and payments, Avé was more definitive: "In any sort of regulated domains — banking, lending, payments — there are strict guardrails on what we can and can't say to sellers. Those are just part of the product and business."

Dorsey cut 4,000 jobs in the name of AI — Managerbot is the first answer to what those tools are actually building

It is impossible to evaluate Managerbot outside the context of the radical organizational surgery Block performed just weeks ago. In late February, Dorsey announced that Block would cut more than 4,000 of its roughly 10,000 employees — nearly half the workforce — explicitly citing AI as the driving rationale. As the BBC reported, Dorsey wrote that "AI fundamentally changes what it means to build and run a company." Block's stock surged more than 20 percent on the news, according to ABC7.

The company's Q4 2025 earnings report, released alongside the layoff announcement, showed gross profit of $2.87 billion — up 24 percent year over year — and raised 2026 guidance to $12.2 billion in gross profit, according to AlphaSense's earnings analysis. Block also reported a greater than 40 percent increase in production code shipped per engineer since September 2025 through the use of agentic coding tools. As CNBC commentator Steve Sedgwick wrote in an opinion piece following the announcement, "I keep getting told on CNBC that AI will create new jobs to replace those being lost. I've been asking the same question for years now." The Observer's Mark Minevich was more pointed, calling Block's layoffs "probably the first legitimate mass layoff driven by A.I. as the actual operating thesis."

Managerbot, then, is the product answer to the obvious follow-up question: if Block shed 4,000 workers in the name of intelligence tools, what exactly are those intelligence tools building? Avé framed the product as proof of concept for Block's entire strategic thesis. "Block has been in the press recently about rebuilding as an intelligence company, and it's like, a lot of people are asking, 'What does that mean for us?'" Avé said. "What I like to do is show, not tell. We're building Managerbot, which I think is one of the more advanced, maybe the most advanced, small business agent out there today."

Sellers who use Managerbot are consolidating their businesses onto Square — and that may be the real strategic payoff

Perhaps the most consequential signal Avé shared was an early behavioral pattern: sellers who begin using Managerbot are voluntarily migrating more of their business operations onto the Square platform, consolidating payroll, time cards, and shift scheduling into Block's ecosystem to feed the agent more data. "When they start interacting with Managerbot, they want to move more of their business onto Square because they see the value," Avé said. "They're like, 'I should put my payroll here. I should get time cards here. I should get my shift schedules here,' because once all that data is in one place, they can make better decisions and manage their business better."

This dynamic could prove to be Managerbot's most significant long-term effect — not as a standalone feature, but as a gravitational force pulling sellers deeper into Block's integrated commerce stack. Block's Q4 earnings already showed Square's new volume added grew 29 percent year over year, with sales-led NVA surging 62 percent. Avé also argued that Square's first-party architecture — built organically rather than through acquisitions — gives it a structural advantage over competitors in the AI era. "We've kind of harmonized and canonicalized this data at a sensible layer," he said. "It's not super hard to create more skills for these data domains."

When VentureBeat pressed Avé on the tension between helping sellers and upselling them on Block's own financial products — lending, payments processing, and other services that generate revenue for the company — he acknowledged the concern but framed Managerbot's mission in terms of decision-making quality. "The goal for Managerbot is to help sellers increase their decision-making correctness," Avé said. "If we can make sellers better at running their business by making better decisions and giving time back, I think that's a good thing."

Block says Managerbot isn't a chatbot — it's a business protector that compounds the company's entire AI strategy

Avé was insistent that Managerbot represents something categorically different from the chatbot-as-advisor model that has proliferated across enterprise software. "A lot of people are building chatbots as advisors — it can answer a question for you," he said. "What we really want Managerbot to be is a protector of your business. This is identifying trends. This is spotting things that you might have missed. This is helping you run your business and take actions."

He also argued that the agent model compounds Block's development velocity in ways that traditional software cannot match. "It's much more straightforward to add a capability to Managerbot than it is to build a big Web 2.0 UI," Avé said. "If we can deliver more capabilities, more features, more value to our sellers, the whole system compounds."

Whether that compounding materializes — and whether sellers ultimately experience Managerbot as a trusted protector or a sophisticated upsell engine — will determine much about Block's future. The company has staked its corporate identity, its headcount, and its Wall Street narrative on the conviction that AI agents can deliver more value with fewer humans in the loop. Managerbot is the first product to carry the full weight of that promise. And the small business owners who keep their shops open with Square terminals, who juggle shift schedules on napkins and skip marketing because there aren't enough hours in the day — they didn't ask to be the test case for Silicon Valley's boldest AI thesis. But as of today, they are.

Amazon S3 Files gives AI agents a native file system workspace, ending the object-file split that breaks multi-agent pipelines

Tue, 07 Apr 2026 21:36:00 GMT

AI agents run on file systems using standard tools to navigate directories and read file paths.

The challenge, however, is that there is a lot of enterprise data in object storage systems, notably Amazon S3. Object stores serve data through API calls, not file paths. Bridging that gap has required a separate file system layer alongside S3, duplicated data and sync pipelines to keep both aligned.

The rise of agentic AI makes that challenge even harder, and it was affecting Amazon's own ability to get things done. Engineering teams at AWS using tools like Kiro and Claude Code kept running into the same problem: Agents defaulted to local file tools, but the data was in S3. Downloading it locally worked until the agent's context window compacted and the session state was lost.

Amazon's answer is S3 Files, which mounts any S3 bucket directly into an agent's local environment with a single command. The data stays in S3, with no migration required. Under the hood, AWS connects its Elastic File System (EFS) technology to S3 to deliver full file system semantics, not a workaround. S3 Files is available now in most AWS Regions.

"By making data in S3 immediately available, as if it's part of the local file system, we found that we had a really big acceleration with the ability of things like Kiro and Claude Code to be able to work with that data," Andy Warfield, VP and distinguished engineer at AWS, told VentureBeat.

The difference between file and object storage and why it matters

S3 was built for durability, scale and API-based access at the object level. Those properties made it the default storage layer for enterprise data. But they also created a fundamental incompatibility with the file-based tools that developers and agents depend on. "S3 is not a file system, and it doesn't have file semantics on a whole bunch of fronts," Warfield said. "You can't do a move, an atomic move of an object, and there aren't actually directories in S3."

Previous attempts to bridge that gap relied on FUSE (Filesystems in USErspace), a software layer that lets developers mount a custom file system in user space without changing the underlying storage. Tools like AWS's own Mount Point, Google's gcsfuse and Microsoft's blobfuse2 all used FUSE-based drivers to make their respective object stores look like a file system.

Warfield noted that the problem is that those object stores still weren't file systems. Those drivers either faked file behavior by stuffing extra metadata into buckets, which broke the object API view, or they refused file operations that the object store couldn't support.

S3 Files takes a different architecture entirely. AWS is connecting its EFS (Elastic File System) technology directly to S3, presenting a full native file system layer while keeping S3 as the system of record. Both the file system API and the S3 object API remain accessible simultaneously against the same data.

How S3 Files accelerates agentic AI

Before S3 Files, an agent working with object data had to be explicitly instructed to download files before using tools. That created a session state problem. As agents compacted their context windows, the record of what had been downloaded locally was often lost.

"I would find myself having to remind the agent that the data was available locally," Warfield said.

Warfield walked through the before-and-after for a common agent task involving log analysis. He explained that a developer was using Kiro or Claude Code to work with log data, in the object only case they would need to tell the agent where the log files are located and to go and download them. Whereas if the logs are immediately mountable on the local file system, the developer can simply identify that the logs are at a specific path, and the agent immediately has access to go through them.

For multi-agent pipelines, multiple agents can access the same mounted bucket simultaneously. AWS says thousands of compute resources can connect to a single S3 file system at the same time, with aggregate read throughput reaching multiple terabytes per second — figures VentureBeat was not able to independently verify.

Shared state across agents works through standard file system conventions: subdirectories, notes files and shared project directories that any agent in the pipeline can read and write. Warfield described AWS engineering teams using this pattern internally, with agents logging investigation notes and task summaries into shared project directories.

For teams building RAG pipelines on top of shared agent content, S3 Vectors — launched at AWS re:Invent in December 2024 — layers on top for similarity search and retrieval-augmented generation against that same data.

What analysts say: this is not just a better FUSE

AWS is positioning S3 Files against FUSE-based file access from Azure Blob NFS and Google Cloud Storage FUSE. For AI workloads, the meaningful distinction is not primarily performance.

"S3 Files eliminates the data shuffle between object and file storage, turning S3 into a shared, low-latency working space without copying data," Jeff Vogel, analyst at Gartner, told VentureBeat. "The file system becomes a view, not another dataset."

With FUSE-based approaches, each agent maintains its own local view of the data. When multiple agents work simultaneously, those views can potentially fall out of sync.

"It eliminates an entire class of failure modes including unexplained training/inference failures caused by stale metadata, which are notoriously difficult to debug," Vogel said. "FUSE-based solutions externalize complexity and issues to the user."

The agent-level implications go further still. The architectural argument matters less than what it unlocks in practice.

"For agentic AI, which thinks in terms of files, paths, and local scripts, this is the missing link," Dave McCarthy, analyst at IDC, told VentureBeat. "It allows an AI agent to treat an exabyte-scale bucket as its own local hard drive, enabling a level of autonomous operational speed that was previously bottled up by API overhead associated with approaches like FUSE."

Beyond the agent workflow, McCarthy sees S3 Files as a broader inflection point for how enterprises use their data.

"The launch of S3 Files isn't just S3 with a new interface; it's the removal of the final friction point between massive data lakes and autonomous AI," he said. "By converging file and object access with S3, they are opening the door to more use cases with less reworking."

What this means for enterprises

For enterprise teams that have been maintaining a separate file system alongside S3 to support file-based applications or agent workloads, that architecture is now unnecessary.

For enterprise teams consolidating AI infrastructure on S3, the practical shift is concrete: S3 stops being the destination for agent output and becomes the environment where agent work happens.

"All of these API changes that you're seeing out of the storage teams come from firsthand work and customer experience using agents to work with data," Warfield said. "We're really singularly focused on removing any friction and making those interactions go as well as they can."

Anthropic says its most powerful AI cyber model is too dangerous to release publicly — so it built Project Glasswing

michael.nunez@venturebeat.com (Michael Nuñez) — Tue, 07 Apr 2026 21:35:44 GMT

Anthropic on Tuesday announced Project Glasswing, a sweeping cybersecurity initiative that pairs an unreleased frontier AI model — Claude Mythos Preview — with a coalition of twelve major technology and finance companies in an effort to find and patch software vulnerabilities across the world's most critical infrastructure before adversaries can exploit them.

The launch partners include Amazon Web Services, Apple, Broadcom, Cisco, CrowdStrike, Google, JPMorganChase, the Linux Foundation, Microsoft, Nvidia, and Palo Alto Networks. Anthropic says it has also extended access to more than 40 additional organizations that build or maintain critical software, and is committing up to $100 million in usage credits for Claude Mythos Preview across the effort, along with $4 million in direct donations to open-source security organizations.

The announcement arrives at a moment of extraordinary momentum — and extraordinary scrutiny — for the San Francisco-based AI startup. Anthropic disclosed on Sunday that its annualized revenue run rate has surpassed $30 billion, up from approximately $9 billion at the end of 2025, and the number of business customers each spending over $1 million annually now exceeds 1,000, doubling in less than two months. The company simultaneously announced a multi-gigawatt compute deal with Google and Broadcom. On the same day, Bloomberg reported that Anthropic had poached a senior Microsoft executive, Eric Boyd, to lead its infrastructure expansion.

But Glasswing is something categorically different from a revenue milestone or a compute deal. It’s Anthropic's most ambitious attempt to translate frontier AI capabilities — capabilities the company itself describes as dangerous — into a defensive advantage before those same capabilities proliferate to hostile actors.

Why Anthropic built a model it considers too dangerous to release publicly

At the center of Project Glasswing sits Claude Mythos Preview, a general-purpose frontier model that Anthropic says has already identified thousands of high-severity zero-day vulnerabilities — meaning flaws previously unknown to software developers — in every major operating system and every major web browser, along with a range of other critical software.

The company is not making the model generally available.

"We do not plan to make Claude Mythos Preview generally available due to its cybersecurity capabilities," Newton Cheng, Frontier Red Team Cyber Lead at Anthropic, told VentureBeat in an exclusive interview. "However, given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely. The fallout — for economies, public safety, and national security — could be severe."

That language — "the fallout could be severe" — is striking coming from the company that built the model. Anthropic is effectively arguing that the tool it created is powerful enough to reshape the cybersecurity landscape, and that the only responsible thing to do is to keep it restricted while giving defenders a head start.

The technical results reinforce that claim. According to Anthropic's press release, Mythos Preview was able to find nearly all of the vulnerabilities it surfaced, and develop many related exploits, entirely autonomously, without any human steering. Three examples stand out: The model found a 27-year-old vulnerability in OpenBSD — widely regarded as one of the most security-hardened operating systems in the world and commonly used to run firewalls and critical infrastructure. The flaw allowed an attacker to remotely crash any machine running the OS simply by connecting to it. It also discovered a 16-year-old vulnerability in FFmpeg — the near-ubiquitous video encoding and decoding library — in a line of code that automated testing tools had exercised five million times without ever catching the problem. And perhaps most alarmingly, Mythos Preview autonomously found and chained together several vulnerabilities in the Linux kernel to escalate from ordinary user access to complete control of the machine.

All three vulnerabilities have been reported to the relevant maintainers and have since been patched. For many other vulnerabilities still in the remediation pipeline, Anthropic says it is publishing cryptographic hashes of the details today, with plans to reveal specifics after fixes are in place.

On the CyberGym evaluation benchmark, Mythos Preview scored 83.1%, compared to 66.6% for Claude Opus 4.6, Anthropic's next-best model. The gap is even wider on coding benchmarks: Mythos Preview achieves 93.9% on SWE-bench Verified versus 80.8% for Opus 4.6, and 77.8% on SWE-bench Pro versus 53.4%.

How Anthropic plans to disclose thousands of zero-days without overwhelming open-source maintainers

Finding thousands of zero-days at once sounds impressive. Actually handling the output responsibly is a logistical nightmare — and one of the sharpest criticisms that security researchers have raised about AI-driven vulnerability discovery. Flooding open-source maintainers, many of whom are unpaid volunteers, with an avalanche of critical bug reports could easily do more harm than good.

Cheng told VentureBeat that Anthropic has built a triage pipeline specifically to manage this problem. "We triage every bug that we find and then send the highest severity bugs to professional human triagers we have contracted to assist in our disclosure process by manually validating every bug report before we send it out to ensure that we send only high-quality reports to maintainers," he said.

That pipeline is designed to prevent exactly the scenario that maintainers fear most: an automated firehose of unverified reports. "We do not submit large volumes of findings to a single project without first reaching out in an effort to agree on a pace the maintainer can sustain," Cheng added.

When Anthropic has access to the source code, the company aims to include a candidate patch with every report, labeled by provenance — meaning the maintainer knows the patch was written or reviewed by a model — and offers to collaborate on a production-quality fix. "Models can write patches," Cheng noted, "but there are many factors that impact patch quality, and we strongly recommend that autonomously-written patches are put under the same scrutiny and testing that human-written patches are."

On disclosure timelines, Anthropic says it follows a coordinated vulnerability disclosure framework. Once a patch is available, the company will generally wait 45 days before publishing full technical details, giving downstream users time to deploy the fix before exploitation information becomes public. Cheng said the company may shorten that buffer "if the details are already publicly known through other channels, or if earlier publication would materially help defenders identify and mitigate ongoing attacks," or extend it "when patch deployment is unusually complex or the affected footprint is unusually broad."

Those are reasonable principles, but they will be tested at a scale that no vulnerability disclosure program has ever attempted. The sheer volume of findings — thousands of zero-days across every major platform — means that even a well-designed triage process will face bottlenecks. And the 45-day disclosure window assumes that maintainers can actually produce, test, and ship a patch in that time, which is far from guaranteed for complex kernel-level bugs or deeply embedded cryptographic flaws.

The source code leak, the CMS blunder, and why trust is Anthropic's biggest vulnerability

The irony of a company claiming to build the most capable cyber model ever constructed while simultaneously suffering a string of embarrassing security lapses has not been lost on observers.

In late March, a draft blog post about Mythos was left in an unsecured and publicly searchable data store — a CMS misconfiguration that exposed roughly 3,000 internal assets, including what appeared to be strategic plans for the model's rollout. Days later, on March 31, anyone who ran npm install on Claude Code pulled down Anthropic's complete original source code — 512,000 lines — for approximately three hours due to a packaging error, an incident that drew widespread attention in the developer community and was first reported by VentureBeat.

When asked why partners and governments should trust Anthropic as the custodian of a model it describes as having unprecedented cyber capabilities, Cheng was direct. "Security is central to how we build and ship," he told VentureBeat. "These two incidents, a blog CMS misconfiguration and an npm packaging error, were human errors in publishing tooling, not breaches of our security architecture. We've made changes to prevent these from happening again, and we'll continue to improve our processes."

It is a technically accurate distinction — neither incident involved a breach of Anthropic's core model weights, training infrastructure, or API systems — but it is also a distinction that may prove difficult to sustain as a public argument. For an organization asking governments and Fortune 500 companies to trust it with a tool that can autonomously find and exploit vulnerabilities in the Linux kernel, even minor operational lapses carry outsized reputational risk. The fact that the Mythos leak itself was what first alerted the security community to the model's existence, weeks before the planned announcement, underscores the point.

What Microsoft, CrowdStrike, and the Linux Foundation found when they tested the model

The coalition's breadth is notable. It includes direct competitors — Google and Microsoft — alongside cybersecurity incumbents, financial institutions, and the steward of the world's largest open-source ecosystem. And several partners have already been running Mythos Preview against their own infrastructure for weeks.

CrowdStrike's CTO Elia Zaitsev framed the initiative in terms of collapsing timelines: "The window between a vulnerability being discovered and being exploited by an adversary has collapsed — what once took months now happens in minutes with AI." AWS Vice President and CISO Amy Herzog said her teams have already been testing Mythos Preview against critical codebases, where the model is "already helping us strengthen our code." And Microsoft's Global CISO Igor Tsyganskiy noted that when tested against CTI-REALM, Microsoft's open-source security benchmark, "Claude Mythos Preview showed substantial improvements compared to previous models."

Perhaps the most revealing comment came from Jim Zemlin, CEO of the Linux Foundation, who pointed to the fundamental asymmetry that has plagued open-source security for decades: "In the past, security expertise has been a luxury reserved for organizations with large security teams. Open-source maintainers — whose software underpins much of the world's critical infrastructure — have historically been left to figure out security on their own." Project Glasswing, he said, "offers a credible path to changing that equation."

To back that claim with dollars, Anthropic says it has donated $2.5 million to Alpha-Omega and OpenSSF through the Linux Foundation, and $1.5 million to the Apache Software Foundation. Maintainers interested in access can apply through Anthropic's Claude for Open Source program.

Inside the pricing, the compute deal, and Anthropic's path toward a potential IPO

After the research preview period — during which Anthropic's $100 million credit commitment will cover most usage — Claude Mythos Preview will be available to participants at $25 per million input tokens and $125 per million output tokens. Participants can access the model through the Claude API, Amazon Bedrock, Google Cloud's Vertex AI, and Microsoft Foundry.

Those prices reflect the model's computational intensity. The draft blog post that leaked in March described Mythos as a large, compute-intensive model that would be expensive for both Anthropic and its customers to serve. Anthropic's solution is to develop and launch new safeguards with an upcoming Claude Opus model, allowing the company to "improve and refine them with a model that does not pose the same level of risk as Mythos Preview," as Cheng told VentureBeat. Security professionals whose legitimate work is affected by those safeguards will be able to apply to an upcoming Cyber Verification Program.

The financial context matters. The same day Project Glasswing launched, Anthropic disclosed its revenue milestone and the Google-Broadcom compute deal. Broadcom signed an expanded deal with Anthropic that will give the AI startup access to about 3.5 gigawatts worth of computing capacity drawing on Google's AI processors, according to CNBC. The scale of compute being marshaled is staggering — and it helps explain why Anthropic needs both the revenue from enterprise cybersecurity partnerships and the infrastructure to serve a model of Mythos Preview's size.

The timing also intersects with growing speculation about Anthropic's path to a public offering. The company is reportedly evaluating an IPO as early as October 2026. A high-profile, government-adjacent cybersecurity initiative with blue-chip partners is exactly the kind of program that burnishes an IPO narrative — particularly when the company can simultaneously point to $30 billion in annualized revenue and a compute footprint measured in gigawatts.

Anthropic says defenders have months, not years, before adversaries catch up

The most consequential question raised by Project Glasswing is not whether Mythos Preview's capabilities are real — the partner endorsements and patched vulnerabilities suggest they are — but how much time defenders actually have before similar capabilities are available to adversaries.

Cheng was candid about the timeline. "Frontier AI capabilities are likely to advance substantially over just the next few months," he told VentureBeat. "Given the rate of AI progress, it will not be long before such capabilities proliferate, potentially beyond actors who are committed to deploying them safely." He described Project Glasswing as "an important step toward giving defenders a durable advantage in the coming AI-driven era of cybersecurity" but added a crucial caveat: "It's important to note, this is a starting point. No one organization can solve these cybersecurity problems alone."

That framing — months, not years — is worth taking seriously. DARPA launched its original Cyber Grand Challenge in 2016, a competition to create automatic defensive systems capable of reasoning about flaws, formulating patches, and deploying them on a network in real time. At the time, the winning AI-powered bot, Mayhem, finished last when placed against human teams at DEF CON. A decade later, Anthropic is claiming that a frontier AI model can find vulnerabilities that survived 27 years of expert human review and millions of automated security tests — and can chain exploits together autonomously to achieve full system compromise.

The delta between those two data points illustrates why the industry is treating this as a genuine inflection point, not a marketing exercise. Anthropic itself has firsthand experience with the offensive side of this equation: the company disclosed in November 2025 that a Chinese state-sponsored group achieved 80 to 90 percent autonomous tactical execution using Claude across approximately 30 targets, according to Anthropic's misuse report.

Project Glasswing arrives during one of the most turbulent weeks in Anthropic's history. In the span of days, the company has announced a model it considers too dangerous for public release, disclosed that its revenue has tripled, sealed a multi-gigawatt compute deal, hired a senior Microsoft executive, made it more expensive for Claude Code subscribers to use third-party tools like OpenClaw, and weathered a major outage of its Claude chatbot on Tuesday morning. Anthropic says it will report publicly on what it has learned within 90 days. In the medium term, the company has proposed that an independent, third-party body might be the ideal home for continued work on large-scale cybersecurity projects.

Whether any of that is fast enough depends on a race that is already underway. Anthropic built a model that can autonomously crack open the most hardened operating systems on the planet — and is now betting that sharing it with defenders, under careful restrictions, will do more good than the inevitable moment when similar capabilities land in less careful hands. It is, in essence, a wager that transparency can outrun proliferation. The next few months will determine whether that bet pays off, or whether the glasswing's wings were never quite opaque enough to hide what was coming.