It's hard to remember, but at the start of 2025, Chinese generative AI models and agents were widely seen as second class in performance to comparable products from U.S. labs like OpenAI, Anthropic, Google and even Meta.

That all changed with the January 2025 release of DeepSeek, and the gap has only narrowed even more in the intervening 8 months, with many Chinese tech giants and startups — from Alibaba to Baidu to Z.ai, Kimi, Manus and more — all fielding impressively powerful AI products, most of them open source and available for anyone in the world to take and use for free, some of them matching or outperforming paid, top tier U.S. equivalents.

Now that moment seems to have arrived once again for AI agents — which, despite being a largely nebulous term still to this day in the AI industry, we'll define for purposes of this article as a generative AI product that can complete multi-step work autonomously over an extended period of time (10+ min) given a single block of natural language, free-form input instructions from a human user. They're usually but not exclusively powered by a large language model (LLM) or multimodal model.

This week, another AI agent research team at Alibaba — the Tongyi Lab, not to be confused with the Qwen Team that releases foundation models under the same parent company — unveiled a powerful, new open source agent specifically for conducting "deep research" across the web and compiling through, accurate reports and other materials for individuals and organizations.

The new Tongyi DeepResearch Agent is setting off a furor among AI power users and experts around the globe for its high performance marks: according to its makers, its the "the first fully open-source Web Agent to achieve performance on par with OpenAI's Deep Research with only 30B (Activated 3B) parameters."

Remember that parameters are the number of internal settings guiding the behavior of LLMs, with more parameters typically meaning a higher performing model. OpenAI's older GPT-4 was said to have nearly 2 trillion. To put 30 billion parameters into perspective, with only 3 billion activated (the number that are used when actually handling a unit of information, or token), it's performing equivalent to models 25 times its size, as AI researcher and software engineer Ahmad Osman pointed out on X:

Indeed, benchmarks released by Tongyi Lab show the Tongyi DeepResearch Agent can match or exceed many larger or proprietary alternatives. It scores:

  • 32.9 on Humanity’s Last Exam (HLE), the highest among all models tested — even beating OpenAI's o3

  • 43.4 on BrowseComp, approaching OpenAI o3’s 49.7.

  • 46.7 on BrowseComp-ZH, second only to OpenAI o3’s 58.1.

  • 75.0 on xbench-DeepSearch.

  • 72.2 on WebWalkerQA.

  • 90.6 on FRAMES, the highest of all models.

Tongyi DeepResearch Agent benchmarks from Alibaba Tongyi Lab

Performance benchmarks for Tongyi DeepResearch Agent as of September 2025. Credit: Alibaba Cloud Tongyi Lab

These results place Tongyi DeepResearch above other open-source models like DeepSeek V3.1, Kimi K2, and Claude-4-Sonnet on multiple tasks, despite its relatively modest size.

In benchmark evaluations specific to legal research, Tongyi DeepResearch Agent outperforms both OpenAI and Anthropic Claude DeepResearch agents in case citation quality (64.26 vs. 57.56 and 40.43) and slightly edges out both in key point accuracy (88.28).

Like the Qwen3-30B-A3B LLM from which it is derived, Tongyi DeepResearch Agent available free for developers and enterprises to take, download, customize and deploy — even for commercial applications, products, and workflows — via HuggingFace, GitHub and ModelScope — under a permissive Apache 2.0 license.

End-to-End Agent Training Pipeline

How did the researchers do it?

Tongyi says the agent was created using a fully automated training pipeline, without relying on human-labeled data.

Specifically, the "agent learns via trial-and-error in a custom, highly stable simulated environment" that the researchers created using a copy of Wikipedia's knowledge base, allowing their agent to explore it and perform actions that "closely mirror those of a real-world setting," i.e., the open web. This also helped keep costs down, avoiding the variability and cost of live web APIs.

They also used a custom tool sandbox to ensure predictable tool performance (i.e., the agent writing and executing Python code to structure information in its output), while a data curation engine dynamically adapted the training data set in response to model performance — that is, generating more esoteric or harder synthetic data as the agent showed it was capable of handling it. As the researchers write in a white paper (p.6) on their methods released today:

"The quality of the data directly determines the upper bound on the model’s ability to generalize to out-of-distribution scenarios through self-exploration. To address this challenge, we optimize data in real time, guided by training dynamics. This optimization is achieved through a fully automated data synthesis and filtering pipeline that dynamically adjusts the training set. By closing the loop between data generation and model training, this approach not only ensures training stability but also delivers substantial performance gains."

The result is a lean but capable model that already powers real-world tools, including Gaode Mate — an AI travel planner integrated into Amap — and Tongyi FaRui, a legal research assistant that autonomously retrieves and cites relevant case law and statutes.

Two complementary agent models form the backbone of the DeepResearch release:

  • AgentFounder-30B, which specializes in pretraining agentic behaviors through a novel continual pretraining method; and

  • WebSailor-V2-30B-A3B, which enhances post-training through scalable reinforcement learning in dual simulation-real environments.

    Both models use the same Qwen3-30B-A3B base and produce results that rival proprietary agents up to 20x their size.

A defining feature of Tongyi DeepResearch is its training pipeline, which spans three stages: Agentic Continual Pre-training (CPT), Supervised Fine-Tuning (SFT), and Reinforcement Learning (RL). This full-stack setup supports the development of agentic capabilities from raw interaction data to refined, multi-step reasoning workflows.

For pre-training, the team introduces AgentFounder, a systematic data engine that uses knowledge graphs, documents, and tool-use trajectories to generate large-scale synthetic question-answer pairs.

A key innovation in AgentFounder is the use of two structured approaches to synthetic behavior modeling: First-order Action Synthesis (FAS), which creates diverse planning and reasoning steps without tool execution; and High-order Action Synthesis (HAS), which remaps discarded or partial agent trajectories into stepwise decision-making datasets, enhancing sample efficiency and diversity.

This data is further refined using formal methods, including set-theory-based reasoning structure modeling, ensuring controllable complexity and scalability.

The final RL stage uses a customized Group Relative Policy Optimization (GRPO) algorithm. It applies on-policy learning with token-level gradient optimization, leave-one-out advantage estimation, and careful filtering of poor-quality negative samples.

The team reports consistent upward trends in reward signals and stable exploration behavior, attributing much of this to the structure and quality of their synthetic training data rather than algorithm tweaks alone.

WebSailor-V2, trained in this stack, achieves 35.3 on BrowseComp-EN, 44.1 on BrowseComp-ZH, and 30.6 on HLE — results that exceed all open-source baselines and even challenge OpenAI and DeepSeek models more than 10x larger.

Two Modes of Inference: ReAct and Heavy Mode

Tongyi DeepResearch supports two modes at inference:

  1. ReAct Mode – This mode adheres to the Thought-Action-Observation loop, showcasing the model’s capabilities without prompt engineering. It offers a straightforward way to benchmark agentic performance in a clean, repeatable environment.

  2. Heavy Mode – Built on the IterResearch paradigm, this setup avoids the pitfalls of overly long contexts by breaking research tasks into discrete rounds. Each round reconstructs a focused workspace, allowing the agent to decide whether to continue gathering information or synthesize an answer. For particularly complex problems, multiple agents can run these loops in parallel, with a final synthesis agent integrating the results.

This multi-agent research-and-synthesis framework helps the model manage cognitive load more effectively during multi-step tasks, which the team calls a key enabler for pushing reasoning performance to the limits of the model’s 128k context window.

In tests, the model shows clear scaling improvements in tool efficiency and context usage, outperforming even 355B and 671B models (GLM-4.5 and DeepSeek-V3.1) on both long-context and low-tool budgets.

Real-World Use Cases Already in Deployment

While the technical architecture is designed for general-purpose agentic reasoning, Tongyi DeepResearch has already been deployed in practical applications:

  • Gaode Mate: In collaboration with Amap (Gaode), Tongyi Lab developed an in-app AI copilot named Xiao Gao. It supports natural language trip planning by searching for scenic spots, identifying pet-friendly hotels, and assembling personalized itineraries with minimal user input.

  • Tongyi FaRui: This legal research agent performs tasks similar to those handled by a junior legal professional. It retrieves case law, cross-references relevant statutes, and synthesizes findings into structured outputs. Results include direct citations, case numbers, and statutes, ensuring transparency and legal validity.

A Growing Family of Agentic Models

Tongyi DeepResearch builds on a broader body of work from the Tongyi Lab team. Over the past six months, they’ve released a family of agents, including:

  • WebWalker (web traversal),

  • WebSailor and WebSailor V2 (reasoning navigation),

  • WebShaper (task modeling),

  • WebResearch (long-horizon reasoning),

  • WebWeaver (structured web evidence),

  • WebResummer, WebWatcher, and more.

Each sub-model contributes either to the data generation process or provides task-specific insights into different aspects of agentic reasoning. Collectively, this research effort represents one of the most comprehensive open-source explorations of deep research agents to date.

According to Tongyi researchers, these agents share a common design philosophy: grounding reasoning and synthesis in agentic interaction workflows, not just static completion. This enables models to generalize better to dynamic environments and long-horizon tasks — and may signal the next evolution of open-source AI tooling.

Limitations and What’s Next

The team acknowledges a few current limitations. Its context window of 128,000 tokens, while equivalent to the amount of information of a 300-400 page novel that can be exchanged in a single input/output interaction with the model, can still be insufficient for the longest, most complex tasks — and is currently behind the standard 256,000 or greater context window offered by OpenAI's GPT-5 and many other leading proprietary models, for example.

Additionally, the methods used for Tongyi DeepResearch haven’t yet been tested at a scale larger than 30B parameters.

There’s also work ahead in optimizing reinforcement learning further through techniques like partial rollouts—though this will require addressing challenges related to off-policy training.

More fundamentally, Tongyi’s researchers argue that current evaluation methods, such as trajectory imitation or static QA correctness, don’t fully capture the capabilities of agents. Their proposal to shift toward step-wise decision supervision may point the way forward for the entire agentic AI field.

Even with these caveats, the release of Tongyi DeepResearch marks a meaningful step forward for open-source agent development.

By publishing the model, tools, training methods, and results, Tongyi Lab invites the broader research community to participate, extend, and test new ideas in this evolving field.