Build research agents without API costs: Alibaba's offline data synthesis breakthrough

Alibaba’s Tongyi Lab has introduced a new open-source training framework that can train open large language models (LLMs) to compete with leading commercial deep research models. The technique, called Agentic Continual Pre-training (Agentic CPT), utilizes a novel data synthesis framework and training pipeline to enhance LLMs' ability to learn complex, multi-step behaviors.

AgentFounder, a deep research agent trained through this framework, sets a new performance record on key industry benchmarks, demonstrating a powerful and cost-effective path for enterprises to develop custom LLM agents for advanced research tasks.

The challenge of agentic alignment

As language models evolve from simple chatbots to autonomous agents, the definition of AI alignment needs a fundamental expansion. Aligning a model’s output with human preference in a single turn is no longer sufficient. The authors of the paper argue that for an agent to be reliable, it must achieve "agentic alignment," which requires it to maintain behavior consistent with human experts when solving complex problems in dynamic environments, such as invoking the right tools and adapting to unpredictable changes like tool failures or misleading information.

According to the authors of the paper, “language models achieving such alignment for web retrieval and knowledge-intensive tasks can be considered deep research agents,” capable of autonomously orchestrating workflows through search, browsing, and code execution to provide trustworthy answers.

However, current LLM post-training methods, such as supervised fine-tuning (SFT) and reinforcement learning (RL), have proven insufficient for this task, especially for open-source models. These models often show large performance gaps compared to proprietary systems like OpenAI’s Deep Research because they are built on general-purpose foundation models.

The researchers stress that neither SFT nor RL provides the learning signals required to enable research agents to navigate vast and complex decision spaces. Instead, they lock models into replicating specific behaviors rather than developing flexible decision-making skills. “Fundamentally, general-purpose foundation models lack agentic inductive biases, forcing post-training to simultaneously learn capabilities and alignment, creating inherent optimization conflicts,” the paper states. “Crucially, pathways toward developing agentic foundation models themselves remain largely unexplored.”

Agentic Continual Pre-training (Agentic CPT)

To solve this problem, the Alibaba team redefined the training pipeline by introducing Agentic CPT as an intermediate stage for alignment. The core objective is to produce a "pre-aligned" foundation model that already possesses strong agentic behaviors before it undergoes final fine-tuning.

Two principles guide this approach:

The initial data sources must be broad and not confined to a single domain.
The training data must include a comprehensive variety of agentic behaviors, preventing the model from simply memorizing patterns and instead encouraging it to explore different problem-solving strategies.

This approach differs fundamentally from other open-source frameworks. "Other deep research agents such as WebSailor, only post-train the model on a small amount of trajectories," Xinyu Wang, a researcher at Alibaba and co-author of the paper, told VentureBeat. "Agentic CPT adds a new stage before them, which introduces a large scale of agentic data for stronger ability on agentic tasks." The result is a powerful "agentic base model" that can be adapted to various post-training methods.

In an enterprise setting, this reliability is crucial, as erratic AI behavior is a major risk. When asked how Agentic CPT leads to more predictable agents, Wang explained that the method improves an agent’s planning and self-correction abilities. Rather than eliminating uncertainty, the framework trains the agent to manage it.

"To balance exploration with reliability, we strengthen the agent’s cross-validation and path selection so it can self-correct under uncertainty," Wang said. "For example, if a target page is inaccessible, the agent reroutes to alternative channels... When a source is uncertain, it proactively seeks independent corroboration. If evidence is insufficient, it defers or flags conclusions rather than making premature judgments." This built-in robustness means an agent is less likely to “hallucinate” a faulty plan or get stuck when an internal database temporarily goes offline.

Agentic CPT framework — The training framework for Agentic CPT (source: arXiv)

Using this framework, the researchers trained AgentFounder, a deep research agent based on the open-source Qwen3-30B base model. Its training pipeline adds an extra set of steps between the standard pre-training and post-training phases that LLMs typically go through. This new stage is broken down into two parts. In Stage 1, the model processes approximately 200 billion tokens of agent data and knowledge-reasoning text, using a 32K context window. In Stage 2, these capabilities are refined using an additional 100 billion tokens of high-quality agent data, along with an extended 128K context window, to train the model to understand complex action spaces and learn long-horizon planning strategies.

Generating data without API calls

A cornerstone of the Agentic CPT framework is a scalable data synthesis approach that trains powerful models without incurring the high costs associated with commercial API calls or human-annotated data. The process has two main components: First-order Action Synthesis (FAS) and Higher-order Action Synthesis (HAS).

FAS is designed to create a diverse training dataset covering a wide range of scenarios, from factual retrieval to multi-hop reasoning. It converts raw data from various sources into a structured "open-world memory" and then generates complex question-answer pairs.

Agentic CPT data generation — Data generation pipeline for Agentic CPT (source: arXiv)

HAS complements this by enabling the system to utilize its generated data more effectively. Instead of training the model on a single "correct" trajectory, HAS generates multiple alternative reasoning paths for each problem. This process teaches the model flexible decision-making rather than simple imitation by allowing it to explore a broader range of potential solutions.

Critically, both of these synthesis methods operate entirely offline, enabling large-scale data generation without incurring API costs.

AgentFounder in action

When tested against a host of general LLMs, commercial agents, and other open-source deep research agents, AgentFounder-30B delivered state-of-the-art results. Overall, it outperformed all existing open-source competitors across four general web search benchmarks. On the English BrowseComp benchmark, AgentFounder-30B surpassed the previous best open-source model, DeepSeek-V3.1, by 10 percentage points, bringing its performance closer to that of closed-source models from OpenAI. The paper notes, “This significant improvement demonstrates that AgentFounder-30B has effectively mastered sophisticated search strategies and reasoning capabilities.”

AgentFounder-30B also demonstrated impressive performance on specialized tasks. On the highly challenging Humanity's Last Exam (HLE) benchmark, it became the first open-source model to surpass the 30-point threshold. It also scored 75.3% on Academic Browse, substantially outperforming all other models and proving its value as an academic assistant.

AgentFounder performance — AgentFounder-30B outperforms open source deep research agents on key industry benchmarks (source: arXiv)

For a business leader, this superior performance translates into tangible benefits. Wang notes that high scores on these benchmarks indicate the agent is "more stable, accurate, and actionable on real-world, long-horizon, tool/knowledge-intensive tasks." In practice, this means faster access to reliable research reports for tasks like competitive market analysis or supply chain monitoring, which depend on "multi-source data aggregation, cross-validation of signals, and fast knowledge refresh," Wang said. However, for high-stakes applications, he recommends a human-in-the-loop setup for risk control, "with human review and approval at critical decision points."

For the enterprise, AgentFounder-30B demonstrates a practical path toward state-of-the-art deep research agents that can be deployed on-premise, offering greater control and security. More importantly, the Agentic CPT framework changes the calculus for developing custom AI. The training pipeline can be tailored to an organization's internal tools and proprietary data sources, enabling the development of highly customized AI agents.

According to Wang, this approach makes building specialized agents a much faster and more affordable process. "With a strong pre-aligned agentic base model, enterprises can perform light adaptation using their in-domain corpora and proprietary tools to rapidly build domain-specific agents, such as for financial analysis or pharmaceutical research," he said. "This approach is feasible in both cost and timeline for most companies." Looking forward, this could mean that as agentic capabilities become native to foundation models, many complex tasks may be solvable with simple prompt engineering alone.