WebExplorer trains expert web agents without human-labeled examples

Researchers from the Hong Kong University of Science and Technology, MiniMax, and University of Waterloo have released a new framework for creating smarter and more efficient web search and browsing agents. The framework, called WebExplorer, uses a systematic approach to automatically generate training data, removing the need for expensive manual labeling while making the examples challenging enough to train highly capable AI agents that can solve long-horizon tasks.

WebExplorer-8B, a model trained with this method, achieves state-of-the-art results for its size on key industry benchmarks. It demonstrates complex tool use, making it well-suited for reasoning and research tasks that require gathering and processing information from multiple online sources. This approach offers a practical path for enterprises to develop powerful, long-horizon web agents for custom applications.

The data quality bottleneck for web agents

Large language models (LLMs) are increasingly being used as agents that can autonomously browse the web to find information. However, developing these agents faces a significant hurdle. Existing open-source models often struggle with complex, multi-step search tasks, while the training methods for more powerful commercial models from companies like OpenAI and Google remain opaque.

The researchers behind WebExplorer hypothesize that the core challenge lies in the quality of the training data. Top benchmarks now include extraordinarily difficult questions that even human annotators struggle to answer. The paper argues that "constructing high-quality, difficult query-answer pairs is essential for developing agents that can achieve super-human performance on information-seeking tasks.”

Creating this high-quality data is a major problem. Most challenging benchmarks rely on manual curation, which is slow, expensive, and results in datasets too small for large-scale model training. This has led to a critical need for methods that can autonomously synthesize large volumes of challenging query-answer pairs. Current automated approaches, however, have their own limitations. Graph-based methods, which map out websites and their connections, are complex to design and maintain. Meanwhile, evolution-based methods, which iteratively make simple questions harder, can result in unnatural or convoluted queries.

How WebExplorer generates challenging data

The WebExplorer framework introduces a simpler, more effective way to create training data through a two-stage process: model-based exploration and iterative query evolution.

First, instead of building a rigid, explicit map of the web, the framework takes a more flexible approach. It gives a powerful LLM a "seed" topic (e.g., "Brazil National Team") and instructs it to perform a series of search and browse actions. The model explores freely, discovering interconnected facts across different websites. This process internally simulates the creation of a knowledge graph without the complex overhead of predefined rules. Once it has built a rich context, the model creates an initial question-answer pair based on the information it found.

In the second stage, the framework makes these initial questions more difficult. The researchers observed that the first-pass questions, while requiring information from multiple sites, were still too easy because they contained obvious clues like specific dates, names, or locations. Inspired by the design of tough benchmarks like BrowseComp, the framework begins an "evolution" process that systematically removes these salient clues and replaces explicit details with vague descriptions. This "long-to-short" evolution, which reduces information rather than adding it, forces the agent to learn deeper reasoning and exploratory search skills.

The researchers used this method to create the WebExplorer-QA dataset, which contains approximately 40,000 challenging question-answer pairs. The data is designed to train agents that can handle ambiguity and perform the kind of multi-step reasoning needed to solve complex real-world problems.

The framework then uses this data in a two-phase training recipe. First, supervised fine-tuning (SFT) provides a "cold start," teaching the model the fundamental skills of using search and browse tools and breaking down complex questions into smaller steps. This is followed by a reinforcement learning (RL) stage, where the model is given more freedom to explore different solution paths on its own. The RL phase further enhances its reasoning capabilities and ability to handle long, multi-step tasks, ultimately developing more advanced problem-solving behaviors with support for up to 128,000 tokens contexts.

WebExplorer in action

To test their framework, the researchers used the WebExplorer-QA data and the training recipe to fine-tune the open source Qwen3-8B model. They then evaluated the resulting model, WebExplorer-8B, against a suite of proprietary and open-source models on several information-seeking benchmarks, including BrowseComp, GAIA, and WebWalkerQA.

The results show that WebExplorer-8B is remarkably efficient. It established a new state-of-the-art performance for its size, and consistently outperformed much larger open-source models. On the challenging BrowseComp benchmarks, it surpassed WebSailor-72B, a model nearly ten times its size. The model also demonstrated strong generalization, performing well on the difficult Humanity’s Last Exam (HLE) benchmark even though its training data was not focused on STEM questions.

WebExplorer-8B performance on key web search and browsing benchmarks (source: GitHub)

The findings can have practical implications for real-world applications. The WebExplorer framework provides a recipe for creating specialized, high-performing web agents without the prohibitive costs of manual data labeling. The ability to train smaller, more efficient models that can outperform larger ones is a major advantage for practical deployment, reducing both computational costs and inference latency. Such models could be customized to interact with internal knowledge bases, perform complex market analysis, or automate deep research tasks.

As the paper concludes, “The success of WebExplorer demonstrates the potential of autonomously synthesizing challenging information-seeking QA pairs and leveraging supervised fine-tuning and reinforcement learning to build advanced, long-horizon web agents.”

The data quality bottleneck for web agents

How WebExplorer generates challenging data

WebExplorer in action

More