Microsoft Research has developed a new reinforcement learning framework that trains large language models for complex reasoning tasks at a fraction of the usual computational cost. The framework, called rStar2-Agent, uses a combination of algorithmic innovations and software optimizations to make the training process more efficient, requiring less data and making better use of hardware.

A 14 billion-parameter model fine-tuned with rStar2-Agent shows near state-of-the-art performance, surpassing the much larger 671B-parameter DeepSeek-R1 on key math benchmarks while generating shorter, more concise answers. For enterprises, the findings point to a new path for developing more reliable and cost-effective AI agents and get more value out of smaller, open-source models.

From ‘thinking longer’ to ‘thinking smarter’

Current AI models often improve their reasoning by generating longer chains of thought (CoT), essentially "thinking longer" about a problem. While effective, this approach has its limits, especially for difficult problems where a single mistake in a long reasoning chain can derail the entire process. In these cases, models rely on internal self-reflection, which often fails to spot errors or correct a flawed approach. Microsoft’s researchers propose a shift from "thinking longer" to "thinking smarter" by giving models advanced cognitive abilities to use tools, validate their work, and learn from feedback.

This smarter approach is achieved through a method called "agentic reinforcement learning." The model acts as an agent that interacts with tools in a dedicated environment, adapting its reasoning based on the feedback it receives. The researchers focused on using Python code and its interpreter as the tool environment. This allows the model to explore alternative solutions, run calculations, and verify intermediate steps to complement vanilla CoT traces.

agentic reinforcement learning

Agentic reinforcement learning uses external tools to verify its reasoning traces (source: arXiv)

In this setup, the model engages in a multi-turn dialogue with the code environment. It generates a piece of reasoning, calls a Python tool to execute a command, receives the output, and incorporates that feedback into its next step of reasoning, repeating the process until it arrives at a final answer.

However, scaling this approach presents significant challenges. The complexity of programming tools can introduce noise into the process. For example, an error message from incorrect code can distract the model from the primary reasoning task. Furthermore, the infrastructure required for large-scale agentic training is demanding, as a single training batch can trigger tens of thousands of tool calls that need to be managed efficiently and safely.

How the rStar2-Agent framework works

To overcome these inherent challenges, the rStar2-Agent framework implements three key innovations:

Efficient and reliable infrastructure: Building a scalable environment for tool-using agents was a major engineering hurdle. Li Lyna Zhang, a principal researcher at Microsoft and co-author of the paper, told VentureBeat that early attempts were plagued by practical issues. "Each training step could trigger tens of thousands of tool calls," Zhang said. "At first, simple parallel execution caused lost requests, which misled the model, and CPUs were overwhelmed while GPUs sat idle, slowing rollouts." She also highlighted the risk of "unpredictable and potentially unsafe outputs from the LLM," which could destabilize the entire system.

To solve this, rStar2-Agent features a high-throughput, isolated code environment capable of handling up to 45,000 concurrent tool calls per step with an average latency of 0.3 seconds. It incorporates a load-balanced scheduler that dynamically allocates requests across GPUs to address the inefficiencies of RL training, which happens due to different lengths of reasoning paths (also referred to as “rollouts”). This scheduler manages asynchronous tool calls and balances computational load across GPUs to make sure devices don’t remain idle when a request is completed. This robust infrastructure enables efficient RL training even with limited GPU resources.

Group Relative Policy Optimization with Resampling on Correct (GRPO-RoC): This innovation builds upon Group Relative Policy Optimization (GRPO), a foundational reinforcement learning algorithm used in models such as DeepSeek-R1 and Phi-4. In standard GRPO, the model generates groups of reasoning paths for a given problem and receives a simple binary reward (correct or incorrect) for the final answer. The algorithm then optimizes the model's policy based on the relative success of these rollouts within their group. However, traditional GRPO faces challenges in agentic reinforcement learning, particularly with noisy feedback from code environments.

GRPO-RoC extends this by addressing environmental noise and improving the quality of learning signals. It achieves this through its "Resample on Correct" (RoC) strategy. RoC first oversamples a large group of rollouts and then selects a subset to create a training batch. It filters positive trajectories to retain only the highest-quality ones that have minimal tool-induced errors or formatting issues. At the same time, it downsamples failed trajectories to preserve diverse failure modes as valuable learning signals.

GRPO-RoC

GRPO-RoC outperforms classic GRPO as the number of training steps increases (source: arXiv)

In an enterprise setting, this focus on clean, high-quality reasoning translates directly to more reliable applications. Zhang offers a clear example: "An automated code agent generating a data processing script might otherwise produce long, error-filled code requiring multiple corrections. With GRPO-RoC, the model learns to generate concise, correct code that runs successfully on the first try, keeping the workflow smooth and predictable. This makes outputs more reliable and stable, which is crucial in enterprise applications."

A tailored training recipe: rStar2-Agent employs a unique training recipe that minimizes computational requirements. Instead of starting with complex reasoning problems, the initial training stage (a non-reasoning supervised fine-tuning, or SFT) focuses on teaching the model the basics: general instruction following and how to correctly format and use coding tools. This avoids overfitting the model to specific reasoning patterns early on. Following this, the model undergoes a multi-stage RL training process where the problem difficulty and maximum response length are gradually increased. Unlike other methods that require very long response lengths (16,000 tokens or more), this approach starts with shorter lengths (8,000 tokens) and gradually scales up (to 12,000 tokens) across stages, which is made possible by the efficiency of the GRPO-RoC algorithm.

A small model that punches above its weight

To test the framework, the researchers fine-tuned a 14 billion-parameter model, Qwen3-14B-base, using 64 instances of AMD MI300X GPUs. The entire process was completed in just one week using only 510 RL training steps, a stark contrast to other methods that can require thousands of steps. The results demonstrate that a relatively small model can achieve top-tier reasoning performance with minimal compute.

Their findings show that rStar2-Agent significantly boosts the 14-billion parameter base model to state-of-the-art levels, matching and even surpassing more heavily trained and much larger frontier LLMs. On the AIME24 benchmark, rStar2-Agent-14B achieved an average accuracy of 80.6%, outperforming OpenAI's 03-mini, DeepSeek-R1, and Claude Opus 4.0 (thinking).

"These results are especially notable given the small 14B scale and highly cost-effective training compute," the researchers note in their paper.

rStar2-agent performance

rStar2-Agent-14B punches above its weight with fewer parameters and shorter training cycle (source:arXiv)

While the results from the 14B model are impressive, the researchers emphasize that the true innovation lies in the method itself, which is not tied to a specific model size. "We ran our experiments on the 14B model mainly to demonstrate the effectiveness and superiority of the rStar2-Agent approach," Zhang explained. "We expect that applying rStar2-Agent to larger models would deliver even stronger reasoning performance." For enterprises, this means the framework offers a dual advantage: it can be used to create highly efficient, specialized small models, but it also provides a path to developing next-generation, state-of-the-art large models with the same principles of reliability and efficiency.

The trained model also demonstrated "smarter reasoning" by using fewer tokens. On challenging math benchmarks, rStar2-Agent-14B achieved higher accuracy than its larger counterparts while generating significantly shorter responses. The researchers note that “by reinforcing higher-quality positive trajectories, our model has effectively learned to use coding tools more intelligently to reason more efficiently.” This efficiency is a critical factor for enterprise applications, as shorter responses translate directly to lower inference costs and faster performance.

Finally, the model showed strong generalization capabilities. Despite being trained exclusively on math problems, it performed well on a diverse set of tasks, including scientific reasoning and agentic tool use. On the GPQA-Diamond science benchmark, it outperformed DeepSeek-V3, showing that the reasoning skills learned in mathematics can be effectively transferred to other domains.

Looking ahead, the researchers see this agentic, tool-centric approach extending far beyond math into other complex, high-value domains. "In drug discovery, it could access chemical and biological databases and run simulations," Zhang suggested, also pointing to applications in legal analysis and financial modeling.

However, she also noted that moving from the structured world of a Python interpreter to more ambiguous, real-world enterprise tools presents the next set of challenges. These tools introduce more "environment noise" and require specialized, reliable environments for the LLM to interact with. Successfully navigating this complexity will be the key to unlocking the next wave of agentic AI in the enterprise.