In a new paper, researchers from Tencent AI Lab Seattle and the University of Maryland, College Park, present a reinforcement learning technique that enables large language models (LLMs) to utilize inference-time scaling more effectively in solving complex reasoning problems.

The technique, called Parallel-R1, uses a special data generation pipeline and a multi-stage training process to instill “parallel thinking” into LLMs. This allows them to branch into multiple different reasoning paths when generating answers, ultimately leading to more robust and accurate conclusions. 

Parallel thinking, which is now being used in some closed-source frontier models, promises to unlock greater reasoning power from existing models through efficient scaling at the time of use, without the need for expensive, manually labeled training data.

The challenges of parallel thinking

The idea of exploring multiple lines of reasoning at once has shown significant value, with Google recently crediting the success of its Gemini Deep Think model at the International Mathematical Olympiad in part to this capability. 

Early AI strategies to mimic this behavior involved brute-force approaches, where a model generates multiple independent answers from the start and picks the most consistent one, often referred to as “best of N.” More nuanced methods, such as Monte Carlo Tree Search and Tree of Thoughts, offer finer-grained control over the reasoning and voting process, but they often depend on handcrafted rules and external systems to guide the reasoning process.

More recent efforts have focused on teaching models this skill directly through training. However, these methods face significant hurdles. Training a model via supervised fine-tuning (SFT), where it learns from pre-written examples, depends entirely on the quality of that data. High-quality data showing parallel thought processes for complex, real-world problems is extremely rare and difficult to create. This often forces the model to simply mimic patterns in the training data rather than acquiring a deep, generalizable parallel thinking skill.

Reinforcement learning (RL), where a model learns through trial and error, offers a more scalable path. But it comes with its own set of problems. LLMs are not pre-trained to think in parallel, so they don't naturally produce the kind of exploratory reasoning paths they need to learn from (a classic "cold-start" problem). Furthermore, designing the right reward system is tricky. If a model is rewarded only for getting the final answer right, it might learn to take shortcuts and abandon the more complex parallel thinking strategy. Conversely, if it's forced to think in parallel, it might do so in situations where it's unnecessary, hurting efficiency and performance.

How Parallel-R1 works

The Parallel-R1 framework is designed to overcome these challenges. The researchers describe it as the "first reinforcement learning (RL) framework that enables parallel thinking behaviors for complex real-world reasoning tasks."

"The key insight of our approach is to bypass the need for the complex data pipelines often considered essential for generating training data on final challenging problems," the researchers write.

Parallel-R1

Parallel-R1 (source: arXiv)

At its core, the approach formalizes parallel thinking into two stages: "Exploration," where the model launches multiple independent reasoning threads when it detects a critical step, and "Summary," where it aggregates the outcomes from these threads to form a conclusion before resuming its main line of thought. At inference time, the model generates text until it produces a special <Parallel> tag, at which point it branches out into different <Path> blocks. Once complete, it generates a <Summary> of the findings and continues. A model trained through Parallel-R1 can perform this branching and converging process multiple times when generating the response to a prompt.

To instill this ability, the researchers developed a three-stage training recipe. First is the "Cold-Start Stage," where the model is fine-tuned on a custom dataset of AI-generated parallel-thinking examples. This initial step teaches the model the basic format of parallel reasoning. Next, in the "RL on Easy Math" stage, the framework applies reinforcement learning to the same dataset to stabilize this new behavior, using a dual reward system that incentivizes both correctness and the proper use of the parallel structure. Finally, in the "RL on General Math" stage, the model is trained on new and more difficult math problems to generalize its parallel thinking skill to more complex scenarios.

Parallel-R1 training

Parallel-R1 training pipeline (source: arXiv)

A key innovation lies in how the initial "cold-start" data is created. Instead of relying on complex data pipelines, the team found that a powerful LLM could generate high-quality parallel reasoning examples for simple problems using straightforward prompts. In their experiments, the researchers used a distilled version of DeepSeek-R1 to generate around 7,000 parallel thinking from the GSM8K math problems dataset. Crucially, they made "the strategic choice to use this ’cold-start’ data not to teach the model how to solve the final target tasks, but specifically to teach it the format of parallel thinking."

Another important part of the framework is its reward function. To solve the reward design challenge, the team developed an alternating reward strategy that switches between rewarding the model for final answer accuracy and for correctly using the parallel thinking structure. According to the paper, "this approach achieves a superior balance between high performance and consistent utilization of parallel thinking compared to using a single reward type alone."

Parallel-R1 performance

Parallel-R1 shows significant improvement on key reasoning benchmarks (source: arXiv)

The researchers tested their framework by training the open source Qwen-3-4B-Base on Parallel-R1 and evaluating it on four standard mathematical reasoning benchmarks, including AIME, AMC, and MATH. The results show that the model trained with Parallel-R1 consistently outperformed baselines, including a model trained with a standard RL approach. 

For real-world applications, Parallel-R1 provides a solution to get more reasoning power out of existing AI systems. This method of scaling capability at inference time, rather than just scaling up the model size, provides a more efficient and practical approach for deploying advanced reasoning AI in real-world enterprise applications.