Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Reasoning ParadigmsParallel ThinkingRLLLM
Abstract:

Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging. Existing methods mainly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced learning rather than exploration and generalization. To address this issue, we propose Parallel-R1, the first reinforcement learning (RL) framework that instills parallel thinking for complex real-world reasoning tasks. Our framework employs a progressive curriculum that addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking behavior, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully elicits parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on difficult tasks with RL. Further analysis reveals a distinct shift in the model's thinking patterns: in the early stage, it utilizes parallel thinking as an exploration strategy, while in the later stage, it employs this ability for multi-perspective verification. Most significantly, we validate parallel thinking as a mid-training exploration scaffold, where this intermediate phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the sequential RL baseline.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Parallel-R1, a reinforcement learning framework for instilling parallel thinking in large language models for mathematical reasoning. It resides in the 'Parallel Thinking via Reinforcement Learning' leaf, which contains only three papers including this work. This represents a relatively sparse research direction within the broader taxonomy of 46 papers across 36 topics, suggesting that RL-driven parallel reasoning remains an emerging area. The sibling papers in this leaf—DeepSeek-R1 and Logic-RL—share the core methodology of using RL to optimize multi-path reasoning, indicating a small but focused cluster of work.

The taxonomy reveals that parallel reasoning methods occupy one major branch, while sequential and adaptive reasoning optimization forms another substantial direction with five subtopics. Neighboring leaves include 'Adaptive Parallel Reasoning Frameworks' (2 papers) and 'Multi-Sample Aggregation' (1 paper), both exploring concurrent reasoning but without the RL-centric training focus. The 'Sequential and Adaptive Reasoning Optimization' branch, particularly 'Pure Reinforcement Learning for Sequential Reasoning,' represents an alternative paradigm that optimizes single-path reasoning rather than concurrent exploration. Parallel-R1 diverges from these by combining RL with explicit parallel path generation.

Among 30 candidates examined, the contribution-level analysis shows mixed novelty signals. The core RL framework for parallel thinking (Contribution 1) examined 10 candidates with zero refutations, suggesting limited direct prior work in this specific formulation. However, the progressive training curriculum (Contribution 2) found 2 refutable candidates among 10 examined, indicating some overlap with existing curriculum or staged training approaches. The third contribution—using parallel thinking as an exploration scaffold—also showed no refutations across 10 candidates. These statistics reflect a targeted search scope rather than exhaustive coverage, leaving open the possibility of additional relevant work beyond the top-30 semantic matches.

Based on the limited search scope, the work appears to occupy a relatively novel position within RL-driven parallel reasoning, though the curriculum training component shows more substantial prior art. The sparse population of the taxonomy leaf and low refutation rates for two of three contributions suggest meaningful differentiation from existing methods. However, the analysis covers only top-30 semantic matches and does not capture potential overlap in broader RL training literature or parallel reasoning architectures outside the examined candidates.

Taxonomy

Core-task Taxonomy Papers
46
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Instilling parallel thinking in large language models via reinforcement learning. The field has evolved into several major branches that reflect different strategies for enhancing LLM reasoning capabilities. Parallel Reasoning Architectures and Training Methods focus on enabling models to explore multiple reasoning paths simultaneously, often through RL-based optimization techniques such as those in Parallel-R1[0] and DeepSeek-R1[2]. Sequential and Adaptive Reasoning Optimization emphasizes dynamic adjustment of reasoning depth and strategy selection, as seen in works like Adaptive Parallel Reasoning[5] and Elastic Reasoning[39]. Domain-Specific Reasoning Applications tailor these methods to particular problem settings, while Multi-Agent and Collaborative Reasoning Systems explore how multiple models or agents can coordinate their thinking. Cognitive Architectures and Theoretical Frameworks, including studies like Dual Process Thinking[13] and LLM Cognitive Architecture[35], draw inspiration from human cognition to structure model reasoning. Surveys and empirical studies such as LLM Reasoning Survey[17] and Slow Thinking Survey[19] provide broader perspectives, and Specialized Training techniques refine optimization strategies across these paradigms. A particularly active line of work centers on whether to pursue native parallel exploration or adaptive sequential refinement. Parallel approaches like Parallel Reasoning[1] and Native Parallel Reasoner[22] enable models to maintain multiple hypotheses concurrently, trading computational cost for broader search coverage. In contrast, adaptive methods such as ReMA[4] and Think Twice[23] dynamically allocate reasoning effort based on problem difficulty. Parallel-R1[0] sits squarely within the parallel RL-driven branch, sharing methodological DNA with DeepSeek-R1[2] and Logic-RL[3] in using reinforcement learning to train models that generate and evaluate multiple reasoning trajectories. Compared to Native Parallel Reasoner[22], which emphasizes architectural changes for inherent parallelism, Parallel-R1[0] focuses more directly on RL-based policy optimization to instill parallel exploration behaviors. Open questions remain about the scalability of parallel search, the interpretability of learned reasoning strategies, and how to best balance exploration breadth with computational efficiency across diverse task domains.

Claimed Contributions

Parallel-R1: First RL framework for parallel thinking on general mathematical reasoning

The authors introduce Parallel-R1, the first reinforcement learning framework that instills parallel thinking capabilities in large language models for complex real-world mathematical reasoning tasks. This is achieved through a progressive training curriculum that starts with supervised fine-tuning on easier problems and transitions to RL on harder tasks, combined with carefully designed reward mechanisms.

10 retrieved papers
Progressive training curriculum with lightweight data pipeline

The authors develop a progressive multi-stage training approach that addresses the cold-start problem by first using supervised fine-tuning on simple tasks (GSM8K) to teach basic parallel thinking formats, then applying reinforcement learning on more difficult problems to generalize the capability. This includes a lightweight data pipeline that generates high-quality parallel thinking trajectories through prompting on easier problems.

10 retrieved papers
Can Refute
Parallel thinking as mid-training exploration scaffold

The authors identify and validate a novel concept where parallel thinking serves as an exploration scaffold during the intermediate training phase. This approach uses parallel thinking to encourage broader exploration early in training, then transitions to sequential reasoning for exploitation, resulting in substantial performance improvements even after the parallel structure is no longer explicitly used.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Parallel-R1: First RL framework for parallel thinking on general mathematical reasoning

The authors introduce Parallel-R1, the first reinforcement learning framework that instills parallel thinking capabilities in large language models for complex real-world mathematical reasoning tasks. This is achieved through a progressive training curriculum that starts with supervised fine-tuning on easier problems and transitions to RL on harder tasks, combined with carefully designed reward mechanisms.

Contribution

Progressive training curriculum with lightweight data pipeline

The authors develop a progressive multi-stage training approach that addresses the cold-start problem by first using supervised fine-tuning on simple tasks (GSM8K) to teach basic parallel thinking formats, then applying reinforcement learning on more difficult problems to generalize the capability. This includes a lightweight data pipeline that generates high-quality parallel thinking trajectories through prompting on easier problems.

Contribution

Parallel thinking as mid-training exploration scaffold

The authors identify and validate a novel concept where parallel thinking serves as an exploration scaffold during the intermediate training phase. This approach uses parallel thinking to encourage broader exploration early in training, then transitions to sequential reasoning for exploitation, resulting in substantial performance improvements even after the parallel structure is no longer explicitly used.