Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Reasoning ParadigmsParallel ThinkingRLLLM

Parallel thinking has emerged as a novel approach for enhancing the reasoning capabilities of large language models (LLMs) by exploring multiple reasoning paths concurrently. However, activating such capabilities through training remains challenging. Existing methods mainly rely on supervised fine-tuning (SFT) over synthetic data, which encourages teacher-forced learning rather than exploration and generalization. To address this issue, we propose Parallel-R1, the first reinforcement learning (RL) framework that instills parallel thinking for complex real-world reasoning tasks. Our framework employs a progressive curriculum that addresses the cold-start problem in training parallel thinking with RL. We first use SFT on prompt-generated trajectories from easier tasks to instill the parallel thinking behavior, then transition to RL to explore and generalize this skill on harder problems. Experiments on various math benchmarks, including MATH, AMC23, and AIME, show that Parallel-R1 successfully elicits parallel thinking, leading to 8.4% accuracy improvements over the sequential thinking model trained directly on difficult tasks with RL. Further analysis reveals a distinct shift in the model's thinking patterns: in the early stage, it utilizes parallel thinking as an exploration strategy, while in the later stage, it employs this ability for multi-perspective verification. Most significantly, we validate parallel thinking as a mid-training exploration scaffold, where this intermediate phase unlocks a higher performance ceiling after RL, yielding a 42.9% improvement over the sequential RL baseline.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Parallel-R1, a reinforcement learning framework for instilling parallel thinking in large language models for mathematical reasoning. It resides in the 'Parallel Thinking via Reinforcement Learning' leaf, which contains only three papers including this work. This represents a relatively sparse research direction within the broader taxonomy of 46 papers across 36 topics, suggesting that RL-driven parallel reasoning remains an emerging area. The sibling papers in this leaf—DeepSeek-R1 and Logic-RL—share the core methodology of using RL to optimize multi-path reasoning, indicating a small but focused cluster of work.

The taxonomy reveals that parallel reasoning methods occupy one major branch, while sequential and adaptive reasoning optimization forms another substantial direction with five subtopics. Neighboring leaves include 'Adaptive Parallel Reasoning Frameworks' (2 papers) and 'Multi-Sample Aggregation' (1 paper), both exploring concurrent reasoning but without the RL-centric training focus. The 'Sequential and Adaptive Reasoning Optimization' branch, particularly 'Pure Reinforcement Learning for Sequential Reasoning,' represents an alternative paradigm that optimizes single-path reasoning rather than concurrent exploration. Parallel-R1 diverges from these by combining RL with explicit parallel path generation.

Among 30 candidates examined, the contribution-level analysis shows mixed novelty signals. The core RL framework for parallel thinking (Contribution 1) examined 10 candidates with zero refutations, suggesting limited direct prior work in this specific formulation. However, the progressive training curriculum (Contribution 2) found 2 refutable candidates among 10 examined, indicating some overlap with existing curriculum or staged training approaches. The third contribution—using parallel thinking as an exploration scaffold—also showed no refutations across 10 candidates. These statistics reflect a targeted search scope rather than exhaustive coverage, leaving open the possibility of additional relevant work beyond the top-30 semantic matches.

Based on the limited search scope, the work appears to occupy a relatively novel position within RL-driven parallel reasoning, though the curriculum training component shows more substantial prior art. The sparse population of the taxonomy leaf and low refutation rates for two of three contributions suggest meaningful differentiation from existing methods. However, the analysis covers only top-30 semantic matches and does not capture potential overlap in broader RL training literature or parallel reasoning architectures outside the examined candidates.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Instilling parallel thinking in large language models via reinforcement learning. The field has evolved into several major branches that reflect different strategies for enhancing LLM reasoning capabilities. Parallel Reasoning Architectures and Training Methods focus on enabling models to explore multiple reasoning paths simultaneously, often through RL-based optimization techniques such as those in Parallel-R1[0] and DeepSeek-R1[2]. Sequential and Adaptive Reasoning Optimization emphasizes dynamic adjustment of reasoning depth and strategy selection, as seen in works like Adaptive Parallel Reasoning[5] and Elastic Reasoning[39]. Domain-Specific Reasoning Applications tailor these methods to particular problem settings, while Multi-Agent and Collaborative Reasoning Systems explore how multiple models or agents can coordinate their thinking. Cognitive Architectures and Theoretical Frameworks, including studies like Dual Process Thinking[13] and LLM Cognitive Architecture[35], draw inspiration from human cognition to structure model reasoning. Surveys and empirical studies such as LLM Reasoning Survey[17] and Slow Thinking Survey[19] provide broader perspectives, and Specialized Training techniques refine optimization strategies across these paradigms. A particularly active line of work centers on whether to pursue native parallel exploration or adaptive sequential refinement. Parallel approaches like Parallel Reasoning[1] and Native Parallel Reasoner[22] enable models to maintain multiple hypotheses concurrently, trading computational cost for broader search coverage. In contrast, adaptive methods such as ReMA[4] and Think Twice[23] dynamically allocate reasoning effort based on problem difficulty. Parallel-R1[0] sits squarely within the parallel RL-driven branch, sharing methodological DNA with DeepSeek-R1[2] and Logic-RL[3] in using reinforcement learning to train models that generate and evaluate multiple reasoning trajectories. Compared to Native Parallel Reasoner[22], which emphasizes architectural changes for inherent parallelism, Parallel-R1[0] focuses more directly on RL-based policy optimization to instill parallel exploration behaviors. Open questions remain about the scalability of parallel search, the interpretability of learned reasoning strategies, and how to best balance exploration breadth with computational efficiency across diverse task domains.

Claimed Contributions

Parallel-R1: First RL framework for parallel thinking on general mathematical reasoning

10 retrieved papers

The authors introduce Parallel-R1, the first reinforcement learning framework that instills parallel thinking capabilities in large language models for complex real-world mathematical reasoning tasks. This is achieved through a progressive training curriculum that starts with supervised fine-tuning on easier problems and transitions to RL on harder tasks, combined with carefully designed reward mechanisms.

10 retrieved papers

Progressive training curriculum with lightweight data pipeline

Can Refute

10 retrieved papers

The authors develop a progressive multi-stage training approach that addresses the cold-start problem by first using supervised fine-tuning on simple tasks (GSM8K) to teach basic parallel thinking formats, then applying reinforcement learning on more difficult problems to generalize the capability. This includes a lightweight data pipeline that generates high-quality parallel thinking trajectories through prompting on easier problems.

10 retrieved papers

Can Refute

Parallel thinking as mid-training exploration scaffold

10 retrieved papers

The authors identify and validate a novel concept where parallel thinking serves as an exploration scaffold during the intermediate training phase. This approach uses parallel thinking to encourage broader exploration early in training, then transitions to sequential reasoning for exploitation, resulting in substantial performance improvements even after the parallel structure is no longer explicitly used.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Instilling parallel reasoning into language models PDF

M Macfarlane, M Kim, N Jojic, W Xu (2025)

[22] Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning PDF

Tong Wu, Yang Liu, Jun Bai, Zixia Jia, Shuyi Zhang, Ziyong Lin, Yanting Wang, Song-Chun Zhu, Zilong Zheng (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Parallel-R1: First RL framework for parallel thinking on general mathematical reasoning

[4] ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning PDF

Cannot Refute

[66] How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study PDF

Cannot Refute

[67] Self-rewarding correction for mathematical reasoning PDF

Cannot Refute

[68] DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition PDF

Cannot Refute

[69] Self-Evolving Curriculum for LLM Reasoning PDF

Cannot Refute

[70] SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning PDF

Cannot Refute

[71] Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback PDF

Cannot Refute

[72] Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models PDF

Cannot Refute

[73] Towards Effective Code-Integrated Reasoning PDF

Cannot Refute

[74] Autotir: Autonomous tools integrated reasoning via reinforcement learning PDF

Cannot Refute

Contribution

Progressive training curriculum with lightweight data pipeline

[59] SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning PDF

Can Refute

[63] QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning PDF

Can Refute

[56] Fine-tuning large vision-language models as decision-making agents via reinforcement learning PDF

Cannot Refute

[57] Automatic berthing using supervised learning and reinforcement learning PDF

Cannot Refute

[58] AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning PDF

Cannot Refute

[60] R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning PDF

Cannot Refute

[61] Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs PDF

Cannot Refute

[62] Progressive Mastery: Customized Curriculum Learning with Guided Prompting for Mathematical Reasoning PDF

Cannot Refute

[64] RAVEN: Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning PDF

Cannot Refute

[65] ZeroSearch: Incentivize the Search Capability of LLMs without Searching PDF

Cannot Refute

Contribution

Parallel thinking as mid-training exploration scaffold

[25] ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving PDF

Cannot Refute

[47] ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning PDF

Cannot Refute

[48] SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning PDF

Cannot Refute

[49] Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step PDF

Cannot Refute

[50] To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning PDF

Cannot Refute

[51] Parallel learning: A perspective and a framework PDF

Cannot Refute

[52] Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs PDF

Cannot Refute

[53] Adaptive Reward-Free Exploration PDF

Cannot Refute

[54] PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning PDF

Cannot Refute

[55] SMART: Scalable Multi-Agent Reasoning and Trajectory Planning in Dense Environments PDF

Cannot Refute

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Instilling parallel reasoning into language models PDF

[22] Native Parallel Reasoner: Reasoning in Parallelism via Self-Distilled Reinforcement Learning PDF

Contribution Analysis

Parallel-R1: First RL framework for parallel thinking on general mathematical reasoning

[4] ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning PDF

[66] How Difficulty-Aware Staged Reinforcement Learning Enhances LLMs' Reasoning Capabilities: A Preliminary Experimental Study PDF

[67] Self-rewarding correction for mathematical reasoning PDF

[68] DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition PDF

[69] Self-Evolving Curriculum for LLM Reasoning PDF

[70] SRFT: A Single-Stage Method with Supervised and Reinforcement Fine-Tuning for Reasoning PDF

[71] Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback PDF

[72] Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models PDF

[73] Towards Effective Code-Integrated Reasoning PDF

[74] Autotir: Autonomous tools integrated reasoning via reinforcement learning PDF

Progressive training curriculum with lightweight data pipeline

[59] SARI: Structured Audio Reasoning via Curriculum-Guided Reinforcement Learning PDF

[63] QwenLong-L1: Towards Long-Context Large Reasoning Models with Reinforcement Learning PDF

[56] Fine-tuning large vision-language models as decision-making agents via reinforcement learning PDF

[57] Automatic berthing using supervised learning and reinforcement learning PDF

[58] AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning PDF

[60] R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning PDF

[61] Step-wise Adaptive Integration of Supervised Fine-tuning and Reinforcement Learning for Task-Specific LLMs PDF

[62] Progressive Mastery: Customized Curriculum Learning with Guided Prompting for Mathematical Reasoning PDF

[64] RAVEN: Robust Advertisement Video Violation Temporal Grounding via Reinforcement Reasoning PDF

[65] ZeroSearch: Incentivize the Search Capability of LLMs without Searching PDF

Parallel thinking as mid-training exploration scaffold

[25] ToTRL: Unlock LLM Tree-of-Thoughts Reasoning Potential through Puzzles Solving PDF

[47] ML-Master: Towards AI-for-AI via Integration of Exploration and Reasoning PDF

[48] SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning PDF

[49] Taming Masked Diffusion Language Models via Consistency Trajectory Reinforcement Learning with Fewer Decoding Step PDF

[50] To Backtrack or Not to Backtrack: When Sequential Search Limits Model Reasoning PDF

[51] Parallel learning: A perspective and a framework PDF

[52] Maestro: Learning to Collaborate via Conditional Listwise Policy Optimization for Multi-Agent LLMs PDF

[53] Adaptive Reward-Free Exploration PDF

[54] PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning PDF

[55] SMART: Scalable Multi-Agent Reasoning and Trajectory Planning in Dense Environments PDF

Table of Contents