Single-stream Policy Optimization

ICLR 2026 Conference SubmissionAnonymous Authors
Single-stream Policy OptimizationLarge Language ModelsReinforcement Learning
Abstract:

We revisit policy-gradient optimization for Large Language Models (LLMs) from a single-stream perspective. Prevailing group-based methods like GRPO reduce variance with on-the-fly baselines but suffer from critical flaws: frequent degenerate groups erase learning signals, and synchronization barriers hinder scalability. We introduce Single-stream Policy Optimization (SPO), which eliminates these issues by design. SPO replaces per-group baselines with a persistent, KL-adaptive value tracker and normalizes advantages globally across the batch, providing a stable, low-variance learning signal for every sample. Being group-free, SPO enables higher throughput and scales effectively in long-horizon or tool-integrated settings where generation times vary. Furthermore, the persistent value tracker naturally enables an adaptive curriculum via prioritized sampling. Experiments using Qwen3-8B show that SPO converges more smoothly and attains higher accuracy than GRPO, while eliminating computation wasted on degenerate groups. Ablation studies confirm that SPO's gains stem from its principled approach to baseline estimation and advantage normalization, offering a more robust and efficient path for LLM reasoning. Across five hard math benchmarks with Qwen3-8B, SPO improves the average maj@32 by +3.4 percentage points(pp)+3.4\ \text{percentage points} (\mathrm{pp}) over GRPO, driven by substantial absolute point gains on challenging datasets, including +7.3 pp+7.3\ \mathrm{pp} on BRUMO 25, +4.4 pp+4.4\ \mathrm{pp} on AIME 25, +3.3 pp+3.3\ \mathrm{pp} on HMMT 25, and achieves consistent relative gain in pass@kk across the evaluated kk values. SPO's success challenges the prevailing trend of adding incidental complexity to RL algorithms, highlighting a path where fundamental principles, not architectural workarounds, drive the next wave of progress in LLM reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Single-stream Policy Optimization (SPO), which eliminates group-based baselines in favor of a persistent KL-adaptive value tracker and global advantage normalization. It resides in the 'Single-Stream and Baseline-Free Approaches' leaf under Core Policy Gradient Algorithm Design. Notably, this leaf contains only the original paper itself—no sibling papers—indicating a sparse research direction within the taxonomy. The broader parent category includes three other leaves (Unified Frameworks, Constrained Optimization, Risk-Seeking Objectives), suggesting that single-stream methods represent a relatively unexplored niche compared to more established variance-reduction or constraint-based approaches.

The taxonomy reveals neighboring work in adjacent leaves and branches. Constrained and Regularized Policy Optimization (three papers) explores KL-divergence constraints and reward maximization, while Unified and Simplified Policy Gradient Frameworks (three papers) focuses on eliminating critic networks or unifying existing methods. The Entropy and Gradient Management branch addresses orthogonal concerns like exploration control and gradient stability, with methods such as Entropy Mechanism and gradient clipping techniques. SPO's single-stream design diverges from these directions by targeting synchronization barriers and degenerate groups rather than entropy dynamics or gradient magnitude issues, positioning it as a complementary approach to existing variance-reduction strategies.

Among thirteen candidates examined, none clearly refute SPO's contributions. The 'Single-stream Policy Optimization algorithm' examined one candidate with no refutations. The 'KL-adaptive Bayesian value tracker' examined four candidates, all non-refutable or unclear. The 'Prioritized prompt sampling curriculum' examined eight candidates, again with no refutations. This suggests that within the limited search scope, SPO's specific combination of persistent value tracking, global advantage normalization, and adaptive curriculum appears novel. However, the small candidate pool (thirteen total) means the analysis covers a narrow slice of potentially relevant prior work, particularly given the broader variance-reduction literature.

Based on the limited search of thirteen candidates, SPO appears to occupy a sparsely populated research direction with no direct prior work in its taxonomy leaf. The absence of refutations across all three contributions suggests novelty within the examined scope, though the small candidate pool and lack of sibling papers limit confidence in this assessment. A more exhaustive search might reveal related work in variance reduction or baseline estimation that was not captured by top-K semantic matching.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
13
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: policy gradient optimization for large language models. The field has evolved into several distinct branches that address complementary challenges in aligning and improving LLM behavior through reinforcement learning. Core Policy Gradient Algorithm Design explores foundational methods such as single-stream and baseline-free approaches, as well as variance reduction techniques that stabilize training. Entropy and Gradient Management focuses on controlling exploration-exploitation trade-offs and mitigating issues like vanishing gradients or degeneration, with works such as Entropy Mechanism[4] and Degeneration-free Policy[38] exemplifying this direction. Diffusion Language Model Policy Optimization adapts policy gradients to generative diffusion frameworks, while Data Selection and Offline Learning emphasizes curating high-quality training data and leveraging offline datasets to improve sample efficiency. Compositional and Multi-Module Policy Optimization, including Multi-module GRPO[5], tackles systems with multiple interacting components. Domain-Specific Applications and Extensions apply these techniques to specialized tasks like reasoning or agent control, and Gradient-Based Optimization Techniques refine the underlying mathematical machinery, addressing challenges highlighted in studies like Vanishing Gradients[18]. Recent work has intensified around variance reduction and computational efficiency, with methods like Reinforce++[2] and GVPO[3] proposing refined baseline strategies and gradient estimators to reduce noise in policy updates. Single-stream Policy[0] sits within the single-stream and baseline-free cluster, emphasizing streamlined architectures that avoid auxiliary critic networks or complex variance-reduction modules. This contrasts with approaches like GVPO[3], which incorporates value-based guidance, and Reinforce++[2], which enhances classical REINFORCE with advanced baselines. Meanwhile, entropy-focused methods such as Entropy Mechanism[4] and gradient management techniques address orthogonal concerns about exploration collapse and training stability. The interplay between algorithmic simplicity and sample efficiency remains a central tension, with Single-stream Policy[0] representing efforts to achieve competitive performance through architectural elegance rather than auxiliary complexity.

Claimed Contributions

Single-stream Policy Optimization (SPO) algorithm

SPO is a new policy gradient method for LLM training that uses a single prompt-response pair per sample instead of groups. It employs a Bayesian value tracker with KL-adaptive memory for baseline estimation and performs global advantage normalization across batches, eliminating the computational waste and synchronization bottlenecks of group-based methods like GRPO.

1 retrieved paper
KL-adaptive Bayesian value tracker

A persistent baseline estimation mechanism that models success probability using a Beta distribution and adapts its memory dynamically based on KL divergence between current and previous policies. This provides stable, low-variance advantage estimates without requiring a separate critic network or multiple samples per prompt.

4 retrieved papers
Prioritized prompt sampling curriculum

An adaptive curriculum learning strategy that samples training prompts based on their estimated learning potential, focusing computational resources on prompts with high uncertainty while maintaining exploration through a minimum sampling weight, thereby improving data efficiency.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Single-stream Policy Optimization (SPO) algorithm

SPO is a new policy gradient method for LLM training that uses a single prompt-response pair per sample instead of groups. It employs a Bayesian value tracker with KL-adaptive memory for baseline estimation and performs global advantage normalization across batches, eliminating the computational waste and synchronization bottlenecks of group-based methods like GRPO.

Contribution

KL-adaptive Bayesian value tracker

A persistent baseline estimation mechanism that models success probability using a Beta distribution and adapts its memory dynamically based on KL divergence between current and previous policies. This provides stable, low-variance advantage estimates without requiring a separate critic network or multiple samples per prompt.

Contribution

Prioritized prompt sampling curriculum

An adaptive curriculum learning strategy that samples training prompts based on their estimated learning potential, focusing computational resources on prompts with high uncertainty while maintaining exploration through a minimum sampling weight, thereby improving data efficiency.

Single-stream Policy Optimization | Novelty Validation