How reinforcement learning after next-token prediction facilitates learning

ICLR 2026 Conference SubmissionAnonymous Authors
large language modelsreinforcement learninglength increasetheory
Abstract:

Recent advances in reasoning domains with neural networks have primarily been enabled by a training recipe that optimizes Large Language Models, previously trained to predict the next-token in a sequence, with reinforcement learning algorithms. We introduce a framework to study the success of this paradigm, and we theoretically expose the optimization mechanisms by which reinforcement learning improves over next-token prediction in this setting. We study learning from mixture distributions of short and long “chain-of-thought” sequences encoding a single task. In particular, when the task consists of predicting the parity of dd bits and long sequences are rare, we show how reinforcement learning after next-token prediction enables autoregressive transformers to generalize, whereas mere next-token prediction requires extreme statistical or computational resources to do so. We further explain how reinforcement learning leverages increased test-time computation, manifested in longer responses, to facilitate this learning process. In a simplified setting, we theoretically prove that autoregressive linear models following this training recipe can efficiently learn to predict the parity of dd bits as long as the proportion of long demonstrations in the data mix is not exponentially small in the input dimension dd. Finally, we demonstrate these same phenomena in other settings, including the post-training of Llama-series models on mixture variations of common mathematical reasoning benchmarks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper develops a theoretical framework explaining how reinforcement learning after next-token prediction enables autoregressive transformers to learn reasoning tasks, specifically parity prediction over d bits. It occupies the 'Theoretical Mechanisms of RL for Reasoning' leaf within the taxonomy, which contains only this single paper. This isolation reflects the scarcity of formal theoretical work in a field otherwise dominated by algorithmic innovations and empirical studies. The paper's position highlights a critical gap: while numerous methods apply RL to reasoning tasks, rigorous mathematical analysis of why these methods succeed remains rare.

The taxonomy reveals substantial activity in neighboring branches. The sibling leaf 'Empirical Capability Analysis' contains three papers probing whether RL expands reasoning capabilities, while the parent branch 'Theoretical Foundations and Empirical Analysis' contrasts formal proofs with empirical investigations. Adjacent branches like 'Core RL Algorithms for LLM Reasoning' and 'Reward Modeling Innovations' house algorithmic contributions that this work seeks to explain theoretically. The taxonomy's scope notes clarify boundaries: this paper excludes empirical capability studies and method development, focusing instead on optimization mechanisms underlying the training recipe.

Among twenty candidates examined across three contributions, none clearly refutes the paper's claims. The 'Theoretical framework for RL after next-token prediction' examined ten candidates with zero refutations, while the 'Theoretical separation between next-token prediction and combined training' examined three candidates, also with zero refutations. The 'Polynomial sample complexity result for parity learning' examined seven candidates without finding overlapping prior work. This limited search scope—twenty papers from semantic search and citation expansion—suggests the analysis captures highly relevant neighbors but cannot claim exhaustive coverage of all theoretical reasoning literature.

Given the restricted search scale and the paper's unique position as the sole occupant of its taxonomy leaf, the work appears to address an underexplored theoretical niche. The absence of refutations among examined candidates, combined with the taxonomy's evidence of sparse formal analysis relative to algorithmic development, suggests the contributions occupy relatively open territory. However, the analysis reflects top-twenty semantic matches rather than comprehensive field coverage, leaving open the possibility of relevant theoretical work outside this scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Reinforcement learning after next-token prediction for reasoning tasks. The field has evolved into a rich landscape organized around several major branches. Reinforcement Learning Training Paradigms and Algorithms explores diverse RL methods—ranging from policy gradient techniques to process-supervised approaches like ProRL[9] and SuperRL[11]—that refine language models beyond standard supervised fine-tuning. Theoretical Foundations and Empirical Analysis investigates the underlying mechanisms by which RL improves reasoning, examining questions such as whether RL truly enhances reasoning capabilities (Does reinforcement learning really[2]) or primarily optimizes surface-level patterns. Reasoning Efficiency and Inference Optimization focuses on reducing computational overhead during test-time reasoning, while Multimodal and Vision-Language Reasoning extends RL-based reasoning to settings that integrate visual inputs, as seen in works like Vl-rethinker[5] and GFlowVLM[33]. Domain-Specific Applications and Extensions apply these methods to specialized areas such as finance (Fin-R1[36]) and chemistry (Reasoning-Driven Retrosynthesis Prediction[27]), and Scaling and Production Systems address the challenges of deploying large reasoning models in real-world environments. A particularly active line of work centers on understanding how RL interacts with pre-trained language models to unlock deeper reasoning. How reinforcement learning after[0] sits squarely within the Theoretical Mechanisms of RL for Reasoning cluster, probing the fundamental question of what RL contributes beyond next-token prediction. This contrasts with neighboring empirical studies like Reinforcement learning on pre-training[3], which examines RL's role during the pre-training phase itself, and Reinforcement Learning for Reasoning[1], which surveys broader algorithmic strategies. Meanwhile, works such as Kimi k15[8] and d1[13] demonstrate large-scale implementations that blend theoretical insights with practical deployment. The central tension across these branches revolves around whether RL genuinely fosters novel reasoning patterns or merely refines existing heuristics, a debate that shapes ongoing research into process rewards, multi-step inference, and the design of training objectives that encourage robust generalization.

Claimed Contributions

Theoretical framework for RL after next-token prediction

The authors develop a theoretical framework to analyze why reinforcement learning applied after next-token prediction enables autoregressive models to generalize on challenging tasks. They identify and prove the optimization mechanisms that make this two-stage training recipe effective, particularly when learning from mixtures of short and long chain-of-thought sequences.

10 retrieved papers
Theoretical separation between next-token prediction and combined training

The authors provide the first theoretical proof that next-token prediction combined with RL is fundamentally more sample-efficient than next-token prediction alone for autoregressive models. They also give the first optimization-theoretic explanation for why model response length increases during reinforcement learning.

3 retrieved papers
Polynomial sample complexity result for parity learning

The authors prove that autoregressive linear models trained with next-token prediction followed by reinforcement learning can learn d-bit parity in polynomial sample complexity, provided long demonstrations are not exponentially rare. This contrasts with known exponential lower bounds for learning parity without chain-of-thought demonstrations.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical framework for RL after next-token prediction

The authors develop a theoretical framework to analyze why reinforcement learning applied after next-token prediction enables autoregressive models to generalize on challenging tasks. They identify and prove the optimization mechanisms that make this two-stage training recipe effective, particularly when learning from mixtures of short and long chain-of-thought sequences.

Contribution

Theoretical separation between next-token prediction and combined training

The authors provide the first theoretical proof that next-token prediction combined with RL is fundamentally more sample-efficient than next-token prediction alone for autoregressive models. They also give the first optimization-theoretic explanation for why model response length increases during reinforcement learning.

Contribution

Polynomial sample complexity result for parity learning

The authors prove that autoregressive linear models trained with next-token prediction followed by reinforcement learning can learn d-bit parity in polynomial sample complexity, provided long demonstrations are not exponentially rare. This contrasts with known exponential lower bounds for learning parity without chain-of-thought demonstrations.

How reinforcement learning after next-token prediction facilitates learning | Novelty Validation