How reinforcement learning after next-token prediction facilitates learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

large language modelsreinforcement learninglength increasetheory

Recent advances in reasoning domains with neural networks have primarily been enabled by a training recipe that optimizes Large Language Models, previously trained to predict the next-token in a sequence, with reinforcement learning algorithms. We introduce a framework to study the success of this paradigm, and we theoretically expose the optimization mechanisms by which reinforcement learning improves over next-token prediction in this setting. We study learning from mixture distributions of short and long “chain-of-thought” sequences encoding a single task. In particular, when the task consists of predicting the parity of $d$ bits and long sequences are rare, we show how reinforcement learning after next-token prediction enables autoregressive transformers to generalize, whereas mere next-token prediction requires extreme statistical or computational resources to do so. We further explain how reinforcement learning leverages increased test-time computation, manifested in longer responses, to facilitate this learning process. In a simplified setting, we theoretically prove that autoregressive linear models following this training recipe can efficiently learn to predict the parity of $d$ bits as long as the proportion of long demonstrations in the data mix is not exponentially small in the input dimension $d$ . Finally, we demonstrate these same phenomena in other settings, including the post-training of Llama-series models on mixture variations of common mathematical reasoning benchmarks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

This paper develops a theoretical framework explaining how reinforcement learning after next-token prediction enables autoregressive transformers to learn reasoning tasks, specifically parity prediction over d bits. It occupies the 'Theoretical Mechanisms of RL for Reasoning' leaf within the taxonomy, which contains only this single paper. This isolation reflects the scarcity of formal theoretical work in a field otherwise dominated by algorithmic innovations and empirical studies. The paper's position highlights a critical gap: while numerous methods apply RL to reasoning tasks, rigorous mathematical analysis of why these methods succeed remains rare.

The taxonomy reveals substantial activity in neighboring branches. The sibling leaf 'Empirical Capability Analysis' contains three papers probing whether RL expands reasoning capabilities, while the parent branch 'Theoretical Foundations and Empirical Analysis' contrasts formal proofs with empirical investigations. Adjacent branches like 'Core RL Algorithms for LLM Reasoning' and 'Reward Modeling Innovations' house algorithmic contributions that this work seeks to explain theoretically. The taxonomy's scope notes clarify boundaries: this paper excludes empirical capability studies and method development, focusing instead on optimization mechanisms underlying the training recipe.

Among twenty candidates examined across three contributions, none clearly refutes the paper's claims. The 'Theoretical framework for RL after next-token prediction' examined ten candidates with zero refutations, while the 'Theoretical separation between next-token prediction and combined training' examined three candidates, also with zero refutations. The 'Polynomial sample complexity result for parity learning' examined seven candidates without finding overlapping prior work. This limited search scope—twenty papers from semantic search and citation expansion—suggests the analysis captures highly relevant neighbors but cannot claim exhaustive coverage of all theoretical reasoning literature.

Given the restricted search scale and the paper's unique position as the sole occupant of its taxonomy leaf, the work appears to address an underexplored theoretical niche. The absence of refutations among examined candidates, combined with the taxonomy's evidence of sparse formal analysis relative to algorithmic development, suggests the contributions occupy relatively open territory. However, the analysis reflects top-twenty semantic matches rather than comprehensive field coverage, leaving open the possibility of relevant theoretical work outside this scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Reinforcement learning after next-token prediction for reasoning tasks. The field has evolved into a rich landscape organized around several major branches. Reinforcement Learning Training Paradigms and Algorithms explores diverse RL methods—ranging from policy gradient techniques to process-supervised approaches like ProRL[9] and SuperRL[11]—that refine language models beyond standard supervised fine-tuning. Theoretical Foundations and Empirical Analysis investigates the underlying mechanisms by which RL improves reasoning, examining questions such as whether RL truly enhances reasoning capabilities (Does reinforcement learning really[2]) or primarily optimizes surface-level patterns. Reasoning Efficiency and Inference Optimization focuses on reducing computational overhead during test-time reasoning, while Multimodal and Vision-Language Reasoning extends RL-based reasoning to settings that integrate visual inputs, as seen in works like Vl-rethinker[5] and GFlowVLM[33]. Domain-Specific Applications and Extensions apply these methods to specialized areas such as finance (Fin-R1[36]) and chemistry (Reasoning-Driven Retrosynthesis Prediction[27]), and Scaling and Production Systems address the challenges of deploying large reasoning models in real-world environments. A particularly active line of work centers on understanding how RL interacts with pre-trained language models to unlock deeper reasoning. How reinforcement learning after[0] sits squarely within the Theoretical Mechanisms of RL for Reasoning cluster, probing the fundamental question of what RL contributes beyond next-token prediction. This contrasts with neighboring empirical studies like Reinforcement learning on pre-training[3], which examines RL's role during the pre-training phase itself, and Reinforcement Learning for Reasoning[1], which surveys broader algorithmic strategies. Meanwhile, works such as Kimi k15[8] and d1[13] demonstrate large-scale implementations that blend theoretical insights with practical deployment. The central tension across these branches revolves around whether RL genuinely fosters novel reasoning patterns or merely refines existing heuristics, a debate that shapes ongoing research into process rewards, multi-step inference, and the design of training objectives that encourage robust generalization.

Claimed Contributions

Theoretical framework for RL after next-token prediction

10 retrieved papers

The authors develop a theoretical framework to analyze why reinforcement learning applied after next-token prediction enables autoregressive models to generalize on challenging tasks. They identify and prove the optimization mechanisms that make this two-stage training recipe effective, particularly when learning from mixtures of short and long chain-of-thought sequences.

10 retrieved papers

Theoretical separation between next-token prediction and combined training

3 retrieved papers

The authors provide the first theoretical proof that next-token prediction combined with RL is fundamentally more sample-efficient than next-token prediction alone for autoregressive models. They also give the first optimization-theoretic explanation for why model response length increases during reinforcement learning.

3 retrieved papers

Polynomial sample complexity result for parity learning

7 retrieved papers

The authors prove that autoregressive linear models trained with next-token prediction followed by reinforcement learning can learn d-bit parity in polynomial sample complexity, provided long demonstrations are not exponentially rare. This contrasts with known exponential lower bounds for learning parity without chain-of-thought demonstrations.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical framework for RL after next-token prediction

[13] d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning PDF

Cannot Refute

[40] RLVR-World: Training World Models with Reinforcement Learning PDF

Cannot Refute

[51] X-omni: Reinforcement learning makes discrete autoregressive image generative models great again PDF

Cannot Refute

[52] Humanoid locomotion as next token prediction PDF

Cannot Refute

[53] GraphAF: a Flow-based Autoregressive Model for Molecular Graph Generation PDF

Cannot Refute

[54] GenARM: Reward guided generation with autoregressive reward model for test-time alignment PDF

Cannot Refute

[55] VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning PDF

Cannot Refute

[56] Decision Transformer: Reinforcement Learning via Sequence Modeling PDF

Cannot Refute

[57] ivideogpt: Interactive videogpts are scalable world models PDF

Cannot Refute

[58] CtrlDiff: Boosting Large Diffusion Language Models with Dynamic Block Prediction and Controllable Generation PDF

Cannot Refute

Contribution

Theoretical separation between next-token prediction and combined training

[66] Improving Token-Based World Models with Parallel Observation Prediction PDF

Cannot Refute

[67] Sample-efficient Imitative Multi-token Decision Transformer for Real-world Driving PDF

Cannot Refute

[68] Private Federated Learning using Preference-Optimized Synthetic Data PDF

Cannot Refute

Contribution

Polynomial sample complexity result for parity learning

[59] From sparse dependence to sparse attention: unveiling how chain-of-thought enhances transformer sample efficiency PDF

Cannot Refute

[60] On the power of decision trees in auto-regressive language modeling PDF

Cannot Refute

[61] Task Generalization With AutoRegressive Compositional Structure: Can Learning From Tasks Generalize to Tasks? PDF

Cannot Refute

[62] Abrupt learning in transformers: A case study on matrix completion PDF

Cannot Refute

[63] Transformers provably solve parity efficiently with chain of thought PDF

Cannot Refute

[64] Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training PDF

Cannot Refute

[65] Understanding the Test-Time Computing of Transformers: A Theoretical Study on In-Context Linear Regression PDF

Cannot Refute

How reinforcement learning after next-token prediction facilitates learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Theoretical framework for RL after next-token prediction

[13] d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning PDF

[40] RLVR-World: Training World Models with Reinforcement Learning PDF

[51] X-omni: Reinforcement learning makes discrete autoregressive image generative models great again PDF

[52] Humanoid locomotion as next token prediction PDF

[53] GraphAF: a Flow-based Autoregressive Model for Molecular Graph Generation PDF

[54] GenARM: Reward guided generation with autoregressive reward model for test-time alignment PDF

[55] VARGPT-v1.1: Improve Visual Autoregressive Large Unified Model via Iterative Instruction Tuning and Reinforcement Learning PDF

[56] Decision Transformer: Reinforcement Learning via Sequence Modeling PDF

[57] ivideogpt: Interactive videogpts are scalable world models PDF

[58] CtrlDiff: Boosting Large Diffusion Language Models with Dynamic Block Prediction and Controllable Generation PDF

Theoretical separation between next-token prediction and combined training

[66] Improving Token-Based World Models with Parallel Observation Prediction PDF

[67] Sample-efficient Imitative Multi-token Decision Transformer for Real-world Driving PDF

[68] Private Federated Learning using Preference-Optimized Synthetic Data PDF

Polynomial sample complexity result for parity learning

[59] From sparse dependence to sparse attention: unveiling how chain-of-thought enhances transformer sample efficiency PDF

[60] On the power of decision trees in auto-regressive language modeling PDF

[61] Task Generalization With AutoRegressive Compositional Structure: Can Learning From Tasks Generalize to Tasks? PDF

[62] Abrupt learning in transformers: A case study on matrix completion PDF

[63] Transformers provably solve parity efficiently with chain of thought PDF

[64] Provable Benefit of Curriculum in Transformer Tree-Reasoning Post-Training PDF

[65] Understanding the Test-Time Computing of Transformers: A Theoretical Study on In-Context Linear Regression PDF

Table of Contents