RLP: Reinforcement as a Pretraining Objective

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement LearningPretrainingReasoningLarge Language Models
Abstract:

The dominant paradigm for training large reasoning models starts with pre-training using next-token prediction loss on vast amounts of data. Reinforcement learning, while powerful in scaling reasoning, is introduced only as the very last phase of post-training, preceded by supervised fine-tuning. While dominant, is this an optimal way of training? In this paper, we present RLP, an information-driven reinforcement pretraining objective, that brings the core spirit of reinforcement learning---exploration---to the last phase of pretraining. The key idea is to treat chain-of-thought as an exploratory action, with rewards computed based on the information gain it provides for predicting future tokens. This training objective essentially encourages the model to think for itself before predicting what comes next, thus teaching an independent thinking behavior earlier in the pretraining. More concretely, the reward signal measures the increase in log-likelihood of the next token when conditioning on both context and a sampled reasoning chain, compared to conditioning on context alone. This approach yields a verifier-free dense reward signal, allowing for efficient training for the full document stream during pretraining. Specifically, RLP reframes reinforcement learning for reasoning as a pretraining objective on ordinary text, bridging the gap between next-token prediction and the emergence of useful chain-of-thought reasoning. Pretraining with RLP on Qwen3-1.7B-Base lifts the overall average across an eight‑benchmark math‑and‑science suite by 19%. With identical post‑training, the gains compound, with the largest improvements on reasoning‑heavy tasks such as AIME25 and MMLU‑Pro. Applying RLP to the hybrid NVIDIA-Nemotron-Nano-12B-v2-Base increases the overall average from 42.81% to 61.32% and raises the average on scientific reasoning by 23%, demonstrating scalability across architectures and model sizes.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces RLP, an information-driven reinforcement pretraining objective that integrates chain-of-thought reasoning directly into the pretraining phase. Within the taxonomy, it resides in the 'Pretraining-Phase RL Integration' leaf under 'RL Training Objectives and Algorithms for Reasoning'. This leaf contains only two papers, indicating a relatively sparse research direction. The taxonomy reveals that most RL-for-reasoning work focuses on post-training methods (policy optimization, value-based approaches) or domain-specific applications, making pretraining-phase integration a less crowded area.

The taxonomy structure shows that RLP's nearest neighbors include policy optimization methods (five papers using PPO/GRPO variants), value-based approaches (two papers on Q-function optimization), and reward model design (three papers on process supervision). These branches address RL integration but primarily during post-training or fine-tuning phases. The 'Reasoning Paradigms and Prompting Strategies' branch explores chain-of-thought structures but without explicit pretraining-phase RL objectives. RLP diverges by treating chain-of-thought as exploratory actions during pretraining itself, bridging the gap between next-token prediction and reasoning-aware training earlier in the model lifecycle.

Among thirty candidates examined, none clearly refute any of RLP's three core contributions: the information-driven pretraining objective (ten candidates examined, zero refutable), the verifier-free dense reward signal (ten candidates, zero refutable), and bridging next-token prediction with chain-of-thought reasoning (ten candidates, zero refutable). This suggests that within the limited search scope, the specific combination of information gain-based rewards during pretraining appears novel. However, the analysis does not claim exhaustive coverage; it reflects top-K semantic matches and citation expansion, not a comprehensive field survey.

Given the sparse population of the pretraining-phase integration leaf and the absence of refuting candidates among thirty examined papers, RLP appears to occupy a relatively unexplored niche. The limited search scope means adjacent work in policy optimization or reward design may contain relevant ideas not captured here. The taxonomy context indicates that while RL for reasoning is an active field, integrating RL objectives directly into pretraining remains less developed than post-training approaches.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Reinforcement learning for chain-of-thought reasoning during language model pretraining. The field structure reflects a broad effort to integrate RL-driven reasoning capabilities into language models across multiple dimensions. The taxonomy organizes work into branches that address training algorithms and objectives (including pretraining-phase integration as in RLP Pretraining[0]), domain-specific applications spanning finance, medicine, and knowledge graphs, multimodal reasoning that extends beyond text, tool-augmented and retrieval-enhanced methods (e.g., ReAct[3]), alternative architectural paradigms, prompting strategies, theoretical surveys (Large Reasoning Survey[2], RL Survey[20]), specialized benchmarks, reward design, and even non-reasoning LLM tasks. This structure highlights how RL for reasoning has evolved from isolated post-training fine-tuning into a more holistic enterprise that touches pretraining, inference scaling, and cross-domain generalization. Several active lines of work reveal key trade-offs and open questions. One central theme is whether to inject RL during pretraining versus post-training: RLP Pretraining[0] explores early-stage integration, contrasting with many studies that apply RL after initial model training. Another tension involves balancing sample efficiency with exploration depth, as seen in works like One Example RL[4] and Search-r1[8], which investigate how much data and search are needed for effective reasoning. Hierarchical Reward Models[5] and Cross-Domain RL[6] address reward design and generalization across tasks, while MiMo[21] examines mixture-of-experts strategies for scaling. RLP Pretraining[0] sits within the pretraining-phase integration cluster, emphasizing that reasoning abilities can be shaped from the earliest training stages rather than retrofitted later. This approach contrasts with post-hoc methods but shares motivations with works like Guiding Pretraining[17] and Teaching Reasoning[18], which also seek to embed reasoning structure early in model development.

Claimed Contributions

RLP: information-driven reinforcement pretraining objective

The authors introduce RLP, a novel pretraining objective that treats chain-of-thought as an exploratory action and computes rewards based on the information gain it provides for predicting future tokens. This approach encourages models to think before predicting the next token, teaching independent thinking behavior earlier in pretraining.

10 retrieved papers
Verifier-free dense reward signal for pretraining

The method provides a dense, position-wise reward signal computed from log-likelihood ratios without requiring external verifiers or task-specific checkers. This enables uniform application to domain-agnostic web-scale text during pretraining.

10 retrieved papers
Bridging next-token prediction and chain-of-thought reasoning

RLP reformulates reinforcement learning for reasoning as a pretraining objective applicable to ordinary text corpora, connecting traditional next-token prediction with the development of useful chain-of-thought reasoning capabilities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RLP: information-driven reinforcement pretraining objective

The authors introduce RLP, a novel pretraining objective that treats chain-of-thought as an exploratory action and computes rewards based on the information gain it provides for predicting future tokens. This approach encourages models to think before predicting the next token, teaching independent thinking behavior earlier in pretraining.

Contribution

Verifier-free dense reward signal for pretraining

The method provides a dense, position-wise reward signal computed from log-likelihood ratios without requiring external verifiers or task-specific checkers. This enables uniform application to domain-agnostic web-scale text during pretraining.

Contribution

Bridging next-token prediction and chain-of-thought reasoning

RLP reformulates reinforcement learning for reasoning as a pretraining objective applicable to ordinary text corpora, connecting traditional next-token prediction with the development of useful chain-of-thought reasoning capabilities.