The Art of Scaling Reinforcement Learning Compute for LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
ScalingLLMsReasoning
Abstract:

Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range of common design choices to analyze their effects on asymptotic performance and compute efficiency. We observe: (1) Not all recipes yield similar asymptotic performance, Details such as loss aggregation, normalization, curriculum, and off-policy algorithm primarily modulate compute efficiency without materially shifting the asymptote, and (3) Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. Combining these insights, we propose a best-practice recipe, ScaleRL, and demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours. Our work provides both a scientific framework for analyzing scaling in RL and a practical recipe that brings RL training closer to the predictability long achieved in pre-training.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper presents a predictive framework for RL scaling in LLMs using sigmoidal compute-performance curves, alongside a best-practice recipe called ScaleRL. It resides in the 'Compute-Performance Scaling Laws for RL Post-Training' leaf, which contains only three papers total, including this one. This represents a relatively sparse research direction within the broader taxonomy of 50 papers across 19 leaf nodes, suggesting the specific focus on predictive RL scaling laws for LLM post-training remains an emerging area with limited prior systematic investigation.

The taxonomy tree reveals that neighboring work concentrates on pre-training scaling laws (e.g., Compute Optimal Training, DiLoCo Scaling Laws) and cross-family prediction methods, but these explicitly exclude RL-specific post-training dynamics. The sibling papers in the same leaf—Predictive GRPO Laws and Math Reasoning Scaling—examine RL scaling in narrower contexts (specific algorithms or mathematical domains), whereas this work aims for broader predictive modeling across diverse RL training regimes. Adjacent branches address algorithmic innovations (policy gradient methods, reward design) and infrastructure optimization, but lack the systematic compute-performance prediction focus central to this contribution.

Among 13 candidates examined across three contributions, zero refutable pairs were identified. The predictive framework contribution examined one candidate with no refutation; the ScaleRL recipe examined two candidates with no refutation; and the comprehensive empirical study examined ten candidates with no refutation. This limited search scope—13 papers rather than an exhaustive review—suggests the analysis captures immediate semantic neighbors but may not reflect the full landscape of RL scaling research. The absence of refutations among examined candidates indicates that, within this bounded search, the specific combination of sigmoidal curve fitting, design choice ablations, and best-practice recipe formulation appears distinct from prior work.

Based on the limited literature search of 13 candidates, the work appears to occupy a relatively novel position within RL scaling law research, particularly in its systematic approach to predicting compute-performance trajectories. However, the sparse population of the taxonomy leaf and the constrained search scope mean this assessment reflects top-K semantic matches rather than comprehensive field coverage. The analysis does not capture potential overlaps with broader scaling law literature outside the examined candidate set.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
13
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: predictive scaling of reinforcement learning compute for large language models. The field structure reflects a multifaceted effort to understand and optimize how RL post-training scales with computational resources. The taxonomy organizes work into several main branches: RL Training Dynamics and Scaling Laws examines fundamental relationships between compute budgets and model performance, often through empirical studies of how reward signals and policy updates behave under varying resource allocations; RL Algorithms and Training Methods for LLM Reasoning focuses on algorithmic innovations such as policy gradient stabilization, selective rollout strategies, and novel reward formulations that improve sample efficiency; Infrastructure and Deployment Optimization addresses practical concerns like serverless inference architectures, GPU allocation strategies, and resource scheduling; Domain-Specific Applications and Integration explores how RL-enhanced LLMs are deployed in specialized contexts such as scientific research assistants or real-time systems; and Foundational Concepts and Survey Literature provides broader context through reviews of LLM capabilities, emergent reasoning phenomena, and technical foundations. Representative works like Compute Optimal Training[5] and DiLoCo Scaling Laws[15] illustrate efforts to characterize training efficiency, while Predictive GRPO Laws[12] and Math Reasoning Scaling[11] probe how specific RL methods scale in reasoning-heavy domains. Particularly active lines of work center on deriving predictive laws that relate compute investment to downstream task performance, balancing the trade-off between exploration costs and inference-time gains, and understanding when prolonged training yields diminishing returns. Scaling RL Compute[0] sits within the branch examining compute-performance scaling laws for RL post-training, closely aligned with Predictive GRPO Laws[12] and Math Reasoning Scaling[11], which similarly investigate how algorithmic choices and problem domains modulate scaling behavior. While Math Reasoning Scaling[11] emphasizes domain-specific benchmarks in mathematical problem-solving, Scaling RL Compute[0] takes a broader view of predictive modeling across diverse RL training regimes, aiming to forecast performance gains before committing large-scale resources. This contrasts with infrastructure-focused efforts like Serverless AI Inference[3] or deployment studies such as Edge Deployment RL[39], which prioritize operational efficiency over theoretical scaling predictions. The work contributes to an emerging consensus that principled resource allocation requires not only empirical scaling curves but also interpretable models that generalize across tasks and training configurations.

Claimed Contributions

Predictive framework for RL scaling in LLMs using sigmoidal compute-performance curves

The authors introduce a sigmoidal curve framework (Equation 1) that models the relationship between expected reward and training compute, enabling extrapolation of RL performance from lower-compute runs to higher compute budgets. This framework quantifies asymptotic performance (A) and compute efficiency (B), providing a predictive methodology for evaluating RL scalability.

1 retrieved paper
ScaleRL: a best-practice RL recipe that scales predictably with compute

The authors develop ScaleRL, an RL training recipe that integrates asynchronous Pipeline-RL, forced length interruptions, truncated importance sampling RL loss (CISPO), prompt-level loss averaging, batch-level advantage normalization, FP32 precision at logits, zero-variance filtering, and no-positive-resampling. This recipe achieves state-of-the-art asymptotic performance and compute efficiency while maintaining predictable scaling trajectories.

2 retrieved papers
Comprehensive empirical study identifying three key principles for RL scaling

Through over 400,000 GPU-hours of experiments, the authors systematically ablate design choices in RL training and establish three principles: different methods reach different performance ceilings, common interventions mainly affect compute efficiency rather than asymptotic performance, and scalable methods can be identified early by estimating scaling parameters from initial training dynamics.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Predictive framework for RL scaling in LLMs using sigmoidal compute-performance curves

The authors introduce a sigmoidal curve framework (Equation 1) that models the relationship between expected reward and training compute, enabling extrapolation of RL performance from lower-compute runs to higher compute budgets. This framework quantifies asymptotic performance (A) and compute efficiency (B), providing a predictive methodology for evaluating RL scalability.

Contribution

ScaleRL: a best-practice RL recipe that scales predictably with compute

The authors develop ScaleRL, an RL training recipe that integrates asynchronous Pipeline-RL, forced length interruptions, truncated importance sampling RL loss (CISPO), prompt-level loss averaging, batch-level advantage normalization, FP32 precision at logits, zero-variance filtering, and no-positive-resampling. This recipe achieves state-of-the-art asymptotic performance and compute efficiency while maintaining predictable scaling trajectories.

Contribution

Comprehensive empirical study identifying three key principles for RL scaling

Through over 400,000 GPU-hours of experiments, the authors systematically ablate design choices in RL training and establish three principles: different methods reach different performance ceilings, common interventions mainly affect compute efficiency rather than asymptotic performance, and scalable methods can be identified early by estimating scaling parameters from initial training dynamics.