Pre-training under infinite compute

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.5 Download Report PDF

scaling lawsdata efficiencypre-training

Since compute grows much faster than web text available for language model pre-training, we ask how one should approach pre-training under fixed data and no compute constraints. We first show that existing data-constrained approaches of increasing epoch count and parameter count overfit, and we improve upon such recipes by tuning regularization, finding that the optimal weight decay is $30\times$ larger than standard practice. Since our regularized recipe monotonically decreases loss following a power law in parameter count, we estimate its best possible performance via the \textbf{asymptote} of its scaling law rather than the performance at a fixed compute budget. We then identify that ensembling independently trained models achieves a significantly lower loss asymptote than the regularized recipe. Our best intervention combining epoching, regularization, parameter scaling, and ensemble scaling achieves an asymptote at 200M tokens using $5.17\times$ less data than our baseline, and our data scaling laws predict that this improvement persists at higher token budgets. We find that our data efficiency gains can be realized at smaller parameter counts as we can distill an ensemble into a student model that is 8 $\times$ smaller and retains $83$ % of the ensembling benefit. Finally, our interventions designed for validation loss generalize to downstream benchmarks, achieving a $9$ % improvement for pre-training evals. Our results show that simple algorithmic improvements can enable significantly more data-efficient pre-training in a compute-rich future.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a data-efficient pre-training framework combining regularization, parameter scaling, and ensemble methods to optimize performance under fixed data budgets. It resides in the 'Scaling Law Formulation and Analysis' leaf, which contains six papers including foundational work like Chinchilla and recent extensions examining generalization trade-offs. This leaf sits within the broader 'Compute-Optimal Scaling and Resource Allocation' branch, indicating a moderately populated research direction focused on theoretical scaling principles rather than empirical recipes or system implementations.

The taxonomy reveals neighboring work in 'Empirical Training Strategies and Recipes' (three papers on practical protocols) and 'Data Curation and Selection Methods' (multiple leaves addressing quality filtering, diversity optimization, and synthetic generation). The paper's focus on asymptotic scaling laws under compute abundance distinguishes it from sibling papers examining joint compute-data constraints (Chinchilla) or generalization bounds. The scope_note clarifies this leaf excludes empirical recipes without theoretical analysis, positioning the work as extending scaling law theory rather than proposing purely practical training configurations.

Among 22 candidates examined across three contributions, no clearly refuting prior work was identified. The regularized parameter scaling contribution examined 10 candidates with zero refutations, suggesting novelty in the specific combination of tuned weight decay and asymptote-based evaluation. The ensemble scaling framework similarly examined 10 candidates without refutation, indicating the asymptote-focused methodology may be distinctive. The joint scaling recipe examined only 2 candidates, reflecting a more limited search scope for this compositional contribution. These statistics indicate the analysis covered a focused set of semantically related papers rather than an exhaustive field survey.

The limited search scope (22 candidates from semantic search) means the analysis captures closely related scaling law research but may not cover all relevant empirical training studies or data curation methods. The absence of refuting papers among examined candidates suggests the specific combination of techniques—particularly the asymptote-based evaluation framework and ensemble scaling under data constraints—appears novel within the sampled literature, though broader coverage might reveal additional overlaps in adjacent research directions.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: data-efficient language model pre-training under compute abundance. When computational resources are plentiful, the central challenge shifts from simply scaling up to making the most effective use of available data and compute together. The taxonomy reflects this dual focus through several major branches. Compute-Optimal Scaling and Resource Allocation examines how to balance model size, data volume, and training duration—building on foundational work like Chinchilla[1] and extending into newer analyses of generalization trade-offs (Compute Optimal Generalization[49]). Data Curation and Selection Methods addresses the quality and composition of training corpora, spanning retrieval-based approaches (Corpus Aware Retrieval[3]), domain-specific filtering (DoReMi[12], DataComp LM[13]), and synthetic data generation (Synthetic Data Scaling[14]). Adaptive and Continual Pre-Training explores how models can be updated or specialized over time (Domain Adaptation Pretraining[6]), while Sample-Efficient Training Objectives and Architectures investigates alternative learning signals (ELECTRA[11]) and architectural innovations. Additional branches cover multimodal extensions, distributed infrastructure (Megatron[5], Datacenter LLM Development[17]), privacy-preserving methods (Federated Pre-text[23]), and domain-specific applications, forming a comprehensive landscape of strategies for efficient pre-training. Several active lines of work reveal key trade-offs and open questions. One thread focuses on scaling law formulation: understanding how loss, model capacity, and data interact under different resource constraints (Chinchilla[1], Compute Optimal Analysis[28], Downstream Scaling Laws[30]). Another emphasizes data quality over sheer volume, with methods that curate, deduplicate (SoftDedup[29]), or synthesize high-value examples (Rephrasing the Web[25], Seed Free Synthetic[35]). A third direction tackles the practical realities of large-scale training, from cost modeling (LLM Cost Modeling[20]) to system-level optimizations (Cerebras GPT[4]). Infinite Compute Pretraining[0] sits squarely within the Compute-Optimal Scaling branch, specifically addressing scaling law formulation and analysis. Its emphasis on scenarios where compute is abundant but data remains finite contrasts with earlier work like Chinchilla[1], which derived optimal ratios under joint resource constraints, and complements recent studies on generalization bounds (Compute Optimal Generalization[49]) by exploring how to allocate unlimited compute when data quality or diversity becomes the bottleneck.

Claimed Contributions

Regularized parameter scaling recipe with tuned weight decay

10 retrieved papers

The authors introduce a regularized pre-training recipe that jointly tunes weight decay, learning rate, and epoch count at each parameter count. This approach achieves monotonic loss decrease following a power law in parameter count, with optimal weight decay being 30 times larger than the standard 0.1 value used in practice.

10 retrieved papers

Ensemble scaling recipe and asymptote-based evaluation framework

10 retrieved papers

The authors propose an ensembling recipe that trains multiple independent models and averages their logits, achieving a lower loss asymptote than parameter scaling alone. They introduce evaluating scaling recipes by the asymptote of their scaling law rather than performance at fixed compute budgets.

10 retrieved papers

Joint scaling recipe composing parameter and ensemble scaling

2 retrieved papers

The authors develop a joint scaling recipe that composes both parameter scaling and ensemble scaling by taking the double limit as both parameter count and ensemble member count approach infinity. This combined approach achieves significantly improved data efficiency compared to standard pre-training recipes.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Training compute-optimal large language models PDF

Hoffmann, Jordan, Borgeaud, Sebastian, Jordan Hoffmann, Mensch, Arthur, Sebastian Borgeaud, Buchatskaya, Elena, A. Mensch, Cai, Trevor, Elena Buchatskaya, Rutherford, Eliza, Trevor Cai, Casas, Diego de Las, Eliza Rutherford, Hendricks, Lisa Anne, Diego de Las Casas, Welbl, Johannes, Lisa Anne Hendricks, Clark, Aidan, Johannes Welbl, Hennigan, Tom, Aidan Clark, Noland, Eric, Tom Hennigan, Millican, Katie, Eric Noland, Driessche, George van den, Katie Millican, Damoc, Bogdan, George van den Driessche, Guy, Aurelia, Bogdan Damoc, Osindero, Simon, Aurelia Guy, Simonyan, Karen, Simon Osindero, Elsen, Erich, K. Simonyan, Rae, Jack W., Erich Elsen, Vinyals, Oriol, Jack W. Rae, Sifre, Laurent, O. Vinyals, L. Sifre (2022)

[18] Scaling laws revisited: modeling the role of data quality in language model pretraining PDF

Subramanyam, Anirudh, Chen, Yuxin, Anirudh Subramanyam, Grossman, Robert L., Yuxin Chen, Robert L. Grossman (2025)

[28] An empirical analysis of compute-optimal large language model training PDF

J Hoffmann, S Borgeaud, A Mensch (2022)

[30] Scaling Laws for Predicting Downstream Performance in LLMs PDF

Chen Yangyi, Huang, Binxuan, Yangyi Chen, Gao Yifan, Binxuan Huang, Wang Zhengyang, Yifan Gao, Yang Jing-feng, Zhengyang Wang, Ji, Heng, Jingfeng Yang, Heng Ji (2024) • Trans. Mach. Learn. Res.

[49] Compute-Optimal LLMs Provably Generalize Better With Scale PDF

Finzi, Marc, Kapoor, Sanyam, Marc Finzi, Granziol, Diego, Sanyam Kapoor, Diego Granziol, De Sa, Christopher, Anming Gu, Kolter J. Zico, Christopher De Sa, Wilson, Andrew Gordon, J. Kolter, Andrew Gordon Wilson (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Regularized parameter scaling recipe with tuned weight decay

[63] Rank minimization, alignment and weight decay in neural networks PDF

Cannot Refute

[64] How to set AdamW's weight decay as you scale model and dataset size PDF

Cannot Refute

[65] Rethinking weight decay for robust fine-tuning of foundation models PDF

Cannot Refute

[66] Weight decay induces low-rank attention layers PDF

Cannot Refute

[67] SGD with weight decay secretly minimizes the ranks of your neural networks PDF

Cannot Refute

[68] Explicit regularisation, sharpness and calibration PDF

Cannot Refute

[69] Rotational equilibrium: How weight decay balances learning across neural networks PDF

Cannot Refute

[70] Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy PDF

Cannot Refute

[71] Understanding decoupled and early weight decay PDF

Cannot Refute

[72] Weight Decay With Tailored Adam on Scale-Invariant Weights for Better Generalization PDF

Cannot Refute

Contribution

Ensemble scaling recipe and asymptote-based evaluation framework

[51] Prune and tune ensembles: low-cost ensemble learning with sparse independent subnetworks PDF

Cannot Refute

[52] Deep Learning Ensemble Method for Classifying Glaucoma Stages Using Fundus Photographs and Convolutional Neural Networks PDF

Cannot Refute

[53] Ensemble learning using decorrelated neural networks PDF

Cannot Refute

[54] Using Bayesian model averaging to calibrate forecast ensembles PDF

Cannot Refute

[55] Hybrid and Ensemble Methods of Two Days Ahead Forecasts of Electric Energy Production in a Small Wind Turbine PDF

Cannot Refute

[56] Snapshot Ensemble One-Dimensional Convolutional Neural Networks for Ballistic Target Recognition PDF

Cannot Refute

[57] Ensemble-learning approaches for network security and anomaly detection PDF

Cannot Refute

[58] Boost Neural Networks by Checkpoints PDF

Cannot Refute

[59] Counting the Cost: Quantifying the Rising Impacts of Heat-Related Productivity Losses in the United States (2001â2023) PDF

Cannot Refute

[60] When Ensembling Smaller Models is More Efficient than Single Large Models PDF

Cannot Refute

Contribution

Joint scaling recipe composing parameter and ensemble scaling

[61] The surprising ineffectiveness of pre-trained visual representations for model-based reinforcement learning PDF

Cannot Refute

[62] Simultaneous Learning the Dimension and Parameter of a Statistical Model with Big Data PDF

Cannot Refute

Pre-training under infinite compute

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Training compute-optimal large language models PDF

[18] Scaling laws revisited: modeling the role of data quality in language model pretraining PDF

[28] An empirical analysis of compute-optimal large language model training PDF

[30] Scaling Laws for Predicting Downstream Performance in LLMs PDF

[49] Compute-Optimal LLMs Provably Generalize Better With Scale PDF

Contribution Analysis

Regularized parameter scaling recipe with tuned weight decay

[63] Rank minimization, alignment and weight decay in neural networks PDF

[64] How to set AdamW's weight decay as you scale model and dataset size PDF

[65] Rethinking weight decay for robust fine-tuning of foundation models PDF

[66] Weight decay induces low-rank attention layers PDF

[67] SGD with weight decay secretly minimizes the ranks of your neural networks PDF

[68] Explicit regularisation, sharpness and calibration PDF

[69] Rotational equilibrium: How weight decay balances learning across neural networks PDF

[70] Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy PDF

[71] Understanding decoupled and early weight decay PDF

[72] Weight Decay With Tailored Adam on Scale-Invariant Weights for Better Generalization PDF

Ensemble scaling recipe and asymptote-based evaluation framework

[51] Prune and tune ensembles: low-cost ensemble learning with sparse independent subnetworks PDF

[52] Deep Learning Ensemble Method for Classifying Glaucoma Stages Using Fundus Photographs and Convolutional Neural Networks PDF

[53] Ensemble learning using decorrelated neural networks PDF

[54] Using Bayesian model averaging to calibrate forecast ensembles PDF

[55] Hybrid and Ensemble Methods of Two Days Ahead Forecasts of Electric Energy Production in a Small Wind Turbine PDF

[56] Snapshot Ensemble One-Dimensional Convolutional Neural Networks for Ballistic Target Recognition PDF

[57] Ensemble-learning approaches for network security and anomaly detection PDF

[58] Boost Neural Networks by Checkpoints PDF

[59] Counting the Cost: Quantifying the Rising Impacts of Heat-Related Productivity Losses in the United States (2001â2023) PDF

[60] When Ensembling Smaller Models is More Efficient than Single Large Models PDF

Joint scaling recipe composing parameter and ensemble scaling

[61] The surprising ineffectiveness of pre-trained visual representations for model-based reinforcement learning PDF

[62] Simultaneous Learning the Dimension and Parameter of a Statistical Model with Big Data PDF

Table of Contents

[59] Counting the Cost: Quantifying the Rising Impacts of Heat-Related Productivity Losses in the United States (2001â2023) PDF