Pre-training under infinite compute

ICLR 2026 Conference SubmissionAnonymous Authors
scaling lawsdata efficiencypre-training
Abstract:

Since compute grows much faster than web text available for language model pre-training, we ask how one should approach pre-training under fixed data and no compute constraints. We first show that existing data-constrained approaches of increasing epoch count and parameter count overfit, and we improve upon such recipes by tuning regularization, finding that the optimal weight decay is 30×30\times larger than standard practice. Since our regularized recipe monotonically decreases loss following a power law in parameter count, we estimate its best possible performance via the \textbf{asymptote} of its scaling law rather than the performance at a fixed compute budget. We then identify that ensembling independently trained models achieves a significantly lower loss asymptote than the regularized recipe. Our best intervention combining epoching, regularization, parameter scaling, and ensemble scaling achieves an asymptote at 200M tokens using 5.17×5.17\times less data than our baseline, and our data scaling laws predict that this improvement persists at higher token budgets. We find that our data efficiency gains can be realized at smaller parameter counts as we can distill an ensemble into a student model that is 8×\times smaller and retains 8383% of the ensembling benefit. Finally, our interventions designed for validation loss generalize to downstream benchmarks, achieving a 99% improvement for pre-training evals. Our results show that simple algorithmic improvements can enable significantly more data-efficient pre-training in a compute-rich future.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a data-efficient pre-training framework combining regularization, parameter scaling, and ensemble methods to optimize performance under fixed data budgets. It resides in the 'Scaling Law Formulation and Analysis' leaf, which contains six papers including foundational work like Chinchilla and recent extensions examining generalization trade-offs. This leaf sits within the broader 'Compute-Optimal Scaling and Resource Allocation' branch, indicating a moderately populated research direction focused on theoretical scaling principles rather than empirical recipes or system implementations.

The taxonomy reveals neighboring work in 'Empirical Training Strategies and Recipes' (three papers on practical protocols) and 'Data Curation and Selection Methods' (multiple leaves addressing quality filtering, diversity optimization, and synthetic generation). The paper's focus on asymptotic scaling laws under compute abundance distinguishes it from sibling papers examining joint compute-data constraints (Chinchilla) or generalization bounds. The scope_note clarifies this leaf excludes empirical recipes without theoretical analysis, positioning the work as extending scaling law theory rather than proposing purely practical training configurations.

Among 22 candidates examined across three contributions, no clearly refuting prior work was identified. The regularized parameter scaling contribution examined 10 candidates with zero refutations, suggesting novelty in the specific combination of tuned weight decay and asymptote-based evaluation. The ensemble scaling framework similarly examined 10 candidates without refutation, indicating the asymptote-focused methodology may be distinctive. The joint scaling recipe examined only 2 candidates, reflecting a more limited search scope for this compositional contribution. These statistics indicate the analysis covered a focused set of semantically related papers rather than an exhaustive field survey.

The limited search scope (22 candidates from semantic search) means the analysis captures closely related scaling law research but may not cover all relevant empirical training studies or data curation methods. The absence of refuting papers among examined candidates suggests the specific combination of techniques—particularly the asymptote-based evaluation framework and ensemble scaling under data constraints—appears novel within the sampled literature, though broader coverage might reveal additional overlaps in adjacent research directions.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: data-efficient language model pre-training under compute abundance. When computational resources are plentiful, the central challenge shifts from simply scaling up to making the most effective use of available data and compute together. The taxonomy reflects this dual focus through several major branches. Compute-Optimal Scaling and Resource Allocation examines how to balance model size, data volume, and training duration—building on foundational work like Chinchilla[1] and extending into newer analyses of generalization trade-offs (Compute Optimal Generalization[49]). Data Curation and Selection Methods addresses the quality and composition of training corpora, spanning retrieval-based approaches (Corpus Aware Retrieval[3]), domain-specific filtering (DoReMi[12], DataComp LM[13]), and synthetic data generation (Synthetic Data Scaling[14]). Adaptive and Continual Pre-Training explores how models can be updated or specialized over time (Domain Adaptation Pretraining[6]), while Sample-Efficient Training Objectives and Architectures investigates alternative learning signals (ELECTRA[11]) and architectural innovations. Additional branches cover multimodal extensions, distributed infrastructure (Megatron[5], Datacenter LLM Development[17]), privacy-preserving methods (Federated Pre-text[23]), and domain-specific applications, forming a comprehensive landscape of strategies for efficient pre-training. Several active lines of work reveal key trade-offs and open questions. One thread focuses on scaling law formulation: understanding how loss, model capacity, and data interact under different resource constraints (Chinchilla[1], Compute Optimal Analysis[28], Downstream Scaling Laws[30]). Another emphasizes data quality over sheer volume, with methods that curate, deduplicate (SoftDedup[29]), or synthesize high-value examples (Rephrasing the Web[25], Seed Free Synthetic[35]). A third direction tackles the practical realities of large-scale training, from cost modeling (LLM Cost Modeling[20]) to system-level optimizations (Cerebras GPT[4]). Infinite Compute Pretraining[0] sits squarely within the Compute-Optimal Scaling branch, specifically addressing scaling law formulation and analysis. Its emphasis on scenarios where compute is abundant but data remains finite contrasts with earlier work like Chinchilla[1], which derived optimal ratios under joint resource constraints, and complements recent studies on generalization bounds (Compute Optimal Generalization[49]) by exploring how to allocate unlimited compute when data quality or diversity becomes the bottleneck.

Claimed Contributions

Regularized parameter scaling recipe with tuned weight decay

The authors introduce a regularized pre-training recipe that jointly tunes weight decay, learning rate, and epoch count at each parameter count. This approach achieves monotonic loss decrease following a power law in parameter count, with optimal weight decay being 30 times larger than the standard 0.1 value used in practice.

10 retrieved papers
Ensemble scaling recipe and asymptote-based evaluation framework

The authors propose an ensembling recipe that trains multiple independent models and averages their logits, achieving a lower loss asymptote than parameter scaling alone. They introduce evaluating scaling recipes by the asymptote of their scaling law rather than performance at fixed compute budgets.

10 retrieved papers
Joint scaling recipe composing parameter and ensemble scaling

The authors develop a joint scaling recipe that composes both parameter scaling and ensemble scaling by taking the double limit as both parameter count and ensemble member count approach infinity. This combined approach achieves significantly improved data efficiency compared to standard pre-training recipes.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Regularized parameter scaling recipe with tuned weight decay

The authors introduce a regularized pre-training recipe that jointly tunes weight decay, learning rate, and epoch count at each parameter count. This approach achieves monotonic loss decrease following a power law in parameter count, with optimal weight decay being 30 times larger than the standard 0.1 value used in practice.

Contribution

Ensemble scaling recipe and asymptote-based evaluation framework

The authors propose an ensembling recipe that trains multiple independent models and averages their logits, achieving a lower loss asymptote than parameter scaling alone. They introduce evaluating scaling recipes by the asymptote of their scaling law rather than performance at fixed compute budgets.

Contribution

Joint scaling recipe composing parameter and ensemble scaling

The authors develop a joint scaling recipe that composes both parameter scaling and ensemble scaling by taking the double limit as both parameter count and ensemble member count approach infinity. This combined approach achieves significantly improved data efficiency compared to standard pre-training recipes.

Pre-training under infinite compute | Novelty Validation