Pre-training under infinite compute
Overview
Overall Novelty Assessment
The paper proposes a data-efficient pre-training framework combining regularization, parameter scaling, and ensemble methods to optimize performance under fixed data budgets. It resides in the 'Scaling Law Formulation and Analysis' leaf, which contains six papers including foundational work like Chinchilla and recent extensions examining generalization trade-offs. This leaf sits within the broader 'Compute-Optimal Scaling and Resource Allocation' branch, indicating a moderately populated research direction focused on theoretical scaling principles rather than empirical recipes or system implementations.
The taxonomy reveals neighboring work in 'Empirical Training Strategies and Recipes' (three papers on practical protocols) and 'Data Curation and Selection Methods' (multiple leaves addressing quality filtering, diversity optimization, and synthetic generation). The paper's focus on asymptotic scaling laws under compute abundance distinguishes it from sibling papers examining joint compute-data constraints (Chinchilla) or generalization bounds. The scope_note clarifies this leaf excludes empirical recipes without theoretical analysis, positioning the work as extending scaling law theory rather than proposing purely practical training configurations.
Among 22 candidates examined across three contributions, no clearly refuting prior work was identified. The regularized parameter scaling contribution examined 10 candidates with zero refutations, suggesting novelty in the specific combination of tuned weight decay and asymptote-based evaluation. The ensemble scaling framework similarly examined 10 candidates without refutation, indicating the asymptote-focused methodology may be distinctive. The joint scaling recipe examined only 2 candidates, reflecting a more limited search scope for this compositional contribution. These statistics indicate the analysis covered a focused set of semantically related papers rather than an exhaustive field survey.
The limited search scope (22 candidates from semantic search) means the analysis captures closely related scaling law research but may not cover all relevant empirical training studies or data curation methods. The absence of refuting papers among examined candidates suggests the specific combination of techniques—particularly the asymptote-based evaluation framework and ensemble scaling under data constraints—appears novel within the sampled literature, though broader coverage might reveal additional overlaps in adjacent research directions.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce a regularized pre-training recipe that jointly tunes weight decay, learning rate, and epoch count at each parameter count. This approach achieves monotonic loss decrease following a power law in parameter count, with optimal weight decay being 30 times larger than the standard 0.1 value used in practice.
The authors propose an ensembling recipe that trains multiple independent models and averages their logits, achieving a lower loss asymptote than parameter scaling alone. They introduce evaluating scaling recipes by the asymptote of their scaling law rather than performance at fixed compute budgets.
The authors develop a joint scaling recipe that composes both parameter scaling and ensemble scaling by taking the double limit as both parameter count and ensemble member count approach infinity. This combined approach achieves significantly improved data efficiency compared to standard pre-training recipes.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] Training compute-optimal large language models PDF
[18] Scaling laws revisited: modeling the role of data quality in language model pretraining PDF
[28] An empirical analysis of compute-optimal large language model training PDF
[30] Scaling Laws for Predicting Downstream Performance in LLMs PDF
[49] Compute-Optimal LLMs Provably Generalize Better With Scale PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Regularized parameter scaling recipe with tuned weight decay
The authors introduce a regularized pre-training recipe that jointly tunes weight decay, learning rate, and epoch count at each parameter count. This approach achieves monotonic loss decrease following a power law in parameter count, with optimal weight decay being 30 times larger than the standard 0.1 value used in practice.
[63] Rank minimization, alignment and weight decay in neural networks PDF
[64] How to set AdamW's weight decay as you scale model and dataset size PDF
[65] Rethinking weight decay for robust fine-tuning of foundation models PDF
[66] Weight decay induces low-rank attention layers PDF
[67] SGD with weight decay secretly minimizes the ranks of your neural networks PDF
[68] Explicit regularisation, sharpness and calibration PDF
[69] Rotational equilibrium: How weight decay balances learning across neural networks PDF
[70] Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy PDF
[71] Understanding decoupled and early weight decay PDF
[72] Weight Decay With Tailored Adam on Scale-Invariant Weights for Better Generalization PDF
Ensemble scaling recipe and asymptote-based evaluation framework
The authors propose an ensembling recipe that trains multiple independent models and averages their logits, achieving a lower loss asymptote than parameter scaling alone. They introduce evaluating scaling recipes by the asymptote of their scaling law rather than performance at fixed compute budgets.
[51] Prune and tune ensembles: low-cost ensemble learning with sparse independent subnetworks PDF
[52] Deep Learning Ensemble Method for Classifying Glaucoma Stages Using Fundus Photographs and Convolutional Neural Networks PDF
[53] Ensemble learning using decorrelated neural networks PDF
[54] Using Bayesian model averaging to calibrate forecast ensembles PDF
[55] Hybrid and Ensemble Methods of Two Days Ahead Forecasts of Electric Energy Production in a Small Wind Turbine PDF
[56] Snapshot Ensemble One-Dimensional Convolutional Neural Networks for Ballistic Target Recognition PDF
[57] Ensemble-learning approaches for network security and anomaly detection PDF
[58] Boost Neural Networks by Checkpoints PDF
[59] Counting the Cost: Quantifying the Rising Impacts of Heat-Related Productivity Losses in the United States (2001â2023) PDF
[60] When Ensembling Smaller Models is More Efficient than Single Large Models PDF
Joint scaling recipe composing parameter and ensemble scaling
The authors develop a joint scaling recipe that composes both parameter scaling and ensemble scaling by taking the double limit as both parameter count and ensemble member count approach infinity. This combined approach achieves significantly improved data efficiency compared to standard pre-training recipes.