Fantastic Pretraining Optimizers and Where to Find Them

ICLR 2026 Conference SubmissionAnonymous Authors
optimizerbenchmarkingpretrain
Abstract:

AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2×\times speedup. We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups. To address these two issues, we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B–1.2B parameters) and data-to-model ratios (1--8×\times the Chinchilla optimum). We find that fair and informative comparisons require rigorous hyperparameter tuning and evaluations across a range of model scales and data-to-model ratios, performed at the end of training. First, optimal hyperparameters for one optimizer may be suboptimal for another, making blind hyperparameter transfer unfair. Second, the actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1×\times for 1.2B parameter models. Thirdly, comparing intermediate checkpoints before reaching the target training budgets can be misleading, as rankings between two optimizers can flip during training due to learning rate decay. Through our thorough investigation, we find that all the fastest optimizers such as Muon and Soap, use matrices as preconditioners --- multiplying gradients with matrices rather than entry-wise scalars. However, the speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4×\times over AdamW for 0.1B parameter models to merely 1.1×\times for 1.2B parameter models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a systematic benchmarking methodology for evaluating ten deep learning optimizers across multiple model scales and data-to-model ratios, emphasizing rigorous hyperparameter tuning and end-of-training evaluation. It resides in the 'Evaluation Methodologies and Benchmarking' leaf of the taxonomy, which contains only two papers total. This represents a relatively sparse research direction within the broader field of optimizer research for language model pretraining, suggesting that comprehensive empirical evaluation frameworks remain underexplored compared to the proliferation of novel optimizer proposals.

The taxonomy reveals that most research activity concentrates on proposing new optimizer algorithms (Second-Order Methods, Adaptive First-Order Methods) and addressing system-level efficiency (Memory and Communication Efficiency, with multiple papers on distributed training). The original paper's leaf sits adjacent to these algorithmic innovation branches but diverges by focusing on evaluation rigor rather than algorithmic novelty. The taxonomy's scope and exclude notes clarify that this work belongs in benchmarking because it systematically compares existing methods rather than proposing new optimization algorithms or training schedules, positioning it as methodological infrastructure for the field.

Among thirty candidates examined, none clearly refute the three main contributions. The systematic benchmarking methodology (ten candidates examined, zero refutable) appears novel in its comprehensive scope across model scales and data ratios. The empirical finding about matrix-based versus scalar-based optimizers (ten candidates, zero refutable) and the demonstration of scale-dependent speedup degradation (ten candidates, zero refutable) both show no substantial prior overlap within the limited search scope. These statistics suggest the work addresses gaps in rigorous comparative evaluation, though the analysis covers top-thirty semantic matches rather than exhaustive literature review.

Based on the limited search scope and taxonomy structure, the work appears to fill a methodological gap in a sparse research direction. The absence of refutable candidates across all contributions, combined with the leaf containing only one sibling paper, suggests the systematic evaluation approach may represent a substantive contribution. However, this assessment reflects analysis of thirty semantically similar papers and does not constitute comprehensive coverage of all optimizer benchmarking efforts in the broader literature.

Taxonomy

Core-task Taxonomy Papers
40
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Benchmarking optimizers for language model pretraining. The field encompasses a broad landscape of research directions aimed at improving the efficiency, scalability, and effectiveness of training large language models. The taxonomy reveals several major branches: Optimizer Algorithm Design explores novel variants and adaptive methods like Sophia[2] and Minimalist Optimizer[3]; Learning Rate and Training Schedule Optimization investigates scheduling strategies and decay patterns; Data Composition addresses mixture optimization approaches such as DoReMi[11]; Memory and Communication Efficiency tackles distributed training challenges exemplified by DeepSpeed[17] and Diloco[28]; Model Merging and Checkpoint Utilization examines techniques for combining and reusing trained models; Continual and Specialized Pretraining focuses on domain adaptation; Model Compression and Pruning develops methods like LoraShear[26]; Evaluation Methodologies and Benchmarking establishes rigorous assessment frameworks; Theoretical Analysis studies implicit bias and convergence properties; Post-Training and Fine-Tuning Optimization refines pretrained models; and Domain-Specific Applications surveys specialized use cases. Within this landscape, particularly active lines of work contrast algorithmic innovation against practical efficiency concerns and rigorous evaluation standards. Works like Benchmarking Optimizers[1] and Scalable Pretraining Benchmarking[21] emphasize systematic comparison methodologies, while optimizer variants such as AdamS[14] and Stepsize Anything[23] push algorithmic boundaries. The original paper Fantastic Pretraining Optimizers[0] sits squarely within the Evaluation Methodologies and Benchmarking branch alongside Benchmarking Optimizers[1], sharing a focus on comprehensive empirical assessment rather than proposing new algorithms. Where Benchmarking Optimizers[1] may establish foundational comparison frameworks, Fantastic Pretraining Optimizers[0] likely extends or refines these evaluation practices, contributing to the growing recognition that robust benchmarking is essential for validating the myriad optimizer proposals emerging across other branches of this taxonomy.

Claimed Contributions

Systematic benchmarking methodology with rigorous hyperparameter tuning

The authors introduce a three-phase hyperparameter tuning framework that uses coordinate descent to ensure near-optimal configurations for each optimizer across multiple model scales and data-to-model ratios, addressing the problem of unfair comparisons due to unequal hyperparameter tuning.

10 retrieved papers
Empirical finding that matrix-based optimizers consistently outperform scalar-based optimizers

The study reveals that optimizers using matrix-based preconditioning (such as Muon, Soap, and Kron) consistently achieve better performance than scalar-based methods (like AdamW, Lion, and Mars), though the speedup decreases with model scale from 1.4× at 0.1B parameters to 1.1× at 1.2B parameters.

10 retrieved papers
Demonstration that claimed optimizer speedups are lower than reported and scale-dependent

The authors show that when baselines are properly tuned and evaluations are conducted at the end of training across different scales, the speedups of alternative optimizers over AdamW are substantially lower than the 1.4× to 2× claims in prior work, and these speedups diminish as model size increases.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic benchmarking methodology with rigorous hyperparameter tuning

The authors introduce a three-phase hyperparameter tuning framework that uses coordinate descent to ensure near-optimal configurations for each optimizer across multiple model scales and data-to-model ratios, addressing the problem of unfair comparisons due to unequal hyperparameter tuning.

Contribution

Empirical finding that matrix-based optimizers consistently outperform scalar-based optimizers

The study reveals that optimizers using matrix-based preconditioning (such as Muon, Soap, and Kron) consistently achieve better performance than scalar-based methods (like AdamW, Lion, and Mars), though the speedup decreases with model scale from 1.4× at 0.1B parameters to 1.1× at 1.2B parameters.

Contribution

Demonstration that claimed optimizer speedups are lower than reported and scale-dependent

The authors show that when baselines are properly tuned and evaluations are conducted at the end of training across different scales, the speedups of alternative optimizers over AdamW are substantially lower than the 1.4× to 2× claims in prior work, and these speedups diminish as model size increases.

Fantastic Pretraining Optimizers and Where to Find Them | Novelty Validation