Fantastic Pretraining Optimizers and Where to Find Them

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

optimizerbenchmarkingpretrain

AdamW has long been the dominant optimizer in language model pretraining, despite numerous claims that alternative optimizers offer 1.4 to 2 $\times$ speedup. We posit that two methodological shortcomings have obscured fair comparisons and hindered practical adoption: (i) unequal hyperparameter tuning and (ii) limited or misleading evaluation setups. To address these two issues, we conduct a systematic study of ten deep learning optimizers across four model scales (0.1B–1.2B parameters) and data-to-model ratios (1--8 $\times$ the Chinchilla optimum). We find that fair and informative comparisons require rigorous hyperparameter tuning and evaluations across a range of model scales and data-to-model ratios, performed at the end of training. First, optimal hyperparameters for one optimizer may be suboptimal for another, making blind hyperparameter transfer unfair. Second, the actual speedup of many proposed optimizers over well-tuned baselines is lower than claimed and decreases with model size to only 1.1 $\times$ for 1.2B parameter models. Thirdly, comparing intermediate checkpoints before reaching the target training budgets can be misleading, as rankings between two optimizers can flip during training due to learning rate decay. Through our thorough investigation, we find that all the fastest optimizers such as Muon and Soap, use matrices as preconditioners --- multiplying gradients with matrices rather than entry-wise scalars. However, the speedup of matrix-based optimizers is inversely proportional to model scale, decreasing from 1.4 $\times$ over AdamW for 0.1B parameter models to merely 1.1 $\times$ for 1.2B parameter models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a systematic benchmarking methodology for evaluating ten deep learning optimizers across multiple model scales and data-to-model ratios, emphasizing rigorous hyperparameter tuning and end-of-training evaluation. It resides in the 'Evaluation Methodologies and Benchmarking' leaf of the taxonomy, which contains only two papers total. This represents a relatively sparse research direction within the broader field of optimizer research for language model pretraining, suggesting that comprehensive empirical evaluation frameworks remain underexplored compared to the proliferation of novel optimizer proposals.

The taxonomy reveals that most research activity concentrates on proposing new optimizer algorithms (Second-Order Methods, Adaptive First-Order Methods) and addressing system-level efficiency (Memory and Communication Efficiency, with multiple papers on distributed training). The original paper's leaf sits adjacent to these algorithmic innovation branches but diverges by focusing on evaluation rigor rather than algorithmic novelty. The taxonomy's scope and exclude notes clarify that this work belongs in benchmarking because it systematically compares existing methods rather than proposing new optimization algorithms or training schedules, positioning it as methodological infrastructure for the field.

Among thirty candidates examined, none clearly refute the three main contributions. The systematic benchmarking methodology (ten candidates examined, zero refutable) appears novel in its comprehensive scope across model scales and data ratios. The empirical finding about matrix-based versus scalar-based optimizers (ten candidates, zero refutable) and the demonstration of scale-dependent speedup degradation (ten candidates, zero refutable) both show no substantial prior overlap within the limited search scope. These statistics suggest the work addresses gaps in rigorous comparative evaluation, though the analysis covers top-thirty semantic matches rather than exhaustive literature review.

Based on the limited search scope and taxonomy structure, the work appears to fill a methodological gap in a sparse research direction. The absence of refutable candidates across all contributions, combined with the leaf containing only one sibling paper, suggests the systematic evaluation approach may represent a substantive contribution. However, this assessment reflects analysis of thirty semantically similar papers and does not constitute comprehensive coverage of all optimizer benchmarking efforts in the broader literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Benchmarking optimizers for language model pretraining. The field encompasses a broad landscape of research directions aimed at improving the efficiency, scalability, and effectiveness of training large language models. The taxonomy reveals several major branches: Optimizer Algorithm Design explores novel variants and adaptive methods like Sophia[2] and Minimalist Optimizer[3]; Learning Rate and Training Schedule Optimization investigates scheduling strategies and decay patterns; Data Composition addresses mixture optimization approaches such as DoReMi[11]; Memory and Communication Efficiency tackles distributed training challenges exemplified by DeepSpeed[17] and Diloco[28]; Model Merging and Checkpoint Utilization examines techniques for combining and reusing trained models; Continual and Specialized Pretraining focuses on domain adaptation; Model Compression and Pruning develops methods like LoraShear[26]; Evaluation Methodologies and Benchmarking establishes rigorous assessment frameworks; Theoretical Analysis studies implicit bias and convergence properties; Post-Training and Fine-Tuning Optimization refines pretrained models; and Domain-Specific Applications surveys specialized use cases. Within this landscape, particularly active lines of work contrast algorithmic innovation against practical efficiency concerns and rigorous evaluation standards. Works like Benchmarking Optimizers[1] and Scalable Pretraining Benchmarking[21] emphasize systematic comparison methodologies, while optimizer variants such as AdamS[14] and Stepsize Anything[23] push algorithmic boundaries. The original paper Fantastic Pretraining Optimizers[0] sits squarely within the Evaluation Methodologies and Benchmarking branch alongside Benchmarking Optimizers[1], sharing a focus on comprehensive empirical assessment rather than proposing new algorithms. Where Benchmarking Optimizers[1] may establish foundational comparison frameworks, Fantastic Pretraining Optimizers[0] likely extends or refines these evaluation practices, contributing to the growing recognition that robust benchmarking is essential for validating the myriad optimizer proposals emerging across other branches of this taxonomy.

Claimed Contributions

Systematic benchmarking methodology with rigorous hyperparameter tuning

10 retrieved papers

The authors introduce a three-phase hyperparameter tuning framework that uses coordinate descent to ensure near-optimal configurations for each optimizer across multiple model scales and data-to-model ratios, addressing the problem of unfair comparisons due to unequal hyperparameter tuning.

10 retrieved papers

Empirical finding that matrix-based optimizers consistently outperform scalar-based optimizers

10 retrieved papers

The study reveals that optimizers using matrix-based preconditioning (such as Muon, Soap, and Kron) consistently achieve better performance than scalar-based methods (like AdamW, Lion, and Mars), though the speedup decreases with model scale from 1.4× at 0.1B parameters to 1.1× at 1.2B parameters.

10 retrieved papers

Demonstration that claimed optimizer speedups are lower than reported and scale-dependent

10 retrieved papers

The authors show that when baselines are properly tuned and evaluations are conducted at the end of training across different scales, the speedups of alternative optimizers over AdamW are substantially lower than the 1.4× to 2× claims in prior work, and these speedups diminish as model size increases.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Benchmarking Optimizers for Large Language Model Pretraining PDF

Semenov, Andrei, Pagliardini, Matteo, Jaggi, Martin (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Systematic benchmarking methodology with rigorous hyperparameter tuning

[59] Ebola Optimization Search Algorithm: A New Nature-Inspired Metaheuristic Optimization Algorithm PDF

Cannot Refute

[60] Optimizing Crop Selection Through Hyperparameter Tuning of Neural Networks: Development and Evaluation of Deep Learning Models for Enhancing Crop Recommendation Systems PDF

Cannot Refute

[61] Hyper-parameter optimization of deep learning model for prediction of Parkinson's disease PDF

Cannot Refute

[62] Optimizing the Hyperparameter Tuning of YOLOv5 for Underwater Detection PDF

Cannot Refute

[63] Hyperparameter Optimization of Neural Networks Using Grid Search for Predicting HVAC Heating Coil Performance PDF

Cannot Refute

[64] Adaptive Hyperparameter Fine-Tuning for Boosting the Robustness and Quality of the Particle Swarm Optimization Algorithm for Non-Linear RBF Neural Network Modelling and Its Applications PDF

Cannot Refute

[65] Optimizing the neural network hyperparameters utilizing genetic algorithm PDF

Cannot Refute

[66] Hyperparameter Tuning for Deep Neural Networks Based Optimization Algorithm PDF

Cannot Refute

[67] Hyperparameter Optimization in Convolutional Neural Network using Genetic Algorithms PDF

Cannot Refute

[68] Lemur neural network dataset: Towards seamless automl PDF

Cannot Refute

Contribution

Empirical finding that matrix-based optimizers consistently outperform scalar-based optimizers

[41] Memory-efficient 4-bit preconditioned stochastic optimization PDF

Cannot Refute

[42] Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks PDF

Cannot Refute

[43] Matrix-Free Preconditioning in Online Learning PDF

Cannot Refute

[44] Stability of Stochastic Delayed Recurrent Neural Networks PDF

Cannot Refute

[45] Shampoo: Preconditioned stochastic tensor optimization PDF

Cannot Refute

[46] Dynamic Low-rank Approximation of Full-Matrix Preconditioner for Training Generalized Linear Models PDF

Cannot Refute

[47] PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective PDF

Cannot Refute

[48] MARS-M: When Variance Reduction Meets Matrices PDF

Cannot Refute

[49] Two-Level K-FAC Preconditioning for Deep Learning PDF

Cannot Refute

[50] A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale PDF

Cannot Refute

Contribution

Demonstration that claimed optimizer speedups are lower than reported and scale-dependent

[1] Benchmarking Optimizers for Large Language Model Pretraining PDF

Cannot Refute

[11] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining PDF

Cannot Refute

[51] Training Compute-Optimal Large Language Models PDF

Cannot Refute

[52] Baseline Defenses for Adversarial Attacks Against Aligned Language Models PDF

Cannot Refute

[53] Instructzero: Efficient instruction optimization for black-box large language models PDF

Cannot Refute

[54] Training language models to reason efficiently PDF

Cannot Refute

[55] Scaling Laws for Neural Language Models PDF

Cannot Refute

[56] A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models PDF

Cannot Refute

[57] Deconstructing what makes a good optimizer for language models PDF

Cannot Refute

[58] Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models PDF

Cannot Refute

Fantastic Pretraining Optimizers and Where to Find Them

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Benchmarking Optimizers for Large Language Model Pretraining PDF

Contribution Analysis

Systematic benchmarking methodology with rigorous hyperparameter tuning

[59] Ebola Optimization Search Algorithm: A New Nature-Inspired Metaheuristic Optimization Algorithm PDF

[60] Optimizing Crop Selection Through Hyperparameter Tuning of Neural Networks: Development and Evaluation of Deep Learning Models for Enhancing Crop Recommendation Systems PDF

[61] Hyper-parameter optimization of deep learning model for prediction of Parkinson's disease PDF

[62] Optimizing the Hyperparameter Tuning of YOLOv5 for Underwater Detection PDF

[63] Hyperparameter Optimization of Neural Networks Using Grid Search for Predicting HVAC Heating Coil Performance PDF

[64] Adaptive Hyperparameter Fine-Tuning for Boosting the Robustness and Quality of the Particle Swarm Optimization Algorithm for Non-Linear RBF Neural Network Modelling and Its Applications PDF

[65] Optimizing the neural network hyperparameters utilizing genetic algorithm PDF

[66] Hyperparameter Tuning for Deep Neural Networks Based Optimization Algorithm PDF

[67] Hyperparameter Optimization in Convolutional Neural Network using Genetic Algorithms PDF

[68] Lemur neural network dataset: Towards seamless automl PDF

Empirical finding that matrix-based optimizers consistently outperform scalar-based optimizers

[41] Memory-efficient 4-bit preconditioned stochastic optimization PDF

[42] Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks PDF

[43] Matrix-Free Preconditioning in Online Learning PDF

[44] Stability of Stochastic Delayed Recurrent Neural Networks PDF

[45] Shampoo: Preconditioned stochastic tensor optimization PDF

[46] Dynamic Low-rank Approximation of Full-Matrix Preconditioner for Training Generalized Linear Models PDF

[47] PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective PDF

[48] MARS-M: When Variance Reduction Meets Matrices PDF

[49] Two-Level K-FAC Preconditioning for Deep Learning PDF

[50] A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale PDF

Demonstration that claimed optimizer speedups are lower than reported and scale-dependent

[1] Benchmarking Optimizers for Large Language Model Pretraining PDF

[11] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining PDF

[51] Training Compute-Optimal Large Language Models PDF

[52] Baseline Defenses for Adversarial Attacks Against Aligned Language Models PDF

[53] Instructzero: Efficient instruction optimization for black-box large language models PDF

[54] Training language models to reason efficiently PDF

[55] Scaling Laws for Neural Language Models PDF

[56] A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models PDF

[57] Deconstructing what makes a good optimizer for language models PDF

[58] Variance-reduced Zeroth-Order Methods for Fine-Tuning Language Models PDF

Table of Contents