Fantastic Pretraining Optimizers and Where to Find Them
Overview
Overall Novelty Assessment
The paper contributes a systematic benchmarking methodology for evaluating ten deep learning optimizers across multiple model scales and data-to-model ratios, emphasizing rigorous hyperparameter tuning and end-of-training evaluation. It resides in the 'Evaluation Methodologies and Benchmarking' leaf of the taxonomy, which contains only two papers total. This represents a relatively sparse research direction within the broader field of optimizer research for language model pretraining, suggesting that comprehensive empirical evaluation frameworks remain underexplored compared to the proliferation of novel optimizer proposals.
The taxonomy reveals that most research activity concentrates on proposing new optimizer algorithms (Second-Order Methods, Adaptive First-Order Methods) and addressing system-level efficiency (Memory and Communication Efficiency, with multiple papers on distributed training). The original paper's leaf sits adjacent to these algorithmic innovation branches but diverges by focusing on evaluation rigor rather than algorithmic novelty. The taxonomy's scope and exclude notes clarify that this work belongs in benchmarking because it systematically compares existing methods rather than proposing new optimization algorithms or training schedules, positioning it as methodological infrastructure for the field.
Among thirty candidates examined, none clearly refute the three main contributions. The systematic benchmarking methodology (ten candidates examined, zero refutable) appears novel in its comprehensive scope across model scales and data ratios. The empirical finding about matrix-based versus scalar-based optimizers (ten candidates, zero refutable) and the demonstration of scale-dependent speedup degradation (ten candidates, zero refutable) both show no substantial prior overlap within the limited search scope. These statistics suggest the work addresses gaps in rigorous comparative evaluation, though the analysis covers top-thirty semantic matches rather than exhaustive literature review.
Based on the limited search scope and taxonomy structure, the work appears to fill a methodological gap in a sparse research direction. The absence of refutable candidates across all contributions, combined with the leaf containing only one sibling paper, suggests the systematic evaluation approach may represent a substantive contribution. However, this assessment reflects analysis of thirty semantically similar papers and does not constitute comprehensive coverage of all optimizer benchmarking efforts in the broader literature.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce a three-phase hyperparameter tuning framework that uses coordinate descent to ensure near-optimal configurations for each optimizer across multiple model scales and data-to-model ratios, addressing the problem of unfair comparisons due to unequal hyperparameter tuning.
The study reveals that optimizers using matrix-based preconditioning (such as Muon, Soap, and Kron) consistently achieve better performance than scalar-based methods (like AdamW, Lion, and Mars), though the speedup decreases with model scale from 1.4× at 0.1B parameters to 1.1× at 1.2B parameters.
The authors show that when baselines are properly tuned and evaluations are conducted at the end of training across different scales, the speedups of alternative optimizers over AdamW are substantially lower than the 1.4× to 2× claims in prior work, and these speedups diminish as model size increases.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] Benchmarking Optimizers for Large Language Model Pretraining PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Systematic benchmarking methodology with rigorous hyperparameter tuning
The authors introduce a three-phase hyperparameter tuning framework that uses coordinate descent to ensure near-optimal configurations for each optimizer across multiple model scales and data-to-model ratios, addressing the problem of unfair comparisons due to unequal hyperparameter tuning.
[59] Ebola Optimization Search Algorithm: A New Nature-Inspired Metaheuristic Optimization Algorithm PDF
[60] Optimizing Crop Selection Through Hyperparameter Tuning of Neural Networks: Development and Evaluation of Deep Learning Models for Enhancing Crop Recommendation Systems PDF
[61] Hyper-parameter optimization of deep learning model for prediction of Parkinson's disease PDF
[62] Optimizing the Hyperparameter Tuning of YOLOv5 for Underwater Detection PDF
[63] Hyperparameter Optimization of Neural Networks Using Grid Search for Predicting HVAC Heating Coil Performance PDF
[64] Adaptive Hyperparameter Fine-Tuning for Boosting the Robustness and Quality of the Particle Swarm Optimization Algorithm for Non-Linear RBF Neural Network Modelling and Its Applications PDF
[65] Optimizing the neural network hyperparameters utilizing genetic algorithm PDF
[66] Hyperparameter Tuning for Deep Neural Networks Based Optimization Algorithm PDF
[67] Hyperparameter Optimization in Convolutional Neural Network using Genetic Algorithms PDF
[68] Lemur neural network dataset: Towards seamless automl PDF
Empirical finding that matrix-based optimizers consistently outperform scalar-based optimizers
The study reveals that optimizers using matrix-based preconditioning (such as Muon, Soap, and Kron) consistently achieve better performance than scalar-based methods (like AdamW, Lion, and Mars), though the speedup decreases with model scale from 1.4× at 0.1B parameters to 1.1× at 1.2B parameters.
[41] Memory-efficient 4-bit preconditioned stochastic optimization PDF
[42] Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks PDF
[43] Matrix-Free Preconditioning in Online Learning PDF
[44] Stability of Stochastic Delayed Recurrent Neural Networks PDF
[45] Shampoo: Preconditioned stochastic tensor optimization PDF
[46] Dynamic Low-rank Approximation of Full-Matrix Preconditioner for Training Generalized Linear Models PDF
[47] PolarGrad: A Class of Matrix-Gradient Optimizers from a Unifying Preconditioning Perspective PDF
[48] MARS-M: When Variance Reduction Meets Matrices PDF
[49] Two-Level K-FAC Preconditioning for Deep Learning PDF
[50] A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale PDF
Demonstration that claimed optimizer speedups are lower than reported and scale-dependent
The authors show that when baselines are properly tuned and evaluations are conducted at the end of training across different scales, the speedups of alternative optimizers over AdamW are substantially lower than the 1.4× to 2× claims in prior work, and these speedups diminish as model size increases.