Understanding the Mechanisms of Fast Hyperparameter Transfer

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

hyperparameter transferhyperparameter tuningscaling lawsoptimization dynamicsmaximal update parameterizationscience of deep learning

The growing scale of deep learning models has rendered exhaustive hyperparameter (HP) optimization prohibitively expensive. A promising solution is the use of scale-aware HPs, which can enable direct transfer of optimal settings from small-scale grid searches to large models with minimal performance loss. Such approaches are useful when the optimal settings converge "fast" enough with scale. While approaches like the Maximal Update Parameterization ( $\mu$ P) have empirically displayed fast transfer when scaling model width, a deeper conceptual understanding of the mechanisms that enable this is still missing. Our work establishes a systematic conceptual framework for analyzing fast HP transfer across different synthetic and practical scenarios. In synthetic settings, we present various quantitative examples where transfer either offers a provable computational advantage or fails even under $\mu$ P. We then propose a key property that enables the fast transfer often observed in practice: through a novel decomposition of the optimization trajectory, we identify one component that rapidly converges with model width and determines the optimal HPs, and the other that continues to improve the loss with increased width but has negligible impact on HP choice. We conjecture that this decomposition elucidates the key mechanisms behind fast transfer and empirically validate it in practical settings such as LLM training.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a conceptual framework for understanding when and why hyperparameter transfer succeeds or fails across model widths, particularly under Maximal Update Parametrization (μP). It resides in the 'Maximal Update Parametrization (μP) and Extensions' leaf, which contains seven papers—a moderately populated research direction within the broader parametrization-based transfer landscape. This leaf focuses on width-scaling methods that preserve optimization dynamics, distinguishing itself from multi-axis extensions (depth, batch size) and optimizer-specific approaches. The work aims to move beyond empirical demonstrations of μP's effectiveness toward a principled understanding of the mechanisms enabling fast transfer.

The taxonomy reveals that parametrization-based methods form one of six major branches addressing hyperparameter transfer. Neighboring leaves include 'Multi-Axis Parametrization Extensions' (six papers handling depth and batch size jointly) and 'Specialized Architecture Parametrizations' (four papers for MoE, FNO, and FP8 training). The 'Optimizer-Specific Transfer' branch explores complementary strategies through preconditioned optimizers and layerwise learning rates, while 'Bayesian and Meta-Learning Transfer' pursues probabilistic surrogate models. The paper's focus on decomposing optimization trajectories to isolate width-stable components connects conceptually to theoretical foundations but remains grounded in the parametrization paradigm rather than optimizer design or meta-learning.

Among fifteen candidates examined, the framework contribution encountered one potentially refutable prior work out of ten candidates reviewed, while the loss decomposition (three candidates) and synthetic examples (two candidates) showed no clear refutations. The limited search scope—top-K semantic matches plus citation expansion—means these statistics reflect a targeted sample rather than exhaustive coverage. The framework contribution appears most exposed to overlap, likely because conceptual analyses of μP's mechanisms have been explored in sibling papers. The decomposition and synthetic examples may offer more distinctive technical angles, though the small candidate pools (two to three papers each) constrain confidence in their novelty assessment.

Based on the examined literature, the work occupies a moderately crowded research direction with established parametrization methods but contributes theoretical depth to understanding transfer mechanisms. The analysis covers a focused set of candidates rather than the full field, so conclusions about novelty remain provisional. The decomposition approach and synthetic counterexamples may represent the most original contributions, while the overarching framework builds incrementally on existing μP scholarship within a well-defined but active research area.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Hyperparameter transfer across neural network scales. The field addresses how to reuse hyperparameters—such as learning rates, initialization schemes, and optimizer settings—when moving from small to large models, thereby reducing the cost of tuning at scale. The taxonomy organizes work into six main branches. Parametrization-Based Transfer Methods focus on reparametrizing network weights and learning rates so that optimal settings remain stable across widths or depths, with Maximal Update Parametrization (μP) and its extensions forming a prominent line of research. Optimizer-Specific Transfer examines how particular optimizers (e.g., Adam, SGD) behave under scaling and how their hyperparameters can be adapted. Bayesian and Meta-Learning Transfer leverages probabilistic models or learned transfer functions to predict good configurations. Application-Specific Transfer Methods tailor strategies to domains like vision or language modeling. Model Initialization and Warmstarting explores using smaller pretrained models to seed larger ones. Finally, Theoretical Foundations and Analysis provides scaling laws and convergence guarantees that underpin transfer strategies. Several active themes emerge across these branches. One central question is whether transfer can be made nearly automatic—works like Tensor Programs V[1] and Zero Shot Transfer[15] aim for zero-shot or minimal-tuning regimes—versus methods that accept some residual search, as in CNN Hyperparameter Optimization[3] or Syne Tune[20]. Another contrast lies between width-centric parametrizations (μP-style approaches) and depth or layer-specific scaling rules. Fast Hyperparameter Transfer[0] sits within the Parametrization-Based Transfer branch, specifically under μP and Extensions, emphasizing efficient transfer with minimal retuning. It shares conceptual ground with Tensor Programs V[1] and Mu Learning Rate[45], which also exploit structured parametrizations to stabilize hyperparameters. Compared to Zero Shot Transfer[15], which targets immediate applicability, Fast Hyperparameter Transfer[0] may allow modest adjustments while still achieving strong cross-scale performance, positioning it as a practical middle ground in the parametrization-driven landscape.

Claimed Contributions

Conceptual framework for analyzing fast hyperparameter transfer

Can Refute

10 retrieved papers

The authors develop a formal framework to analyze when hyperparameter transfer is computationally useful, defining fast transfer as occurring when optimal hyperparameters converge faster than the evaluation metric. They connect this to computational efficiency through theorems showing when transfer strategies outperform direct tuning.

10 retrieved papers

Can Refute

Loss decomposition into width-stable and width-sensitive components

3 retrieved papers

The authors introduce a novel trajectory-level loss decomposition that separates the linearized loss change into top-k components (which remain width-invariant and determine optimal hyperparameters) and residual components (which improve with width but minimally affect hyperparameter choice). This decomposition provides a mechanistic explanation for fast transfer.

3 retrieved papers

Synthetic examples demonstrating conditions for fast transfer

2 retrieved papers

The authors provide concrete synthetic examples including random features regression (where fast transfer provably occurs) and two-layer ReLU networks (where transfer can be slow even under maximal update parameterization), illustrating that fast transfer depends on structural properties of the training process rather than being guaranteed by parameterization alone.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer PDF

Yang, Greg, Greg Yang, Hu, Edward J., J. Edward Hu, Babuschkin, Igor, I. Babuschkin, J. E. Hu, Sidor, Szymon, Szymon Sidor, Igor Babuschkin, Liu Xiaodong, Xiaodong Liu, Farhi, David, David Farhi, Ryder, Nick, Nick Ryder, Pachocki, Jakub, Jakub Pachocki, Chen, Weizhu, Weizhu Chen, J. Pachocki, Gao, Jianfeng, Jianfeng Gao (2022)

[15] Tuning large neural networks via zero-shot hyperparameter transfer PDF

Ge Yang, J. Edward Hu, I. Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, Jianfeng Gao (2021)

[18] Super Consistency of Neural Network Landscapes and Learning Rate Transfer PDF

Thomas Hofmann, Alexandru Meterez, Lorenzo Noci, Antonio Orvieto (2024) • Neural Information Processing Systems

[29] Understanding Scaling Laws in Deep Neural Networks via Feature Learning Dynamics PDF

Zihan Yao, Ruoyu Wu, Tianxiang Gao (2025)

[45] An Empirical Study of P Learning Rate Transfer PDF

L Lingle (2024)

[48] How Width Scaling Affects Neural Networks: Generalization, Optimal Hyperparameters, Feature Learning and Beyond PDF

MSM Haas (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Conceptual framework for analyzing fast hyperparameter transfer

[1] Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer PDF

Can Refute

[2] Scalable hyperparameter transfer learning PDF

Cannot Refute

[7] Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit PDF

Cannot Refute

[10] Tune as you scale: Hyperparameter optimization for compute efficient training PDF

Cannot Refute

[13] Dion: Distributed orthonormalized updates PDF

Cannot Refute

[35] Practical Efficiency of Muon for Pretraining PDF

Cannot Refute

[52] Two-step hyperparameter optimization method: Accelerating hyperparameter search by using a fraction of a training dataset PDF

Cannot Refute

[53] Be aware of overfitting by hyperparameter optimization! PDF

Cannot Refute

[54] Calibrated Dataset Condensation for Faster Hyperparameter Search PDF

Cannot Refute

[55] Scaling Laws for Fine-Grained Mixture of Experts PDF

Cannot Refute

Contribution

Loss decomposition into width-stable and width-sensitive components

[48] How Width Scaling Affects Neural Networks: Generalization, Optimal Hyperparameters, Feature Learning and Beyond PDF

Cannot Refute

[49] The empirical impact of neural parameter symmetries, or lack thereof PDF

Cannot Refute

[50] Invariant polynomials and machine learning PDF

Cannot Refute

Contribution

Synthetic examples demonstrating conditions for fast transfer

[15] Tuning large neural networks via zero-shot hyperparameter transfer PDF

Cannot Refute

[51] Sparse maximal update parameterization: A holistic approach to sparse training dynamics PDF

Cannot Refute

Understanding the Mechanisms of Fast Hyperparameter Transfer

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer PDF

[15] Tuning large neural networks via zero-shot hyperparameter transfer PDF

[18] Super Consistency of Neural Network Landscapes and Learning Rate Transfer PDF

[29] Understanding Scaling Laws in Deep Neural Networks via Feature Learning Dynamics PDF

[45] An Empirical Study of P Learning Rate Transfer PDF

[48] How Width Scaling Affects Neural Networks: Generalization, Optimal Hyperparameters, Feature Learning and Beyond PDF

Contribution Analysis

Conceptual framework for analyzing fast hyperparameter transfer

[1] Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer PDF

[2] Scalable hyperparameter transfer learning PDF

[7] Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit PDF

[10] Tune as you scale: Hyperparameter optimization for compute efficient training PDF

[13] Dion: Distributed orthonormalized updates PDF

[35] Practical Efficiency of Muon for Pretraining PDF

[52] Two-step hyperparameter optimization method: Accelerating hyperparameter search by using a fraction of a training dataset PDF

[53] Be aware of overfitting by hyperparameter optimization! PDF

[54] Calibrated Dataset Condensation for Faster Hyperparameter Search PDF

[55] Scaling Laws for Fine-Grained Mixture of Experts PDF

Loss decomposition into width-stable and width-sensitive components

[48] How Width Scaling Affects Neural Networks: Generalization, Optimal Hyperparameters, Feature Learning and Beyond PDF

[49] The empirical impact of neural parameter symmetries, or lack thereof PDF

[50] Invariant polynomials and machine learning PDF

Synthetic examples demonstrating conditions for fast transfer

[15] Tuning large neural networks via zero-shot hyperparameter transfer PDF

[51] Sparse maximal update parameterization: A holistic approach to sparse training dynamics PDF

Table of Contents