Understanding the Mechanisms of Fast Hyperparameter Transfer

ICLR 2026 Conference SubmissionAnonymous Authors
hyperparameter transferhyperparameter tuningscaling lawsoptimization dynamicsmaximal update parameterizationscience of deep learning
Abstract:

The growing scale of deep learning models has rendered exhaustive hyperparameter (HP) optimization prohibitively expensive. A promising solution is the use of scale-aware HPs, which can enable direct transfer of optimal settings from small-scale grid searches to large models with minimal performance loss. Such approaches are useful when the optimal settings converge "fast" enough with scale. While approaches like the Maximal Update Parameterization (μ\muP) have empirically displayed fast transfer when scaling model width, a deeper conceptual understanding of the mechanisms that enable this is still missing. Our work establishes a systematic conceptual framework for analyzing fast HP transfer across different synthetic and practical scenarios. In synthetic settings, we present various quantitative examples where transfer either offers a provable computational advantage or fails even under μ\muP. We then propose a key property that enables the fast transfer often observed in practice: through a novel decomposition of the optimization trajectory, we identify one component that rapidly converges with model width and determines the optimal HPs, and the other that continues to improve the loss with increased width but has negligible impact on HP choice. We conjecture that this decomposition elucidates the key mechanisms behind fast transfer and empirically validate it in practical settings such as LLM training.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a conceptual framework for understanding when and why hyperparameter transfer succeeds or fails across model widths, particularly under Maximal Update Parametrization (μP). It resides in the 'Maximal Update Parametrization (μP) and Extensions' leaf, which contains seven papers—a moderately populated research direction within the broader parametrization-based transfer landscape. This leaf focuses on width-scaling methods that preserve optimization dynamics, distinguishing itself from multi-axis extensions (depth, batch size) and optimizer-specific approaches. The work aims to move beyond empirical demonstrations of μP's effectiveness toward a principled understanding of the mechanisms enabling fast transfer.

The taxonomy reveals that parametrization-based methods form one of six major branches addressing hyperparameter transfer. Neighboring leaves include 'Multi-Axis Parametrization Extensions' (six papers handling depth and batch size jointly) and 'Specialized Architecture Parametrizations' (four papers for MoE, FNO, and FP8 training). The 'Optimizer-Specific Transfer' branch explores complementary strategies through preconditioned optimizers and layerwise learning rates, while 'Bayesian and Meta-Learning Transfer' pursues probabilistic surrogate models. The paper's focus on decomposing optimization trajectories to isolate width-stable components connects conceptually to theoretical foundations but remains grounded in the parametrization paradigm rather than optimizer design or meta-learning.

Among fifteen candidates examined, the framework contribution encountered one potentially refutable prior work out of ten candidates reviewed, while the loss decomposition (three candidates) and synthetic examples (two candidates) showed no clear refutations. The limited search scope—top-K semantic matches plus citation expansion—means these statistics reflect a targeted sample rather than exhaustive coverage. The framework contribution appears most exposed to overlap, likely because conceptual analyses of μP's mechanisms have been explored in sibling papers. The decomposition and synthetic examples may offer more distinctive technical angles, though the small candidate pools (two to three papers each) constrain confidence in their novelty assessment.

Based on the examined literature, the work occupies a moderately crowded research direction with established parametrization methods but contributes theoretical depth to understanding transfer mechanisms. The analysis covers a focused set of candidates rather than the full field, so conclusions about novelty remain provisional. The decomposition approach and synthetic counterexamples may represent the most original contributions, while the overarching framework builds incrementally on existing μP scholarship within a well-defined but active research area.

Taxonomy

Core-task Taxonomy Papers
48
3
Claimed Contributions
15
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Hyperparameter transfer across neural network scales. The field addresses how to reuse hyperparameters—such as learning rates, initialization schemes, and optimizer settings—when moving from small to large models, thereby reducing the cost of tuning at scale. The taxonomy organizes work into six main branches. Parametrization-Based Transfer Methods focus on reparametrizing network weights and learning rates so that optimal settings remain stable across widths or depths, with Maximal Update Parametrization (μP) and its extensions forming a prominent line of research. Optimizer-Specific Transfer examines how particular optimizers (e.g., Adam, SGD) behave under scaling and how their hyperparameters can be adapted. Bayesian and Meta-Learning Transfer leverages probabilistic models or learned transfer functions to predict good configurations. Application-Specific Transfer Methods tailor strategies to domains like vision or language modeling. Model Initialization and Warmstarting explores using smaller pretrained models to seed larger ones. Finally, Theoretical Foundations and Analysis provides scaling laws and convergence guarantees that underpin transfer strategies. Several active themes emerge across these branches. One central question is whether transfer can be made nearly automatic—works like Tensor Programs V[1] and Zero Shot Transfer[15] aim for zero-shot or minimal-tuning regimes—versus methods that accept some residual search, as in CNN Hyperparameter Optimization[3] or Syne Tune[20]. Another contrast lies between width-centric parametrizations (μP-style approaches) and depth or layer-specific scaling rules. Fast Hyperparameter Transfer[0] sits within the Parametrization-Based Transfer branch, specifically under μP and Extensions, emphasizing efficient transfer with minimal retuning. It shares conceptual ground with Tensor Programs V[1] and Mu Learning Rate[45], which also exploit structured parametrizations to stabilize hyperparameters. Compared to Zero Shot Transfer[15], which targets immediate applicability, Fast Hyperparameter Transfer[0] may allow modest adjustments while still achieving strong cross-scale performance, positioning it as a practical middle ground in the parametrization-driven landscape.

Claimed Contributions

Conceptual framework for analyzing fast hyperparameter transfer

The authors develop a formal framework to analyze when hyperparameter transfer is computationally useful, defining fast transfer as occurring when optimal hyperparameters converge faster than the evaluation metric. They connect this to computational efficiency through theorems showing when transfer strategies outperform direct tuning.

10 retrieved papers
Can Refute
Loss decomposition into width-stable and width-sensitive components

The authors introduce a novel trajectory-level loss decomposition that separates the linearized loss change into top-k components (which remain width-invariant and determine optimal hyperparameters) and residual components (which improve with width but minimally affect hyperparameter choice). This decomposition provides a mechanistic explanation for fast transfer.

3 retrieved papers
Synthetic examples demonstrating conditions for fast transfer

The authors provide concrete synthetic examples including random features regression (where fast transfer provably occurs) and two-layer ReLU networks (where transfer can be slow even under maximal update parameterization), illustrating that fast transfer depends on structural properties of the training process rather than being guaranteed by parameterization alone.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Conceptual framework for analyzing fast hyperparameter transfer

The authors develop a formal framework to analyze when hyperparameter transfer is computationally useful, defining fast transfer as occurring when optimal hyperparameters converge faster than the evaluation metric. They connect this to computational efficiency through theorems showing when transfer strategies outperform direct tuning.

Contribution

Loss decomposition into width-stable and width-sensitive components

The authors introduce a novel trajectory-level loss decomposition that separates the linearized loss change into top-k components (which remain width-invariant and determine optimal hyperparameters) and residual components (which improve with width but minimally affect hyperparameter choice). This decomposition provides a mechanistic explanation for fast transfer.

Contribution

Synthetic examples demonstrating conditions for fast transfer

The authors provide concrete synthetic examples including random features regression (where fast transfer provably occurs) and two-layer ReLU networks (where transfer can be slow even under maximal update parameterization), illustrating that fast transfer depends on structural properties of the training process rather than being guaranteed by parameterization alone.