Taming Curvature: Architecture Warm-up for Stable Transformer Training

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Curvaturetransformers

Training billion-parameter Transformers is often brittle, with transient loss spikes and divergence that waste compute. Even though the recently developed Edge of Stability (EoS) theory provides a powerful tool to understand and control the stability of optimization methods via the (preconditioned) curvature, these curvature-controlling methods are not popular in large-scale Transformer training due to the complexity of curvature estimation. To this end, we first introduce a fast online estimator of the largest (preconditioned) Hessian eigenvalue (i.e., curvature) based on a warm-started variant for power iteration with Hessian–vector products. We show theoretically, and verify empirically, that the proposed method makes per-iteration curvature tracking feasible at billion-parameter scale while being more accurate. Using this tool, we find that training instabilities coincide with surges in preconditioned curvature and that curvature grows with depth. Motivated by these observations, we propose architecture warm-up: progressively growing network depth to carefully control the preconditioned Hessian and stabilize training. Experiments on large Transformers validate that our approach enables efficient curvature tracking and reduces instabilities compared to existing state-of-the-art stabilization techniques without slowing down convergence.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a fast online estimator for the largest preconditioned Hessian eigenvalue and proposes architecture warm-up to stabilize billion-parameter Transformer training. It resides in the Direct Curvature Control for Training Stability leaf, which contains four papers total. This leaf sits within the broader Curvature-Based Training Stabilization and Optimization branch, indicating a moderately populated research direction focused on explicit curvature regulation during training. The sibling papers in this leaf address related stability mechanisms, suggesting the area is active but not overcrowded.

The taxonomy reveals neighboring leaves dedicated to Second-Order Optimization Methods and Adaptive First-Order Methods with Curvature Insights, both of which leverage curvature information but differ in computational strategy and optimization framework. The Hessian-Based Analysis and Theoretical Insights branch provides complementary theoretical perspectives on curvature evolution and training dynamics. The paper's focus on direct curvature control distinguishes it from sharpness-aware generalization methods and post-training compression techniques, which occupy separate branches with distinct objectives and application contexts.

Among twenty-one candidates examined, the fast online estimator contribution shows overlap with prior work: ten candidates were reviewed, and two appear to provide refutable precedents. The architecture warm-up strategy examined only one candidate with no clear refutation, suggesting greater novelty in this direction. The theoretical bounds on warm-started power iteration examined ten candidates with no refutations, indicating this contribution may be less contested. The limited search scope means these findings reflect top-K semantic matches rather than exhaustive coverage of the literature.

Given the moderate density of the Direct Curvature Control leaf and the limited literature search, the work appears to offer incremental advances in curvature estimation alongside a potentially novel architectural strategy. The analysis covers top-twenty-one semantic matches and does not claim completeness; a broader search might reveal additional overlapping methods or confirm the relative novelty of architecture warm-up.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Stabilizing large-scale Transformer training through curvature control. The field organizes around four main branches that reflect distinct uses of curvature information in deep learning. Curvature-Based Training Stabilization and Optimization focuses on methods that directly manipulate or monitor curvature properties—such as Hessian eigenvalues or loss landscape sharpness—to improve convergence and prevent instabilities during training, with works like Self-stabilization[13] and Stepping on the edge[18] exemplifying this direction. Hessian-Based Analysis and Theoretical Insights gathers studies that use second-order information to understand optimization dynamics, generalization, and the role of adaptive optimizers, as seen in Why Transformers Need Adam[2] and Hessian-based Analysis of Large[11]. Model Compression and Efficiency via Curvature leverages curvature estimates—often Hessian approximations—to guide pruning, quantization, and other compression strategies, with many studies such as Global Vision Transformer Pruning[1] and PTQ4ViT[8] demonstrating practical efficiency gains. Finally, Alternative Geometric and Architectural Approaches explores broader geometric perspectives, including Ricci curvature and manifold-based techniques, offering complementary views on network design and optimization. A particularly active theme within the first branch is the tension between theoretical guarantees and computational feasibility: while exact Hessian computation is prohibitive at scale, approximations and adaptive heuristics must balance accuracy with overhead. Taming Curvature[0] sits squarely in the Direct Curvature Control for Training Stability cluster, emphasizing explicit mechanisms to regulate curvature during large-scale Transformer training. Compared to Self-stabilization[13], which may rely on implicit feedback loops, and Stepping on the edge[18], which explores edge-of-stability phenomena, Taming Curvature[0] appears to advocate for more direct intervention strategies. This positioning highlights an ongoing question in the field: whether stability is best achieved through careful architectural or algorithmic design that inherently tames curvature, or through active monitoring and correction during the optimization process.

Claimed Contributions

Fast online estimator for preconditioned Hessian eigenvalue using warm-started power iteration

Can Refute

10 retrieved papers

The authors develop an efficient method to track the largest eigenvalue of the preconditioned Hessian during training by warm-starting power iteration with the previous step's eigenvector. This approach requires fewer than five Hessian-vector products per step, making online curvature tracking feasible for billion-parameter Transformers while improving accuracy compared to cold-start methods.

10 retrieved papers

Can Refute

Architecture warm-up strategy for controlling curvature and stabilizing training

1 retrieved paper

The authors introduce a training strategy that progressively increases the effective depth of Transformers by initially freezing layers to identity and gradually unfreezing them according to a schedule. This approach controls the preconditioned curvature to match the stability threshold throughout training, particularly during learning rate warm-up phases.

1 retrieved paper

Theoretical bounds on eigenvector change and iteration savings for warm-started power iteration

10 retrieved papers

The authors provide theoretical analysis (Theorems 1 and 2) establishing that the top Hessian eigenvector evolves slowly under Lipschitz continuity assumptions, and quantify the iteration count reduction achieved by warm-starting. This theoretical foundation justifies why warm-started power iteration converges faster and more accurately than random initialization.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[13] Self-stabilization: The implicit bias of gradient descent at the edge of stability PDF

Damian, Alex, Alex Damian, Nichani, Eshaan, Eshaan Nichani, Alexandru Damian, Lee, Jason D., Jason D. Lee (2022)

[18] Stepping on the edge: Curvature aware learning rate tuners PDF

Atish Agarwala, Mathieu Blondel, Jean-Bastien Grill, Fabian Pedregosa, Vincent Roulet, Grzegorz Swirszcz (2024)

[29] A loss curvature perspective on training instability in deep learning PDF

Gilmer, Justin, Justin Gilmer, Ghorbani, Behrooz, Behrooz Ghorbani, J. Gilmer, Garg, Ankush, Ankush Garg, B. Ghorbani, Kudugunta, Sneha, Sneha Kudugunta, Neyshabur, Behnam, Behnam Neyshabur, Cardoze, David, David E. Cardoze, Dahl George, George E. Dahl, Nado, Zachary, Zachary Nado, Firat, Orhan, Orhan FÄ±rat, Orhan Firat (2021)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Fast online estimator for preconditioned Hessian eigenvalue using warm-started power iteration

[63] Training deep and recurrent networks with hessian-free optimization PDF

Can Refute

[64] A loss curvature perspective on training instabilities of deep learning models PDF

Can Refute

[59] Hessian-aware zeroth-order optimization PDF

Cannot Refute

[60] Connecting Parameter Magnitudes and Hessian Eigenspaces at Scale using Sketched Methods PDF

Cannot Refute

[61] A New Perspective on Shampoo's Preconditioner PDF

Cannot Refute

[62] Deep Linear Network Training Dynamics from Random Initialization: Data, Width, Depth, and Hyperparameter Transfer PDF

Cannot Refute

[65] Spectral Architecture Optimization for Efficient Model Training PDF

Cannot Refute

[66] Achieving Small-Batch Accuracy with Large-Batch Scalability via Adaptive Learning Rate Adjustment PDF

Cannot Refute

[67] CAMP-HiVe: Cyclic Pair Merging based Efficient DNN Pruning with Hessian-Vector Approximation for Resource-Constrained Systems PDF

Cannot Refute

[68] Minibatches can make neural network training repeatable for clinical applications PDF

Cannot Refute

Contribution

Architecture warm-up strategy for controlling curvature and stabilizing training

[69] Kronecker-factored approximate curvature for modern neural network architectures PDF

Cannot Refute

Contribution

Theoretical bounds on eigenvector change and iteration savings for warm-started power iteration

[49] A novel transcendental metaphor metaheuristic algorithm based on power method PDF

Cannot Refute

[50] Convex optimization of markov decision processes based on z transform: A theoretical framework for two-space decomposition and linear programming â¦ PDF

Cannot Refute

[51] Graph positional encoding via random feature propagation PDF

Cannot Refute

[52] Truncated power method for sparse eigenvalue problems PDF

Cannot Refute

[53] Momentum accelerated power iterations and the restarted Lanczos method PDF

Cannot Refute

[54] Convergence of Eigenvector Continuation PDF

Cannot Refute

[55] Adaptive power method for estimating large deviations in Markov chains. PDF

Cannot Refute

[56] The random component-wise power method PDF

Cannot Refute

[57] Nonlinear Power Method for Computing Eigenvectors of Proximal Operators and Neural Networks PDF

Cannot Refute

[58] Noisy accelerated power method for eigenproblems with applications PDF

Cannot Refute

Taming Curvature: Architecture Warm-up for Stable Transformer Training

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[13] Self-stabilization: The implicit bias of gradient descent at the edge of stability PDF

[18] Stepping on the edge: Curvature aware learning rate tuners PDF

[29] A loss curvature perspective on training instability in deep learning PDF

Contribution Analysis

Fast online estimator for preconditioned Hessian eigenvalue using warm-started power iteration

[63] Training deep and recurrent networks with hessian-free optimization PDF

[64] A loss curvature perspective on training instabilities of deep learning models PDF

[59] Hessian-aware zeroth-order optimization PDF

[60] Connecting Parameter Magnitudes and Hessian Eigenspaces at Scale using Sketched Methods PDF

[61] A New Perspective on Shampoo's Preconditioner PDF

[62] Deep Linear Network Training Dynamics from Random Initialization: Data, Width, Depth, and Hyperparameter Transfer PDF

[65] Spectral Architecture Optimization for Efficient Model Training PDF

[66] Achieving Small-Batch Accuracy with Large-Batch Scalability via Adaptive Learning Rate Adjustment PDF

[67] CAMP-HiVe: Cyclic Pair Merging based Efficient DNN Pruning with Hessian-Vector Approximation for Resource-Constrained Systems PDF

[68] Minibatches can make neural network training repeatable for clinical applications PDF

Architecture warm-up strategy for controlling curvature and stabilizing training

[69] Kronecker-factored approximate curvature for modern neural network architectures PDF

Theoretical bounds on eigenvector change and iteration savings for warm-started power iteration

[49] A novel transcendental metaphor metaheuristic algorithm based on power method PDF

[50] Convex optimization of markov decision processes based on z transform: A theoretical framework for two-space decomposition and linear programming â¦ PDF

[51] Graph positional encoding via random feature propagation PDF

[52] Truncated power method for sparse eigenvalue problems PDF

[53] Momentum accelerated power iterations and the restarted Lanczos method PDF

[54] Convergence of Eigenvector Continuation PDF

[55] Adaptive power method for estimating large deviations in Markov chains. PDF

[56] The random component-wise power method PDF

[57] Nonlinear Power Method for Computing Eigenvectors of Proximal Operators and Neural Networks PDF

[58] Noisy accelerated power method for eigenproblems with applications PDF

Table of Contents

[50] Convex optimization of markov decision processes based on z transform: A theoretical framework for two-space decomposition and linear programming â¦ PDF