Abstract:

Training billion-parameter Transformers is often brittle, with transient loss spikes and divergence that waste compute. Even though the recently developed Edge of Stability (EoS) theory provides a powerful tool to understand and control the stability of optimization methods via the (preconditioned) curvature, these curvature-controlling methods are not popular in large-scale Transformer training due to the complexity of curvature estimation. To this end, we first introduce a fast online estimator of the largest (preconditioned) Hessian eigenvalue (i.e., curvature) based on a warm-started variant for power iteration with Hessian–vector products. We show theoretically, and verify empirically, that the proposed method makes per-iteration curvature tracking feasible at billion-parameter scale while being more accurate. Using this tool, we find that training instabilities coincide with surges in preconditioned curvature and that curvature grows with depth. Motivated by these observations, we propose architecture warm-up: progressively growing network depth to carefully control the preconditioned Hessian and stabilize training. Experiments on large Transformers validate that our approach enables efficient curvature tracking and reduces instabilities compared to existing state-of-the-art stabilization techniques without slowing down convergence.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a fast online estimator for the largest preconditioned Hessian eigenvalue and proposes architecture warm-up to stabilize billion-parameter Transformer training. It resides in the Direct Curvature Control for Training Stability leaf, which contains four papers total. This leaf sits within the broader Curvature-Based Training Stabilization and Optimization branch, indicating a moderately populated research direction focused on explicit curvature regulation during training. The sibling papers in this leaf address related stability mechanisms, suggesting the area is active but not overcrowded.

The taxonomy reveals neighboring leaves dedicated to Second-Order Optimization Methods and Adaptive First-Order Methods with Curvature Insights, both of which leverage curvature information but differ in computational strategy and optimization framework. The Hessian-Based Analysis and Theoretical Insights branch provides complementary theoretical perspectives on curvature evolution and training dynamics. The paper's focus on direct curvature control distinguishes it from sharpness-aware generalization methods and post-training compression techniques, which occupy separate branches with distinct objectives and application contexts.

Among twenty-one candidates examined, the fast online estimator contribution shows overlap with prior work: ten candidates were reviewed, and two appear to provide refutable precedents. The architecture warm-up strategy examined only one candidate with no clear refutation, suggesting greater novelty in this direction. The theoretical bounds on warm-started power iteration examined ten candidates with no refutations, indicating this contribution may be less contested. The limited search scope means these findings reflect top-K semantic matches rather than exhaustive coverage of the literature.

Given the moderate density of the Direct Curvature Control leaf and the limited literature search, the work appears to offer incremental advances in curvature estimation alongside a potentially novel architectural strategy. The analysis covers top-twenty-one semantic matches and does not claim completeness; a broader search might reveal additional overlapping methods or confirm the relative novelty of architecture warm-up.

Taxonomy

Core-task Taxonomy Papers
48
3
Claimed Contributions
21
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Stabilizing large-scale Transformer training through curvature control. The field organizes around four main branches that reflect distinct uses of curvature information in deep learning. Curvature-Based Training Stabilization and Optimization focuses on methods that directly manipulate or monitor curvature properties—such as Hessian eigenvalues or loss landscape sharpness—to improve convergence and prevent instabilities during training, with works like Self-stabilization[13] and Stepping on the edge[18] exemplifying this direction. Hessian-Based Analysis and Theoretical Insights gathers studies that use second-order information to understand optimization dynamics, generalization, and the role of adaptive optimizers, as seen in Why Transformers Need Adam[2] and Hessian-based Analysis of Large[11]. Model Compression and Efficiency via Curvature leverages curvature estimates—often Hessian approximations—to guide pruning, quantization, and other compression strategies, with many studies such as Global Vision Transformer Pruning[1] and PTQ4ViT[8] demonstrating practical efficiency gains. Finally, Alternative Geometric and Architectural Approaches explores broader geometric perspectives, including Ricci curvature and manifold-based techniques, offering complementary views on network design and optimization. A particularly active theme within the first branch is the tension between theoretical guarantees and computational feasibility: while exact Hessian computation is prohibitive at scale, approximations and adaptive heuristics must balance accuracy with overhead. Taming Curvature[0] sits squarely in the Direct Curvature Control for Training Stability cluster, emphasizing explicit mechanisms to regulate curvature during large-scale Transformer training. Compared to Self-stabilization[13], which may rely on implicit feedback loops, and Stepping on the edge[18], which explores edge-of-stability phenomena, Taming Curvature[0] appears to advocate for more direct intervention strategies. This positioning highlights an ongoing question in the field: whether stability is best achieved through careful architectural or algorithmic design that inherently tames curvature, or through active monitoring and correction during the optimization process.

Claimed Contributions

Fast online estimator for preconditioned Hessian eigenvalue using warm-started power iteration

The authors develop an efficient method to track the largest eigenvalue of the preconditioned Hessian during training by warm-starting power iteration with the previous step's eigenvector. This approach requires fewer than five Hessian-vector products per step, making online curvature tracking feasible for billion-parameter Transformers while improving accuracy compared to cold-start methods.

10 retrieved papers
Can Refute
Architecture warm-up strategy for controlling curvature and stabilizing training

The authors introduce a training strategy that progressively increases the effective depth of Transformers by initially freezing layers to identity and gradually unfreezing them according to a schedule. This approach controls the preconditioned curvature to match the stability threshold throughout training, particularly during learning rate warm-up phases.

1 retrieved paper
Theoretical bounds on eigenvector change and iteration savings for warm-started power iteration

The authors provide theoretical analysis (Theorems 1 and 2) establishing that the top Hessian eigenvector evolves slowly under Lipschitz continuity assumptions, and quantify the iteration count reduction achieved by warm-starting. This theoretical foundation justifies why warm-started power iteration converges faster and more accurately than random initialization.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Fast online estimator for preconditioned Hessian eigenvalue using warm-started power iteration

The authors develop an efficient method to track the largest eigenvalue of the preconditioned Hessian during training by warm-starting power iteration with the previous step's eigenvector. This approach requires fewer than five Hessian-vector products per step, making online curvature tracking feasible for billion-parameter Transformers while improving accuracy compared to cold-start methods.

Contribution

Architecture warm-up strategy for controlling curvature and stabilizing training

The authors introduce a training strategy that progressively increases the effective depth of Transformers by initially freezing layers to identity and gradually unfreezing them according to a schedule. This approach controls the preconditioned curvature to match the stability threshold throughout training, particularly during learning rate warm-up phases.

Contribution

Theoretical bounds on eigenvector change and iteration savings for warm-started power iteration

The authors provide theoretical analysis (Theorems 1 and 2) establishing that the top Hessian eigenvector evolves slowly under Lipschitz continuity assumptions, and quantify the iteration count reduction achieved by warm-starting. This theoretical foundation justifies why warm-started power iteration converges faster and more accurately than random initialization.