Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

TransformerSignal PropagationTheory of Neural NetworksPhysics for Machine Learning

Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While previous work has studied different scaling regimes for transformers, an asymptotically exact, down-to-the constant prescription for how to initialise transformers has so far been lacking. Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and MLP. Our theory yields a simple algorithm to compute trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. We overcome the key challenge, an exact treatment of the self-attention layer, by establishing a formal parallel with the Random Energy Model from statistical physics. We also analyse gradients in the backward path and determine the regime where gradients vanish at initialisation. We demonstrate the versatility of our framework through three case studies. Our theoretical framework gives a unified perspective on the two failure modes of self-attention and gives quantitative predictions on the scale of both weights and residual connections that guarantee smooth training.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper develops an analytical theory of signal propagation through deep transformers at initialization, providing exact prescriptions for initialization hyperparameters via trainability diagrams. It resides in the 'Unified Signal Propagation Theories' leaf alongside two sibling papers (d80a3981588bd30e144f2cd7681b2bfe and 07d60bf0b666ec37067d9414d7b2a5a7) within a taxonomy of 19 papers across the field. This leaf represents a relatively focused research direction within the broader 'Theoretical Frameworks for Signal Propagation Analysis' branch, suggesting the paper contributes to an active but not overcrowded area of theoretical investigation.

The taxonomy reveals neighboring work in mean field approaches (2 papers) and geometric/dynamical systems analysis (1 paper), indicating the paper's theoretical framework sits within a landscape of diverse mathematical tools for analyzing initialization. The taxonomy structure shows clear separation between theoretical frameworks and pathological phenomena studies (rank collapse, gradient issues, attention dynamics), with the paper positioned to bridge these areas by providing unified understanding of failure modes. Related architectural design work (normalization modifications, zero-initialized gating) exists in a separate branch, suggesting the paper's theoretical contributions may inform but differ from solution-oriented approaches.

Among 23 candidates examined, the first contribution (analytical theory of signal propagation) shows 1 refutable candidate out of 10 examined, indicating some prior theoretical work exists in this space but the specific formulation may offer distinguishing features. The second contribution (unified perspective on rank and entropy collapse) examined 3 candidates with none refutable, suggesting this framing could be relatively novel. The third contribution (Random Energy Model mapping for self-attention) examined 10 candidates with none refutable, potentially representing a more distinctive methodological innovation within the limited search scope.

Based on examination of 23 semantically-related candidates, the work appears to occupy a meaningful position within unified theoretical frameworks, with the Random Energy Model approach and dual-collapse perspective showing fewer overlaps in the examined literature. The limited search scope means comprehensive novelty assessment requires broader investigation, particularly given the paper's position in an active theoretical research direction with multiple complementary approaches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: signal propagation in deep transformers at initialisation. The field examines how information flows through transformer layers before training begins, a critical determinant of trainability and stability. The taxonomy organizes this landscape into four main branches: theoretical frameworks that develop mathematical tools for analyzing signal flow, studies of pathological phenomena like rank collapse and gradient explosion, architecture and initialization design strategies that ensure healthy propagation, and empirical investigations validating these principles across applications. Theoretical work such as Effective Theory Initialization[3] and Unifying Signal Propagation[15] provides rigorous foundations for understanding how variance and correlations evolve with depth, while design-oriented studies like Deepnet Scaling[2] and Autoinit[5] translate these insights into practical initialization schemes that enable training of extremely deep models. A central tension emerges between understanding failure modes versus engineering solutions. Some lines of work focus on characterizing pathologies—Rank Collapse Transformers[4] examines representational degeneracy, while Feature Diversity Initialization[6] addresses homogenization of learned features—whereas others propose architectural interventions like Rezero[9] or normalization strategies in Stable Transformers[11]. Two Failure Modes[0] sits within the unified theoretical frameworks branch, closely aligned with Unifying Signal Propagation[15] in developing comprehensive analytical perspectives. Compared to neighboring works, Two Failure Modes[0] appears to emphasize identifying distinct breakdown regimes at initialization, whereas Stable Transformers[11] focuses more directly on architectural modifications to prevent instabilities. This positioning suggests the original paper contributes foundational understanding of what can go wrong, complementing both the broader unification efforts and the more application-driven initialization methods scattered across the taxonomy.

Claimed Contributions

Analytical theory of signal propagation through deep Transformers at initialisation

Can Refute

10 retrieved papers

The authors develop an exact analytical framework that tracks how token similarity evolves through Transformer layers at initialisation. This theory yields simple algorithms to compute trainability diagrams identifying correct initialisation hyperparameters for a given architecture.

10 retrieved papers

Can Refute

Unified perspective on rank collapse and entropy collapse via phase transition

3 retrieved papers

The authors establish a formal parallel with the Random Energy Model from statistical physics to provide a unified explanation of rank collapse and entropy collapse. They identify a sharp phase transition governed by the variance of query/key weight initialisation that separates these two failure modes.

3 retrieved papers

Exact treatment of self-attention layer via Random Energy Model mapping

10 retrieved papers

The authors introduce a novel mapping between self-attention at initialisation and the Random Energy Model from statistical physics. This mapping enables asymptotically exact analysis of self-attention in the infinite sequence length limit, overcoming previous approximations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[11] Transformers get stable: An end-to-end signal propagation theory for language models PDF

Kedia, Akhil, Zaidi, Mohd Abbas, Akhil Kedia, Khyalia, Sushil, Mohd Abbas Zaidi, Jungï¼ Jungho, Sushil Khyalia, Goka, Harshith, Jungho Jung, Lee, Haejun, Harshith Goka, Haejun Lee (2024)

[15] A Unifying Theory of Signal Propagation in Deep Transformers PDF

A Giorlandino, S Goldt (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Analytical theory of signal propagation through deep Transformers at initialisation

[3] Effective Theory of Transformers at Initialization PDF

Can Refute

[7] Geometric Dynamics of Signal Propagation Predict Trainability of Transformers PDF

Cannot Refute

[13] Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention PDF

Cannot Refute

[32] Peri-ln: Revisiting normalization layer in the transformer architecture PDF

Cannot Refute

[33] Contrastive Forward-Forward: A Training Algorithm of Vision Transformer PDF

Cannot Refute

[34] The shaped transformer: Attention models in the infinite depth-and-width limit PDF

Cannot Refute

[35] A Sparse Transformer-Enhanced Graph Convolutional Model for Robust Node Importance Ranking in Complex Networks PDF

Cannot Refute

[36] Powernorm: Rethinking batch normalization in transformers PDF

Cannot Refute

[37] Fusion Optimization of KAN and Transformer under the Hesitant Fuzzy Environment and its Application in Intelligent Transportation Planning PDF

Cannot Refute

[38] Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks PDF

Cannot Refute

Contribution

Unified perspective on rank collapse and entropy collapse via phase transition

[15] A Unifying Theory of Signal Propagation in Deep Transformers PDF

Cannot Refute

[30] Attention is not all you need: Pure attention loses rank doubly exponentially with depth PDF

Cannot Refute

[31] From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics PDF

Cannot Refute

Contribution

Exact treatment of self-attention layer via Random Energy Model mapping

[20] Mapping of attention mechanisms to a generalized potts model PDF

Cannot Refute

[21] Testing the spin-bath view of self-attention: A Hamiltonian analysis of GPT-2 Transformer PDF

Cannot Refute

[22] Dynamic metastability in the self-attention model PDF

Cannot Refute

[23] Towards understanding how attention mechanism works in deep learning PDF

Cannot Refute

[24] Dynamical Mean-Field Theory of Self-Attention Neural Networks PDF

Cannot Refute

[25] Dissecting the interplay of attention paths in a statistical mechanics theory of transformers PDF

Cannot Refute

[26] Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization PDF

Cannot Refute

[27] The Mean-Field Dynamics of Transformers PDF

Cannot Refute

[28] High-dimensional learning of narrow neural networks PDF

Cannot Refute

[29] A phase transition between positional and semantic learning in a solvable model of dot-product attention PDF

Cannot Refute

Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[11] Transformers get stable: An end-to-end signal propagation theory for language models PDF

[15] A Unifying Theory of Signal Propagation in Deep Transformers PDF

Contribution Analysis

Analytical theory of signal propagation through deep Transformers at initialisation

[3] Effective Theory of Transformers at Initialization PDF

[7] Geometric Dynamics of Signal Propagation Predict Trainability of Transformers PDF

[13] Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention PDF

[32] Peri-ln: Revisiting normalization layer in the transformer architecture PDF

[33] Contrastive Forward-Forward: A Training Algorithm of Vision Transformer PDF

[34] The shaped transformer: Attention models in the infinite depth-and-width limit PDF

[35] A Sparse Transformer-Enhanced Graph Convolutional Model for Robust Node Importance Ranking in Complex Networks PDF

[36] Powernorm: Rethinking batch normalization in transformers PDF

[37] Fusion Optimization of KAN and Transformer under the Hesitant Fuzzy Environment and its Application in Intelligent Transportation Planning PDF

[38] Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks PDF

Unified perspective on rank collapse and entropy collapse via phase transition

[15] A Unifying Theory of Signal Propagation in Deep Transformers PDF

[30] Attention is not all you need: Pure attention loses rank doubly exponentially with depth PDF

[31] From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics PDF

Exact treatment of self-attention layer via Random Energy Model mapping

[20] Mapping of attention mechanisms to a generalized potts model PDF

[21] Testing the spin-bath view of self-attention: A Hamiltonian analysis of GPT-2 Transformer PDF

[22] Dynamic metastability in the self-attention model PDF

[23] Towards understanding how attention mechanism works in deep learning PDF

[24] Dynamical Mean-Field Theory of Self-Attention Neural Networks PDF

[25] Dissecting the interplay of attention paths in a statistical mechanics theory of transformers PDF

[26] Analysis of mean-field models arising from self-attention dynamics in transformer architectures with layer normalization PDF

[27] The Mean-Field Dynamics of Transformers PDF

[28] High-dimensional learning of narrow neural networks PDF

[29] A phase transition between positional and semantic learning in a solvable model of dot-product attention PDF

Table of Contents