Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation

ICLR 2026 Conference SubmissionAnonymous Authors
TransformerSignal PropagationTheory of Neural NetworksPhysics for Machine Learning
Abstract:

Finding the right initialisation for neural networks is crucial to ensure smooth training and good performance. In transformers, the wrong initialisation can lead to one of two failure modes of self-attention layers: rank collapse, where all tokens collapse into similar representations, and entropy collapse, where highly concentrated attention scores lead to training instability. While previous work has studied different scaling regimes for transformers, an asymptotically exact, down-to-the constant prescription for how to initialise transformers has so far been lacking. Here, we provide an analytical theory of signal propagation through deep transformers with self-attention, layer normalisation, skip connections and MLP. Our theory yields a simple algorithm to compute trainability diagrams that identify the correct choice of initialisation hyper-parameters for a given architecture. We overcome the key challenge, an exact treatment of the self-attention layer, by establishing a formal parallel with the Random Energy Model from statistical physics. We also analyse gradients in the backward path and determine the regime where gradients vanish at initialisation. We demonstrate the versatility of our framework through three case studies. Our theoretical framework gives a unified perspective on the two failure modes of self-attention and gives quantitative predictions on the scale of both weights and residual connections that guarantee smooth training.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper develops an analytical theory of signal propagation through deep transformers at initialization, providing exact prescriptions for initialization hyperparameters via trainability diagrams. It resides in the 'Unified Signal Propagation Theories' leaf alongside two sibling papers (d80a3981588bd30e144f2cd7681b2bfe and 07d60bf0b666ec37067d9414d7b2a5a7) within a taxonomy of 19 papers across the field. This leaf represents a relatively focused research direction within the broader 'Theoretical Frameworks for Signal Propagation Analysis' branch, suggesting the paper contributes to an active but not overcrowded area of theoretical investigation.

The taxonomy reveals neighboring work in mean field approaches (2 papers) and geometric/dynamical systems analysis (1 paper), indicating the paper's theoretical framework sits within a landscape of diverse mathematical tools for analyzing initialization. The taxonomy structure shows clear separation between theoretical frameworks and pathological phenomena studies (rank collapse, gradient issues, attention dynamics), with the paper positioned to bridge these areas by providing unified understanding of failure modes. Related architectural design work (normalization modifications, zero-initialized gating) exists in a separate branch, suggesting the paper's theoretical contributions may inform but differ from solution-oriented approaches.

Among 23 candidates examined, the first contribution (analytical theory of signal propagation) shows 1 refutable candidate out of 10 examined, indicating some prior theoretical work exists in this space but the specific formulation may offer distinguishing features. The second contribution (unified perspective on rank and entropy collapse) examined 3 candidates with none refutable, suggesting this framing could be relatively novel. The third contribution (Random Energy Model mapping for self-attention) examined 10 candidates with none refutable, potentially representing a more distinctive methodological innovation within the limited search scope.

Based on examination of 23 semantically-related candidates, the work appears to occupy a meaningful position within unified theoretical frameworks, with the Random Energy Model approach and dual-collapse perspective showing fewer overlaps in the examined literature. The limited search scope means comprehensive novelty assessment requires broader investigation, particularly given the paper's position in an active theoretical research direction with multiple complementary approaches.

Taxonomy

Core-task Taxonomy Papers
19
3
Claimed Contributions
23
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: signal propagation in deep transformers at initialisation. The field examines how information flows through transformer layers before training begins, a critical determinant of trainability and stability. The taxonomy organizes this landscape into four main branches: theoretical frameworks that develop mathematical tools for analyzing signal flow, studies of pathological phenomena like rank collapse and gradient explosion, architecture and initialization design strategies that ensure healthy propagation, and empirical investigations validating these principles across applications. Theoretical work such as Effective Theory Initialization[3] and Unifying Signal Propagation[15] provides rigorous foundations for understanding how variance and correlations evolve with depth, while design-oriented studies like Deepnet Scaling[2] and Autoinit[5] translate these insights into practical initialization schemes that enable training of extremely deep models. A central tension emerges between understanding failure modes versus engineering solutions. Some lines of work focus on characterizing pathologies—Rank Collapse Transformers[4] examines representational degeneracy, while Feature Diversity Initialization[6] addresses homogenization of learned features—whereas others propose architectural interventions like Rezero[9] or normalization strategies in Stable Transformers[11]. Two Failure Modes[0] sits within the unified theoretical frameworks branch, closely aligned with Unifying Signal Propagation[15] in developing comprehensive analytical perspectives. Compared to neighboring works, Two Failure Modes[0] appears to emphasize identifying distinct breakdown regimes at initialization, whereas Stable Transformers[11] focuses more directly on architectural modifications to prevent instabilities. This positioning suggests the original paper contributes foundational understanding of what can go wrong, complementing both the broader unification efforts and the more application-driven initialization methods scattered across the taxonomy.

Claimed Contributions

Analytical theory of signal propagation through deep Transformers at initialisation

The authors develop an exact analytical framework that tracks how token similarity evolves through Transformer layers at initialisation. This theory yields simple algorithms to compute trainability diagrams identifying correct initialisation hyperparameters for a given architecture.

10 retrieved papers
Can Refute
Unified perspective on rank collapse and entropy collapse via phase transition

The authors establish a formal parallel with the Random Energy Model from statistical physics to provide a unified explanation of rank collapse and entropy collapse. They identify a sharp phase transition governed by the variance of query/key weight initialisation that separates these two failure modes.

3 retrieved papers
Exact treatment of self-attention layer via Random Energy Model mapping

The authors introduce a novel mapping between self-attention at initialisation and the Random Energy Model from statistical physics. This mapping enables asymptotically exact analysis of self-attention in the infinite sequence length limit, overcoming previous approximations.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Analytical theory of signal propagation through deep Transformers at initialisation

The authors develop an exact analytical framework that tracks how token similarity evolves through Transformer layers at initialisation. This theory yields simple algorithms to compute trainability diagrams identifying correct initialisation hyperparameters for a given architecture.

Contribution

Unified perspective on rank collapse and entropy collapse via phase transition

The authors establish a formal parallel with the Random Energy Model from statistical physics to provide a unified explanation of rank collapse and entropy collapse. They identify a sharp phase transition governed by the variance of query/key weight initialisation that separates these two failure modes.

Contribution

Exact treatment of self-attention layer via Random Energy Model mapping

The authors introduce a novel mapping between self-attention at initialisation and the Random Energy Model from statistical physics. This mapping enables asymptotically exact analysis of self-attention in the infinite sequence length limit, overcoming previous approximations.