Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation
Overview
Overall Novelty Assessment
The paper develops an analytical theory of signal propagation through deep transformers at initialization, providing exact prescriptions for initialization hyperparameters via trainability diagrams. It resides in the 'Unified Signal Propagation Theories' leaf alongside two sibling papers (d80a3981588bd30e144f2cd7681b2bfe and 07d60bf0b666ec37067d9414d7b2a5a7) within a taxonomy of 19 papers across the field. This leaf represents a relatively focused research direction within the broader 'Theoretical Frameworks for Signal Propagation Analysis' branch, suggesting the paper contributes to an active but not overcrowded area of theoretical investigation.
The taxonomy reveals neighboring work in mean field approaches (2 papers) and geometric/dynamical systems analysis (1 paper), indicating the paper's theoretical framework sits within a landscape of diverse mathematical tools for analyzing initialization. The taxonomy structure shows clear separation between theoretical frameworks and pathological phenomena studies (rank collapse, gradient issues, attention dynamics), with the paper positioned to bridge these areas by providing unified understanding of failure modes. Related architectural design work (normalization modifications, zero-initialized gating) exists in a separate branch, suggesting the paper's theoretical contributions may inform but differ from solution-oriented approaches.
Among 23 candidates examined, the first contribution (analytical theory of signal propagation) shows 1 refutable candidate out of 10 examined, indicating some prior theoretical work exists in this space but the specific formulation may offer distinguishing features. The second contribution (unified perspective on rank and entropy collapse) examined 3 candidates with none refutable, suggesting this framing could be relatively novel. The third contribution (Random Energy Model mapping for self-attention) examined 10 candidates with none refutable, potentially representing a more distinctive methodological innovation within the limited search scope.
Based on examination of 23 semantically-related candidates, the work appears to occupy a meaningful position within unified theoretical frameworks, with the Random Energy Model approach and dual-collapse perspective showing fewer overlaps in the examined literature. The limited search scope means comprehensive novelty assessment requires broader investigation, particularly given the paper's position in an active theoretical research direction with multiple complementary approaches.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors develop an exact analytical framework that tracks how token similarity evolves through Transformer layers at initialisation. This theory yields simple algorithms to compute trainability diagrams identifying correct initialisation hyperparameters for a given architecture.
The authors establish a formal parallel with the Random Energy Model from statistical physics to provide a unified explanation of rank collapse and entropy collapse. They identify a sharp phase transition governed by the variance of query/key weight initialisation that separates these two failure modes.
The authors introduce a novel mapping between self-attention at initialisation and the Random Energy Model from statistical physics. This mapping enables asymptotically exact analysis of self-attention in the infinite sequence length limit, overcoming previous approximations.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[11] Transformers get stable: An end-to-end signal propagation theory for language models PDF
[15] A Unifying Theory of Signal Propagation in Deep Transformers PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Analytical theory of signal propagation through deep Transformers at initialisation
The authors develop an exact analytical framework that tracks how token similarity evolves through Transformer layers at initialisation. This theory yields simple algorithms to compute trainability diagrams identifying correct initialisation hyperparameters for a given architecture.
[3] Effective Theory of Transformers at Initialization PDF
[7] Geometric Dynamics of Signal Propagation Predict Trainability of Transformers PDF
[13] Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention PDF
[32] Peri-ln: Revisiting normalization layer in the transformer architecture PDF
[33] Contrastive Forward-Forward: A Training Algorithm of Vision Transformer PDF
[34] The shaped transformer: Attention models in the infinite depth-and-width limit PDF
[35] A Sparse Transformer-Enhanced Graph Convolutional Model for Robust Node Importance Ranking in Complex Networks PDF
[36] Powernorm: Rethinking batch normalization in transformers PDF
[37] Fusion Optimization of KAN and Transformer under the Hesitant Fuzzy Environment and its Application in Intelligent Transportation Planning PDF
[38] Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks PDF
Unified perspective on rank collapse and entropy collapse via phase transition
The authors establish a formal parallel with the Random Energy Model from statistical physics to provide a unified explanation of rank collapse and entropy collapse. They identify a sharp phase transition governed by the variance of query/key weight initialisation that separates these two failure modes.
[15] A Unifying Theory of Signal Propagation in Deep Transformers PDF
[30] Attention is not all you need: Pure attention loses rank doubly exponentially with depth PDF
[31] From Condensation to Rank Collapse: A Two-Stage Analysis of Transformer Training Dynamics PDF
Exact treatment of self-attention layer via Random Energy Model mapping
The authors introduce a novel mapping between self-attention at initialisation and the Random Energy Model from statistical physics. This mapping enables asymptotically exact analysis of self-attention in the infinite sequence length limit, overcoming previous approximations.