Critical attention scaling in long-context transformers

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

large language modelsattention scalinglong-context length scalingrank-collapsephase transitionYaRNQwen

As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length $n$ increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While $\text{\emph{attention scaling}}$ effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor $\beta_n$ , theoretical justification for this approach remains lacking.

We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor $\beta_n$ : insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling $\beta_n \asymp \log n$ and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper provides a theoretical analysis of attention scaling in long-context transformers, specifically justifying the logarithmic scaling factor used in models like YaRN and Qwen. It resides in the 'Attention Collapse and Scaling Theory' leaf under 'Theoretical Foundations and Analysis', which contains only two papers including this one. This is a notably sparse research direction within the broader taxonomy of fifty papers, suggesting that rigorous theoretical justification for attention scaling remains an underexplored area despite widespread empirical adoption of these techniques.

The taxonomy reveals that most long-context research concentrates on architecture design (linear attention, sparse patterns, hybrid mechanisms) and systems optimization rather than theoretical foundations. The paper's closest neighbor in its leaf examines embedding collapse from a representational perspective, while nearby branches address complexity analysis and probabilistic frameworks. The work diverges from the dominant empirical trend by providing mathematical grounding for a phenomenon—rank collapse—that practitioners address through heuristic scaling factors, bridging the gap between theoretical understanding and architectural practice.

Among thirty candidates examined across three contributions, none were found to clearly refute the paper's claims. The critical scaling law contribution examined ten candidates with zero refutations, as did the phase transition framework and gradient propagation analysis. This suggests that within the limited search scope, no prior work appears to have rigorously characterized the logarithmic scaling threshold or formalized the phase transition governing attention dynamics. The absence of refutable overlap across all contributions indicates potential novelty, though the search examined a modest candidate pool rather than an exhaustive literature review.

Based on the limited search scope of thirty semantically similar papers, the work appears to occupy a relatively unexplored theoretical niche. The sparse population of its taxonomy leaf and the absence of refuting candidates suggest substantive novelty, though this assessment is constrained by the top-K semantic search methodology and does not constitute comprehensive coverage of all potentially relevant theoretical work in attention mechanisms.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: attention scaling in long-context transformers. The field has evolved into a rich landscape organized around several complementary directions. Theoretical Foundations and Analysis examines fundamental phenomena such as attention collapse and scaling behavior, providing the mathematical underpinnings for understanding how transformers behave as context grows. Architecture Design and Attention Mechanisms explores novel attention patterns—ranging from sparse schemes like Big Bird[16] and Longformer[25] to hierarchical approaches such as LongNet[2]—that reduce quadratic complexity. Position Encoding Strategies addresses how models maintain positional awareness over extended sequences, while Systems Optimization and Efficiency tackles practical concerns like memory management and distributed computation, exemplified by Ring Attention[8] and Context Parallelism[30]. Input Processing and Compression investigates methods to condense or selectively retain information, as seen in Landmark Attention[3] and Prompt Compression[28]. Domain-Specific Applications and Survey and Comparative Studies round out the taxonomy by contextualizing these techniques in real-world settings and synthesizing progress across the field, with works like Efficient Attention Survey[4] and Context Extension Survey[5] offering broad perspectives. Within this landscape, a particularly active line of inquiry centers on understanding and mitigating pathological behaviors that emerge at scale. Critical Attention Scaling[0] sits squarely in the Theoretical Foundations branch alongside Embedding Collapse[37], both investigating how attention distributions degrade or concentrate as context length increases. While Embedding Collapse[37] focuses on representational degradation in embedding spaces, Critical Attention Scaling[0] emphasizes the dynamics of attention weight distributions and their impact on model expressiveness. This theoretical cluster contrasts with more architecture-driven efforts like Infini-Attention[1] or LongT5[6], which propose new mechanisms to handle long contexts without necessarily dissecting the underlying scaling laws. The interplay between these theoretical insights and architectural innovations remains an open question: understanding when and why attention collapses can inform the design of more robust long-context systems, bridging foundational analysis with practical engineering.

Claimed Contributions

Critical scaling law for attention with logarithmic factor

10 retrieved papers

The authors establish that the critical scaling factor for attention scores is βn proportional to log n, which prevents rank-collapse in long-context transformers. This result provides theoretical justification for empirical methods like YaRN and Qwen that use logarithmic scaling.

10 retrieved papers

Phase transition framework for attention dynamics

10 retrieved papers

The paper introduces a tractable mathematical model demonstrating that attention undergoes a phase transition controlled by βn. Below the critical threshold, tokens collapse to uniformity; above it, attention becomes identity-like, eliminating meaningful token interactions.

10 retrieved papers

Gradient propagation analysis across scaling regimes

10 retrieved papers

The authors characterize how the phase transition affects gradient flow during backpropagation. They prove that gradients vanish in the subcritical regime but remain stable in the supercritical regime, connecting forward-pass rank-collapse to backward-pass gradient dynamics.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[37] Length-induced embedding collapse in transformer-based models PDF

Zhou Yuqi, Dai, Sunhao, Yuqi Zhou, Cao Zhanshuo, Sunhao Dai, Zhang Xiao, Zhanshuo Cao, Xu Jun, Xiao Zhang, Jun Xu (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Critical scaling law for attention with logarithmic factor

[2] Longnet: Scaling transformers to 1,000,000,000 tokens PDF

Cannot Refute

[23] SWAN: An Efficient and Scalable Approach for Long-Context Language Modeling PDF

Cannot Refute

[71] Squeezed attention: Accelerating long context length llm inference PDF

Cannot Refute

[72] â Bench: Extending long context evaluation beyond 100k tokens PDF

Cannot Refute

[73] Logarithmic memory networks (lmns): Efficient long-range sequence modeling for resource-constrained environments PDF

Cannot Refute

[74] Log-Linear Attention PDF

Cannot Refute

[75] The What, Why, and How of Context Length Extension Techniques in Large Language Models--A Detailed Survey PDF

Cannot Refute

[76] Hierarchical context merging: Better long context understanding for pre-trained llms PDF

Cannot Refute

[77] Towards Efficient Long-Context Natural Language Processing PDF

Cannot Refute

[78] Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows PDF

Cannot Refute

Contribution

Phase transition framework for attention dynamics

[51] Lovit: Long video transformer for surgical phase recognition PDF

Cannot Refute

[52] Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity PDF

Cannot Refute

[53] A phase transition between positional and semantic learning in a solvable model of dot-product attention PDF

Cannot Refute

[54] Ultrafast and accurate prediction of polycrystalline hafnium oxide phase-field ferroelectric hysteresis using graph neural networks. PDF

Cannot Refute

[55] Small-scale proxies for large-scale Transformer training instabilities PDF

Cannot Refute

[56] Dynamical Mean-Field Theory of Self-Attention Neural Networks PDF

Cannot Refute

[57] Phase Conductor on Multi-layered Attentions for Machine Comprehension PDF

Cannot Refute

[58] Weakly supervised change detection using guided anisotropic diffusion PDF

Cannot Refute

[59] Attention to Order: Transformers Discover Phase Transitions via Learnability PDF

Cannot Refute

[60] One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer PDF

Cannot Refute

Contribution

Gradient propagation analysis across scaling regimes

[61] Giadnet: Gradient inspired attention driven denoising network PDF

Cannot Refute

[62] Inflection-dependent gradient masking in predictive distribution collapse: A procedural mechanism in large language models PDF

Cannot Refute

[63] Geometric dynamics of signal propagation predict trainability of transformers PDF

Cannot Refute

[64] Variable multi-scale attention fusion network and adaptive correcting gradient optimization for multi-task learning PDF

Cannot Refute

[65] Mind the gap: a spectral analysis of rank collapse and signal propagation in transformers PDF

Cannot Refute

[66] Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation PDF

Cannot Refute

[67] Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers PDF

Cannot Refute

[68] Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer PDF

Cannot Refute

[69] The Mean-Field Dynamics of Transformers PDF

Cannot Refute

[70] Transformers get stable: An end-to-end signal propagation theory for language models PDF

Cannot Refute

Critical attention scaling in long-context transformers

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[37] Length-induced embedding collapse in transformer-based models PDF

Contribution Analysis

Critical scaling law for attention with logarithmic factor

[2] Longnet: Scaling transformers to 1,000,000,000 tokens PDF

[23] SWAN: An Efficient and Scalable Approach for Long-Context Language Modeling PDF

[71] Squeezed attention: Accelerating long context length llm inference PDF

[72] â Bench: Extending long context evaluation beyond 100k tokens PDF

[73] Logarithmic memory networks (lmns): Efficient long-range sequence modeling for resource-constrained environments PDF

[74] Log-Linear Attention PDF

[75] The What, Why, and How of Context Length Extension Techniques in Large Language Models--A Detailed Survey PDF

[76] Hierarchical context merging: Better long context understanding for pre-trained llms PDF

[77] Towards Efficient Long-Context Natural Language Processing PDF

[78] Gradual Forgetting: Logarithmic Compression for Extending Transformer Context Windows PDF

Phase transition framework for attention dynamics

[51] Lovit: Long video transformer for surgical phase recognition PDF

[52] Polar Sparsity: High Throughput Batched LLM Inferencing with Scalable Contextual Sparsity PDF

[53] A phase transition between positional and semantic learning in a solvable model of dot-product attention PDF

[54] Ultrafast and accurate prediction of polycrystalline hafnium oxide phase-field ferroelectric hysteresis using graph neural networks. PDF

[55] Small-scale proxies for large-scale Transformer training instabilities PDF

[56] Dynamical Mean-Field Theory of Self-Attention Neural Networks PDF

[57] Phase Conductor on Multi-layered Attentions for Machine Comprehension PDF

[58] Weakly supervised change detection using guided anisotropic diffusion PDF

[59] Attention to Order: Transformers Discover Phase Transitions via Learnability PDF

[60] One Attention, One Scale: Phase-Aligned Rotary Positional Embeddings for Mixed-Resolution Diffusion Transformer PDF

Gradient propagation analysis across scaling regimes

[61] Giadnet: Gradient inspired attention driven denoising network PDF

[62] Inflection-dependent gradient masking in predictive distribution collapse: A procedural mechanism in large language models PDF

[63] Geometric dynamics of signal propagation predict trainability of transformers PDF

[64] Variable multi-scale attention fusion network and adaptive correcting gradient optimization for multi-task learning PDF

[65] Mind the gap: a spectral analysis of rank collapse and signal propagation in transformers PDF

[66] Two failure modes of deep transformers and how to avoid them: a unified theory of signal propagation at initialisation PDF

[67] Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers PDF

[68] Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer PDF

[69] The Mean-Field Dynamics of Transformers PDF

[70] Transformers get stable: An end-to-end signal propagation theory for language models PDF

Table of Contents

[72] â Bench: Extending long context evaluation beyond 100k tokens PDF