Critical attention scaling in long-context transformers

ICLR 2026 Conference SubmissionAnonymous Authors
large language modelsattention scalinglong-context length scalingrank-collapsephase transitionYaRNQwen
Abstract:

As large language models scale to longer contexts, attention layers suffer from a fundamental pathology: attention scores collapse toward uniformity as context length nn increases, causing tokens to cluster excessively, a phenomenon known as rank-collapse. While attention scaling\text{\emph{attention scaling}} effectively addresses this deficiency by rescaling attention scores with a polylogarithmic factor βn\beta_n, theoretical justification for this approach remains lacking.

We analyze a simplified yet tractable model that magnifies the effect of attention scaling. In this model, attention exhibits a phase transition governed by the scaling factor βn\beta_n: insufficient scaling collapses all tokens to a single direction, while excessive scaling reduces attention to identity, thereby eliminating meaningful interactions between tokens. Our main result identifies the critical scaling βnlogn\beta_n \asymp \log n and provides a rigorous justification for attention scaling in YaRN and Qwen, clarifying why logarithmic scaling maintains sparse, content-adaptive attention at large context lengths.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper provides a theoretical analysis of attention scaling in long-context transformers, specifically justifying the logarithmic scaling factor used in models like YaRN and Qwen. It resides in the 'Attention Collapse and Scaling Theory' leaf under 'Theoretical Foundations and Analysis', which contains only two papers including this one. This is a notably sparse research direction within the broader taxonomy of fifty papers, suggesting that rigorous theoretical justification for attention scaling remains an underexplored area despite widespread empirical adoption of these techniques.

The taxonomy reveals that most long-context research concentrates on architecture design (linear attention, sparse patterns, hybrid mechanisms) and systems optimization rather than theoretical foundations. The paper's closest neighbor in its leaf examines embedding collapse from a representational perspective, while nearby branches address complexity analysis and probabilistic frameworks. The work diverges from the dominant empirical trend by providing mathematical grounding for a phenomenon—rank collapse—that practitioners address through heuristic scaling factors, bridging the gap between theoretical understanding and architectural practice.

Among thirty candidates examined across three contributions, none were found to clearly refute the paper's claims. The critical scaling law contribution examined ten candidates with zero refutations, as did the phase transition framework and gradient propagation analysis. This suggests that within the limited search scope, no prior work appears to have rigorously characterized the logarithmic scaling threshold or formalized the phase transition governing attention dynamics. The absence of refutable overlap across all contributions indicates potential novelty, though the search examined a modest candidate pool rather than an exhaustive literature review.

Based on the limited search scope of thirty semantically similar papers, the work appears to occupy a relatively unexplored theoretical niche. The sparse population of its taxonomy leaf and the absence of refuting candidates suggest substantive novelty, though this assessment is constrained by the top-K semantic search methodology and does not constitute comprehensive coverage of all potentially relevant theoretical work in attention mechanisms.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: attention scaling in long-context transformers. The field has evolved into a rich landscape organized around several complementary directions. Theoretical Foundations and Analysis examines fundamental phenomena such as attention collapse and scaling behavior, providing the mathematical underpinnings for understanding how transformers behave as context grows. Architecture Design and Attention Mechanisms explores novel attention patterns—ranging from sparse schemes like Big Bird[16] and Longformer[25] to hierarchical approaches such as LongNet[2]—that reduce quadratic complexity. Position Encoding Strategies addresses how models maintain positional awareness over extended sequences, while Systems Optimization and Efficiency tackles practical concerns like memory management and distributed computation, exemplified by Ring Attention[8] and Context Parallelism[30]. Input Processing and Compression investigates methods to condense or selectively retain information, as seen in Landmark Attention[3] and Prompt Compression[28]. Domain-Specific Applications and Survey and Comparative Studies round out the taxonomy by contextualizing these techniques in real-world settings and synthesizing progress across the field, with works like Efficient Attention Survey[4] and Context Extension Survey[5] offering broad perspectives. Within this landscape, a particularly active line of inquiry centers on understanding and mitigating pathological behaviors that emerge at scale. Critical Attention Scaling[0] sits squarely in the Theoretical Foundations branch alongside Embedding Collapse[37], both investigating how attention distributions degrade or concentrate as context length increases. While Embedding Collapse[37] focuses on representational degradation in embedding spaces, Critical Attention Scaling[0] emphasizes the dynamics of attention weight distributions and their impact on model expressiveness. This theoretical cluster contrasts with more architecture-driven efforts like Infini-Attention[1] or LongT5[6], which propose new mechanisms to handle long contexts without necessarily dissecting the underlying scaling laws. The interplay between these theoretical insights and architectural innovations remains an open question: understanding when and why attention collapses can inform the design of more robust long-context systems, bridging foundational analysis with practical engineering.

Claimed Contributions

Critical scaling law for attention with logarithmic factor

The authors establish that the critical scaling factor for attention scores is βn proportional to log n, which prevents rank-collapse in long-context transformers. This result provides theoretical justification for empirical methods like YaRN and Qwen that use logarithmic scaling.

10 retrieved papers
Phase transition framework for attention dynamics

The paper introduces a tractable mathematical model demonstrating that attention undergoes a phase transition controlled by βn. Below the critical threshold, tokens collapse to uniformity; above it, attention becomes identity-like, eliminating meaningful token interactions.

10 retrieved papers
Gradient propagation analysis across scaling regimes

The authors characterize how the phase transition affects gradient flow during backpropagation. They prove that gradients vanish in the subcritical regime but remain stable in the supercritical regime, connecting forward-pass rank-collapse to backward-pass gradient dynamics.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Critical scaling law for attention with logarithmic factor

The authors establish that the critical scaling factor for attention scores is βn proportional to log n, which prevents rank-collapse in long-context transformers. This result provides theoretical justification for empirical methods like YaRN and Qwen that use logarithmic scaling.

Contribution

Phase transition framework for attention dynamics

The paper introduces a tractable mathematical model demonstrating that attention undergoes a phase transition controlled by βn. Below the critical threshold, tokens collapse to uniformity; above it, attention becomes identity-like, eliminating meaningful token interactions.

Contribution

Gradient propagation analysis across scaling regimes

The authors characterize how the phase transition affects gradient flow during backpropagation. They prove that gradients vanish in the subcritical regime but remain stable in the supercritical regime, connecting forward-pass rank-collapse to backward-pass gradient dynamics.