Long-Context Generalization with Sparse Attention

ICLR 2026 Conference SubmissionAnonymous Authors
long-contextsparse attentionlength generalisation
Abstract:

Transformer-based architectures traditionally employ softmax to compute attention weights, which produces dense distributions over all tokens in a sequence. While effective in many settings, this density has been shown to be detrimental for tasks that demand precise focus on fixed-size patterns: as sequence length increases, non-informative tokens accumulate attention probability mass, leading to dispersion and representational collapse. We show in this paper that dynamically sparse attention mechanisms using α\alpha-entmax can avoid these issues, due to their ability to assign exact zeros to irrelevant tokens. Furthermore, we introduce Adaptive-Scalable Entmax (ASEntmax), which endows α\alpha-entmax with a learnable temperature parameter, allowing the attention distribution to interpolate between sparse (pattern-focused) and dense (softmax-like) regimes. Our empirical evaluation on synthetic tasks and language modeling demonstrates that ASEntmax substantially outperforms softmax, scalable softmax, and fixed-temperature α\alpha-entmax baselines, achieving up to 1000×\times length extrapolation on synthetic benchmarks and superior long-context generalization on language modeling while preserving short-context performance, including better perplexity trends and higher retrieval accuracies at 8×\times training length.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ASEntmax, a learnable-temperature variant of α-entmax attention, and provides theoretical analysis of sparse attention for long-context modeling. It sits within the Sparse and Selective Attention leaf of the taxonomy, which contains only three papers total. This is a relatively sparse research direction compared to denser areas like Positional Encoding methods or Hybrid Architectures. The two sibling papers focus on structured sparsity patterns (Big Bird) and compression-based approaches, while this work emphasizes adaptive sparsity through learnable temperature parameters that interpolate between sparse and dense regimes.

The taxonomy reveals that Sparse and Selective Attention is one of six subtopics under Attention Mechanism Modifications, alongside Dense Attention Variants, Memory-Augmented Attention, and others. Neighboring leaves include Dense Attention Variants (which maintain full distributions) and Memory-Augmented Attention (which extends context through compression or recurrence). The scope note for this leaf emphasizes 'sparsity or selectivity to handle longer sequences efficiently,' distinguishing it from dense variants that modify softmax while preserving full distributions. This work bridges efficiency and generalization concerns, connecting to both the computational motivations of sparse attention and the length extrapolation goals central to the broader taxonomy.

Among 29 candidates examined across three contributions, none were found to clearly refute the proposed work. The theoretical analysis of α-entmax examined 10 candidates with no refutations; ASEntmax examined 9 candidates with no refutations; and the empirical demonstration of extreme length extrapolation examined 10 candidates with no refutations. This suggests that within the limited search scope, the combination of learnable-temperature sparse attention and its application to length extrapolation appears relatively unexplored. However, the search examined only top-K semantic matches and citations, not an exhaustive survey of sparse attention or entmax literature.

Based on the limited literature search, the work appears to occupy a distinct position within sparse attention research by combining adaptive temperature learning with length generalization objectives. The taxonomy context shows this is a less crowded area compared to positional encoding or hybrid architecture research. The absence of refuting candidates among 29 examined suggests novelty within the search scope, though broader entmax or sparse attention communities may contain relevant prior work not captured by semantic search.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: length generalization in transformer attention mechanisms. The field addresses how transformers can maintain or improve performance when processing sequences longer than those seen during training. The taxonomy organizes research into seven main branches: Positional Encoding and Embedding Schemes explore how position information affects extrapolation (e.g., ALiBi[20]); Attention Mechanism Modifications redesign core attention operations through sparse patterns, compression, or alternative formulations; Hybrid Architectures blend transformers with recurrent or state-space models (e.g., Samba[25], Mega[33]); Training Strategies and Data Augmentation develop curriculum or augmentation methods; Theoretical Foundations provide formal analyses of generalization bounds and expressiveness; Empirical Studies systematically evaluate length extrapolation across tasks; and Domain-Specific Applications tailor solutions to speech, vision, or reasoning domains. These branches reflect complementary perspectives—some focus on architectural innovation, others on training regimes or theoretical guarantees—yet all converge on enabling transformers to handle longer contexts reliably. Within Attention Mechanism Modifications, sparse and selective attention methods form a particularly active line of work, balancing computational efficiency with representational capacity. Big Bird[29] introduced structured sparsity patterns combining local, global, and random attention, demonstrating that carefully designed sparse schemes can preserve model quality while reducing quadratic complexity. Query-Key Compression[41] takes a different approach by compressing attention matrices to manage memory and computation. Sparse Attention Generalization[0] sits within this cluster, emphasizing how sparsity patterns themselves can be designed or learned to improve length extrapolation rather than merely reduce cost. Compared to Big Bird[29], which fixes sparsity structure a priori, and Query-Key Compression[41], which focuses on compression mechanics, Sparse Attention Generalization[0] appears to investigate adaptive or principled sparse designs that explicitly target generalization to longer sequences, bridging efficiency concerns with the core challenge of length robustness.

Claimed Contributions

Theoretical analysis of α-entmax for long-context modeling

The authors provide theoretical guarantees demonstrating that α-entmax attention avoids attention dispersion, prevents representational collapse, and alleviates over-squashing in long-context transformers. They prove that α-entmax maintains bounded normalized entropy and reduces gradient paths from O(nL) to O(sL).

10 retrieved papers
Adaptive-Scalable Entmax (ASEntmax)

The authors propose ASEntmax, a novel attention mechanism that extends α-entmax with learnable, head-specific and query-specific temperature parameters. This allows the model to adaptively adjust sparsity based on sequence length and content, balancing between sparse and dense attention regimes.

9 retrieved papers
Empirical demonstration of extreme length extrapolation

The authors demonstrate through extensive experiments that ASEntmax achieves superior long-context generalization, including 1000× length extrapolation on synthetic tasks and improved perplexity trends and retrieval accuracies at 8× training length on language modeling tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theoretical analysis of α-entmax for long-context modeling

The authors provide theoretical guarantees demonstrating that α-entmax attention avoids attention dispersion, prevents representational collapse, and alleviates over-squashing in long-context transformers. They prove that α-entmax maintains bounded normalized entropy and reduces gradient paths from O(nL) to O(sL).

Contribution

Adaptive-Scalable Entmax (ASEntmax)

The authors propose ASEntmax, a novel attention mechanism that extends α-entmax with learnable, head-specific and query-specific temperature parameters. This allows the model to adaptively adjust sparsity based on sequence length and content, balancing between sparse and dense attention regimes.

Contribution

Empirical demonstration of extreme length extrapolation

The authors demonstrate through extensive experiments that ASEntmax achieves superior long-context generalization, including 1000× length extrapolation on synthetic tasks and improved perplexity trends and retrieval accuracies at 8× training length on language modeling tasks.