Subquadratic Algorithms and Hardness for Attention with Any Temperature

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

AttentionComplexity

Despite the popularity of the Transformer architecture, the standard algorithm for computing Attention suffers from quadratic time complexity in context length $n$ . Alman and Song showed that when the head dimension $d = \Theta(\log n)$ , subquadratic Attention is possible if and only if the inputs have small entries bounded by $B = o(\sqrt{\log n})$ in absolute values, under the Strong Exponential Time Hypothesis ( $\mathsf{SETH}$ ). Equivalently, subquadratic Attention is possible if and only if the softmax is applied with high temperature for $d=\Theta(\log n)$ . Running times of these algorithms depend exponentially on $B$ and thus they do not lead to even a polynomial-time algorithm outside the specific range of $B$ .

This naturally leads to the question: when can Attention be computed efficiently without strong assumptions on temperature? Are there fast attention algorithms that scale polylogarithmically with entry size $B$ ? In this work, we resolve this question and characterize when fast Attention for arbitrary temperatures is possible. First, for all constant $d = O(1)$ , we give the first subquadratic $\tilde{O}(n^{2 - 1/d} \cdot \mathrm{polylog}(B))$ time algorithm for Attention with large $B$ . Our result holds even for matrices with large head dimension if they have low rank. Combined with a reduction from Gradient Computation to Attention, we obtain a subquadratic algorithm for the full LLM training process. Furthermore, we show that any substantial improvement on our algorithm is unlikely. In particular, we show that even when $d = 2^{\Theta(\log^* n)}$ , Attention requires $n^{2 - o(1)}$ time under $\mathsf{SETH}$ .

Finally, in the regime where $d = \mathrm{poly}(n)$ , the standard algorithm requires $O(n^{2} d)$ time while previous lower bounds only ruled out algorithms with truly subquadratic time in $n$ . We close this gap and show that the standard algorithm is optimal under popular fine-grained complexity assumptions.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes the first truly subquadratic algorithm for computing Attention with constant head dimension and arbitrary temperature, achieving Õ(n^(2-1/d)) runtime that scales polylogarithmically with entry size B. It also establishes conditional lower bounds and optimality results under fine-grained complexity assumptions. Within the taxonomy, this work sits in the 'Algorithmic Complexity and Hardness Results' leaf under 'Theoretical Foundations and Computational Complexity'. Notably, this leaf contains only the original paper itself—no sibling papers—indicating a sparse research direction focused specifically on algorithmic complexity for arbitrary-temperature attention.

The taxonomy reveals that most related work resides in neighboring branches rather than the same leaf. The sibling leaf 'Attention Behavior Analysis and Normalization' contains one paper examining theoretical properties without complexity focus. The broader field structure shows substantial activity in adaptive temperature mechanisms (three papers in 'Self-Adaptive Temperature Mechanisms') and application-specific tuning across vision, language, and time-series domains (nine papers total). These directions emphasize empirical performance and dynamic adjustment rather than worst-case algorithmic guarantees, highlighting how this work diverges by prioritizing computational complexity theory over practical adaptation strategies.

Among fourteen candidates examined, the contribution-level analysis shows varied novelty profiles. The first contribution (subquadratic algorithm for constant d) examined zero candidates, suggesting limited direct prior work in this specific formulation. The second contribution (conditional lower bounds) examined eight candidates with one appearing to provide overlapping results, indicating some existing hardness analysis in related settings. The third contribution (optimality for large d) examined six candidates with none clearly refuting it. This pattern suggests the algorithmic result may be more novel than the lower-bound analysis, though the limited search scope (fourteen total candidates) means substantial relevant work could exist beyond top-K semantic matches.

Based on the available signals from this limited literature search, the work appears to occupy a relatively unexplored niche at the intersection of fine-grained complexity theory and attention mechanisms. The taxonomy structure confirms sparse activity in pure algorithmic complexity for arbitrary-temperature attention, with most field energy directed toward adaptive methods and applications. However, the analysis covers only top-K semantic matches and does not constitute an exhaustive survey of theoretical computer science literature on attention complexity.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: computing attention mechanism with arbitrary temperature. The field structure reflects a multifaceted exploration of how temperature parameters shape attention computations in neural architectures. The taxonomy organizes work into four main branches: Theoretical Foundations and Computational Complexity examines the algorithmic underpinnings and hardness results for efficient attention; Adaptive and Learnable Temperature Control investigates methods that dynamically adjust temperature during training or inference, as seen in works like Adaptive Attention Temperature[9] and Attention Temperature ViT[4]; Temperature Tuning for Model Transfer and Generalization focuses on leveraging temperature to improve cross-domain performance and knowledge distillation, exemplified by Attention Temperature Distillation[3]; and Application-Specific Temperature Tuning applies temperature modulation to diverse domains ranging from vision tasks to reasoning and generation, including efforts like Temperature-Guided Reasoning[14] and Optimal Temperature In-Context[11]. Several active lines of work reveal contrasting emphases and open questions. One cluster explores learnable or adaptive temperature schemes that treat the parameter as trainable rather than fixed, balancing expressiveness against computational overhead. Another line investigates the role of temperature in model compression and transfer, where careful tuning can preserve performance under distillation or domain shift. Subquadratic Attention[0] sits within the Theoretical Foundations branch, focusing on algorithmic complexity and hardness results for computing attention with arbitrary temperature efficiently. This contrasts with adaptive approaches like Adaptive Attention Temperature[9], which prioritize dynamic adjustment over worst-case complexity guarantees, and with application-driven studies such as Selective Attention[5] or TianXing[2], which emphasize empirical gains in specific settings. The interplay between theoretical tractability and practical flexibility remains a central tension, with Subquadratic Attention[0] contributing foundational insights into the computational limits of temperature-parameterized attention.

Claimed Contributions

First truly subquadratic algorithm for Attention with constant head dimension and arbitrary temperature

0 retrieved papers

The authors present the first algorithm that computes Attention in truly subquadratic time for constant head dimension d while scaling polylogarithmically (rather than exponentially) with the entry size B. This removes the restriction to high-temperature softmax that previous subquadratic algorithms required.

0 retrieved papers

Conditional lower bounds showing near-optimality of the algorithm

Can Refute

8 retrieved papers

The authors establish that any substantial improvement over their algorithm is unlikely by proving that Attention requires near-quadratic time even when the head dimension is extremely slowly growing (d = 2^Θ(log* n)), under the Strong Exponential Time Hypothesis. This demonstrates the tightness of their algorithmic result.

8 retrieved papers

Can Refute

Optimality result for large head dimension under fine-grained complexity assumptions

6 retrieved papers

The authors prove that when the head dimension is polynomial in context length, the standard O(n^2 d) algorithm is conditionally optimal under the Generalized High-Dimensional Orthogonal Vectors Hypothesis, closing the gap between known upper and lower bounds in this regime.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

First truly subquadratic algorithm for Attention with constant head dimension and arbitrary temperature

Contribution

Conditional lower bounds showing near-optimality of the algorithm

[23] Fast attention requires bounded entries PDF

Can Refute

[22] On the computational complexity of self-attention PDF

Cannot Refute

[24] The fine-grained complexity of gradient computation for training large language models PDF

Cannot Refute

[25] Fast gradient computation for rope attention in almost linear time PDF

Cannot Refute

[26] Fast rope attention: Combining the polynomial method and fast fourier transform PDF

Cannot Refute

[27] On statistical rates and provably efficient criteria of latent diffusion transformers (dits) PDF

Cannot Refute

[28] Fundamental limits of prompt tuning transformers: Universality, capacity and efficiency PDF

Cannot Refute

[29] SPARQ: Outlier-free SpeechLM with Fast Adaptation and Robust Quantization PDF

Cannot Refute

Contribution

Optimality result for large head dimension under fine-grained complexity assumptions

[16] Linear Complexity Self-Attention With Order Polynomials PDF

Cannot Refute

[17] Towards infinite-long prefix in transformer PDF

Cannot Refute

[18] Quality over Quantity in Attention Layers: When Adding More Heads Hurts PDF

Cannot Refute

[19] Padre: A unifying polynomial attention drop-in replacement for efficient vision transformer PDF

Cannot Refute

[20] On the Capacity of Self-Attention PDF

Cannot Refute

[21] Inceptive Visual Representation Learning With Diverse Multi-Head Sparse Attention PDF

Cannot Refute

Subquadratic Algorithms and Hardness for Attention with Any Temperature

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

First truly subquadratic algorithm for Attention with constant head dimension and arbitrary temperature

Conditional lower bounds showing near-optimality of the algorithm

[23] Fast attention requires bounded entries PDF

[22] On the computational complexity of self-attention PDF

[24] The fine-grained complexity of gradient computation for training large language models PDF

[25] Fast gradient computation for rope attention in almost linear time PDF

[26] Fast rope attention: Combining the polynomial method and fast fourier transform PDF

[27] On statistical rates and provably efficient criteria of latent diffusion transformers (dits) PDF

[28] Fundamental limits of prompt tuning transformers: Universality, capacity and efficiency PDF

[29] SPARQ: Outlier-free SpeechLM with Fast Adaptation and Robust Quantization PDF

Optimality result for large head dimension under fine-grained complexity assumptions

[16] Linear Complexity Self-Attention With Order Polynomials PDF

[17] Towards infinite-long prefix in transformer PDF

[18] Quality over Quantity in Attention Layers: When Adding More Heads Hurts PDF

[19] Padre: A unifying polynomial attention drop-in replacement for efficient vision transformer PDF

[20] On the Capacity of Self-Attention PDF

[21] Inceptive Visual Representation Learning With Diverse Multi-Head Sparse Attention PDF

Table of Contents