Abstract:

Despite the popularity of the Transformer architecture, the standard algorithm for computing Attention suffers from quadratic time complexity in context length nn. Alman and Song showed that when the head dimension d=Θ(logn)d = \Theta(\log n), subquadratic Attention is possible if and only if the inputs have small entries bounded by B=o(logn)B = o(\sqrt{\log n}) in absolute values, under the Strong Exponential Time Hypothesis (SETH\mathsf{SETH}). Equivalently, subquadratic Attention is possible if and only if the softmax is applied with high temperature for d=Θ(logn)d=\Theta(\log n). Running times of these algorithms depend exponentially on BB and thus they do not lead to even a polynomial-time algorithm outside the specific range of BB.

This naturally leads to the question: when can Attention be computed efficiently without strong assumptions on temperature? Are there fast attention algorithms that scale polylogarithmically with entry size BB? In this work, we resolve this question and characterize when fast Attention for arbitrary temperatures is possible. First, for all constant d=O(1)d = O(1), we give the first subquadratic O~(n21/dpolylog(B))\tilde{O}(n^{2 - 1/d} \cdot \mathrm{polylog}(B)) time algorithm for Attention with large BB. Our result holds even for matrices with large head dimension if they have low rank. Combined with a reduction from Gradient Computation to Attention, we obtain a subquadratic algorithm for the full LLM training process. Furthermore, we show that any substantial improvement on our algorithm is unlikely. In particular, we show that even when d=2Θ(logn)d = 2^{\Theta(\log^* n)}, Attention requires n2o(1)n^{2 - o(1)} time under SETH\mathsf{SETH}.

Finally, in the regime where d=poly(n)d = \mathrm{poly}(n), the standard algorithm requires O(n2d)O(n^{2} d) time while previous lower bounds only ruled out algorithms with truly subquadratic time in nn. We close this gap and show that the standard algorithm is optimal under popular fine-grained complexity assumptions.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes the first truly subquadratic algorithm for computing Attention with constant head dimension and arbitrary temperature, achieving Õ(n^(2-1/d)) runtime that scales polylogarithmically with entry size B. It also establishes conditional lower bounds and optimality results under fine-grained complexity assumptions. Within the taxonomy, this work sits in the 'Algorithmic Complexity and Hardness Results' leaf under 'Theoretical Foundations and Computational Complexity'. Notably, this leaf contains only the original paper itself—no sibling papers—indicating a sparse research direction focused specifically on algorithmic complexity for arbitrary-temperature attention.

The taxonomy reveals that most related work resides in neighboring branches rather than the same leaf. The sibling leaf 'Attention Behavior Analysis and Normalization' contains one paper examining theoretical properties without complexity focus. The broader field structure shows substantial activity in adaptive temperature mechanisms (three papers in 'Self-Adaptive Temperature Mechanisms') and application-specific tuning across vision, language, and time-series domains (nine papers total). These directions emphasize empirical performance and dynamic adjustment rather than worst-case algorithmic guarantees, highlighting how this work diverges by prioritizing computational complexity theory over practical adaptation strategies.

Among fourteen candidates examined, the contribution-level analysis shows varied novelty profiles. The first contribution (subquadratic algorithm for constant d) examined zero candidates, suggesting limited direct prior work in this specific formulation. The second contribution (conditional lower bounds) examined eight candidates with one appearing to provide overlapping results, indicating some existing hardness analysis in related settings. The third contribution (optimality for large d) examined six candidates with none clearly refuting it. This pattern suggests the algorithmic result may be more novel than the lower-bound analysis, though the limited search scope (fourteen total candidates) means substantial relevant work could exist beyond top-K semantic matches.

Based on the available signals from this limited literature search, the work appears to occupy a relatively unexplored niche at the intersection of fine-grained complexity theory and attention mechanisms. The taxonomy structure confirms sparse activity in pure algorithmic complexity for arbitrary-temperature attention, with most field energy directed toward adaptive methods and applications. However, the analysis covers only top-K semantic matches and does not constitute an exhaustive survey of theoretical computer science literature on attention complexity.

Taxonomy

Core-task Taxonomy Papers
15
3
Claimed Contributions
14
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: computing attention mechanism with arbitrary temperature. The field structure reflects a multifaceted exploration of how temperature parameters shape attention computations in neural architectures. The taxonomy organizes work into four main branches: Theoretical Foundations and Computational Complexity examines the algorithmic underpinnings and hardness results for efficient attention; Adaptive and Learnable Temperature Control investigates methods that dynamically adjust temperature during training or inference, as seen in works like Adaptive Attention Temperature[9] and Attention Temperature ViT[4]; Temperature Tuning for Model Transfer and Generalization focuses on leveraging temperature to improve cross-domain performance and knowledge distillation, exemplified by Attention Temperature Distillation[3]; and Application-Specific Temperature Tuning applies temperature modulation to diverse domains ranging from vision tasks to reasoning and generation, including efforts like Temperature-Guided Reasoning[14] and Optimal Temperature In-Context[11]. Several active lines of work reveal contrasting emphases and open questions. One cluster explores learnable or adaptive temperature schemes that treat the parameter as trainable rather than fixed, balancing expressiveness against computational overhead. Another line investigates the role of temperature in model compression and transfer, where careful tuning can preserve performance under distillation or domain shift. Subquadratic Attention[0] sits within the Theoretical Foundations branch, focusing on algorithmic complexity and hardness results for computing attention with arbitrary temperature efficiently. This contrasts with adaptive approaches like Adaptive Attention Temperature[9], which prioritize dynamic adjustment over worst-case complexity guarantees, and with application-driven studies such as Selective Attention[5] or TianXing[2], which emphasize empirical gains in specific settings. The interplay between theoretical tractability and practical flexibility remains a central tension, with Subquadratic Attention[0] contributing foundational insights into the computational limits of temperature-parameterized attention.

Claimed Contributions

First truly subquadratic algorithm for Attention with constant head dimension and arbitrary temperature

The authors present the first algorithm that computes Attention in truly subquadratic time for constant head dimension d while scaling polylogarithmically (rather than exponentially) with the entry size B. This removes the restriction to high-temperature softmax that previous subquadratic algorithms required.

0 retrieved papers
Conditional lower bounds showing near-optimality of the algorithm

The authors establish that any substantial improvement over their algorithm is unlikely by proving that Attention requires near-quadratic time even when the head dimension is extremely slowly growing (d = 2^Θ(log* n)), under the Strong Exponential Time Hypothesis. This demonstrates the tightness of their algorithmic result.

8 retrieved papers
Can Refute
Optimality result for large head dimension under fine-grained complexity assumptions

The authors prove that when the head dimension is polynomial in context length, the standard O(n^2 d) algorithm is conditionally optimal under the Generalized High-Dimensional Orthogonal Vectors Hypothesis, closing the gap between known upper and lower bounds in this regime.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

First truly subquadratic algorithm for Attention with constant head dimension and arbitrary temperature

The authors present the first algorithm that computes Attention in truly subquadratic time for constant head dimension d while scaling polylogarithmically (rather than exponentially) with the entry size B. This removes the restriction to high-temperature softmax that previous subquadratic algorithms required.

Contribution

Conditional lower bounds showing near-optimality of the algorithm

The authors establish that any substantial improvement over their algorithm is unlikely by proving that Attention requires near-quadratic time even when the head dimension is extremely slowly growing (d = 2^Θ(log* n)), under the Strong Exponential Time Hypothesis. This demonstrates the tightness of their algorithmic result.

Contribution

Optimality result for large head dimension under fine-grained complexity assumptions

The authors prove that when the head dimension is polynomial in context length, the standard O(n^2 d) algorithm is conditionally optimal under the Generalized High-Dimensional Orthogonal Vectors Hypothesis, closing the gap between known upper and lower bounds in this regime.