Log-Linear Attention

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

subquadratic architecturetriton kernelstructured matrices

The attention mechanism in Transformers is an important primitive for accurate and scalable sequence modeling. Its quadratic-compute and linear-memory complexity however remain significant bottlenecks. Linear attention and state-space models enable linear-time, constant-memory sequence modeling and can moreover be trained efficiently through matmul-rich parallelization across sequence length. However, at their core these models are still RNNs, and thus their use of a fixed-size hidden state to model the context is a fundamental limitation. This paper develops log-linear attention, an attention mechanism that balances linear attention's efficiency and the expressiveness of softmax attention. Log-linear attention replaces the fixed-size hidden state with a logarithmically growing set of hidden states. We show that with a particular growth function, log-linear attention admits a similarly matmul-rich parallel form whose compute cost is log-linear in sequence length. Log-linear attention is a general framework and can be applied on top of existing linear attention variants. As case studies, we instantiate log-linear variants of two recent architectures---Mamba-2 and Gated DeltaNet---and find they perform well compared to their linear-time variants.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes log-linear attention, a mechanism that replaces fixed-size hidden states with logarithmically growing state sets to balance efficiency and expressiveness. It resides in the 'Log-Linear Attention Mechanisms' leaf under 'Logarithmic Memory Architectures for Sequence Modeling', where it is currently the sole paper. This leaf is part of a sparse taxonomy containing only eight papers across nine leaf nodes, suggesting the paper addresses a relatively underexplored research direction within efficient sequence modeling.

The taxonomy reveals neighboring work in hierarchical tree-based memory networks and adaptive vocabulary learning, both pursuing logarithmic scaling through different mechanisms. The sibling leaf 'Hierarchical Tree-Based Memory Networks' contains one paper on dynamic context summarization, while 'Curriculum-Based Vocabulary Expansion' explores token-level compression. The paper's focus on attention reformulation distinguishes it from these approaches, which emphasize explicit memory modules or vocabulary adaptation rather than core attention mechanism redesign.

Among 27 candidates examined, the chunkwise parallel training algorithm shows overlap with one prior work, while the log-linear attention mechanism itself and the architecture instantiations (Mamba-2, Gated DeltaNet variants) examined 10 and 9 candidates respectively with no clear refutations. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The core mechanism appears more novel within this sample, while the training algorithm has identifiable precedent.

Based on the limited literature search, the work appears to occupy a sparse research area with few direct competitors in its taxonomy leaf. The analysis covers 27 semantically related candidates but cannot claim exhaustive field coverage. The core attention mechanism shows stronger novelty signals than the training algorithm component within this sample.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient sequence modeling with logarithmically growing hidden states. The field addresses the challenge of scaling sequence models by constraining memory growth to logarithmic rather than linear or quadratic rates as sequence length increases. The taxonomy reveals four main branches: Logarithmic Memory Architectures for Sequence Modeling explores novel attention and memory mechanisms that achieve sublinear state expansion, exemplified by works like Logarithmic Memory Networks[1] and Log-Linear Attention[0]; Adaptive Vocabulary and Tokenization Learning investigates dynamic token representations that reduce effective sequence length, as seen in Vocabulary Curriculum[2]; Probabilistic Hidden State Models applies classical probabilistic frameworks such as HMMs to compress sequential information, including approaches like Hierarchical HMM Response[6] and HMM Malware Classification[7]; and Domain-Specific Sequence Modeling Applications tailors these efficiency techniques to specialized tasks, ranging from ECG Arrhythmia Classification[3] to reinforcement learning contexts like Uncertainty State-Space RL[4]. These branches collectively aim to balance expressiveness with computational tractability. A central tension across these branches concerns the trade-off between architectural simplicity and domain adaptability. Logarithmic memory architectures pursue general-purpose efficiency through novel attention designs, while domain-specific applications often achieve compression by exploiting task structure. Bio-Inspired Temporal Memory[5] bridges these perspectives by drawing on neuroscience principles to inform memory organization. The original paper, Log-Linear Attention[0], sits squarely within the architectural innovation branch, proposing a mechanism that scales attention complexity logarithmically. Compared to Logarithmic Memory Networks[1], which may emphasize explicit memory modules, Log-Linear Attention[0] focuses on reformulating the attention operation itself. This positions it as a foundational contribution to efficient attention design, distinct from probabilistic approaches like Hidden Structure Models[8] that rely on latent variable inference, and from domain-tailored methods that sacrifice generality for task-specific gains.

Claimed Contributions

Log-linear attention mechanism

10 retrieved papers

The authors introduce log-linear attention, which replaces the fixed-size hidden state of linear attention with a logarithmically growing set of hidden states. This mechanism achieves log-linear compute cost and logarithmic memory cost in sequence length while maintaining a matmul-rich parallel training form.

10 retrieved papers

Chunkwise parallel training algorithm

Can Refute

8 retrieved papers

The authors develop a chunkwise parallel training algorithm that exploits the hierarchical structure of log-linear attention. The algorithm achieves O(T log T) training complexity by decomposing computations into intra-chunk and inter-chunk stages, enabling efficient parallelization on modern accelerators.

8 retrieved papers

Can Refute

Log-linear variants of Mamba-2 and Gated DeltaNet

9 retrieved papers

The authors demonstrate that log-linear attention is a general framework by applying it to two existing architectures (Mamba-2 and Gated DeltaNet), creating log-linear variants that maintain the original transition matrix structures while incorporating hierarchical masking for improved performance.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution