Scaling Linear Attention with Sparse State Expansion

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Linear AttentionLanguage Model

The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification. This enables sparse state updates via softmax-based top- $k$ hard classification, thereby extending receptive fields and reducing inter-class interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions, effectively decoupling parameter size from state capacity while maintaining the sparse classification paradigm. Supported by efficient parallelized implementations, our design achieves effective classification and highly discriminative state representations. We extensively validate SSE in both pure linear and hybrid (SSE-H) architectures across language modeling, in-context retrieval, and mathematical reasoning benchmarks. SSE demonstrates strong retrieval performance and scales favorably with state size. Moreover, after reinforcement learning (RL) training, our 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small reasoning models, scoring 64.5 on AIME24 and 50.2 on AIME25, significantly outperforming similarly sized open-source Transformers. These results highlight SSE as a promising and efficient architecture for long-context modeling.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes row-sparse updates and Sparse State Expansion (SSE) for linear attention, aiming to improve context compression and retrieval in long-sequence scenarios. It resides in the 'Sparse and Selective State Linear Attention' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy of fifty papers. This leaf focuses specifically on sparse updates and state expansion techniques, distinguishing it from denser areas like gated linear attention (four papers) or kernel-based methods (three papers). The small sibling count suggests this particular approach to sparsity and state management is less crowded than other linear attention variants.

The taxonomy reveals neighboring leaves exploring alternative strategies for efficient linear attention. Gated Linear Attention Variants (four papers) use gating mechanisms for selective information flow, while Kernel-Based and Low-Rank Linear Attention (three papers) approximates softmax via feature transformations. Hybrid Softmax-Linear Architectures (three papers) integrate both attention types, and Recurrent and State Space Formulations (three papers) cast linear attention as RNNs with constant-size states. The paper's focus on sparse classification and state partitioning diverges from these directions by explicitly decoupling parameter size from state capacity through expansion, rather than relying on gating, kernel approximations, or hybrid designs.

Among twelve candidates examined, the Sparse State Expansion mechanism shows potential overlap with one prior work, while the row-sparse update formulation and parallelized implementations appear more distinct. Specifically, the SSE contribution examined five candidates with one refutable match, suggesting some precedent exists within the limited search scope. The row-sparse formulation examined only one candidate with no refutations, and the implementation examined six candidates with none refutable. These statistics reflect a targeted semantic search, not an exhaustive survey, so the apparent novelty of row-sparse updates and implementations should be interpreted cautiously given the small candidate pool.

Based on the limited search of twelve semantically similar papers, the work appears to occupy a less-explored niche within linear attention research, though the SSE mechanism may have closer precedents than the other contributions. The taxonomy structure confirms that sparse and selective state methods remain a relatively small subfield compared to gated or kernel-based approaches. However, the restricted candidate pool means potentially relevant work outside the top semantic matches could exist, particularly in adjacent leaves like recurrent formulations or memory-augmented attention, which were not exhaustively examined.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient long-context modeling with linear attention. The field addresses the computational bottleneck of standard softmax attention by developing mechanisms that scale linearly rather than quadratically with sequence length. The taxonomy reveals five main branches: Linear Attention Mechanisms and Architectures explores novel attention formulations including sparse and selective state designs like Mamba[18] and Mixture-of-Memories[47]; Efficient Implementation and Hardware Optimization focuses on parallelization strategies such as Lightning Attention[11] and Tiled Flash Linear[20]; Theoretical Analysis and Learning Dynamics investigates convergence properties and expressiveness through works like Linear Attention Asymptotics[2]; Domain-Specific Applications and Adaptations tailors linear attention to vision, speech, and other modalities; and Long-Context Language Modeling and Scaling examines how these methods perform when extended to extremely long sequences, as in Infini-Attention[3] and Kimi Linear[39]. Within the mechanisms branch, a particularly active line of work centers on sparse and selective state linear attention, where models dynamically choose which information to retain or compress. Sparse State Expansion[0] sits squarely in this cluster, emphasizing selective expansion of hidden states to balance expressiveness and efficiency. This contrasts with approaches like Mamba[18], which uses data-dependent gating to filter state updates, and Mixture-of-Memories[47], which employs multiple memory banks to capture diverse contextual patterns. A key trade-off across these methods involves the tension between maintaining rich representational capacity and preserving the linear complexity guarantee: some designs introduce mild sparsity or selectivity to improve quality without sacrificing scalability, while others pursue more aggressive compression. Open questions remain about how to best integrate these selective mechanisms with hardware-efficient implementations and whether hybrid architectures combining linear and softmax components offer superior performance on real-world long-context tasks.

Claimed Contributions

Row-sparse update formulation for linear attention

1 retrieved paper

The authors propose a novel framework that treats state rows as distinct latent categories and uses softmax-based top-k row selection to enable sparse state updates. This approach extends receptive fields and reduces information interference compared to vanilla linear attention.

1 retrieved paper

Sparse State Expansion (SSE) mechanism

Can Refute

5 retrieved papers

SSE expands the contextual state into N partitions with shared attention parameters and uses a write-read gate for partition selection. This design effectively decouples parameter size from state capacity while maintaining the sparse row-selection paradigm, enabling more discriminative state representations.

5 retrieved papers

Can Refute

Efficient parallelized implementations of SSE

6 retrieved papers

The authors develop parallelized implementations of SSE using masking and varlen techniques optimized for various training contexts. These implementations enable efficient execution while maintaining the benefits of sparse state expansion across different sequence lengths.

6 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[18] Mamba: Linear-Time Sequence Modeling with Selective State Spaces PDF

Gu, Albert, Dao, Tri, Albert Gu, Tri Dao (2023)

[47] MoM: Linear Sequence Modeling with Mixture-of-Memories PDF

Jusen Du, Weigao Sun, Hu Jiaxi, Disen Lan, Cheng Yu, Jiaxi Hu, Yu Cheng (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Row-sparse update formulation for linear attention

[60] SparseLinTab: Sparse linear self-attention for efficient feature interaction in tabular data PDF

Cannot Refute

Contribution

Sparse State Expansion (SSE) mechanism

[57] HGRN2: Gated Linear RNNs with State Expansion PDF

Can Refute

[7] Gated Slot Attention for Efficient Linear-Time Sequence Modeling PDF

Cannot Refute

[43] Scaling up the state size of RNN LLMs for long-context scenarios PDF

Cannot Refute

[58] Toward Better Ear Disease Diagnosis: A Multi-Modal Multi-Fusion Model Using Endoscopic Images of the Tympanic Membrane and Pure-Tone Audiometry PDF

Cannot Refute

[59] Expanded Gating Ranges Improve Activation Functions PDF

Cannot Refute

Contribution

Efficient parallelized implementations of SSE

[51] Trainable Dynamic Mask Sparse Attention PDF

Cannot Refute

[52] Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques PDF

Cannot Refute

[53] HAM-SpMSpV: an Optimized Parallel Algorithm for Masked Sparse Matrix-Sparse Vector Multiplications on multi-core CPUs PDF

Cannot Refute

[54] Fast attention over long sequences with dynamic sparse flash attention PDF

Cannot Refute

[55] Parallel Algorithms for Masked Sparse Matrix-Matrix Products PDF

Cannot Refute

[56] FlashMask: Reducing the Complexity of Attention Computation through Sparse Mask Representation PDF

Cannot Refute

Scaling Linear Attention with Sparse State Expansion

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[18] Mamba: Linear-Time Sequence Modeling with Selective State Spaces PDF

[47] MoM: Linear Sequence Modeling with Mixture-of-Memories PDF

Contribution Analysis

Row-sparse update formulation for linear attention

[60] SparseLinTab: Sparse linear self-attention for efficient feature interaction in tabular data PDF

Sparse State Expansion (SSE) mechanism

[57] HGRN2: Gated Linear RNNs with State Expansion PDF

[7] Gated Slot Attention for Efficient Linear-Time Sequence Modeling PDF

[43] Scaling up the state size of RNN LLMs for long-context scenarios PDF

[58] Toward Better Ear Disease Diagnosis: A Multi-Modal Multi-Fusion Model Using Endoscopic Images of the Tympanic Membrane and Pure-Tone Audiometry PDF

[59] Expanded Gating Ranges Improve Activation Functions PDF

Efficient parallelized implementations of SSE

[51] Trainable Dynamic Mask Sparse Attention PDF

[52] Longer Attention Span: Increasing Transformer Context Length with Sparse Graph Processing Techniques PDF

[53] HAM-SpMSpV: an Optimized Parallel Algorithm for Masked Sparse Matrix-Sparse Vector Multiplications on multi-core CPUs PDF

[54] Fast attention over long sequences with dynamic sparse flash attention PDF

[55] Parallel Algorithms for Masked Sparse Matrix-Matrix Products PDF

[56] FlashMask: Reducing the Complexity of Attention Computation through Sparse Mask Representation PDF

Table of Contents