ProxyAttn: Guided Sparse Attention via Representative Heads

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Efficient LLMSparse Attention

The quadratic complexity of attention mechanisms limits the efficiency of Large Language Models (LLMs) on long-text tasks. Recently, methods that dynamically estimate block importance have enabled efficient block sparse attention, leading to significant acceleration in long-text pre-filling of LLMs. However, their block-level coarse-grained estimation inevitably leads to performance degradation at high sparsity ratios. In this work, we propose ProxyAttn, a training-free sparse attention algorithm that achieves token-level estimation by compressing the dimension of attention heads. Based on our observation of the similarity among multiple attention heads in long texts, we use the attention scores of pooled representative heads to approximate the scores for all heads. To account for the varying sparsity among heads, we also propose a block-aware dynamic budget estimation method. By combining the scores from a set of representative heads with a multi-head dynamic budget, we can achieve a more fine-grained block attention evaluation at a low computational cost. Experiments on a variety of mainstream models and extensive benchmarks confirm the underlying similarity among attention heads in long texts. Leveraging a token-level fine-grained estimation, the proposed method achieves substantial gains in performance and efficiency compared to existing methods. More precisely, ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

ProxyAttn proposes a training-free sparse attention algorithm that achieves token-level importance estimation by compressing attention head dimensions and using pooled representative heads as proxies. The paper sits in the 'Mixture and Heterogeneous Sparse Attention' leaf, which contains only three papers total, including ProxyAttn itself and two siblings (Moa and MoBA). This is a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting the specific approach of heterogeneous head-level strategies remains less explored compared to uniform sparse patterns or adaptive selection methods found in neighboring leaves.

The taxonomy reveals that ProxyAttn's leaf is part of the 'Hybrid and Multi-Strategy Sparse Attention' branch, which sits alongside 'Fixed and Structured Sparse Patterns' and 'Adaptive and Dynamic Sparse Attention' within the broader 'Sparse Attention Pattern Design and Selection' category. Neighboring leaves include 'Dense-Sparse Switchable Attention' (two papers) and 'Multi-Stage and Hierarchical Hybrid Attention' (three papers), indicating that hybrid strategies collectively represent a moderately active area. The taxonomy's scope note clarifies that this leaf focuses on assigning different sparse patterns to different heads or layers, distinguishing it from uniform approaches in adjacent categories like 'Top-k and Scoring-Based Selection' (four papers) or 'Input-Dependent Dynamic Sparsity' (five papers).

Among fifteen candidates examined across three contributions, the block-aware dynamic budget estimation method shows the most substantial prior work overlap: ten candidates were examined, with one appearing to provide refutable prior work. The ProxyAttn algorithm itself examined only two candidates with no clear refutations, while the attention head similarity observation examined three candidates, also without refutations. These statistics reflect a limited semantic search scope rather than exhaustive coverage. The dynamic budget contribution appears to intersect with existing adaptive sparsity methods, whereas the proxy-based head pooling mechanism and the empirical similarity observation may represent more distinctive angles within the examined candidate set.

Based on the limited search of fifteen candidates, ProxyAttn appears to occupy a moderately novel position within a less-crowded research direction. The head-level heterogeneity strategy distinguishes it from more common uniform sparse patterns, though the dynamic budget component shows overlap with prior adaptive methods. The analysis does not cover the full landscape of sparse attention research, and a broader literature search might reveal additional related work, particularly in the adaptive sparsity and mixture-of-experts attention domains that neighbor this taxonomy leaf.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient sparse attention for long-context language models. The field has organized itself around several complementary directions. Sparse Attention Pattern Design and Selection explores how to choose which tokens to attend to, ranging from fixed patterns like sliding windows to adaptive strategies that predict importance on the fly. System-Level Acceleration and Implementation focuses on translating these patterns into fast kernels and memory-efficient serving systems, ensuring that theoretical sparsity gains materialize in practice. Alternative Attention Mechanisms and Extensions investigates fundamentally different architectures—such as linear attention or state-space models—that sidestep quadratic complexity altogether. Training and Adaptation for Long Contexts addresses how to extend pretrained models to longer sequences without prohibitive retraining costs, while Analysis, Benchmarking, and Theoretical Foundations provides the empirical testbeds and formal guarantees needed to compare methods rigorously. Finally, Domain-Specific and Application-Oriented Methods tailor sparse attention to particular use cases like code generation or retrieval-augmented generation, where task structure can guide sparsity choices. Within the pattern-design branch, a particularly active line of work explores hybrid and mixture-based strategies that combine multiple sparsity heuristics or dynamically route tokens to different attention modules. ProxyAttn[0] exemplifies this trend by using proxy mechanisms to blend several sparse patterns, aiming to capture both local coherence and long-range dependencies without committing to a single fixed structure. This approach contrasts with simpler uniform strategies and aligns closely with recent mixture-of-attention frameworks like Moa[1] and MoBA[16], which also leverage heterogeneous attention heads to balance efficiency and expressiveness. Meanwhile, works such as MInference[11] and SeerAttention[33] emphasize adaptive, query-driven sparsity that predicts important tokens at inference time, trading off some overhead for greater flexibility. The central tension across these methods is whether to rely on lightweight, static patterns that are easy to accelerate or to invest in more sophisticated selection mechanisms that can better preserve model quality as context lengths grow into the millions of tokens.

Claimed Contributions

ProxyAttn sparse attention algorithm

2 retrieved papers

The authors introduce ProxyAttn, a training-free method that estimates block importance for sparse attention by compressing along the attention head dimension rather than the sequence dimension. It uses pooled representative proxy heads to approximate attention scores for all heads, enabling fine-grained block importance evaluation at low computational cost.

2 retrieved papers

Block-aware dynamic budget estimation method

Can Refute

10 retrieved papers

The authors develop an online method to dynamically allocate different sparsity budgets to individual attention heads based on their varying sparsity characteristics. This allows diverse sparse attention patterns across heads while using unified importance scores from proxy heads.

10 retrieved papers

Can Refute

Observation of attention head similarity in long contexts

3 retrieved papers

The authors empirically observe and establish that multiple attention heads exhibit consistency in which tokens they focus on in long-context scenarios, with the primary difference being their sparsity levels rather than token focus. This observation forms the theoretical foundation for using proxy heads.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Moa: Mixture of sparse attention for automatic large language model compression PDF

Fu Tian-Yu, Tianyu Fu, Huang Haofeng, Haofeng Huang, Ning, Xuefei, Xuefei Ning, Zhang, Genghan, Genghan Zhang, Chen, Boju, Boju Chen, Wu Tianqi, Tianqi Wu, Wang HongYi, Hongyi Wang, Huang Zi-xiao, Zixiao Huang, Li Shiyao, Shiyao Li, Yan Shengen, Shengen Yan, Dai, Guohao, Guohao Dai, Yang, Huazhong, Huazhong Yang, Wang Yu, Yu Wang (2024)

[16] Moba: Mixture of block attention for long-context llms PDF

Jiang, Zhejun, Enzhe Lu, Liu jingyuan, Zhejun Jiang, Du, Yulun, Jingyuan Liu, Jiang Tao, Yulun Du, Hong Chao, Tao Jiang, Liu Shao-wei, Chao Hong, He Weiran, Shaowei Liu, Yuan, Enming, Weiran He, Wang Yu-zhi, Enming Yuan, Huang Zhi-qi, Yuzhi Wang, Huan, Zhiqi Huang, Xu Suting, Huan Yuan, Xu, Xinran, Suting Xu, Lai, Guokun, Xinran Xu, Chen Yan-ru, Guokun Lai, Zheng Hua-bin, Yanru Chen, Yan Jun-jie, Huabin Zheng, Su, Jianlin, Junjie Yan, Wu Yuxin, Jianling Su, Yuxin Wu, Yang Zhilin, Neo Y. Zhang, Zhou Xin-yu, Zhilin Yang, Zhang Mingxing, Xinyu Zhou, Qiu, Jiezhong, Mingxing Zhang, Jiezhong Qiu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ProxyAttn sparse attention algorithm

[11] Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention PDF

Cannot Refute

[63] A Unified Sparse Attention via Multi-Granularity Compression PDF

Cannot Refute

Contribution

Block-aware dynamic budget estimation method

[56] Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs PDF

Can Refute

[6] The sparse frontier: Sparse attention trade-offs in transformer llms PDF

Cannot Refute

[51] A comprehensive survey of small language models in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and â¦ PDF

Cannot Refute

[52] An integrated multi-head dual sparse self-attention network for remaining useful life prediction PDF

Cannot Refute

[53] Scene Adaptive Sparse Transformer for Event-based Object Detection PDF

Cannot Refute

[54] Trainable dynamic mask sparse attention PDF

Cannot Refute

[55] Spatten: Efficient sparse attention architecture with cascade token and head pruning PDF

Cannot Refute

[57] Dynamic sparse attention for scalable transformer acceleration PDF

Cannot Refute

[58] Chasing Sparsity in Vision Transformers: An End-to-End Exploration PDF

Cannot Refute

[59] SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs PDF

Cannot Refute

Contribution

Observation of attention head similarity in long contexts

[60] Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference PDF

Cannot Refute

[61] Attention flows: Analyzing and comparing attention mechanisms in language models PDF

Cannot Refute

[62] MuDAF: Long-Context Multi-Document Attention Focusing through Contrastive Learning on Attention Heads PDF

Cannot Refute

ProxyAttn: Guided Sparse Attention via Representative Heads

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Moa: Mixture of sparse attention for automatic large language model compression PDF

[16] Moba: Mixture of block attention for long-context llms PDF

Contribution Analysis

ProxyAttn sparse attention algorithm

[11] Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention PDF

[63] A Unified Sparse Attention via Multi-Granularity Compression PDF

Block-aware dynamic budget estimation method

[56] Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs PDF

[6] The sparse frontier: Sparse attention trade-offs in transformer llms PDF

[51] A comprehensive survey of small language models in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and â¦ PDF

[52] An integrated multi-head dual sparse self-attention network for remaining useful life prediction PDF

[53] Scene Adaptive Sparse Transformer for Event-based Object Detection PDF

[54] Trainable dynamic mask sparse attention PDF

[55] Spatten: Efficient sparse attention architecture with cascade token and head pruning PDF

[57] Dynamic sparse attention for scalable transformer acceleration PDF

[58] Chasing Sparsity in Vision Transformers: An End-to-End Exploration PDF

[59] SparseMM: Head Sparsity Emerges from Visual Concept Responses in MLLMs PDF

Observation of attention head similarity in long contexts

[60] Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference PDF

[61] Attention flows: Analyzing and comparing attention mechanisms in language models PDF

[62] MuDAF: Long-Context Multi-Document Attention Focusing through Contrastive Learning on Attention Heads PDF

Table of Contents

[51] A comprehensive survey of small language models in the era of large language models: Techniques, enhancements, applications, collaboration with llms, and â¦ PDF