ProxyAttn: Guided Sparse Attention via Representative Heads

ICLR 2026 Conference SubmissionAnonymous Authors
Efficient LLMSparse Attention
Abstract:

The quadratic complexity of attention mechanisms limits the efficiency of Large Language Models (LLMs) on long-text tasks. Recently, methods that dynamically estimate block importance have enabled efficient block sparse attention, leading to significant acceleration in long-text pre-filling of LLMs. However, their block-level coarse-grained estimation inevitably leads to performance degradation at high sparsity ratios. In this work, we propose ProxyAttn, a training-free sparse attention algorithm that achieves token-level estimation by compressing the dimension of attention heads. Based on our observation of the similarity among multiple attention heads in long texts, we use the attention scores of pooled representative heads to approximate the scores for all heads. To account for the varying sparsity among heads, we also propose a block-aware dynamic budget estimation method. By combining the scores from a set of representative heads with a multi-head dynamic budget, we can achieve a more fine-grained block attention evaluation at a low computational cost. Experiments on a variety of mainstream models and extensive benchmarks confirm the underlying similarity among attention heads in long texts. Leveraging a token-level fine-grained estimation, the proposed method achieves substantial gains in performance and efficiency compared to existing methods. More precisely, ProxyAttn can achieve up to 10.3x attention acceleration and 2.4x prefilling acceleration without significant performance loss.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

ProxyAttn proposes a training-free sparse attention algorithm that achieves token-level importance estimation by compressing attention head dimensions and using pooled representative heads as proxies. The paper sits in the 'Mixture and Heterogeneous Sparse Attention' leaf, which contains only three papers total, including ProxyAttn itself and two siblings (Moa and MoBA). This is a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting the specific approach of heterogeneous head-level strategies remains less explored compared to uniform sparse patterns or adaptive selection methods found in neighboring leaves.

The taxonomy reveals that ProxyAttn's leaf is part of the 'Hybrid and Multi-Strategy Sparse Attention' branch, which sits alongside 'Fixed and Structured Sparse Patterns' and 'Adaptive and Dynamic Sparse Attention' within the broader 'Sparse Attention Pattern Design and Selection' category. Neighboring leaves include 'Dense-Sparse Switchable Attention' (two papers) and 'Multi-Stage and Hierarchical Hybrid Attention' (three papers), indicating that hybrid strategies collectively represent a moderately active area. The taxonomy's scope note clarifies that this leaf focuses on assigning different sparse patterns to different heads or layers, distinguishing it from uniform approaches in adjacent categories like 'Top-k and Scoring-Based Selection' (four papers) or 'Input-Dependent Dynamic Sparsity' (five papers).

Among fifteen candidates examined across three contributions, the block-aware dynamic budget estimation method shows the most substantial prior work overlap: ten candidates were examined, with one appearing to provide refutable prior work. The ProxyAttn algorithm itself examined only two candidates with no clear refutations, while the attention head similarity observation examined three candidates, also without refutations. These statistics reflect a limited semantic search scope rather than exhaustive coverage. The dynamic budget contribution appears to intersect with existing adaptive sparsity methods, whereas the proxy-based head pooling mechanism and the empirical similarity observation may represent more distinctive angles within the examined candidate set.

Based on the limited search of fifteen candidates, ProxyAttn appears to occupy a moderately novel position within a less-crowded research direction. The head-level heterogeneity strategy distinguishes it from more common uniform sparse patterns, though the dynamic budget component shows overlap with prior adaptive methods. The analysis does not cover the full landscape of sparse attention research, and a broader literature search might reveal additional related work, particularly in the adaptive sparsity and mixture-of-experts attention domains that neighbor this taxonomy leaf.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
15
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: efficient sparse attention for long-context language models. The field has organized itself around several complementary directions. Sparse Attention Pattern Design and Selection explores how to choose which tokens to attend to, ranging from fixed patterns like sliding windows to adaptive strategies that predict importance on the fly. System-Level Acceleration and Implementation focuses on translating these patterns into fast kernels and memory-efficient serving systems, ensuring that theoretical sparsity gains materialize in practice. Alternative Attention Mechanisms and Extensions investigates fundamentally different architectures—such as linear attention or state-space models—that sidestep quadratic complexity altogether. Training and Adaptation for Long Contexts addresses how to extend pretrained models to longer sequences without prohibitive retraining costs, while Analysis, Benchmarking, and Theoretical Foundations provides the empirical testbeds and formal guarantees needed to compare methods rigorously. Finally, Domain-Specific and Application-Oriented Methods tailor sparse attention to particular use cases like code generation or retrieval-augmented generation, where task structure can guide sparsity choices. Within the pattern-design branch, a particularly active line of work explores hybrid and mixture-based strategies that combine multiple sparsity heuristics or dynamically route tokens to different attention modules. ProxyAttn[0] exemplifies this trend by using proxy mechanisms to blend several sparse patterns, aiming to capture both local coherence and long-range dependencies without committing to a single fixed structure. This approach contrasts with simpler uniform strategies and aligns closely with recent mixture-of-attention frameworks like Moa[1] and MoBA[16], which also leverage heterogeneous attention heads to balance efficiency and expressiveness. Meanwhile, works such as MInference[11] and SeerAttention[33] emphasize adaptive, query-driven sparsity that predicts important tokens at inference time, trading off some overhead for greater flexibility. The central tension across these methods is whether to rely on lightweight, static patterns that are easy to accelerate or to invest in more sophisticated selection mechanisms that can better preserve model quality as context lengths grow into the millions of tokens.

Claimed Contributions

ProxyAttn sparse attention algorithm

The authors introduce ProxyAttn, a training-free method that estimates block importance for sparse attention by compressing along the attention head dimension rather than the sequence dimension. It uses pooled representative proxy heads to approximate attention scores for all heads, enabling fine-grained block importance evaluation at low computational cost.

2 retrieved papers
Block-aware dynamic budget estimation method

The authors develop an online method to dynamically allocate different sparsity budgets to individual attention heads based on their varying sparsity characteristics. This allows diverse sparse attention patterns across heads while using unified importance scores from proxy heads.

10 retrieved papers
Can Refute
Observation of attention head similarity in long contexts

The authors empirically observe and establish that multiple attention heads exhibit consistency in which tokens they focus on in long-context scenarios, with the primary difference being their sparsity levels rather than token focus. This observation forms the theoretical foundation for using proxy heads.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ProxyAttn sparse attention algorithm

The authors introduce ProxyAttn, a training-free method that estimates block importance for sparse attention by compressing along the attention head dimension rather than the sequence dimension. It uses pooled representative proxy heads to approximate attention scores for all heads, enabling fine-grained block importance evaluation at low computational cost.

Contribution

Block-aware dynamic budget estimation method

The authors develop an online method to dynamically allocate different sparsity budgets to individual attention heads based on their varying sparsity characteristics. This allows diverse sparse attention patterns across heads while using unified importance scores from proxy heads.

Contribution

Observation of attention head similarity in long contexts

The authors empirically observe and establish that multiple attention heads exhibit consistency in which tokens they focus on in long-context scenarios, with the primary difference being their sparsity levels rather than token focus. This observation forms the theoretical foundation for using proxy heads.