FASA: FREQUENCY-AWARE SPARSE ATTENTION

ICLR 2026 Conference SubmissionAnonymous Authors
Functional sparsity of FC; KV cache
Abstract:

The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100% of full-KV performance when only keeping 256 tokens, and achieves 2.56×\times speedup using just 18.9% of the cache on AIME24.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes FASA, a framework for query-aware token importance prediction in KV cache compression, leveraging frequency-domain analysis of RoPE positional encodings. It resides in the 'Frequency-Domain and Structural Attention Analysis' leaf, which contains only two papers including FASA itself. This leaf sits within the broader 'Attention-Based Importance Prediction' branch, which encompasses five papers across three leaves. The sparse population of this specific leaf suggests that frequency-domain approaches to attention analysis for KV cache compression represent a relatively unexplored research direction within the field.

The taxonomy reveals that FASA's parent branch, 'Attention-Based Importance Prediction', neighbors 'Alternative Importance Signals' (reconstruction-based and learned predictors) and sits within the larger 'Query-Aware Dynamic Token Selection' category. Sibling leaves include 'Direct Attention Score Utilization' (five papers) and 'Temporal Attention Pattern Modeling' (two papers), which analyze attention in time-domain or sequential contexts. The taxonomy's scope notes clarify that frequency-domain methods like FASA are explicitly distinguished from time-domain attention analysis, positioning the work at a structural intersection between attention mechanisms and positional encoding properties.

Among thirty candidates examined, the contribution-level analysis reveals mixed novelty signals. The discovery of functional sparsity at frequency-chunk level examined ten candidates with zero refutations, suggesting this insight may be relatively novel within the limited search scope. However, the FASA framework itself examined ten candidates and found one refutable overlap, indicating some prior work addresses similar query-aware prediction goals. The specialized variants contribution also examined ten candidates without refutation. These statistics reflect a focused literature search rather than exhaustive coverage, leaving open the possibility of additional relevant work beyond the top-thirty semantic matches.

Based on the limited search scope of thirty candidates, FASA appears to occupy a sparsely populated research direction within frequency-domain attention analysis, though the framework's core prediction mechanism shows some overlap with existing query-aware methods. The frequency-chunk insight seems less anticipated by prior work, but the analysis cannot rule out relevant contributions outside the examined candidate set. The taxonomy structure suggests this work bridges positional encoding theory and practical cache compression in a relatively underexplored way.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: query-aware token importance prediction for KV cache compression. The field addresses memory bottlenecks in large language models by selectively retaining or compressing key-value cache entries during inference. The taxonomy reveals a rich landscape organized around several complementary strategies. Query-Aware Dynamic Token Selection methods, such as those employing attention-based importance prediction, adapt cache decisions to the current query context, enabling fine-grained control over which tokens matter most. In contrast, Query-Agnostic Compression and Static Heuristic-Based Selection rely on predetermined rules or structural patterns, trading adaptability for simplicity. Hybrid Compression Strategies and Quantization-Based Compression combine multiple techniques to balance memory savings with accuracy, while Layer-Adaptive and Head-Specific Allocation tailor compression policies to different model components. Emerging directions include Token Eviction and Retrieval mechanisms, Low-Rank Approximation, and Domain-Specific Compression for specialized tasks, alongside Hardware-Aware and System-Level Optimization that co-design algorithms with deployment constraints. Survey and Benchmark Studies provide empirical grounding across these diverse approaches. A particularly active line of work centers on predicting token importance dynamically using attention patterns. FASA[0] exemplifies this by analyzing frequency-domain and structural attention characteristics to identify which tokens contribute most to future queries, situating itself within the Attention-Based Importance Prediction branch. This contrasts with simpler eviction heuristics like Scissorhands[1] or static policies, and complements recent query-aware systems such as Quest[2] and TokenButler[3], which also adapt cache decisions on-the-fly but may emphasize different scoring mechanisms or integration with retrieval. Nearby works like Keyformer[4] explore alternative attention architectures, while system-level surveys such as System-Aware KV Survey[5] and benchmarks like KV Compression Benchmark[8] highlight trade-offs between compression ratio, latency, and task-specific performance. The central tension across these branches involves balancing the overhead of dynamic prediction against the memory and quality gains it enables, with FASA[0] contributing a frequency-domain lens to this ongoing exploration.

Claimed Contributions

Discovery of functional sparsity at frequency-chunk level in RoPE

The authors identify that within Rotary Positional Encodings (RoPE), a small subset of frequency chunks (FCs) termed dominant FCs consistently exhibits high contextual agreement with full attention heads, while other FCs primarily construct positional patterns. This functional sparsity is shown to be sparse, universal across architectures, and task-agnostic.

10 retrieved papers
FASA framework for training-free query-aware token importance prediction

FASA is a two-stage framework that first uses dominant frequency chunks to predict token importance (Token Importance Prediction stage), then performs focused attention computation on the selected critical tokens (Focused Attention Computation stage). This approach achieves query-aware token eviction without requiring costly training.

10 retrieved papers
Can Refute
Two specialized variants of FASA optimized for different constraints

The authors develop FASA-M which minimizes GPU memory footprint by offloading value cache and non-dominant key components to CPU memory, and FASA-C which prioritizes inference speed by retaining the full cache on-GPU but accessing only a sparse subset. These variants offer different efficiency profiles while maintaining equivalent downstream task performance.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Discovery of functional sparsity at frequency-chunk level in RoPE

The authors identify that within Rotary Positional Encodings (RoPE), a small subset of frequency chunks (FCs) termed dominant FCs consistently exhibits high contextual agreement with full attention heads, while other FCs primarily construct positional patterns. This functional sparsity is shown to be sparse, universal across architectures, and task-agnostic.

Contribution

FASA framework for training-free query-aware token importance prediction

FASA is a two-stage framework that first uses dominant frequency chunks to predict token importance (Token Importance Prediction stage), then performs focused attention computation on the selected critical tokens (Focused Attention Computation stage). This approach achieves query-aware token eviction without requiring costly training.

Contribution

Two specialized variants of FASA optimized for different constraints

The authors develop FASA-M which minimizes GPU memory footprint by offloading value cache and non-dominant key components to CPU memory, and FASA-C which prioritizes inference speed by retaining the full cache on-GPU but accessing only a sparse subset. These variants offer different efficiency profiles while maintaining equivalent downstream task performance.