QuoKA: Query-Oriented KV Selection for Efficient LLM Prefill

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Efficient LLM InferenceLLM Prefill AccelerationSparse AttentionKV Cache SubselectionTraining-Free

We present QuoKA: Query-oriented KV selection for efficient attention, a training-free and hardware agnostic sparse attention algorithm for accelerating transformer inference under chunked prefill. While many queries focus on a smaller group of keys in the attention operator, we observe that queries with low cosine similarity with respect to the mean query interact more strongly with more keys and have the greatest contribution to final attention logits. By prioritizing these low cosine similarity queries, the behavior of full attention during the prefill stage can be closely approximated. QuoKA leverages this observation, accelerating attention by (1) first retaining a small set of representative queries and (2) then subselecting the keys most aligned with those queries. Through experiments on Needle-In-A-Haystack, LongBench, RULER, and Math500, we show that, while realizing a 3× reduction in time-to-first-token, 5× speedup in attention on Nvidia GPUs and up to nearly a 7× speedup on Intel Xeon CPUs, QuoKA achieves near-baseline accuracy, utilizing 88% fewer key-value pairs per attention evaluation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

QuoKA contributes a training-free sparse attention algorithm that selects key-value pairs based on query cosine dissimilarity during chunked prefill. The paper resides in the 'Query-Key Similarity-Based Selection' leaf, which contains only three papers total, indicating a relatively focused but not overcrowded research direction. This leaf sits within 'Content-Based Dynamic Sparsity Prediction,' part of the broader 'Dynamic Sparse Attention Pattern Discovery' branch. The small sibling count suggests this specific approach—prioritizing queries with low cosine similarity to the mean query—occupies a moderately explored niche rather than a saturated subfield.

The taxonomy tree reveals that query-key similarity-based selection is one of three parallel approaches under content-based dynamic sparsity prediction, alongside learnable sparse attention routing and hierarchical multi-stage sparsity methods. Neighboring leaves include block importance estimation techniques that operate at coarser granularity and context-adaptive methods that adjust sparsity budgets dynamically. QuoKA's focus on fine-grained query selection distinguishes it from block-level scoring methods like antidiagonal estimation while sharing the runtime adaptivity characteristic of context-adaptive approaches. The exclude notes clarify that QuoKA's training-free, content-driven selection differs fundamentally from static pattern methods or learnable routing strategies.

Among thirty candidates examined, the analysis found two papers that can refute the first contribution (query-oriented KV selection), while the second and third contributions (cosine dissimilarity observation and three-stage framework) showed no clear refutations across ten candidates each. The limited scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The refutable candidates for the core contribution suggest that within this restricted search, some prior work addresses similar query-based selection mechanisms, though the specific cosine dissimilarity heuristic and three-stage design appear less directly anticipated by examined literature.

Based on the limited search scope of thirty candidates, QuoKA appears to introduce specific technical choices—particularly the cosine dissimilarity criterion for query prioritization—that are not clearly prefigured in the examined subset. However, the core idea of query-oriented KV selection has precedent among the two refutable candidates identified. The analysis cannot determine whether a broader literature search would reveal additional overlapping work, especially given the relatively small sibling count in this taxonomy leaf. The novelty assessment remains provisional pending more comprehensive coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient sparse attention for transformer prefill acceleration. The field has organized itself around several complementary strategies for reducing the quadratic cost of attention during prefill. Dynamic sparse attention pattern discovery methods adaptively identify which tokens to attend to based on runtime information, often through content-based predictions or query-key similarity metrics. Static sparse attention mechanisms impose fixed sparsity patterns such as block-diagonal or strided structures, while hybrid dense-sparse architectures blend full and sparse computations selectively. Parallel branches focus on computational optimization techniques like kernel fusion and algorithmic improvements, hardware-accelerated implementations targeting GPUs or specialized accelerators, and domain-specific adaptations for vision or multimodal models. Additional lines of work address KV cache optimization to reduce memory overhead, training-free methods that apply sparsity without retraining, and theoretical foundations that analyze approximation quality and convergence guarantees. Within dynamic sparse attention discovery, a particularly active cluster revolves around content-based sparsity prediction using query-key similarity. QuoKA[0] exemplifies this approach by selecting attention targets based on predicted relevance scores derived from query-key interactions, aiming to preserve accuracy while drastically reducing computation. Closely related works like Minference[1] and ProxyAttn[18] similarly exploit similarity-based heuristics to prune less important tokens dynamically, though they differ in how aggressively they filter and whether they incorporate auxiliary structures like proxy tokens. A central trade-off across these methods involves balancing the overhead of computing selection criteria against the savings from reduced attention operations, with some approaches like Flexprefill[3] exploring adaptive granularity to optimize this balance. QuoKA[0] sits squarely in this query-key similarity-based selection cluster, sharing the core philosophy of runtime adaptivity with neighbors like ProxyAttn[18] but potentially differing in its specific scoring mechanism or integration with prefill pipelines.

Claimed Contributions

QUOKA: Query-oriented KV selection for efficient attention

Can Refute

10 retrieved papers

The authors introduce QUOKA, a training-free and hardware-agnostic sparse attention method designed specifically for chunked prefill. It accelerates attention by first retaining a small set of representative queries based on cosine dissimilarity, then subselecting keys most aligned with those queries using cosine similarity scoring.

10 retrieved papers

Can Refute

Query subselection based on cosine dissimilarity observation

10 retrieved papers

The authors observe and leverage the geometric property that queries with lower cosine similarity to the mean query interact more strongly with more keys and contribute most to final attention logits. This observation motivates prioritizing such queries to approximate full attention behavior during prefill.

10 retrieved papers

Three-stage KV selection framework with group-aware aggregation

10 retrieved papers

The authors develop a three-stage framework consisting of query subselection, cosine-similarity scoring for stable relevance estimation, and group-aware aggregation that maintains compatibility with grouped-query attention architectures while reducing computational cost through pre-aggregation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention PDF

Amir Abdi, Surin Ahn, Zhenhua Han, Huiqiang Jiang, Dongsheng Li, Yucheng LI, Chin-Yew Lin, Luo Xufang, Lili Qiu, Qianhui Wu, Yuqing Yang, Chengruidong Zhang (2024)

[18] ProxyAttn: Guided Sparse Attention via Representative Heads PDF

Wang Yixuan, He Huang, Yixuan Wang, Bao, Siqi, H. He, Wu Hua, Siqi Bao, Wang Haifeng, Hua Wu, Zhu Qingfu, Haifeng Wang, Che, Wanxiang, Qingfu Zhu, Wanxiang Che (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

QUOKA: Query-oriented KV selection for efficient attention

[3] Flexprefill: A context-aware sparse attention mechanism for efficient long-sequence inference PDF

Can Refute

[62] Sampleattention: Near-lossless acceleration of long context llm inference with adaptive structured sparse attention PDF

Can Refute

[1] Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention PDF

Cannot Refute

[10] Xattention: Block sparse attention with antidiagonal scoring PDF

Cannot Refute

[11] Dynamic Sparse Attention for Scalable Transformer Acceleration PDF

Cannot Refute

[60] Qwen2. 5-1m technical report PDF

Cannot Refute

[61] Seerattention: Learning intrinsic sparse attention in your llms PDF

Cannot Refute

[63] A silver bullet or a compromise for full attention? a comprehensive study of gist token-based context compression PDF

Cannot Refute

[64] Energon: Toward Efficient Acceleration of Transformers Using Dynamic Sparse Attention PDF

Cannot Refute

[65] SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs PDF

Cannot Refute

Contribution

Query subselection based on cosine dissimilarity observation

[66] Is Cosine-Similarity of Embeddings Really About Similarity? PDF

Cannot Refute

[67] Dissecting query-key interaction in vision transformers PDF

Cannot Refute

[68] CosPoint Transformer: Enhancing 3D Semantic Segmentation With Cosine Similarity Attention and Cross-Attention PDF

Cannot Refute

[69] Enhancing query relevance: leveraging SBERT and cosine similarity for optimal information retrieval PDF

Cannot Refute

[70] A wind power forecasting model based on data decomposition and cross-attention mechanism with cosine similarity PDF

Cannot Refute

[71] Cottention: Linear transformers with cosine attention PDF

Cannot Refute

[72] An Optimized Cosine Similarity-Based Attention Gate for Temporal Sequence Patterns Recognition PDF

Cannot Refute

[73] Improved organs at risk segmentation based on modified U-Net with self-attention and consistency regularisation PDF

Cannot Refute

[74] Anisotropy is inherent to self-attention in transformers PDF

Cannot Refute

[75] NaLaFormer: Norm-Aware Linear Attention for Transformer Models PDF

Cannot Refute

Contribution

Three-stage KV selection framework with group-aware aggregation

[50] EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention PDF

Cannot Refute

[51] Analog in-memory computing attention mechanism for fast and energy-efficient large language models PDF

Cannot Refute

[52] Chinese Named Entity Recognition Based on BERT and Grouped-query Attention PDF

Cannot Refute

[53] Hardware-Efficient Attention for Fast Decoding PDF

Cannot Refute

[54] Reducing transformer key-value cache size with cross-layer attention PDF

Cannot Refute

[55] MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases PDF

Cannot Refute

[56] QCQA: Quality and Capacity-aware grouped Query Attention PDF

Cannot Refute

[57] Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention PDF

Cannot Refute

[58] Gta: Grouped-head latent attention PDF

Cannot Refute

[59] Tensor Product Attention Is All You Need PDF

Cannot Refute

QuoKA: Query-Oriented KV Selection for Efficient LLM Prefill

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention PDF

[18] ProxyAttn: Guided Sparse Attention via Representative Heads PDF

Contribution Analysis

QUOKA: Query-oriented KV selection for efficient attention

[3] Flexprefill: A context-aware sparse attention mechanism for efficient long-sequence inference PDF

[62] Sampleattention: Near-lossless acceleration of long context llm inference with adaptive structured sparse attention PDF

[1] Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention PDF

[10] Xattention: Block sparse attention with antidiagonal scoring PDF

[11] Dynamic Sparse Attention for Scalable Transformer Acceleration PDF

[60] Qwen2. 5-1m technical report PDF

[61] Seerattention: Learning intrinsic sparse attention in your llms PDF

[63] A silver bullet or a compromise for full attention? a comprehensive study of gist token-based context compression PDF

[64] Energon: Toward Efficient Acceleration of Transformers Using Dynamic Sparse Attention PDF

[65] SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs PDF

Query subselection based on cosine dissimilarity observation

[66] Is Cosine-Similarity of Embeddings Really About Similarity? PDF

[67] Dissecting query-key interaction in vision transformers PDF

[68] CosPoint Transformer: Enhancing 3D Semantic Segmentation With Cosine Similarity Attention and Cross-Attention PDF

[69] Enhancing query relevance: leveraging SBERT and cosine similarity for optimal information retrieval PDF

[70] A wind power forecasting model based on data decomposition and cross-attention mechanism with cosine similarity PDF

[71] Cottention: Linear transformers with cosine attention PDF

[72] An Optimized Cosine Similarity-Based Attention Gate for Temporal Sequence Patterns Recognition PDF

[73] Improved organs at risk segmentation based on modified U-Net with self-attention and consistency regularisation PDF

[74] Anisotropy is inherent to self-attention in transformers PDF

[75] NaLaFormer: Norm-Aware Linear Attention for Transformer Models PDF

Three-stage KV selection framework with group-aware aggregation

[50] EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention PDF

[51] Analog in-memory computing attention mechanism for fast and energy-efficient large language models PDF

[52] Chinese Named Entity Recognition Based on BERT and Grouped-query Attention PDF

[53] Hardware-Efficient Attention for Fast Decoding PDF

[54] Reducing transformer key-value cache size with cross-layer attention PDF

[55] MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases PDF

[56] QCQA: Quality and Capacity-aware grouped Query Attention PDF

[57] Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention PDF

[58] Gta: Grouped-head latent attention PDF

[59] Tensor Product Attention Is All You Need PDF

Table of Contents