QuoKA: Query-Oriented KV Selection for Efficient LLM Prefill

ICLR 2026 Conference SubmissionAnonymous Authors
Efficient LLM InferenceLLM Prefill AccelerationSparse AttentionKV Cache SubselectionTraining-Free
Abstract:

We present QuoKA: Query-oriented KV selection for efficient attention, a training-free and hardware agnostic sparse attention algorithm for accelerating transformer inference under chunked prefill. While many queries focus on a smaller group of keys in the attention operator, we observe that queries with low cosine similarity with respect to the mean query interact more strongly with more keys and have the greatest contribution to final attention logits. By prioritizing these low cosine similarity queries, the behavior of full attention during the prefill stage can be closely approximated. QuoKA leverages this observation, accelerating attention by (1) first retaining a small set of representative queries and (2) then subselecting the keys most aligned with those queries. Through experiments on Needle-In-A-Haystack, LongBench, RULER, and Math500, we show that, while realizing a 3× reduction in time-to-first-token, 5× speedup in attention on Nvidia GPUs and up to nearly a 7× speedup on Intel Xeon CPUs, QuoKA achieves near-baseline accuracy, utilizing 88% fewer key-value pairs per attention evaluation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

QuoKA contributes a training-free sparse attention algorithm that selects key-value pairs based on query cosine dissimilarity during chunked prefill. The paper resides in the 'Query-Key Similarity-Based Selection' leaf, which contains only three papers total, indicating a relatively focused but not overcrowded research direction. This leaf sits within 'Content-Based Dynamic Sparsity Prediction,' part of the broader 'Dynamic Sparse Attention Pattern Discovery' branch. The small sibling count suggests this specific approach—prioritizing queries with low cosine similarity to the mean query—occupies a moderately explored niche rather than a saturated subfield.

The taxonomy tree reveals that query-key similarity-based selection is one of three parallel approaches under content-based dynamic sparsity prediction, alongside learnable sparse attention routing and hierarchical multi-stage sparsity methods. Neighboring leaves include block importance estimation techniques that operate at coarser granularity and context-adaptive methods that adjust sparsity budgets dynamically. QuoKA's focus on fine-grained query selection distinguishes it from block-level scoring methods like antidiagonal estimation while sharing the runtime adaptivity characteristic of context-adaptive approaches. The exclude notes clarify that QuoKA's training-free, content-driven selection differs fundamentally from static pattern methods or learnable routing strategies.

Among thirty candidates examined, the analysis found two papers that can refute the first contribution (query-oriented KV selection), while the second and third contributions (cosine dissimilarity observation and three-stage framework) showed no clear refutations across ten candidates each. The limited scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The refutable candidates for the core contribution suggest that within this restricted search, some prior work addresses similar query-based selection mechanisms, though the specific cosine dissimilarity heuristic and three-stage design appear less directly anticipated by examined literature.

Based on the limited search scope of thirty candidates, QuoKA appears to introduce specific technical choices—particularly the cosine dissimilarity criterion for query prioritization—that are not clearly prefigured in the examined subset. However, the core idea of query-oriented KV selection has precedent among the two refutable candidates identified. The analysis cannot determine whether a broader literature search would reveal additional overlapping work, especially given the relatively small sibling count in this taxonomy leaf. The novelty assessment remains provisional pending more comprehensive coverage.

Taxonomy

Core-task Taxonomy Papers
49
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: efficient sparse attention for transformer prefill acceleration. The field has organized itself around several complementary strategies for reducing the quadratic cost of attention during prefill. Dynamic sparse attention pattern discovery methods adaptively identify which tokens to attend to based on runtime information, often through content-based predictions or query-key similarity metrics. Static sparse attention mechanisms impose fixed sparsity patterns such as block-diagonal or strided structures, while hybrid dense-sparse architectures blend full and sparse computations selectively. Parallel branches focus on computational optimization techniques like kernel fusion and algorithmic improvements, hardware-accelerated implementations targeting GPUs or specialized accelerators, and domain-specific adaptations for vision or multimodal models. Additional lines of work address KV cache optimization to reduce memory overhead, training-free methods that apply sparsity without retraining, and theoretical foundations that analyze approximation quality and convergence guarantees. Within dynamic sparse attention discovery, a particularly active cluster revolves around content-based sparsity prediction using query-key similarity. QuoKA[0] exemplifies this approach by selecting attention targets based on predicted relevance scores derived from query-key interactions, aiming to preserve accuracy while drastically reducing computation. Closely related works like Minference[1] and ProxyAttn[18] similarly exploit similarity-based heuristics to prune less important tokens dynamically, though they differ in how aggressively they filter and whether they incorporate auxiliary structures like proxy tokens. A central trade-off across these methods involves balancing the overhead of computing selection criteria against the savings from reduced attention operations, with some approaches like Flexprefill[3] exploring adaptive granularity to optimize this balance. QuoKA[0] sits squarely in this query-key similarity-based selection cluster, sharing the core philosophy of runtime adaptivity with neighbors like ProxyAttn[18] but potentially differing in its specific scoring mechanism or integration with prefill pipelines.

Claimed Contributions

QUOKA: Query-oriented KV selection for efficient attention

The authors introduce QUOKA, a training-free and hardware-agnostic sparse attention method designed specifically for chunked prefill. It accelerates attention by first retaining a small set of representative queries based on cosine dissimilarity, then subselecting keys most aligned with those queries using cosine similarity scoring.

10 retrieved papers
Can Refute
Query subselection based on cosine dissimilarity observation

The authors observe and leverage the geometric property that queries with lower cosine similarity to the mean query interact more strongly with more keys and contribute most to final attention logits. This observation motivates prioritizing such queries to approximate full attention behavior during prefill.

10 retrieved papers
Three-stage KV selection framework with group-aware aggregation

The authors develop a three-stage framework consisting of query subselection, cosine-similarity scoring for stable relevance estimation, and group-aware aggregation that maintains compatibility with grouped-query attention architectures while reducing computational cost through pre-aggregation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

QUOKA: Query-oriented KV selection for efficient attention

The authors introduce QUOKA, a training-free and hardware-agnostic sparse attention method designed specifically for chunked prefill. It accelerates attention by first retaining a small set of representative queries based on cosine dissimilarity, then subselecting keys most aligned with those queries using cosine similarity scoring.

Contribution

Query subselection based on cosine dissimilarity observation

The authors observe and leverage the geometric property that queries with lower cosine similarity to the mean query interact more strongly with more keys and contribute most to final attention logits. This observation motivates prioritizing such queries to approximate full attention behavior during prefill.

Contribution

Three-stage KV selection framework with group-aware aggregation

The authors develop a three-stage framework consisting of query subselection, cosine-similarity scoring for stable relevance estimation, and group-aware aggregation that maintains compatibility with grouped-query attention architectures while reducing computational cost through pre-aggregation.