QuoKA: Query-Oriented KV Selection for Efficient LLM Prefill
Overview
Overall Novelty Assessment
QuoKA contributes a training-free sparse attention algorithm that selects key-value pairs based on query cosine dissimilarity during chunked prefill. The paper resides in the 'Query-Key Similarity-Based Selection' leaf, which contains only three papers total, indicating a relatively focused but not overcrowded research direction. This leaf sits within 'Content-Based Dynamic Sparsity Prediction,' part of the broader 'Dynamic Sparse Attention Pattern Discovery' branch. The small sibling count suggests this specific approach—prioritizing queries with low cosine similarity to the mean query—occupies a moderately explored niche rather than a saturated subfield.
The taxonomy tree reveals that query-key similarity-based selection is one of three parallel approaches under content-based dynamic sparsity prediction, alongside learnable sparse attention routing and hierarchical multi-stage sparsity methods. Neighboring leaves include block importance estimation techniques that operate at coarser granularity and context-adaptive methods that adjust sparsity budgets dynamically. QuoKA's focus on fine-grained query selection distinguishes it from block-level scoring methods like antidiagonal estimation while sharing the runtime adaptivity characteristic of context-adaptive approaches. The exclude notes clarify that QuoKA's training-free, content-driven selection differs fundamentally from static pattern methods or learnable routing strategies.
Among thirty candidates examined, the analysis found two papers that can refute the first contribution (query-oriented KV selection), while the second and third contributions (cosine dissimilarity observation and three-stage framework) showed no clear refutations across ten candidates each. The limited scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The refutable candidates for the core contribution suggest that within this restricted search, some prior work addresses similar query-based selection mechanisms, though the specific cosine dissimilarity heuristic and three-stage design appear less directly anticipated by examined literature.
Based on the limited search scope of thirty candidates, QuoKA appears to introduce specific technical choices—particularly the cosine dissimilarity criterion for query prioritization—that are not clearly prefigured in the examined subset. However, the core idea of query-oriented KV selection has precedent among the two refutable candidates identified. The analysis cannot determine whether a broader literature search would reveal additional overlapping work, especially given the relatively small sibling count in this taxonomy leaf. The novelty assessment remains provisional pending more comprehensive coverage.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce QUOKA, a training-free and hardware-agnostic sparse attention method designed specifically for chunked prefill. It accelerates attention by first retaining a small set of representative queries based on cosine dissimilarity, then subselecting keys most aligned with those queries using cosine similarity scoring.
The authors observe and leverage the geometric property that queries with lower cosine similarity to the mean query interact more strongly with more keys and contribute most to final attention logits. This observation motivates prioritizing such queries to approximate full attention behavior during prefill.
The authors develop a three-stage framework consisting of query subselection, cosine-similarity scoring for stable relevance estimation, and group-aware aggregation that maintains compatibility with grouped-query attention architectures while reducing computational cost through pre-aggregation.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention PDF
[18] ProxyAttn: Guided Sparse Attention via Representative Heads PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
QUOKA: Query-oriented KV selection for efficient attention
The authors introduce QUOKA, a training-free and hardware-agnostic sparse attention method designed specifically for chunked prefill. It accelerates attention by first retaining a small set of representative queries based on cosine dissimilarity, then subselecting keys most aligned with those queries using cosine similarity scoring.
[3] Flexprefill: A context-aware sparse attention mechanism for efficient long-sequence inference PDF
[62] Sampleattention: Near-lossless acceleration of long context llm inference with adaptive structured sparse attention PDF
[1] Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention PDF
[10] Xattention: Block sparse attention with antidiagonal scoring PDF
[11] Dynamic Sparse Attention for Scalable Transformer Acceleration PDF
[60] Qwen2. 5-1m technical report PDF
[61] Seerattention: Learning intrinsic sparse attention in your llms PDF
[63] A silver bullet or a compromise for full attention? a comprehensive study of gist token-based context compression PDF
[64] Energon: Toward Efficient Acceleration of Transformers Using Dynamic Sparse Attention PDF
[65] SparseAccelerate: Efficient Long-Context Inference for Mid-Range GPUs PDF
Query subselection based on cosine dissimilarity observation
The authors observe and leverage the geometric property that queries with lower cosine similarity to the mean query interact more strongly with more keys and contribute most to final attention logits. This observation motivates prioritizing such queries to approximate full attention behavior during prefill.
[66] Is Cosine-Similarity of Embeddings Really About Similarity? PDF
[67] Dissecting query-key interaction in vision transformers PDF
[68] CosPoint Transformer: Enhancing 3D Semantic Segmentation With Cosine Similarity Attention and Cross-Attention PDF
[69] Enhancing query relevance: leveraging SBERT and cosine similarity for optimal information retrieval PDF
[70] A wind power forecasting model based on data decomposition and cross-attention mechanism with cosine similarity PDF
[71] Cottention: Linear transformers with cosine attention PDF
[72] An Optimized Cosine Similarity-Based Attention Gate for Temporal Sequence Patterns Recognition PDF
[73] Improved organs at risk segmentation based on modified U-Net with self-attention and consistency regularisation PDF
[74] Anisotropy is inherent to self-attention in transformers PDF
[75] NaLaFormer: Norm-Aware Linear Attention for Transformer Models PDF
Three-stage KV selection framework with group-aware aggregation
The authors develop a three-stage framework consisting of query subselection, cosine-similarity scoring for stable relevance estimation, and group-aware aggregation that maintains compatibility with grouped-query attention architectures while reducing computational cost through pre-aggregation.