Taming the Fragility of KV Cache Eviction in LLM Inference
Overview
Overall Novelty Assessment
The paper proposes a defensive aggregation strategy for KV cache eviction, challenging the 'stability assumption' that underlies mean-aggregation approaches in prior work. It sits within the 'Defensive and Robust Aggregation Strategies' leaf of the taxonomy, which contains only two papers including this one. This leaf is a specialized subcategory under 'Attention-Based Importance Scoring', itself part of the broader 'Eviction Policy Design and Optimization' branch. The sparse population of this leaf suggests that robust aggregation for cache eviction is an emerging rather than crowded research direction, with most prior attention-based methods focusing on scoring refinements rather than aggregation robustness.
The taxonomy reveals that most neighboring work concentrates on scoring mechanisms: the sibling 'Heavy-Hitter and Sparsity-Driven Eviction' leaf contains three papers exploiting attention sparsity, while 'Adaptive and Layer-Specific Eviction Policies' addresses layer-wise budget allocation. The paper's focus on aggregation strategy diverges from these directions, which largely accept mean aggregation as default. Nearby branches explore geometric importance metrics and learned eviction models, but none explicitly address worst-case risk control in aggregation. The taxonomy's scope notes clarify that this work excludes compression via merging or quantization, positioning it squarely within eviction policy design rather than alternative memory reduction paradigms.
Among 29 candidates examined across three contributions, no clearly refuting prior work was identified. The 'defensive aggregation strategy' contribution examined 9 candidates with 0 refutations, 'DefensiveKV method' examined 10 with 0 refutations, and 'fragility identification' examined 10 with 0 refutations. This suggests that within the limited search scope, the specific framing of aggregation robustness and worst-case risk control appears relatively unexplored. However, the modest search scale means the analysis captures top semantic matches and immediate citations rather than exhaustive field coverage, leaving open the possibility of related work in adjacent communities or under different terminology.
Based on the limited literature search of 29 candidates, the work appears to occupy a sparsely populated niche within KV cache eviction research. The taxonomy structure and contribution-level statistics suggest novelty in addressing aggregation fragility, though the search scope does not cover the entire field. The single sibling paper in the same leaf and absence of refuting candidates among examined works support the impression of a relatively underexplored direction, contingent on the semantic search boundaries employed.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce a novel aggregation method for KV cache eviction that replaces conventional mean aggregation with a worst-case risk control framework. This two-step approach consists of worst-case risk estimation and adaptive prior-risk correction, designed to handle the fragility of the stability assumption underlying cache eviction methods.
The authors develop DefensiveKV, a cache eviction method that integrates their defensive aggregation strategy into the traditional eviction workflow. They further extend it to Layer-DefensiveKV by incorporating layer-wise budget allocation for joint selection of risky entries across layers.
The authors reveal that the stability assumption—that a fixed subset of cache entries remains consistently important during generation—is fragile and can break down during generation. They demonstrate that this fragility makes the widely-used mean aggregation strategy vulnerable to worst-case performance degradation.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[7] On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Defensive aggregation strategy for KV cache eviction
The authors introduce a novel aggregation method for KV cache eviction that replaces conventional mean aggregation with a worst-case risk control framework. This two-step approach consists of worst-case risk estimation and adaptive prior-risk correction, designed to handle the fragility of the stability assumption underlying cache eviction methods.
[51] Toward Robust and Efficient ML-Based GPU Caching for Modern Inference PDF
[52] Queue management for slo-oriented large language model serving PDF
[53] Evolution of development priorities in key-value stores serving large-scale applications: The {rocksdb} experience PDF
[54] RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference PDF
[55] Adaptive KV-Cache Compression without Manually Setting Budget PDF
[56] Towards programmable and adaptable caches PDF
[57] AUV: Efficient KV Cache Eviction for LLMs via Attention Score Aggregation and Usage Count PDF
[58] Don't Discard, but Keep It Small: Context-Preserving KV Cache Compression with Importance-Aware Adaptive Precision PDF
[59] Enhancing Cloud Key-Value Store Efficiency Through Hierarchical Cache Optimization PDF
DefensiveKV cache eviction method
The authors develop DefensiveKV, a cache eviction method that integrates their defensive aggregation strategy into the traditional eviction workflow. They further extend it to Layer-DefensiveKV by incorporating layer-wise budget allocation for joint selection of risky entries across layers.
[6] {Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention} PDF
[24] CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences PDF
[27] XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference PDF
[69] Mobile edge intelligence for large language models: A contemporary survey PDF
[70] EpiCache: Episodic KV Cache Management for Long Conversational Question Answering PDF
[71] Lethe: Layer-and time-adaptive kv cache pruning for reasoning-intensive llm serving PDF
[72] LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation PDF
[73] Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference PDF
[74] FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference PDF
[75] LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference PDF
Identification of fragility in stability assumption
The authors reveal that the stability assumption—that a fixed subset of cache entries remains consistently important during generation—is fragile and can break down during generation. They demonstrate that this fragility makes the widely-used mean aggregation strategy vulnerable to worst-case performance degradation.