Taming the Fragility of KV Cache Eviction in LLM Inference

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.6 Download Report PDF

Efficient AILarge Language Model; LLM Inference

Large language models have revolutionized natural language processing, yet their deployment remains hampered by the substantial memory and runtime overhead of the transformer’s Key-Value cache. To mitigate this, recent methods employ a scoring-aggregation framework to evict unimportant cache entries, based on the "stability assumption"—that a fixed subset of entries remains consistently important during generation. However, prior work has largely focused on refining importance indicators for scoring, while defaulting to mean aggregation due to a faithful trust in the stability assumption. In this work, we argue that this underlying assumption is inherently fragile, making mean aggregation highly vulnerable in extreme cases. To counter this, we propose a simple yet elegant defensive aggregation strategy: a two-step, linear-time approach that controls worst-case risk, thereby defending against extreme cases with negligible computational overhead. Embodying this strategy, we propose a novel cache eviction method, DefensiveKV and its extension, Layer-DefensiveKV, which incorporates layer-wise budget allocation. Across seven task domains (18 datasets), our methods reduce generation quality loss by 2.3× and 4.3× respectively, versus the strongest baseline under a 20% cache size. These results set new performance benchmarks and pioneer a promising direction for optimizing cache eviction against underlying fragility through worst-case risk management.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a defensive aggregation strategy for KV cache eviction, challenging the 'stability assumption' that underlies mean-aggregation approaches in prior work. It sits within the 'Defensive and Robust Aggregation Strategies' leaf of the taxonomy, which contains only two papers including this one. This leaf is a specialized subcategory under 'Attention-Based Importance Scoring', itself part of the broader 'Eviction Policy Design and Optimization' branch. The sparse population of this leaf suggests that robust aggregation for cache eviction is an emerging rather than crowded research direction, with most prior attention-based methods focusing on scoring refinements rather than aggregation robustness.

The taxonomy reveals that most neighboring work concentrates on scoring mechanisms: the sibling 'Heavy-Hitter and Sparsity-Driven Eviction' leaf contains three papers exploiting attention sparsity, while 'Adaptive and Layer-Specific Eviction Policies' addresses layer-wise budget allocation. The paper's focus on aggregation strategy diverges from these directions, which largely accept mean aggregation as default. Nearby branches explore geometric importance metrics and learned eviction models, but none explicitly address worst-case risk control in aggregation. The taxonomy's scope notes clarify that this work excludes compression via merging or quantization, positioning it squarely within eviction policy design rather than alternative memory reduction paradigms.

Among 29 candidates examined across three contributions, no clearly refuting prior work was identified. The 'defensive aggregation strategy' contribution examined 9 candidates with 0 refutations, 'DefensiveKV method' examined 10 with 0 refutations, and 'fragility identification' examined 10 with 0 refutations. This suggests that within the limited search scope, the specific framing of aggregation robustness and worst-case risk control appears relatively unexplored. However, the modest search scale means the analysis captures top semantic matches and immediate citations rather than exhaustive field coverage, leaving open the possibility of related work in adjacent communities or under different terminology.

Based on the limited literature search of 29 candidates, the work appears to occupy a sparsely populated niche within KV cache eviction research. The taxonomy structure and contribution-level statistics suggest novelty in addressing aggregation fragility, though the search scope does not cover the entire field. The single sibling paper in the same leaf and absence of refuting candidates among examined works support the impression of a relatively underexplored direction, contingent on the semantic search boundaries employed.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: KV cache eviction in large language model inference. The field addresses memory bottlenecks during autoregressive generation by selectively retaining or discarding key-value pairs from the attention cache. The taxonomy reveals several major branches: Eviction Policy Design focuses on scoring and selecting which tokens to evict based on attention patterns or importance metrics, exemplified by works like Heavy Hitter Oracle[3] and Model Tells Discard[20]. KV Cache Compression via Merging and Quantization explores reducing memory footprint through token merging (Cache Merging[15]) or precision reduction (Mixed Precision[38]). Multi-Stage and Hybrid Cache Management combines eviction with tiered storage or dynamic strategies, while Architectural and Layer-Level Cache Optimization targets specific model layers or structures. System-Level KV Cache Management (e.g., Lmcache[10], KV Caching Concurrency[4]) addresses infrastructure concerns like distributed caching and scheduling. Domain-Specific and Application-Driven branches tailor cache strategies to particular workloads, Hardware-Aware methods optimize for accelerators, and Personalized strategies adapt to user context. Surveys and comparative studies (System Aware Survey[12]) provide cross-cutting analysis. A particularly active line of work centers on attention-based importance scoring, where methods compute token relevance from attention weights to guide eviction decisions. Within this space, Taming Fragility[0] sits among defensive and robust aggregation strategies, addressing the challenge that naive attention-based metrics can be brittle or adversarially manipulated. This contrasts with simpler heuristics like Heavy Hitter Oracle[3], which relies on cumulative attention scores without robustness guarantees, and with Eviction Policy Efficacy[7], which empirically evaluates various scoring functions. The emphasis in Taming Fragility[0] on defensive aggregation highlights an emerging concern: as eviction policies become more sophisticated, ensuring their stability and resistance to pathological inputs becomes critical. This robustness theme intersects with broader questions about how to balance compression ratio, generation quality, and computational overhead across diverse workloads and threat models.

Claimed Contributions

Defensive aggregation strategy for KV cache eviction

9 retrieved papers

The authors introduce a novel aggregation method for KV cache eviction that replaces conventional mean aggregation with a worst-case risk control framework. This two-step approach consists of worst-case risk estimation and adaptive prior-risk correction, designed to handle the fragility of the stability assumption underlying cache eviction methods.

9 retrieved papers

DefensiveKV cache eviction method

10 retrieved papers

The authors develop DefensiveKV, a cache eviction method that integrates their defensive aggregation strategy into the traditional eviction workflow. They further extend it to Layer-DefensiveKV by incorporating layer-wise budget allocation for joint selection of risky entries across layers.

10 retrieved papers

Identification of fragility in stability assumption

10 retrieved papers

The authors reveal that the stability assumption—that a fixed subset of cache entries remains consistently important during generation—is fragile and can break down during generation. They demonstrate that this fragility makes the widely-used mean aggregation strategy vulnerable to worst-case performance degradation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference PDF

Ren Siyu, Zhu, Kenny Q., Siyu Ren, Kenny Q. Zhu (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Defensive aggregation strategy for KV cache eviction

[51] Toward Robust and Efficient ML-Based GPU Caching for Modern Inference PDF

Cannot Refute

[52] Queue management for slo-oriented large language model serving PDF

Cannot Refute

[53] Evolution of development priorities in key-value stores serving large-scale applications: The {rocksdb} experience PDF

Cannot Refute

[54] RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference PDF

Cannot Refute

[55] Adaptive KV-Cache Compression without Manually Setting Budget PDF

Cannot Refute

[56] Towards programmable and adaptable caches PDF

Cannot Refute

[57] AUV: Efficient KV Cache Eviction forÂ LLMs viaÂ Attention Score Aggregation andÂ Usage Count PDF

Cannot Refute

[58] Don't Discard, but Keep It Small: Context-Preserving KV Cache Compression with Importance-Aware Adaptive Precision PDF

Cannot Refute

[59] Enhancing Cloud Key-Value Store Efficiency Through Hierarchical Cache Optimization PDF

Cannot Refute

Contribution

DefensiveKV cache eviction method

[6] {Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention} PDF

Cannot Refute

[24] CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences PDF

Cannot Refute

[27] XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference PDF

Cannot Refute

[69] Mobile edge intelligence for large language models: A contemporary survey PDF

Cannot Refute

[70] EpiCache: Episodic KV Cache Management for Long Conversational Question Answering PDF

Cannot Refute

[71] Lethe: Layer-and time-adaptive kv cache pruning for reasoning-intensive llm serving PDF

Cannot Refute

[72] LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation PDF

Cannot Refute

[73] Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference PDF

Cannot Refute

[74] FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference PDF

Cannot Refute

[75] LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference PDF

Cannot Refute

Contribution

Identification of fragility in stability assumption

[19] NACL: A General and Effective KV Cache Eviction Framework for LLM at Inference Time PDF

Cannot Refute

[60] QAQ: Quality Adaptive Quantization for LLM KV Cache PDF

Cannot Refute

[61] GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM PDF

Cannot Refute

[62] Continual CVVC-RAG: A Biologically-Inspired Architecture for Lifelong Retrieval-Augmented Generation PDF

Cannot Refute

[63] Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization PDF

Cannot Refute

[64] BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation PDF

Cannot Refute

[65] Sequential manifold regularization for large language model contextual stability PDF

Cannot Refute

[66] Arkvale: Efficient generative llm inference with recallable key-value eviction PDF

Cannot Refute

[67] Unifying kv cache compression for large language models with leankv PDF

Cannot Refute

[68] Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models PDF

Cannot Refute

Taming the Fragility of KV Cache Eviction in LLM Inference

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] On the Efficacy of Eviction Policy for Key-Value Constrained Generative Language Model Inference PDF

Contribution Analysis

Defensive aggregation strategy for KV cache eviction

[51] Toward Robust and Efficient ML-Based GPU Caching for Modern Inference PDF

[52] Queue management for slo-oriented large language model serving PDF

[53] Evolution of development priorities in key-value stores serving large-scale applications: The {rocksdb} experience PDF

[54] RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference PDF

[55] Adaptive KV-Cache Compression without Manually Setting Budget PDF

[56] Towards programmable and adaptable caches PDF

[57] AUV: Efficient KV Cache Eviction forÂ LLMs viaÂ Attention Score Aggregation andÂ Usage Count PDF

[58] Don't Discard, but Keep It Small: Context-Preserving KV Cache Compression with Importance-Aware Adaptive Precision PDF

[59] Enhancing Cloud Key-Value Store Efficiency Through Hierarchical Cache Optimization PDF

DefensiveKV cache eviction method

[6] {Cost-Efficient} large language model serving for multi-turn conversations with {CachedAttention} PDF

[24] CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences PDF

[27] XKV: Personalized KV Cache Memory Reduction for Long-Context LLM Inference PDF

[69] Mobile edge intelligence for large language models: A contemporary survey PDF

[70] EpiCache: Episodic KV Cache Management for Long Conversational Question Answering PDF

[71] Lethe: Layer-and time-adaptive kv cache pruning for reasoning-intensive llm serving PDF

[72] LAVa: Layer-wise KV Cache Eviction with Dynamic Budget Allocation PDF

[73] Mask Tokens as Prophet: Fine-Grained Cache Eviction for Efficient dLLM Inference PDF

[74] FlowMM: Cross-Modal Information Flow Guided KV Cache Merging for Efficient Multimodal Context Inference PDF

[75] LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference PDF

Identification of fragility in stability assumption

[19] NACL: A General and Effective KV Cache Eviction Framework for LLM at Inference Time PDF

[60] QAQ: Quality Adaptive Quantization for LLM KV Cache PDF

[61] GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM PDF

[62] Continual CVVC-RAG: A Biologically-Inspired Architecture for Lifelong Retrieval-Augmented Generation PDF

[63] Kv cache is 1 bit per channel: Efficient large language model inference with coupled quantization PDF

[64] BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation PDF

[65] Sequential manifold regularization for large language model contextual stability PDF

[66] Arkvale: Efficient generative llm inference with recallable key-value eviction PDF

[67] Unifying kv cache compression for large language models with leankv PDF

[68] Can Transformer Memory Be Corrupted? Investigating Cache-Side Vulnerabilities in Large Language Models PDF

Table of Contents