Taming the Fragility of KV Cache Eviction in LLM Inference

ICLR 2026 Conference SubmissionAnonymous Authors
Efficient AILarge Language Model; LLM Inference
Abstract:

Large language models have revolutionized natural language processing, yet their deployment remains hampered by the substantial memory and runtime overhead of the transformer’s Key-Value cache. To mitigate this, recent methods employ a scoring-aggregation framework to evict unimportant cache entries, based on the "stability assumption"—that a fixed subset of entries remains consistently important during generation. However, prior work has largely focused on refining importance indicators for scoring, while defaulting to mean aggregation due to a faithful trust in the stability assumption. In this work, we argue that this underlying assumption is inherently fragile, making mean aggregation highly vulnerable in extreme cases. To counter this, we propose a simple yet elegant defensive aggregation strategy: a two-step, linear-time approach that controls worst-case risk, thereby defending against extreme cases with negligible computational overhead. Embodying this strategy, we propose a novel cache eviction method, DefensiveKV and its extension, Layer-DefensiveKV, which incorporates layer-wise budget allocation. Across seven task domains (18 datasets), our methods reduce generation quality loss by 2.3× and 4.3× respectively, versus the strongest baseline under a 20% cache size. These results set new performance benchmarks and pioneer a promising direction for optimizing cache eviction against underlying fragility through worst-case risk management.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a defensive aggregation strategy for KV cache eviction, challenging the 'stability assumption' that underlies mean-aggregation approaches in prior work. It sits within the 'Defensive and Robust Aggregation Strategies' leaf of the taxonomy, which contains only two papers including this one. This leaf is a specialized subcategory under 'Attention-Based Importance Scoring', itself part of the broader 'Eviction Policy Design and Optimization' branch. The sparse population of this leaf suggests that robust aggregation for cache eviction is an emerging rather than crowded research direction, with most prior attention-based methods focusing on scoring refinements rather than aggregation robustness.

The taxonomy reveals that most neighboring work concentrates on scoring mechanisms: the sibling 'Heavy-Hitter and Sparsity-Driven Eviction' leaf contains three papers exploiting attention sparsity, while 'Adaptive and Layer-Specific Eviction Policies' addresses layer-wise budget allocation. The paper's focus on aggregation strategy diverges from these directions, which largely accept mean aggregation as default. Nearby branches explore geometric importance metrics and learned eviction models, but none explicitly address worst-case risk control in aggregation. The taxonomy's scope notes clarify that this work excludes compression via merging or quantization, positioning it squarely within eviction policy design rather than alternative memory reduction paradigms.

Among 29 candidates examined across three contributions, no clearly refuting prior work was identified. The 'defensive aggregation strategy' contribution examined 9 candidates with 0 refutations, 'DefensiveKV method' examined 10 with 0 refutations, and 'fragility identification' examined 10 with 0 refutations. This suggests that within the limited search scope, the specific framing of aggregation robustness and worst-case risk control appears relatively unexplored. However, the modest search scale means the analysis captures top semantic matches and immediate citations rather than exhaustive field coverage, leaving open the possibility of related work in adjacent communities or under different terminology.

Based on the limited literature search of 29 candidates, the work appears to occupy a sparsely populated niche within KV cache eviction research. The taxonomy structure and contribution-level statistics suggest novelty in addressing aggregation fragility, though the search scope does not cover the entire field. The single sibling paper in the same leaf and absence of refuting candidates among examined works support the impression of a relatively underexplored direction, contingent on the semantic search boundaries employed.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: KV cache eviction in large language model inference. The field addresses memory bottlenecks during autoregressive generation by selectively retaining or discarding key-value pairs from the attention cache. The taxonomy reveals several major branches: Eviction Policy Design focuses on scoring and selecting which tokens to evict based on attention patterns or importance metrics, exemplified by works like Heavy Hitter Oracle[3] and Model Tells Discard[20]. KV Cache Compression via Merging and Quantization explores reducing memory footprint through token merging (Cache Merging[15]) or precision reduction (Mixed Precision[38]). Multi-Stage and Hybrid Cache Management combines eviction with tiered storage or dynamic strategies, while Architectural and Layer-Level Cache Optimization targets specific model layers or structures. System-Level KV Cache Management (e.g., Lmcache[10], KV Caching Concurrency[4]) addresses infrastructure concerns like distributed caching and scheduling. Domain-Specific and Application-Driven branches tailor cache strategies to particular workloads, Hardware-Aware methods optimize for accelerators, and Personalized strategies adapt to user context. Surveys and comparative studies (System Aware Survey[12]) provide cross-cutting analysis. A particularly active line of work centers on attention-based importance scoring, where methods compute token relevance from attention weights to guide eviction decisions. Within this space, Taming Fragility[0] sits among defensive and robust aggregation strategies, addressing the challenge that naive attention-based metrics can be brittle or adversarially manipulated. This contrasts with simpler heuristics like Heavy Hitter Oracle[3], which relies on cumulative attention scores without robustness guarantees, and with Eviction Policy Efficacy[7], which empirically evaluates various scoring functions. The emphasis in Taming Fragility[0] on defensive aggregation highlights an emerging concern: as eviction policies become more sophisticated, ensuring their stability and resistance to pathological inputs becomes critical. This robustness theme intersects with broader questions about how to balance compression ratio, generation quality, and computational overhead across diverse workloads and threat models.

Claimed Contributions

Defensive aggregation strategy for KV cache eviction

The authors introduce a novel aggregation method for KV cache eviction that replaces conventional mean aggregation with a worst-case risk control framework. This two-step approach consists of worst-case risk estimation and adaptive prior-risk correction, designed to handle the fragility of the stability assumption underlying cache eviction methods.

9 retrieved papers
DefensiveKV cache eviction method

The authors develop DefensiveKV, a cache eviction method that integrates their defensive aggregation strategy into the traditional eviction workflow. They further extend it to Layer-DefensiveKV by incorporating layer-wise budget allocation for joint selection of risky entries across layers.

10 retrieved papers
Identification of fragility in stability assumption

The authors reveal that the stability assumption—that a fixed subset of cache entries remains consistently important during generation—is fragile and can break down during generation. They demonstrate that this fragility makes the widely-used mean aggregation strategy vulnerable to worst-case performance degradation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Defensive aggregation strategy for KV cache eviction

The authors introduce a novel aggregation method for KV cache eviction that replaces conventional mean aggregation with a worst-case risk control framework. This two-step approach consists of worst-case risk estimation and adaptive prior-risk correction, designed to handle the fragility of the stability assumption underlying cache eviction methods.

Contribution

DefensiveKV cache eviction method

The authors develop DefensiveKV, a cache eviction method that integrates their defensive aggregation strategy into the traditional eviction workflow. They further extend it to Layer-DefensiveKV by incorporating layer-wise budget allocation for joint selection of risky entries across layers.

Contribution

Identification of fragility in stability assumption

The authors reveal that the stability assumption—that a fixed subset of cache entries remains consistently important during generation—is fragile and can break down during generation. They demonstrate that this fragility makes the widely-used mean aggregation strategy vulnerable to worst-case performance degradation.