Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective

ICLR 2026 Conference SubmissionAnonymous Authors
KV CachePrefix SharingLRULarge Language ModelsLLM RoutingKV Cache EvictionMulti-LLM Serving
Abstract:

KV caching is a fundamental technique for accelerating Large Language Model (LLM) inference by reusing key-value (KV) pairs from previous queries, but its effectiveness under limited memory is highly sensitive to the eviction policy. The default Least Recently Used (LRU) eviction algorithm struggles with dynamic online query arrivals, especially in multi-LLM serving scenarios, where balancing query load across workers and maximizing cache hit rate of each worker are inherently conflicting objectives. We give the first unified mathematical model that captures the core trade-offs between KV cache eviction and query routing. Our analysis reveals the theoretical limitations of existing methods and leads to principled algorithms that integrate provably competitive randomized KV cache eviction with learning-based methods to adaptively route queries with evolving patterns, thus balancing query load and cache hit rate. Our theoretical results are validated by extensive experiments across 4 benchmarks and 3 prefix-sharing settings, demonstrating improvements of up to 6.92×\times in cache hit rate, 11.96×\times reduction in latency, 14.06×\times reduction in time-to-first-token (TTFT), and 77.4% increase in throughput over the state-of-the-art methods.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a unified mathematical model integrating KV cache eviction with query routing in multi-LLM serving, alongside two algorithms: Randomized Leaf Token (RLT) eviction and Learning-Based Greedy Routing (LBGR). It resides in the 'Cache Eviction Policies and Optimization' leaf, which contains only one sibling paper (Queue Management SLO). This leaf sits within the broader 'KV Cache Management and Eviction Strategies' branch, indicating a moderately sparse research direction focused specifically on eviction policy design rather than storage formats or semantic reuse.

The taxonomy reveals neighboring leaves addressing complementary aspects: 'Cache Storage and Memory Hierarchy' explores tiered memory layouts (e.g., Strata Hierarchical Caching), 'Context and Semantic Cache Reuse' targets similarity-based retrieval, and 'Cache-Aware Request Routing' examines routing algorithms that consider cache states. The original work bridges eviction and routing—a cross-cutting concern—by jointly optimizing both dimensions. Its randomized eviction strategy contrasts with deterministic priority schemes in sibling work and hierarchical staging approaches in adjacent leaves, positioning it at the intersection of cache management and query distribution.

Among seven candidates examined, no contribution was clearly refuted. The unified model (one candidate examined) and LBGR routing (two candidates) show no overlapping prior work in the limited search. RLT eviction (four candidates examined) likewise encountered no refutations, suggesting that randomized token-level eviction combined with load-aware routing represents a relatively unexplored angle. The search scope—seven papers from top-K semantic matches—is narrow, so these findings reflect novelty within a constrained sample rather than exhaustive field coverage.

Given the limited search and sparse taxonomy leaf, the work appears to occupy a distinct niche by unifying eviction and routing under a single framework. The absence of refutations across all contributions, combined with the small sibling set, suggests the approach is not directly anticipated by the examined literature. However, the narrow candidate pool means adjacent or emerging work outside the top-seven matches could still present relevant comparisons.

Taxonomy

Core-task Taxonomy Papers
25
3
Claimed Contributions
7
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: KV cache eviction and query routing in multi-LLM serving systems. The field addresses the dual challenge of managing memory-intensive key-value caches during inference and intelligently distributing queries across multiple large language model instances. The taxonomy reveals several major branches: KV Cache Management and Eviction Strategies focuses on policies that decide which cached tokens to retain or discard, often trading memory footprint against recomputation cost; Query Distribution and Routing explores how to assign incoming requests to appropriate model replicas or endpoints; Distributed and Disaggregated LLM Serving Architectures examines system designs that separate compute from storage or span multiple nodes; Multi-Stage and Multi-Model Serving Pipelines considers workflows involving cascades of smaller and larger models; Scheduling and Resource Management tackles broader orchestration questions such as batching and GPU allocation; and Surveys and Overviews provide synthetic perspectives on inference optimization. Representative works like Strata Hierarchical Caching[4] and Infinite-LLM[5] illustrate memory hierarchy techniques, while SkyLB[6] and Unified Gateway[13] exemplify routing and load-balancing approaches. A particularly active line of work centers on cache eviction policies that balance hit rates with service-level objectives, as seen in Queue Management SLO[1], which integrates eviction decisions with queueing dynamics. Another contrasting theme is hierarchical or tiered caching—Strata Hierarchical Caching[4] and In-Context Caching[7] explore multi-level storage to exploit locality—versus online adaptive schemes like Online Context Caching[10] that adjust eviction on the fly. The original paper, Randomization KV Caching[0], sits squarely within the Cache Eviction Policies and Optimization cluster, proposing randomized selection mechanisms to reduce deterministic bias in token retention. Its emphasis on stochastic eviction distinguishes it from deterministic priority schemes in Queue Management SLO[1] and from the hierarchical staging strategies in Strata Hierarchical Caching[4], offering a complementary angle on managing cache churn under variable workloads.

Claimed Contributions

Unified mathematical model for KV cache-aware load balancing

The authors present the first formal model that jointly captures the tensions between local KV cache eviction policies and global query routing across multiple LLMs. This model decomposes end-to-end latency into service time and queuing delay, enabling principled analysis of the makespan minimization problem.

1 retrieved paper
Randomized Leaf Token (RLT) eviction algorithm

The authors propose RLT, a randomized eviction algorithm that uniformly selects unmarked leaf tokens for eviction. This approach achieves an O(log n) competitive ratio, which is exponentially better than the O(n) ratio of existing LRU-based policies, and is proven optimal among randomized algorithms.

4 retrieved papers
Learning-Based Greedy Routing (LBGR) algorithm

The authors introduce LBGR, which uses online linear regression to estimate per-LLM end-to-end latency by combining service time estimation, exponentially decaying queue load tracking, and learned residual correction. Queries are then routed greedily to the LLM with minimal predicted latency, providing adaptability to evolving query patterns.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Unified mathematical model for KV cache-aware load balancing

The authors present the first formal model that jointly captures the tensions between local KV cache eviction policies and global query routing across multiple LLMs. This model decomposes end-to-end latency into service time and queuing delay, enabling principled analysis of the makespan minimization problem.

Contribution

Randomized Leaf Token (RLT) eviction algorithm

The authors propose RLT, a randomized eviction algorithm that uniformly selects unmarked leaf tokens for eviction. This approach achieves an O(log n) competitive ratio, which is exponentially better than the O(n) ratio of existing LRU-based policies, and is proven optimal among randomized algorithms.

Contribution

Learning-Based Greedy Routing (LBGR) algorithm

The authors introduce LBGR, which uses online linear regression to estimate per-LLM end-to-end latency by combining service time estimation, exponentially decaying queue load tracking, and learned residual correction. Queries are then routed greedily to the LLM with minimal predicted latency, providing adaptability to evolving query patterns.