Randomization Boosts KV Caching, Learning Balances Query Load: A Joint Perspective
Overview
Overall Novelty Assessment
The paper proposes a unified mathematical model integrating KV cache eviction with query routing in multi-LLM serving, alongside two algorithms: Randomized Leaf Token (RLT) eviction and Learning-Based Greedy Routing (LBGR). It resides in the 'Cache Eviction Policies and Optimization' leaf, which contains only one sibling paper (Queue Management SLO). This leaf sits within the broader 'KV Cache Management and Eviction Strategies' branch, indicating a moderately sparse research direction focused specifically on eviction policy design rather than storage formats or semantic reuse.
The taxonomy reveals neighboring leaves addressing complementary aspects: 'Cache Storage and Memory Hierarchy' explores tiered memory layouts (e.g., Strata Hierarchical Caching), 'Context and Semantic Cache Reuse' targets similarity-based retrieval, and 'Cache-Aware Request Routing' examines routing algorithms that consider cache states. The original work bridges eviction and routing—a cross-cutting concern—by jointly optimizing both dimensions. Its randomized eviction strategy contrasts with deterministic priority schemes in sibling work and hierarchical staging approaches in adjacent leaves, positioning it at the intersection of cache management and query distribution.
Among seven candidates examined, no contribution was clearly refuted. The unified model (one candidate examined) and LBGR routing (two candidates) show no overlapping prior work in the limited search. RLT eviction (four candidates examined) likewise encountered no refutations, suggesting that randomized token-level eviction combined with load-aware routing represents a relatively unexplored angle. The search scope—seven papers from top-K semantic matches—is narrow, so these findings reflect novelty within a constrained sample rather than exhaustive field coverage.
Given the limited search and sparse taxonomy leaf, the work appears to occupy a distinct niche by unifying eviction and routing under a single framework. The absence of refutations across all contributions, combined with the small sibling set, suggests the approach is not directly anticipated by the examined literature. However, the narrow candidate pool means adjacent or emerging work outside the top-seven matches could still present relevant comparisons.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present the first formal model that jointly captures the tensions between local KV cache eviction policies and global query routing across multiple LLMs. This model decomposes end-to-end latency into service time and queuing delay, enabling principled analysis of the makespan minimization problem.
The authors propose RLT, a randomized eviction algorithm that uniformly selects unmarked leaf tokens for eviction. This approach achieves an O(log n) competitive ratio, which is exponentially better than the O(n) ratio of existing LRU-based policies, and is proven optimal among randomized algorithms.
The authors introduce LBGR, which uses online linear regression to estimate per-LLM end-to-end latency by combining service time estimation, exponentially decaying queue load tracking, and learned residual correction. Queries are then routed greedily to the LLM with minimal predicted latency, providing adaptability to evolving query patterns.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] Queue management for slo-oriented large language model serving PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Unified mathematical model for KV cache-aware load balancing
The authors present the first formal model that jointly captures the tensions between local KV cache eviction policies and global query routing across multiple LLMs. This model decomposes end-to-end latency into service time and queuing delay, enabling principled analysis of the makespan minimization problem.
[12] BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure PDF
Randomized Leaf Token (RLT) eviction algorithm
The authors propose RLT, a randomized eviction algorithm that uniformly selects unmarked leaf tokens for eviction. This approach achieves an O(log n) competitive ratio, which is exponentially better than the O(n) ratio of existing LRU-based policies, and is proven optimal among randomized algorithms.
[26] Marconi: Prefix caching for the era of hybrid llms PDF
[27] Learned Prefix Caching for Efficient LLM Inference PDF
[28] Prefix and Output Length-Aware Scheduling for Efficient Online LLM Inference PDF
[29] Transcending Cost-Quality Tradeoff in Agent Serving via Session-Awareness PDF
Learning-Based Greedy Routing (LBGR) algorithm
The authors introduce LBGR, which uses online linear regression to estimate per-LLM end-to-end latency by combining service time estimation, exponentially decaying queue load tracking, and learned residual correction. Queries are then routed greedily to the LLM with minimal predicted latency, providing adaptability to evolving query patterns.