DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

ICLR 2026 Conference SubmissionAnonymous Authors
Distributed LLM ServingLLM Context CachingRequest SchedulingCache AffinityLoad Balancing
Abstract:

In large language model (LLM) serving, reusing the key-value (KV) cache of prompts across requests is a key technique for reducing time-to-first-token (TTFT) and lowering serving costs. Cache-affinity scheduling, which co-locates requests with the same prompt prefix to maximize KV cache reuse, often conflicts with load-balancing scheduling, which aims to distribute requests evenly across compute instances. Existing schedulers struggle to reconcile this trade-off, as they operate within a single mapping space, typically applying cache-affinity routing to a subset of requests and load-balanced routing to the rest, without a unified solution to achieve both goals. To overcome this limitation, we propose DualMap, a dual-mapping scheduling strategy for distributed LLM serving that simultaneously enables cache affinity and load balancing. The key idea of DualMap is to map each request to two candidate instances using two independent hash functions based on the request prompt, and then intelligently select the better candidate based on current system states. This design increases the likelihood that requests with shared prefixes are co-located, while evenly dispersing distinct prefixes across the cluster via ``the power of two choices''. To make DualMap robust under dynamic and skewed real-world workloads, we incorporate three techniques: 1) SLO-aware request routing, which prioritizes cache affinity but switches to load-aware scheduling when TTFT exceeds the SLO, enhancing load balance without sacrificing cache reuse; 2) hotspot-aware rebalancing, which dynamically migrates requests from overloaded to underloaded instances, mitigating hotspots and rebalancing the system; 3) lightweight dual-hash-ring scaling, which leverages a dual-hash-ring mapping to support fast and low-overhead instance scaling without costly global remapping. Experiments on real-world workloads show that DualMap improves effective request capacity by up to 2.25×\times under the same TTFT SLO constraints, compared with the state-of-the-art work.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes DualMap, a dual-mapping scheduling strategy that simultaneously pursues cache affinity and load balancing in distributed LLM serving. It resides in the 'Cache-Affinity and Load-Balancing Routing' leaf, which contains four papers total (including this one). This leaf sits within the broader 'Request Scheduling and Routing Strategies' branch, indicating a moderately populated research direction. The taxonomy shows that cache-affinity routing is an active area, with sibling works like Preble and BanaServe addressing similar trade-offs between prefix reuse and server utilization.

The taxonomy reveals several neighboring research directions. Adjacent leaves include 'Preemptive and Priority-Based Scheduling' (focusing on fine-grained preemption) and 'SLO-Aware and Adaptive Scheduling' (emphasizing latency guarantees under varying workloads). The broader 'Request Scheduling and Routing Strategies' branch excludes KV cache storage mechanisms, which are handled under 'KV Cache Management and Storage Architectures'. DualMap's dual-mapping approach connects to load-balancing concerns in 'Multi-GPU and Distributed Memory Management' but remains distinct by focusing on request routing rather than memory allocation or hardware heterogeneity.

Among the 30 candidates examined, none clearly refute any of DualMap's three contributions: the dual-mapping strategy itself, the SLO-aware routing technique, and the hotspot-aware rebalancing strategy. Each contribution was assessed against 10 candidates, with zero refutable overlaps identified. This suggests that within the limited search scope, the specific combination of dual hash-based mapping with SLO-aware selection and rebalancing appears novel. However, the analysis does not claim exhaustive coverage; sibling papers like Preble and BanaServe address overlapping concerns (cache affinity, load balancing) using different mechanisms, indicating that the problem space is well-explored even if the exact solution differs.

Based on the top-30 semantic matches and taxonomy structure, DualMap appears to offer a fresh approach to a recognized challenge in distributed LLM serving. The dual-mapping mechanism distinguishes it from single-mapping strategies in sibling works, though the broader goal of reconciling cache reuse with load distribution is shared across the leaf. The limited search scope means that additional related work may exist beyond the examined candidates, particularly in adjacent scheduling or distributed systems literature not captured by the semantic search.

Taxonomy

Core-task Taxonomy Papers
40
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: request scheduling for distributed LLM serving with KV cache reuse. The field has organized itself around several complementary dimensions. One branch focuses on KV cache management and storage architectures, exploring how to efficiently store and retrieve cached key-value tensors across memory hierarchies and distributed nodes. Another branch examines request scheduling and routing strategies, addressing how to direct incoming queries to servers in ways that maximize cache hits while balancing load. A third branch investigates multi-request and multi-turn KV cache reuse, tackling scenarios where prefixes or conversational context can be shared across users or sessions. Resource management and system optimization studies address broader questions of memory allocation, batching policies, and end-to-end throughput. Finally, surveys and benchmarks provide cross-cutting perspectives on the evolving landscape, as seen in works like KV Cache Survey[18] and Inference Scheduling Survey[31]. Within the scheduling and routing branch, a particularly active line of work balances cache affinity against load distribution. Some systems such as Preble[17] and Online Context Caching[26] prioritize routing requests to instances that already hold relevant cached prefixes, aiming to minimize redundant computation. Others like BanaServe[28] incorporate load-aware heuristics to prevent hotspots when popular prefixes concentrate traffic on a few servers. DualMap[0] sits squarely in this cache-affinity and load-balancing cluster, proposing mechanisms that jointly consider prefix overlap and server utilization when making routing decisions. Compared to Preble[17], which emphasizes prompt-aware scheduling within a single datacenter, DualMap[0] extends the approach to multi-tier or geo-distributed settings where network latency and hierarchical cache placement become additional factors. Meanwhile, BanaServe[28] explores dynamic rebalancing under skewed workloads, a complementary concern that DualMap[0] addresses through its dual-objective mapping strategy. These works collectively illustrate the central trade-off: aggressive cache reuse can yield substantial speedups, but naive affinity routing risks load imbalance and queueing delays.

Claimed Contributions

DualMap dual-mapping scheduling strategy

A scheduling approach that maps each request to two candidate instances using two independent hash functions based on the request prompt, then intelligently selects between them. This design increases the likelihood that requests with shared prefixes are co-located while evenly dispersing distinct prefixes across the cluster.

10 retrieved papers
SLO-aware request routing technique

A routing strategy that prioritizes prompt-aware scheduling to achieve cache affinity and minimize recomputation overhead, but dynamically shifts to load-aware scheduling only when expected TTFT exceeds the predefined SLO, enhancing load balance without sacrificing cache reuse.

10 retrieved papers
Hotspot-aware rebalancing strategy

A rebalancing mechanism that selectively migrates requests from overloaded instances to their backup instances (the alternative instance from the initial dual mapping), mitigating hotspots and rebalancing the system under skewed prefix popularity workloads.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DualMap dual-mapping scheduling strategy

A scheduling approach that maps each request to two candidate instances using two independent hash functions based on the request prompt, then intelligently selects between them. This design increases the likelihood that requests with shared prefixes are co-located while evenly dispersing distinct prefixes across the cluster.

Contribution

SLO-aware request routing technique

A routing strategy that prioritizes prompt-aware scheduling to achieve cache affinity and minimize recomputation overhead, but dynamically shifts to load-aware scheduling only when expected TTFT exceeds the predefined SLO, enhancing load balance without sacrificing cache reuse.

Contribution

Hotspot-aware rebalancing strategy

A rebalancing mechanism that selectively migrates requests from overloaded instances to their backup instances (the alternative instance from the initial dual mapping), mitigating hotspots and rebalancing the system under skewed prefix popularity workloads.