DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Distributed LLM ServingLLM Context CachingRequest SchedulingCache AffinityLoad Balancing

In large language model (LLM) serving, reusing the key-value (KV) cache of prompts across requests is a key technique for reducing time-to-first-token (TTFT) and lowering serving costs. Cache-affinity scheduling, which co-locates requests with the same prompt prefix to maximize KV cache reuse, often conflicts with load-balancing scheduling, which aims to distribute requests evenly across compute instances. Existing schedulers struggle to reconcile this trade-off, as they operate within a single mapping space, typically applying cache-affinity routing to a subset of requests and load-balanced routing to the rest, without a unified solution to achieve both goals. To overcome this limitation, we propose DualMap, a dual-mapping scheduling strategy for distributed LLM serving that simultaneously enables cache affinity and load balancing. The key idea of DualMap is to map each request to two candidate instances using two independent hash functions based on the request prompt, and then intelligently select the better candidate based on current system states. This design increases the likelihood that requests with shared prefixes are co-located, while evenly dispersing distinct prefixes across the cluster via ``the power of two choices''. To make DualMap robust under dynamic and skewed real-world workloads, we incorporate three techniques: 1) SLO-aware request routing, which prioritizes cache affinity but switches to load-aware scheduling when TTFT exceeds the SLO, enhancing load balance without sacrificing cache reuse; 2) hotspot-aware rebalancing, which dynamically migrates requests from overloaded to underloaded instances, mitigating hotspots and rebalancing the system; 3) lightweight dual-hash-ring scaling, which leverages a dual-hash-ring mapping to support fast and low-overhead instance scaling without costly global remapping. Experiments on real-world workloads show that DualMap improves effective request capacity by up to 2.25 $\times$ under the same TTFT SLO constraints, compared with the state-of-the-art work.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes DualMap, a dual-mapping scheduling strategy that simultaneously pursues cache affinity and load balancing in distributed LLM serving. It resides in the 'Cache-Affinity and Load-Balancing Routing' leaf, which contains four papers total (including this one). This leaf sits within the broader 'Request Scheduling and Routing Strategies' branch, indicating a moderately populated research direction. The taxonomy shows that cache-affinity routing is an active area, with sibling works like Preble and BanaServe addressing similar trade-offs between prefix reuse and server utilization.

The taxonomy reveals several neighboring research directions. Adjacent leaves include 'Preemptive and Priority-Based Scheduling' (focusing on fine-grained preemption) and 'SLO-Aware and Adaptive Scheduling' (emphasizing latency guarantees under varying workloads). The broader 'Request Scheduling and Routing Strategies' branch excludes KV cache storage mechanisms, which are handled under 'KV Cache Management and Storage Architectures'. DualMap's dual-mapping approach connects to load-balancing concerns in 'Multi-GPU and Distributed Memory Management' but remains distinct by focusing on request routing rather than memory allocation or hardware heterogeneity.

Among the 30 candidates examined, none clearly refute any of DualMap's three contributions: the dual-mapping strategy itself, the SLO-aware routing technique, and the hotspot-aware rebalancing strategy. Each contribution was assessed against 10 candidates, with zero refutable overlaps identified. This suggests that within the limited search scope, the specific combination of dual hash-based mapping with SLO-aware selection and rebalancing appears novel. However, the analysis does not claim exhaustive coverage; sibling papers like Preble and BanaServe address overlapping concerns (cache affinity, load balancing) using different mechanisms, indicating that the problem space is well-explored even if the exact solution differs.

Based on the top-30 semantic matches and taxonomy structure, DualMap appears to offer a fresh approach to a recognized challenge in distributed LLM serving. The dual-mapping mechanism distinguishes it from single-mapping strategies in sibling works, though the broader goal of reconciling cache reuse with load distribution is shared across the leaf. The limited search scope means that additional related work may exist beyond the examined candidates, particularly in adjacent scheduling or distributed systems literature not captured by the semantic search.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: request scheduling for distributed LLM serving with KV cache reuse. The field has organized itself around several complementary dimensions. One branch focuses on KV cache management and storage architectures, exploring how to efficiently store and retrieve cached key-value tensors across memory hierarchies and distributed nodes. Another branch examines request scheduling and routing strategies, addressing how to direct incoming queries to servers in ways that maximize cache hits while balancing load. A third branch investigates multi-request and multi-turn KV cache reuse, tackling scenarios where prefixes or conversational context can be shared across users or sessions. Resource management and system optimization studies address broader questions of memory allocation, batching policies, and end-to-end throughput. Finally, surveys and benchmarks provide cross-cutting perspectives on the evolving landscape, as seen in works like KV Cache Survey[18] and Inference Scheduling Survey[31]. Within the scheduling and routing branch, a particularly active line of work balances cache affinity against load distribution. Some systems such as Preble[17] and Online Context Caching[26] prioritize routing requests to instances that already hold relevant cached prefixes, aiming to minimize redundant computation. Others like BanaServe[28] incorporate load-aware heuristics to prevent hotspots when popular prefixes concentrate traffic on a few servers. DualMap[0] sits squarely in this cache-affinity and load-balancing cluster, proposing mechanisms that jointly consider prefix overlap and server utilization when making routing decisions. Compared to Preble[17], which emphasizes prompt-aware scheduling within a single datacenter, DualMap[0] extends the approach to multi-tier or geo-distributed settings where network latency and hierarchical cache placement become additional factors. Meanwhile, BanaServe[28] explores dynamic rebalancing under skewed workloads, a complementary concern that DualMap[0] addresses through its dual-objective mapping strategy. These works collectively illustrate the central trade-off: aggressive cache reuse can yield substantial speedups, but naive affinity routing risks load imbalance and queueing delays.

Claimed Contributions

DualMap dual-mapping scheduling strategy

10 retrieved papers

A scheduling approach that maps each request to two candidate instances using two independent hash functions based on the request prompt, then intelligently selects between them. This design increases the likelihood that requests with shared prefixes are co-located while evenly dispersing distinct prefixes across the cluster.

10 retrieved papers

SLO-aware request routing technique

10 retrieved papers

A routing strategy that prioritizes prompt-aware scheduling to achieve cache affinity and minimize recomputation overhead, but dynamically shifts to load-aware scheduling only when expected TTFT exceeds the predefined SLO, enhancing load balance without sacrificing cache reuse.

10 retrieved papers

Hotspot-aware rebalancing strategy

10 retrieved papers

A rebalancing mechanism that selectively migrates requests from overloaded instances to their backup instances (the alternative instance from the initial dual mapping), mitigating hotspots and rebalancing the system under skewed prefix popularity workloads.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[17] Preble: Efficient Distributed Prompt Scheduling for LLM Serving PDF

Srivatsa, Vikranth, He, Zijian, Vikranth Srivatsa, Abhyankar, Reyna, Zijian He, Li Dong-ming, Reyna Abhyankar, Zhang, Yiying, Dongming Li, Yiying Zhang (2024) • International Conference on Learning Representations

[26] Online Context Caching for Distributed Large Language Models Serving PDF

Bin Gao, Zhuomin He, Yizhen Yao, Zhi Zhou, Zhanzhi Lou Lou, Weng Fai Wong, Weng-Fai Wong (2025)

[28] BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure PDF

He, Yiyuan, Xu, Minxian, Yiyuan He, Wu Jingfeng, Minxian Xu, Hu Jianmin, Jingfeng Wu, Ma Chong, Jianmin Hu, Shen Min, Chong Ma, Chen Le, Min Shen, Chengzhong, Le Chen, Qu Lin, Cheng-zhong Xu, Ye, Kejiang, Lin Qu, Kejiang Ye (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DualMap dual-mapping scheduling strategy

[2] MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool PDF

Cannot Refute

[9] Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving PDF

Cannot Refute

[12] Mooncake: A kvcache-centric disaggregated architecture for llm serving PDF

Cannot Refute

[55] Large Language Model partitioning for low-latency inference at the edge PDF

Cannot Refute

[56] ServerlessLLM: Low-Latency Serverless Inference for Large Language Models PDF

Cannot Refute

[57] PLAIN: Leveraging High Internal Bandwidth in PIM for Accelerating Large Language Model Inference via Mixed-Precision Quantization PDF

Cannot Refute

[58] Seesaw: High-throughput llm inference via model re-sharding PDF

Cannot Refute

[59] The effect of scheduling and preemption on the efficiency of llm inference serving PDF

Cannot Refute

[60] SkyWalker: A Locality-Aware Cross-Region Load Balancer for LLM Inference PDF

Cannot Refute

[61] Locality-aware Fair Scheduling in LLM Serving PDF

Cannot Refute

Contribution

SLO-aware request routing technique

[7] A Scalable Approach to Distributed Large Language Model Inference PDF

Cannot Refute

[9] Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving PDF

Cannot Refute

[31] LLM Inference Scheduling: A Survey of Techniques, Frameworks, and Trade-offs PDF

Cannot Refute

[46] SkyLB: A Locality-Aware Cross-Region Load Balancer for LLM Inference PDF

Cannot Refute

[49] AIBrix: Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure PDF

Cannot Refute

[50] Blockllm: Multi-tenant finer-grained serving for large language models PDF

Cannot Refute

[51] Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems PDF

Cannot Refute

[52] PATCHEDSERVE: A Patch Management Framework for SLO-Optimized Hybrid Resolution Diffusion Serving PDF

Cannot Refute

[53] Designing Retrieval-Augmented Generation (RAG) Pipelines in Microservice Architectures PDF

Cannot Refute

[54] AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications PDF

Cannot Refute

Contribution

Hotspot-aware rebalancing strategy

[4] WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling PDF

Cannot Refute

[26] Online Context Caching for Distributed Large Language Models Serving PDF

Cannot Refute

[41] Llumnix: Dynamic scheduling for large language model serving PDF

Cannot Refute

[42] Temporal-Aware GPU Resource Allocation for Distributed LLM Inference via Reinforcement Learning PDF

Cannot Refute

[43] DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving PDF

Cannot Refute

[44] Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture PDF

Cannot Refute

[45] {dLoRA}: Dynamically orchestrating requests and adapters for {LoRA}{LLM} serving PDF

Cannot Refute

[46] SkyLB: A Locality-Aware Cross-Region Load Balancer for LLM Inference PDF

Cannot Refute

[47] GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference PDF

Cannot Refute

[48] Ascendra: Dynamic Request Prioritization for Efficient LLM Serving PDF

Cannot Refute

DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[17] Preble: Efficient Distributed Prompt Scheduling for LLM Serving PDF

[26] Online Context Caching for Distributed Large Language Models Serving PDF

[28] BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure PDF

Contribution Analysis

DualMap dual-mapping scheduling strategy

[2] MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool PDF

[9] Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving PDF

[12] Mooncake: A kvcache-centric disaggregated architecture for llm serving PDF

[55] Large Language Model partitioning for low-latency inference at the edge PDF

[56] ServerlessLLM: Low-Latency Serverless Inference for Large Language Models PDF

[57] PLAIN: Leveraging High Internal Bandwidth in PIM for Accelerating Large Language Model Inference via Mixed-Precision Quantization PDF

[58] Seesaw: High-throughput llm inference via model re-sharding PDF

[59] The effect of scheduling and preemption on the efficiency of llm inference serving PDF

[60] SkyWalker: A Locality-Aware Cross-Region Load Balancer for LLM Inference PDF

[61] Locality-aware Fair Scheduling in LLM Serving PDF

SLO-aware request routing technique

[7] A Scalable Approach to Distributed Large Language Model Inference PDF

[9] Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving PDF

[31] LLM Inference Scheduling: A Survey of Techniques, Frameworks, and Trade-offs PDF

[46] SkyLB: A Locality-Aware Cross-Region Load Balancer for LLM Inference PDF

[49] AIBrix: Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure PDF

[50] Blockllm: Multi-tenant finer-grained serving for large language models PDF

[51] Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems PDF

[52] PATCHEDSERVE: A Patch Management Framework for SLO-Optimized Hybrid Resolution Diffusion Serving PDF

[53] Designing Retrieval-Augmented Generation (RAG) Pipelines in Microservice Architectures PDF

[54] AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications PDF

Hotspot-aware rebalancing strategy

[4] WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling PDF

[26] Online Context Caching for Distributed Large Language Models Serving PDF

[41] Llumnix: Dynamic scheduling for large language model serving PDF

[42] Temporal-Aware GPU Resource Allocation for Distributed LLM Inference via Reinforcement Learning PDF

[43] DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving PDF

[44] Arrow: Adaptive Scheduling Mechanisms for Disaggregated LLM Inference Architecture PDF

[45] {dLoRA}: Dynamically orchestrating requests and adapters for {LoRA}{LLM} serving PDF

[46] SkyLB: A Locality-Aware Cross-Region Load Balancer for LLM Inference PDF

[47] GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference PDF

[48] Ascendra: Dynamic Request Prioritization for Efficient LLM Serving PDF

Table of Contents