DualMap: Enabling Both Cache Affinity and Load Balancing for Distributed LLM Serving
Overview
Overall Novelty Assessment
The paper proposes DualMap, a dual-mapping scheduling strategy that simultaneously pursues cache affinity and load balancing in distributed LLM serving. It resides in the 'Cache-Affinity and Load-Balancing Routing' leaf, which contains four papers total (including this one). This leaf sits within the broader 'Request Scheduling and Routing Strategies' branch, indicating a moderately populated research direction. The taxonomy shows that cache-affinity routing is an active area, with sibling works like Preble and BanaServe addressing similar trade-offs between prefix reuse and server utilization.
The taxonomy reveals several neighboring research directions. Adjacent leaves include 'Preemptive and Priority-Based Scheduling' (focusing on fine-grained preemption) and 'SLO-Aware and Adaptive Scheduling' (emphasizing latency guarantees under varying workloads). The broader 'Request Scheduling and Routing Strategies' branch excludes KV cache storage mechanisms, which are handled under 'KV Cache Management and Storage Architectures'. DualMap's dual-mapping approach connects to load-balancing concerns in 'Multi-GPU and Distributed Memory Management' but remains distinct by focusing on request routing rather than memory allocation or hardware heterogeneity.
Among the 30 candidates examined, none clearly refute any of DualMap's three contributions: the dual-mapping strategy itself, the SLO-aware routing technique, and the hotspot-aware rebalancing strategy. Each contribution was assessed against 10 candidates, with zero refutable overlaps identified. This suggests that within the limited search scope, the specific combination of dual hash-based mapping with SLO-aware selection and rebalancing appears novel. However, the analysis does not claim exhaustive coverage; sibling papers like Preble and BanaServe address overlapping concerns (cache affinity, load balancing) using different mechanisms, indicating that the problem space is well-explored even if the exact solution differs.
Based on the top-30 semantic matches and taxonomy structure, DualMap appears to offer a fresh approach to a recognized challenge in distributed LLM serving. The dual-mapping mechanism distinguishes it from single-mapping strategies in sibling works, though the broader goal of reconciling cache reuse with load distribution is shared across the leaf. The limited search scope means that additional related work may exist beyond the examined candidates, particularly in adjacent scheduling or distributed systems literature not captured by the semantic search.
Taxonomy
Research Landscape Overview
Claimed Contributions
A scheduling approach that maps each request to two candidate instances using two independent hash functions based on the request prompt, then intelligently selects between them. This design increases the likelihood that requests with shared prefixes are co-located while evenly dispersing distinct prefixes across the cluster.
A routing strategy that prioritizes prompt-aware scheduling to achieve cache affinity and minimize recomputation overhead, but dynamically shifts to load-aware scheduling only when expected TTFT exceeds the predefined SLO, enhancing load balance without sacrificing cache reuse.
A rebalancing mechanism that selectively migrates requests from overloaded instances to their backup instances (the alternative instance from the initial dual mapping), mitigating hotspots and rebalancing the system under skewed prefix popularity workloads.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[17] Preble: Efficient Distributed Prompt Scheduling for LLM Serving PDF
[26] Online Context Caching for Distributed Large Language Models Serving PDF
[28] BanaServe: Unified KV Cache and Dynamic Module Migration for Balancing Disaggregated LLM Serving in AI Infrastructure PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
DualMap dual-mapping scheduling strategy
A scheduling approach that maps each request to two candidate instances using two independent hash functions based on the request prompt, then intelligently selects between them. This design increases the likelihood that requests with shared prefixes are co-located while evenly dispersing distinct prefixes across the cluster.
[2] MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool PDF
[9] Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving PDF
[12] Mooncake: A kvcache-centric disaggregated architecture for llm serving PDF
[55] Large Language Model partitioning for low-latency inference at the edge PDF
[56] ServerlessLLM: Low-Latency Serverless Inference for Large Language Models PDF
[57] PLAIN: Leveraging High Internal Bandwidth in PIM for Accelerating Large Language Model Inference via Mixed-Precision Quantization PDF
[58] Seesaw: High-throughput llm inference via model re-sharding PDF
[59] The effect of scheduling and preemption on the efficiency of llm inference serving PDF
[60] SkyWalker: A Locality-Aware Cross-Region Load Balancer for LLM Inference PDF
[61] Locality-aware Fair Scheduling in LLM Serving PDF
SLO-aware request routing technique
A routing strategy that prioritizes prompt-aware scheduling to achieve cache affinity and minimize recomputation overhead, but dynamically shifts to load-aware scheduling only when expected TTFT exceeds the predefined SLO, enhancing load balance without sacrificing cache reuse.
[7] A Scalable Approach to Distributed Large Language Model Inference PDF
[9] Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving PDF
[31] LLM Inference Scheduling: A Survey of Techniques, Frameworks, and Trade-offs PDF
[46] SkyLB: A Locality-Aware Cross-Region Load Balancer for LLM Inference PDF
[49] AIBrix: Towards Scalable, Cost-Effective Large Language Model Inference Infrastructure PDF
[50] Blockllm: Multi-tenant finer-grained serving for large language models PDF
[51] Serving Heterogeneous LoRA Adapters in Distributed LLM Inference Systems PDF
[52] PATCHEDSERVE: A Patch Management Framework for SLO-Optimized Hybrid Resolution Diffusion Serving PDF
[53] Designing Retrieval-Augmented Generation (RAG) Pipelines in Microservice Architectures PDF
[54] AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications PDF
Hotspot-aware rebalancing strategy
A rebalancing mechanism that selectively migrates requests from overloaded instances to their backup instances (the alternative instance from the initial dual mapping), mitigating hotspots and rebalancing the system under skewed prefix popularity workloads.