SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models

ICLR 2026 Conference SubmissionAnonymous Authors
MoE ModelInference AccelerationBatch DecodingExpert Re-routing
Abstract:

Mixture-of-Experts (MoE) architectures employ sparse activation to deliver faster training and inference with higher accuracy than dense LLMs. However, in production serving, MoE models require batch inference to optimize hardware efficiency, which may cause excessive expert activation and thus slow the memory-bound decoding stage. To address the fundamental tension between batch decoding and expert sparsity, we present SERE, a Similarity-based Expert Re-routing method for Efficient batch decoding in MoE models. SERE dynamically reduces the number of active experts in an input‑aware manner by re-routing tokens from secondary experts to their most similar primary counterparts. It also leverages similarity patterns to identify and preserve critical experts, thereby preventing capability loss. Notably, SERE avoids static expert pruning or merging, instead enabling dynamic expert skipping based on batch-level expert redundancy. Additionally, we provide an efficient custom CUDA kernel for SERE, enabling plug-and-play use in vLLM with only a single‑line code change. Extensive experiments on various complex reasoning benchmarks demonstrate that SERE achieves up to 2.0×2.0\times speedup with minimal quality loss, providing a practical solution for cost-efficient and latency-sensitive large-scale MoE deployment.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SERE, a similarity-based expert re-routing method that dynamically reduces active experts during batch decoding by redirecting tokens from secondary experts to similar primary counterparts. Within the taxonomy, SERE resides in the 'Similarity-Based Expert Re-routing' leaf under 'Dynamic Expert Routing and Selection'. This leaf contains only two papers total: SERE itself and one sibling work on opportunistic expert activation. This positioning suggests a relatively sparse research direction within the broader field of efficient MoE batch decoding, which encompasses 50 papers across 24 leaf nodes.

The taxonomy reveals that SERE's parent branch, 'Dynamic Expert Routing and Selection', contains three leaves: similarity-based re-routing, batch-aware expert selection, and task-level routing. Neighboring branches address complementary challenges: 'Expert Offloading and Memory Management' focuses on CPU-GPU orchestration and caching strategies, while 'Parallelism and Distributed Inference' tackles multi-device coordination. SERE's approach diverges from memory-centric offloading solutions by optimizing routing logic rather than expert availability, and differs from batch-aware selection methods by explicitly leveraging expert similarity metrics rather than batch-level characteristics alone.

Among 26 candidates examined through limited semantic search, the contribution-level analysis shows varied novelty signals. The core SERE re-routing method examined 6 candidates with 0 refutations, suggesting limited direct overlap in the search scope. The custom CUDA kernel contribution examined 10 candidates with 0 refutations, indicating potential implementation novelty. However, the expert similarity estimation framework examined 10 candidates and found 3 refutable pairs, suggesting that techniques for measuring expert similarity have more substantial prior work within the examined literature. These statistics reflect the bounded search scope, not exhaustive field coverage.

Based on the limited search of 26 candidates, SERE appears to occupy a sparsely populated research direction with only one sibling paper in its taxonomy leaf. The core re-routing mechanism shows no clear refutation among examined candidates, while the similarity estimation component encounters more prior work. The analysis cannot assess whether additional relevant work exists beyond the top-K semantic matches examined, particularly in adjacent areas like expert merging or pruning that may employ similar similarity metrics.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
26
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: efficient batch decoding in mixture-of-experts models. The field has organized itself around several complementary challenges that arise when serving large MoE architectures at scale. Expert Offloading and Memory Management addresses the fundamental problem of fitting massive expert parameters into limited device memory, with works like MoE-Infinity[7] and Ktransformers[8] exploring CPU-GPU offloading strategies. Dynamic Expert Routing and Selection focuses on how tokens are assigned to experts, including similarity-based re-routing and adaptive activation schemes. Parallelism and Distributed Inference tackles the coordination of experts across multiple devices, while Batching and Scheduling Strategies optimizes how requests are grouped and processed to maximize throughput. Model Architecture and Training considers design choices that improve inference efficiency from the ground up, and Speculative Decoding and Acceleration borrows techniques from autoregressive speedup to reduce latency. System Integration and Benchmarking provides holistic frameworks like Moesys[6] and MoE-Inference-Bench[30], and Domain-Specific Applications explores deployment in constrained environments such as edge devices or vehicular networks. A particularly active tension exists between memory-centric offloading approaches and routing-centric optimization methods. While offloading systems prioritize keeping experts accessible despite hardware limits, routing strategies aim to reduce the number of active experts or redirect tokens to similar experts when capacity is exceeded. SERE[0] sits within the Dynamic Expert Routing and Selection branch, specifically under Similarity-Based Expert Re-routing, where it shares conceptual ground with Opportunistic Expert Activation[43]. Both works explore how to handle overflow or capacity constraints by leveraging expert similarity, but SERE[0] emphasizes re-routing tokens to semantically similar experts when primary choices are unavailable, whereas Opportunistic Expert Activation[43] focuses on activating additional experts opportunistically. This contrasts with heavier offloading solutions like MoE-Infinity[7] or scheduling frameworks like Moesys[6], which manage expert availability through memory orchestration rather than routing logic. The interplay between these approaches highlights an open question: whether intelligent routing can reduce the need for complex memory management, or whether both layers of optimization are necessary for truly efficient batch decoding.

Claimed Contributions

SERE: Similarity-based Expert Re-routing Method

The authors propose SERE, a dynamic expert skipping method that reduces active experts during batch decoding by re-routing tokens from secondary experts to their most similar primary counterparts. The method leverages similarity patterns to identify and preserve critical experts, avoiding static pruning or merging while enabling input-aware expert reduction based on batch-level redundancy.

6 retrieved papers
Efficient Custom CUDA Kernel for SERE

The authors develop a high-performance CUDA kernel implementation of SERE that is model-agnostic and can be seamlessly integrated into the vLLM inference framework. This implementation enables practical deployment with minimal code modification, requiring only a single line of code change.

10 retrieved papers
Expert Similarity Estimation Framework

The authors introduce a framework for computing layer-wise expert similarity matrices using calibration data and various similarity metrics (Frobenius, Cosine, CKA). This pre-computed similarity matrix guides the dynamic re-routing process without requiring retraining or task-specific tuning, and reveals patterns of redundancy and specialization across MoE layers.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SERE: Similarity-based Expert Re-routing Method

The authors propose SERE, a dynamic expert skipping method that reduces active experts during batch decoding by re-routing tokens from secondary experts to their most similar primary counterparts. The method leverages similarity patterns to identify and preserve critical experts, avoiding static pruning or merging while enabling input-aware expert reduction based on batch-level redundancy.

Contribution

Efficient Custom CUDA Kernel for SERE

The authors develop a high-performance CUDA kernel implementation of SERE that is model-agnostic and can be seamlessly integrated into the vLLM inference framework. This implementation enables practical deployment with minimal code modification, requiring only a single line of code change.

Contribution

Expert Similarity Estimation Framework

The authors introduce a framework for computing layer-wise expert similarity matrices using calibration data and various similarity metrics (Frobenius, Cosine, CKA). This pre-computed similarity matrix guides the dynamic re-routing process without requiring retraining or task-specific tuning, and reveals patterns of redundancy and specialization across MoE layers.