SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

MoE ModelInference AccelerationBatch DecodingExpert Re-routing

Mixture-of-Experts (MoE) architectures employ sparse activation to deliver faster training and inference with higher accuracy than dense LLMs. However, in production serving, MoE models require batch inference to optimize hardware efficiency, which may cause excessive expert activation and thus slow the memory-bound decoding stage. To address the fundamental tension between batch decoding and expert sparsity, we present SERE, a Similarity-based Expert Re-routing method for Efficient batch decoding in MoE models. SERE dynamically reduces the number of active experts in an input‑aware manner by re-routing tokens from secondary experts to their most similar primary counterparts. It also leverages similarity patterns to identify and preserve critical experts, thereby preventing capability loss. Notably, SERE avoids static expert pruning or merging, instead enabling dynamic expert skipping based on batch-level expert redundancy. Additionally, we provide an efficient custom CUDA kernel for SERE, enabling plug-and-play use in vLLM with only a single‑line code change. Extensive experiments on various complex reasoning benchmarks demonstrate that SERE achieves up to $2.0\times$ speedup with minimal quality loss, providing a practical solution for cost-efficient and latency-sensitive large-scale MoE deployment.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SERE, a similarity-based expert re-routing method that dynamically reduces active experts during batch decoding by redirecting tokens from secondary experts to similar primary counterparts. Within the taxonomy, SERE resides in the 'Similarity-Based Expert Re-routing' leaf under 'Dynamic Expert Routing and Selection'. This leaf contains only two papers total: SERE itself and one sibling work on opportunistic expert activation. This positioning suggests a relatively sparse research direction within the broader field of efficient MoE batch decoding, which encompasses 50 papers across 24 leaf nodes.

The taxonomy reveals that SERE's parent branch, 'Dynamic Expert Routing and Selection', contains three leaves: similarity-based re-routing, batch-aware expert selection, and task-level routing. Neighboring branches address complementary challenges: 'Expert Offloading and Memory Management' focuses on CPU-GPU orchestration and caching strategies, while 'Parallelism and Distributed Inference' tackles multi-device coordination. SERE's approach diverges from memory-centric offloading solutions by optimizing routing logic rather than expert availability, and differs from batch-aware selection methods by explicitly leveraging expert similarity metrics rather than batch-level characteristics alone.

Among 26 candidates examined through limited semantic search, the contribution-level analysis shows varied novelty signals. The core SERE re-routing method examined 6 candidates with 0 refutations, suggesting limited direct overlap in the search scope. The custom CUDA kernel contribution examined 10 candidates with 0 refutations, indicating potential implementation novelty. However, the expert similarity estimation framework examined 10 candidates and found 3 refutable pairs, suggesting that techniques for measuring expert similarity have more substantial prior work within the examined literature. These statistics reflect the bounded search scope, not exhaustive field coverage.

Based on the limited search of 26 candidates, SERE appears to occupy a sparsely populated research direction with only one sibling paper in its taxonomy leaf. The core re-routing mechanism shows no clear refutation among examined candidates, while the similarity estimation component encounters more prior work. The analysis cannot assess whether additional relevant work exists beyond the top-K semantic matches examined, particularly in adjacent areas like expert merging or pruning that may employ similar similarity metrics.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient batch decoding in mixture-of-experts models. The field has organized itself around several complementary challenges that arise when serving large MoE architectures at scale. Expert Offloading and Memory Management addresses the fundamental problem of fitting massive expert parameters into limited device memory, with works like MoE-Infinity[7] and Ktransformers[8] exploring CPU-GPU offloading strategies. Dynamic Expert Routing and Selection focuses on how tokens are assigned to experts, including similarity-based re-routing and adaptive activation schemes. Parallelism and Distributed Inference tackles the coordination of experts across multiple devices, while Batching and Scheduling Strategies optimizes how requests are grouped and processed to maximize throughput. Model Architecture and Training considers design choices that improve inference efficiency from the ground up, and Speculative Decoding and Acceleration borrows techniques from autoregressive speedup to reduce latency. System Integration and Benchmarking provides holistic frameworks like Moesys[6] and MoE-Inference-Bench[30], and Domain-Specific Applications explores deployment in constrained environments such as edge devices or vehicular networks. A particularly active tension exists between memory-centric offloading approaches and routing-centric optimization methods. While offloading systems prioritize keeping experts accessible despite hardware limits, routing strategies aim to reduce the number of active experts or redirect tokens to similar experts when capacity is exceeded. SERE[0] sits within the Dynamic Expert Routing and Selection branch, specifically under Similarity-Based Expert Re-routing, where it shares conceptual ground with Opportunistic Expert Activation[43]. Both works explore how to handle overflow or capacity constraints by leveraging expert similarity, but SERE[0] emphasizes re-routing tokens to semantically similar experts when primary choices are unavailable, whereas Opportunistic Expert Activation[43] focuses on activating additional experts opportunistically. This contrasts with heavier offloading solutions like MoE-Infinity[7] or scheduling frameworks like Moesys[6], which manage expert availability through memory orchestration rather than routing logic. The interplay between these approaches highlights an open question: whether intelligent routing can reduce the need for complex memory management, or whether both layers of optimization are necessary for truly efficient batch decoding.

Claimed Contributions

SERE: Similarity-based Expert Re-routing Method

6 retrieved papers

The authors propose SERE, a dynamic expert skipping method that reduces active experts during batch decoding by re-routing tokens from secondary experts to their most similar primary counterparts. The method leverages similarity patterns to identify and preserve critical experts, avoiding static pruning or merging while enabling input-aware expert reduction based on batch-level redundancy.

6 retrieved papers

Efficient Custom CUDA Kernel for SERE

10 retrieved papers

The authors develop a high-performance CUDA kernel implementation of SERE that is model-agnostic and can be seamlessly integrated into the vLLM inference framework. This implementation enables practical deployment with minimal code modification, requiring only a single line of code change.

10 retrieved papers

Expert Similarity Estimation Framework

Can Refute

10 retrieved papers

The authors introduce a framework for computing layer-wise expert similarity matrices using calibration data and various similarity metrics (Frobenius, Cosine, CKA). This pre-computed similarity matrix guides the dynamic re-routing process without requiring retraining or task-specific tuning, and reveals patterns of redundancy and specialization across MoE layers.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[43] Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining PDF

Wu, Qingyang, Costin-Andrei Oncescu, Chung, Wai Tong, Qingyang Wu, Robert, Wai Tong Chung, Gopal, Bryan, Robert Wu, Wang Junxiong, Bryan Gopal, Dao, Tri, Junxiong Wang, Athiwaratkun, Ben, Tri Dao, Ben Athiwaratkun (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SERE: Similarity-based Expert Re-routing Method

[61] Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models PDF

Cannot Refute

[62] Conditional information gain networks as sparse mixture of experts PDF

Cannot Refute

[63] MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts PDF

Cannot Refute

[64] Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection PDF

Cannot Refute

[65] MoE-ERAS: Expert Residency Aware Selection PDF

Cannot Refute

[66] Route, Select, Activate: The Mechanics of Mixture of Experts PDF

Cannot Refute

Contribution

Efficient Custom CUDA Kernel for SERE

[19] Eps-moe: Expert pipeline scheduler for cost-efficient moe inference PDF

Cannot Refute

[25] LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design PDF

Cannot Refute

[36] FloE: On-the-Fly MoE Inference on Memory-constrained GPU PDF

Cannot Refute

[67] Megablocks: Efficient sparse training with mixture-of-experts PDF

Cannot Refute

[68] MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators PDF

Cannot Refute

[69] Scattered Mixture-of-Experts Implementation PDF

Cannot Refute

[70] Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale PDF

Cannot Refute

[71] Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference PDF

Cannot Refute

[72] Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores PDF

Cannot Refute

[73] Aristos: Pipelining One-sided Communication in Distributed Mixture of Experts PDF

Cannot Refute

Contribution

Expert Similarity Estimation Framework

[51] Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models PDF

Can Refute

[54] Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy PDF

Can Refute

[55] GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models PDF

Can Refute

[52] Load Balancing Mixture of Experts with Similarity Preserving Routers PDF

Cannot Refute

[53] Little By Little: Continual Learning via Self-Activated Sparse Mixture-of-Rank Adaptive Learning PDF

Cannot Refute

[56] Unified Sparse Mixture of Experts PDF

Cannot Refute

[57] Toward Efficient Inference Attacks: Shadow Model Sharing via Mixture-of-Experts PDF

Cannot Refute

[58] Unveiling Super Experts in Mixture-of-Experts Large Language Models PDF

Cannot Refute

[59] S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning PDF

Cannot Refute

[60] More Experts Than Galaxies: Conditionally-overlapping Experts With Biologically-Inspired Fixed Routing PDF

Cannot Refute

SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[43] Opportunistic Expert Activation: Batch-Aware Expert Routing for Faster Decode Without Retraining PDF

Contribution Analysis

SERE: Similarity-based Expert Re-routing Method

[61] Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models PDF

[62] Conditional information gain networks as sparse mixture of experts PDF

[63] MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts PDF

[64] Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection PDF

[65] MoE-ERAS: Expert Residency Aware Selection PDF

[66] Route, Select, Activate: The Mechanics of Mixture of Experts PDF

Efficient Custom CUDA Kernel for SERE

[19] Eps-moe: Expert pipeline scheduler for cost-efficient moe inference PDF

[25] LoRA-Switch: Boosting the Efficiency of Dynamic LLM Adapters via System-Algorithm Co-design PDF

[36] FloE: On-the-Fly MoE Inference on Memory-constrained GPU PDF

[67] Megablocks: Efficient sparse training with mixture-of-experts PDF

[68] MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators PDF

[69] Scattered Mixture-of-Experts Implementation PDF

[70] Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale PDF

[71] Static Batching of Irregular Workloads on GPUs: Framework and Application to Efficient MoE Model Inference PDF

[72] Samoyeds: Accelerating MoE Models with Structured Sparsity Leveraging Sparse Tensor Cores PDF

[73] Aristos: Pipelining One-sided Communication in Distributed Mixture of Experts PDF

Expert Similarity Estimation Framework

[51] Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models PDF

[54] Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy PDF

[55] GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models PDF

[52] Load Balancing Mixture of Experts with Similarity Preserving Routers PDF

[53] Little By Little: Continual Learning via Self-Activated Sparse Mixture-of-Rank Adaptive Learning PDF

[56] Unified Sparse Mixture of Experts PDF

[57] Toward Efficient Inference Attacks: Shadow Model Sharing via Mixture-of-Experts PDF

[58] Unveiling Super Experts in Mixture-of-Experts Large Language Models PDF

[59] S2MoE: Robust Sparse Mixture of Experts via Stochastic Learning PDF

[60] More Experts Than Galaxies: Conditionally-overlapping Experts With Biologically-Inspired Fixed Routing PDF

Table of Contents