MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

ICLR 2026 Conference SubmissionAnonymous Authors
large language modelsmodel compressionstructured pruning
Abstract:

The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further reparameterized as a linear combination of basis matrices {Bi} shared across all experts within a given MoE layer. The factorization is learned by minimizing the reconstruction error relative to the original weight matrices. Experiments demonstrate that MoBE achieves notably lower accuracy drops compared to prior works. For instance, MoBE can reduce the parameter counts of Qwen3-235BA22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop (about 2% drops when measured relatively).

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: Compressing Mixture-of-Experts based large language models. The field has organized itself around several complementary strategies for reducing the computational and memory footprint of MoE architectures. Expert-Level Structural Compression focuses on pruning, merging, or reorganizing experts themselves—ranging from clustering similar experts (Cluster Expert Pruning[17]) to decomposing expert weights via low-rank factorization (FactorLLM[4], Structured MoE SVD[8]). Quantization and Mixed-Precision Techniques apply bit-width reduction and adaptive precision schemes (QMoE[29], MoEQuant[44]) to shrink model size while preserving accuracy. Inference Optimization and Deployment addresses runtime challenges such as expert offloading (Fast MoE Offloading[12], SwapMoE[25]) and efficient scheduling (Faster MoE Inference[15]). Training Efficiency and Scaling investigates how to grow or initialize MoE models cost-effectively (MoE Scaling Laws[5], Orthogonal Growth MoE[46]), while Domain-Specific and Multimodal Extensions adapt MoE compression to vision-language tasks (MoE-LLaVA[33], DeepSeek-VL2[1]). Parameter-Efficient Fine-Tuning with MoE explores lightweight adaptation methods (X-LoRA[23]), and Benchmarking and Analysis provides empirical insights into trade-offs across these approaches. Within Expert-Level Structural Compression, a particularly active line of work centers on weight decomposition and factorization, where researchers seek to represent expert parameters more compactly without discarding entire experts. MoBE[0] exemplifies this direction by applying block-wise low-rank decomposition to expert weights, aiming to balance compression ratio and task performance. Nearby efforts include Structured MoE SVD[8], which enforces structured singular value decomposition for hardware-friendly compression, and ResMoE[24], which introduces residual connections to preserve information during factorization. Delta Decompression[35] and CAMERA[39] further explore how to efficiently reconstruct or adapt compressed representations at inference time. These factorization-based methods contrast with pruning-centric approaches (Efficient Expert Pruning[19], Mosaic Pruning[26]) that remove redundant experts entirely, and with merging strategies (Super Experts[21]) that consolidate multiple experts into fewer units. MoBE[0] sits squarely in the factorization cluster, sharing the goal of retaining all experts in a compressed form rather than eliminating them, and its block-wise strategy offers a middle ground between fine-grained decomposition and coarse expert-level operations.

Claimed Contributions

Mixture-of-Basis-Experts (MoBE) architecture for MoE compression

The authors propose a novel architecture that factorizes expert weight matrices using rank decomposition combined with shared basis matrices. Each expert's up/gate matrix is decomposed as W = AB, where A is expert-specific and B is re-parameterized as a linear combination of basis matrices shared across all experts within a layer.

10 retrieved papers
Can Refute
Optimization method for converting pretrained MoE to MoBE

The authors develop an algorithm (Algorithm 1) that converts standard pretrained MoE models into the MoBE formulation by optimizing factorized components through gradient-based methods like Adam, minimizing reconstruction error between original and factorized weight matrices.

8 retrieved papers
Demonstration of superior compression with minimal accuracy loss

Through comprehensive experiments on models including Qwen3-235B-A22B-2507, DeepSeek-V3-0324, and Kimi-K2-Instruct, the authors show that MoBE achieves significantly lower reconstruction error and better downstream task performance compared to existing methods like MoLAE and D2-MoE at similar or higher compression rates.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Mixture-of-Basis-Experts (MoBE) architecture for MoE compression

The authors propose a novel architecture that factorizes expert weight matrices using rank decomposition combined with shared basis matrices. Each expert's up/gate matrix is decomposed as W = AB, where A is expert-specific and B is re-parameterized as a linear combination of basis matrices shared across all experts within a layer.

Contribution

Optimization method for converting pretrained MoE to MoBE

The authors develop an algorithm (Algorithm 1) that converts standard pretrained MoE models into the MoBE formulation by optimizing factorized components through gradient-based methods like Adam, minimizing reconstruction error between original and factorized weight matrices.

Contribution

Demonstration of superior compression with minimal accuracy loss

Through comprehensive experiments on models including Qwen3-235B-A22B-2507, DeepSeek-V3-0324, and Kimi-K2-Instruct, the authors show that MoBE achieves significantly lower reconstruction error and better downstream task performance compared to existing methods like MoLAE and D2-MoE at similar or higher compression rates.

MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs | Novelty Validation