MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

large language modelsmodel compressionstructured pruning

The Mixture-of-Experts (MoE) architecture has become a predominant paradigm for scaling large language models (LLMs). Despite offering strong performance and computational efficiency, large MoE-based LLMs like DeepSeek-V3-0324 and Kimi-K2-Instruct present serious challenges due to substantial memory requirements in deployment. While recent works have explored MoE compression to address this issue, existing methods often suffer from considerable accuracy drops (e.g., 7-14% relatively) even at modest compression rates. This paper introduces a novel Mixture-of-Basis-Experts (MoBE) method that achieves model compression while incurring minimal accuracy drops. Specifically, each up/gate matrix in an expert is decomposed via a rank decomposition as W = AB, where matrix A is unique to each expert. The relatively larger matrix B is further reparameterized as a linear combination of basis matrices {Bi} shared across all experts within a given MoE layer. The factorization is learned by minimizing the reconstruction error relative to the original weight matrices. Experiments demonstrate that MoBE achieves notably lower accuracy drops compared to prior works. For instance, MoBE can reduce the parameter counts of Qwen3-235BA22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop (about 2% drops when measured relatively).

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Compressing Mixture-of-Experts based large language models. The field has organized itself around several complementary strategies for reducing the computational and memory footprint of MoE architectures. Expert-Level Structural Compression focuses on pruning, merging, or reorganizing experts themselves—ranging from clustering similar experts (Cluster Expert Pruning[17]) to decomposing expert weights via low-rank factorization (FactorLLM[4], Structured MoE SVD[8]). Quantization and Mixed-Precision Techniques apply bit-width reduction and adaptive precision schemes (QMoE[29], MoEQuant[44]) to shrink model size while preserving accuracy. Inference Optimization and Deployment addresses runtime challenges such as expert offloading (Fast MoE Offloading[12], SwapMoE[25]) and efficient scheduling (Faster MoE Inference[15]). Training Efficiency and Scaling investigates how to grow or initialize MoE models cost-effectively (MoE Scaling Laws[5], Orthogonal Growth MoE[46]), while Domain-Specific and Multimodal Extensions adapt MoE compression to vision-language tasks (MoE-LLaVA[33], DeepSeek-VL2[1]). Parameter-Efficient Fine-Tuning with MoE explores lightweight adaptation methods (X-LoRA[23]), and Benchmarking and Analysis provides empirical insights into trade-offs across these approaches. Within Expert-Level Structural Compression, a particularly active line of work centers on weight decomposition and factorization, where researchers seek to represent expert parameters more compactly without discarding entire experts. MoBE[0] exemplifies this direction by applying block-wise low-rank decomposition to expert weights, aiming to balance compression ratio and task performance. Nearby efforts include Structured MoE SVD[8], which enforces structured singular value decomposition for hardware-friendly compression, and ResMoE[24], which introduces residual connections to preserve information during factorization. Delta Decompression[35] and CAMERA[39] further explore how to efficiently reconstruct or adapt compressed representations at inference time. These factorization-based methods contrast with pruning-centric approaches (Efficient Expert Pruning[19], Mosaic Pruning[26]) that remove redundant experts entirely, and with merging strategies (Super Experts[21]) that consolidate multiple experts into fewer units. MoBE[0] sits squarely in the factorization cluster, sharing the goal of retaining all experts in a compressed form rather than eliminating them, and its block-wise strategy offers a middle ground between fine-grained decomposition and coarse expert-level operations.

Claimed Contributions

Mixture-of-Basis-Experts (MoBE) architecture for MoE compression

Can Refute

10 retrieved papers

The authors propose a novel architecture that factorizes expert weight matrices using rank decomposition combined with shared basis matrices. Each expert's up/gate matrix is decomposed as W = AB, where A is expert-specific and B is re-parameterized as a linear combination of basis matrices shared across all experts within a layer.

10 retrieved papers

Can Refute

Optimization method for converting pretrained MoE to MoBE

8 retrieved papers

The authors develop an algorithm (Algorithm 1) that converts standard pretrained MoE models into the MoBE formulation by optimizing factorized components through gradient-based methods like Adam, minimizing reconstruction error between original and factorized weight matrices.

8 retrieved papers

Demonstration of superior compression with minimal accuracy loss

Can Refute

10 retrieved papers

Through comprehensive experiments on models including Qwen3-235B-A22B-2507, DeepSeek-V3-0324, and Kimi-K2-Instruct, the authors show that MoBE achieves significantly lower reconstruction error and better downstream task performance compared to existing methods like MoLAE and D2-MoE at similar or higher compression rates.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF

W Li, L Li, YL Huang, MG Lee, S Sun, W Xue, Y Guo (2025)

[16] MoE-I: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition PDF

C Yang, Y Sui, J Xiao, L Huang, Y Gong (2024)

[24] Resmoe: Space-efficient compression of mixture of experts llms via residual restoration PDF

Mengting Ai, Tian-Xin Wei, Yifan Chen, Tianxin Wei, Zhichen Zeng, Ritchie Zhao, Girish Varatkar, Bita Darvish Rouhani, Xianfeng Tang, B. Rouhani, Hanghang Tong, Jingrui He (2025)

[35] Delta Decompression for MoE-based LLMs Compression PDF

Gu Hao, Li Wei, Hao Gu, Li Lujun, Wei Li, ZHU Qiyuan, Lujun Li, Lee, Mark, Qi Zhu, Sun Shengjie, Mark Lee, Xue Wei, Shengjie Sun, Guo Yike, Wei Xue, Yi-Ting Guo (2025)

[39] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis PDF

Xu, Yuzhuang, Han Xu, Yuzhuang Xu, Zhang Yuan-chi, Xu Han, Wang Yixuan, Yuanchi Zhang, Liu Yi-jun, Yixuan Wang, Ji Shiyu, Yijun Liu, Zhu Qingfu, Shiyu Ji, Che, Wanxiang, Qingfu Zhu, Wanxiang Che (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Mixture-of-Basis-Experts (MoBE) architecture for MoE compression

[10] Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging PDF

Can Refute

[35] Delta Decompression for MoE-based LLMs Compression PDF

Can Refute

[56] Ders: Towards extremely efficient upcycled mixture-of-experts models PDF

Can Refute

[58] MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models PDF

Can Refute

[57] Multi-task dense prediction via mixture of low-rank experts PDF

Cannot Refute

[59] BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts PDF

Cannot Refute

[60] TT-LoRA MoE: Using Parameter-Efficient Fine-Tuning and Sparse Mixture-Of-Experts PDF

Cannot Refute

[61] Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning PDF

Cannot Refute

[62] PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt Tuning PDF

Cannot Refute

[63] Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression PDF

Cannot Refute

Contribution

Optimization method for converting pretrained MoE to MoBE

[4] FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models PDF

Cannot Refute

[8] Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF

Cannot Refute

[39] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis PDF

Cannot Refute

[51] MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators PDF

Cannot Refute

[52] Monet: Mixture of monosemantic experts for transformers PDF

Cannot Refute

[53] Enhancing RT-DETR Efficiency with Mixture of Experts Approach and Matrix Decomposition PDF

Cannot Refute

[54] MoE-SVD: Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF

Cannot Refute

[55] Residual Mixture of Experts PDF

Cannot Refute

Contribution

Demonstration of superior compression with minimal accuracy loss

[16] MoE-I: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition PDF

Can Refute

[2] EAC-MoE: Expert-selection aware compressor for mixture-of-experts large language models PDF

Cannot Refute

[8] Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF

Cannot Refute

[10] Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging PDF

Cannot Refute

[19] Efficient expert pruning for sparse mixture-of-experts language models: Enhancing performance and reducing inference costs PDF

Cannot Refute

[20] Mixture Compressor for Mixture-of-Experts LLMs Gains More PDF

Cannot Refute

[21] Unveiling Super Experts in Mixture-of-Experts Large Language Models PDF

Cannot Refute

[28] SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation PDF

Cannot Refute

[34] EdgeMoE: Empowering Sparse Large Language Models on Mobile Devices PDF

Cannot Refute

[64] Demystifying the compression of mixture-of-experts through a unified framework PDF

Cannot Refute

MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF

[16] MoE-I: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition PDF

[24] Resmoe: Space-efficient compression of mixture of experts llms via residual restoration PDF

[35] Delta Decompression for MoE-based LLMs Compression PDF

[39] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis PDF

Contribution Analysis

Mixture-of-Basis-Experts (MoBE) architecture for MoE compression

[10] Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging PDF

[35] Delta Decompression for MoE-based LLMs Compression PDF

[56] Ders: Towards extremely efficient upcycled mixture-of-experts models PDF

[58] MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models PDF

[57] Multi-task dense prediction via mixture of low-rank experts PDF

[59] BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts PDF

[60] TT-LoRA MoE: Using Parameter-Efficient Fine-Tuning and Sparse Mixture-Of-Experts PDF

[61] Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning PDF

[62] PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt Tuning PDF

[63] Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression PDF

Optimization method for converting pretrained MoE to MoBE

[4] FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models PDF

[8] Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF

[39] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis PDF

[51] MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators PDF

[52] Monet: Mixture of monosemantic experts for transformers PDF

[53] Enhancing RT-DETR Efficiency with Mixture of Experts Approach and Matrix Decomposition PDF

[54] MoE-SVD: Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF

[55] Residual Mixture of Experts PDF

Demonstration of superior compression with minimal accuracy loss

[16] MoE-I: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition PDF

[2] EAC-MoE: Expert-selection aware compressor for mixture-of-experts large language models PDF

[8] Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF

[10] Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging PDF

[19] Efficient expert pruning for sparse mixture-of-experts language models: Enhancing performance and reducing inference costs PDF

[20] Mixture Compressor for Mixture-of-Experts LLMs Gains More PDF

[21] Unveiling Super Experts in Mixture-of-Experts Large Language Models PDF

[28] SlimMoE: Structured Compression of Large MoE Models via Expert Slimming and Distillation PDF

[34] EdgeMoE: Empowering Sparse Large Language Models on Mobile Devices PDF

[64] Demystifying the compression of mixture-of-experts through a unified framework PDF

Table of Contents