MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose a novel architecture that factorizes expert weight matrices using rank decomposition combined with shared basis matrices. Each expert's up/gate matrix is decomposed as W = AB, where A is expert-specific and B is re-parameterized as a linear combination of basis matrices shared across all experts within a layer.
The authors develop an algorithm (Algorithm 1) that converts standard pretrained MoE models into the MoBE formulation by optimizing factorized components through gradient-based methods like Adam, minimizing reconstruction error between original and factorized weight matrices.
Through comprehensive experiments on models including Qwen3-235B-A22B-2507, DeepSeek-V3-0324, and Kimi-K2-Instruct, the authors show that MoBE achieves significantly lower reconstruction error and better downstream task performance compared to existing methods like MoLAE and D2-MoE at similar or higher compression rates.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[8] Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF
[16] MoE-I: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition PDF
[24] Resmoe: Space-efficient compression of mixture of experts llms via residual restoration PDF
[35] Delta Decompression for MoE-based LLMs Compression PDF
[39] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Mixture-of-Basis-Experts (MoBE) architecture for MoE compression
The authors propose a novel architecture that factorizes expert weight matrices using rank decomposition combined with shared basis matrices. Each expert's up/gate matrix is decomposed as W = AB, where A is expert-specific and B is re-parameterized as a linear combination of basis matrices shared across all experts within a layer.
[10] Sub-MoE: Efficient Mixture-of-Expert LLMs Compression via Subspace Expert Merging PDF
[35] Delta Decompression for MoE-based LLMs Compression PDF
[56] Ders: Towards extremely efficient upcycled mixture-of-experts models PDF
[58] MoLAE: Mixture of Latent Experts for Parameter-Efficient Language Models PDF
[57] Multi-task dense prediction via mixture of low-rank experts PDF
[59] BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts PDF
[60] TT-LoRA MoE: Using Parameter-Efficient Fine-Tuning and Sparse Mixture-Of-Experts PDF
[61] Adaptive Shared Experts with LoRA-Based Mixture of Experts for Multi-Task Learning PDF
[62] PT-MoE: An Efficient Finetuning Framework for Integrating Mixture-of-Experts into Prompt Tuning PDF
[63] Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression PDF
Optimization method for converting pretrained MoE to MoBE
The authors develop an algorithm (Algorithm 1) that converts standard pretrained MoE models into the MoBE formulation by optimizing factorized components through gradient-based methods like Adam, minimizing reconstruction error between original and factorized weight matrices.
[4] FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models PDF
[8] Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF
[39] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis PDF
[51] MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators PDF
[52] Monet: Mixture of monosemantic experts for transformers PDF
[53] Enhancing RT-DETR Efficiency with Mixture of Experts Approach and Matrix Decomposition PDF
[54] MoE-SVD: Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF
[55] Residual Mixture of Experts PDF
Demonstration of superior compression with minimal accuracy loss
Through comprehensive experiments on models including Qwen3-235B-A22B-2507, DeepSeek-V3-0324, and Kimi-K2-Instruct, the authors show that MoBE achieves significantly lower reconstruction error and better downstream task performance compared to existing methods like MoLAE and D2-MoE at similar or higher compression rates.