MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Model CompressionMixture-of-ExpertsStructured PruningExpert Pruning

Mixture-of-Experts (MoE) enables efficient scaling of large language models by activating only a subset of experts per input token. However, deploying MoE-based models incurs significant memory overhead due to the need to retain all experts in memory. While structured pruning is promising to reduce memory costs, existing methods often show suboptimal performance and unstable degradation in three dimensions: model architectures, calibration data sources, and calibration sample sizes. This paper proposes \textbf{M}ixture-\textbf{o}f-\textbf{N}ovices-and-\textbf{E}xperts (\textbf{MoNE}), a novel expert pruning method that replaces redundant experts with lightweight novices to achieve effective and robust model compression. MoNE evaluates expert redundancy based on two metrics: access frequency and output variance. Experts exhibiting low usage and stable outputs are pruned and replaced with lightweight novices—unbiased estimations of their original outputs—minimizing performance degradation. Extensive experiments demonstrate that MoNE consistently outperforms baseline methods with minimal accuracy degradation across the three dimensions, confirming its effectiveness and robustness. Notably, it outperforms baselines by up to 2.72 for the average zero shot accuracy across nine downstream tasks under 25% pruning ratio, with only 0.14 performance drop for Qwen2-57B-A14B. The code is available at https://anonymous.4open.science/r/AnonymizedMoNE.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes MoNE, a framework that replaces redundant experts with lightweight 'novices' to compress MoE models. It resides in the 'Output Stability-Based Pruning' leaf under 'Usage-Guided Expert Selection', which contains only two papers including this one. This leaf is part of the broader 'Expert-Level Structured Pruning Methods' branch, which encompasses multiple pruning strategies across six leaves. The sparse population of this specific leaf suggests that output stability as a primary pruning criterion remains relatively underexplored compared to frequency-based or clustering-driven approaches.

The taxonomy reveals that MoNE's immediate neighbors include 'Activation Frequency and Routing Analysis' (three papers) and 'Redundancy-Based Expert Removal' (five papers across two sub-leaves). The broader 'Expert-Level Structured Pruning Methods' branch contains eleven papers total, while sibling branches like 'Intra-Expert Compression Techniques' and 'Hybrid Compression Frameworks' offer complementary compression strategies. MoNE's focus on output variance distinguishes it from frequency-only methods in the neighboring leaf, yet it shares conceptual overlap with redundancy detection approaches that measure expert similarity through output comparisons rather than weight-space metrics.

Among twenty-two candidates examined, the contribution-level analysis shows mixed novelty signals. The core MoNE framework (Contribution A) examined ten candidates and found one potentially refuting prior work, suggesting some overlap exists within the limited search scope. The dual-metric redundancy evaluation using access frequency and output variance (Contribution B) examined two candidates with no clear refutations, indicating this specific combination may be less explored. The robustness claim across architectures, data sources, and sample sizes (Contribution C) examined ten candidates without refutation, though this may reflect the limited scope rather than definitive novelty.

Based on the top-22 semantic matches examined, MoNE appears to occupy a moderately explored niche within usage-guided pruning. The sparse leaf population and limited refutations for two of three contributions suggest potential novelty, though the single refutation for the core framework warrants careful examination of overlapping prior work. The analysis does not cover exhaustive citation networks or domain-specific venues, leaving open questions about related work in specialized MoE compression literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: structured pruning of mixture-of-experts models. The field organizes around several complementary strategies for compressing MoE architectures while preserving their conditional computation benefits. Expert-Level Structured Pruning Methods focus on removing or merging entire experts based on usage patterns, output stability, or clustering criteria, as seen in works like Cluster Expert Pruning[1] and Not All Experts[6]. Intra-Expert Compression Techniques instead target redundancy within individual experts through weight pruning or low-rank decomposition, while Hybrid Compression Frameworks combine expert-level and intra-expert approaches for greater efficiency gains. Additional branches address Adaptive and Learnable Compression (where pruning decisions evolve during training), One-Shot Pruning Strategies (enabling rapid post-training compression), and MoE Construction and Conversion (transforming dense models into sparse MoE variants). Theoretical Foundations and Analysis provide formal guarantees, and Surveys and Systematic Reviews like MoE Inference Survey[4] synthesize emerging trends across these diverse methodologies. A particularly active line of work explores usage-guided expert selection, where pruning decisions rely on tracking which experts contribute meaningfully across different inputs or domains. MoNE[0] exemplifies this direction by emphasizing output stability-based pruning, ensuring that removed experts minimally disrupt model predictions. This approach contrasts with simpler frequency-based methods and aligns closely with Mixture Compressor[5], which also prioritizes preserving critical expert contributions during compression. Meanwhile, domain-specific strategies like Domain Specific Pruning[3] tailor expert removal to particular task distributions, and clustering-based methods such as Cluster Expert Pruning[1] group redundant experts before merging. MoNE[0] sits within the usage-guided cluster, distinguished by its focus on output stability rather than purely activation frequency, offering a middle ground between aggressive one-shot pruning and computationally intensive adaptive retraining schemes.

Claimed Contributions

MoNE: Mixture-of-Novices-and-Experts framework

Can Refute

10 retrieved papers

MoNE is a novel expert pruning method for compressing MoE models by replacing redundant experts with lightweight novices (unbiased estimations of expert outputs) rather than simply removing them, thereby minimizing performance degradation while reducing memory overhead.

10 retrieved papers

Can Refute

Expert redundancy evaluation using access frequency and output variance

2 retrieved papers

The method introduces a dual-metric approach to evaluate expert redundancy by combining expert access frequency (how often experts are selected) and output variance (stability of expert outputs across calibration data), enabling more accurate identification of redundant experts compared to frequency-based methods alone.

2 retrieved papers

Robust structured pruning across multiple dimensions

10 retrieved papers

The method demonstrates robust and effective compression performance across three critical dimensions (model architectures, calibration data sources, and calibration sample sizes) where existing structured pruning methods show suboptimal and unstable degradation, achieving up to 2.72 improvement in average zero-shot accuracy.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Mixture Compressor for Mixture-of-Experts LLMs Gains More PDF

Huang Wei, Liao, Yue, Liu Jian-hui, He, Ruifei, Tan, Haoru, Zhang Shiming, Li Hongsheng, Liu Si, Qi, Xiaojuan (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MoNE: Mixture-of-Novices-and-Experts framework

[6] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models PDF

Can Refute

[1] Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models PDF

Cannot Refute

[2] EAC-MoE: Expert-selection aware compressor for mixture-of-experts large language models PDF

Cannot Refute

[4] A survey on inference optimization techniques for mixture of experts models PDF

Cannot Refute

[5] Mixture Compressor for Mixture-of-Experts LLMs Gains More PDF

Cannot Refute

[11] Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs PDF

Cannot Refute

[18] Task-Specific Expert Pruning for Sparse Mixture-of-Experts PDF

Cannot Refute

[53] Unveiling super experts in mixture-of-experts large language models PDF

Cannot Refute

[54] A closer look into mixture-of-experts in large language models PDF

Cannot Refute

[55] MoE-LLaVA: Mixture of Experts for Large Vision-Language Models PDF

Cannot Refute

Contribution

Expert redundancy evaluation using access frequency and output variance

[51] Joint Spatiotemporal Models for the Estimation of Prey Consumption and PredatorâPrey Overlap: Dynamics of Pacific Cod Predation on Snow and Tanner Crab in the â¦ PDF

Cannot Refute

[52] Knowledge partitioning: Context-dependent use of expertise PDF

Cannot Refute

Contribution

Robust structured pruning across multiple dimensions

[56] One-cycle Structured Pruning with Stability Driven Structure Search PDF

Cannot Refute

[57] NIRVANA: Structured pruning reimagined for large language models compression PDF

Cannot Refute

[58] DepGraph: Towards Any Structural Pruning PDF

Cannot Refute

[59] Structured pruning adapters PDF

Cannot Refute

[60] Cstar: towards compact and structured deep neural networks with adversarial robustness PDF

Cannot Refute

[61] Hydra: Pruning adversarially robust neural networks PDF

Cannot Refute

[62] Moreaupruner: Robust pruning of large language models against weight perturbations PDF

Cannot Refute

[63] ZipLM: Inference-Aware Structured Pruning of Language Models PDF

Cannot Refute

[64] Using Structured Pruning to Find Winning Lottery Tickets PDF

Cannot Refute

[65] Drpruning: Efficient large language model pruning through distributionally robust optimization PDF

Cannot Refute

MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Mixture Compressor for Mixture-of-Experts LLMs Gains More PDF

Contribution Analysis

MoNE: Mixture-of-Novices-and-Experts framework

[6] Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models PDF

[1] Cluster-Driven Expert Pruning for Mixture-of-Experts Large Language Models PDF

[2] EAC-MoE: Expert-selection aware compressor for mixture-of-experts large language models PDF

[4] A survey on inference optimization techniques for mixture of experts models PDF

[5] Mixture Compressor for Mixture-of-Experts LLMs Gains More PDF

[11] Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs PDF

[18] Task-Specific Expert Pruning for Sparse Mixture-of-Experts PDF

[53] Unveiling super experts in mixture-of-experts large language models PDF

[54] A closer look into mixture-of-experts in large language models PDF

[55] MoE-LLaVA: Mixture of Experts for Large Vision-Language Models PDF

Expert redundancy evaluation using access frequency and output variance

[51] Joint Spatiotemporal Models for the Estimation of Prey Consumption and PredatorâPrey Overlap: Dynamics of Pacific Cod Predation on Snow and Tanner Crab in the â¦ PDF

[52] Knowledge partitioning: Context-dependent use of expertise PDF

Robust structured pruning across multiple dimensions

[56] One-cycle Structured Pruning with Stability Driven Structure Search PDF

[57] NIRVANA: Structured pruning reimagined for large language models compression PDF

[58] DepGraph: Towards Any Structural Pruning PDF

[59] Structured pruning adapters PDF

[60] Cstar: towards compact and structured deep neural networks with adversarial robustness PDF

[61] Hydra: Pruning adversarially robust neural networks PDF

[62] Moreaupruner: Robust pruning of large language models against weight perturbations PDF

[63] ZipLM: Inference-Aware Structured Pruning of Language Models PDF

[64] Using Structured Pruning to Find Winning Lottery Tickets PDF

[65] Drpruning: Efficient large language model pruning through distributionally robust optimization PDF

Table of Contents

[51] Joint Spatiotemporal Models for the Estimation of Prey Consumption and PredatorâPrey Overlap: Dynamics of Pacific Cod Predation on Snow and Tanner Crab in the â¦ PDF