MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE

ICLR 2026 Conference SubmissionAnonymous Authors
Model CompressionMixture-of-ExpertsStructured PruningExpert Pruning
Abstract:

Mixture-of-Experts (MoE) enables efficient scaling of large language models by activating only a subset of experts per input token. However, deploying MoE-based models incurs significant memory overhead due to the need to retain all experts in memory. While structured pruning is promising to reduce memory costs, existing methods often show suboptimal performance and unstable degradation in three dimensions: model architectures, calibration data sources, and calibration sample sizes. This paper proposes \textbf{M}ixture-\textbf{o}f-\textbf{N}ovices-and-\textbf{E}xperts (\textbf{MoNE}), a novel expert pruning method that replaces redundant experts with lightweight novices to achieve effective and robust model compression. MoNE evaluates expert redundancy based on two metrics: access frequency and output variance. Experts exhibiting low usage and stable outputs are pruned and replaced with lightweight novices—unbiased estimations of their original outputs—minimizing performance degradation. Extensive experiments demonstrate that MoNE consistently outperforms baseline methods with minimal accuracy degradation across the three dimensions, confirming its effectiveness and robustness. Notably, it outperforms baselines by up to 2.72 for the average zero shot accuracy across nine downstream tasks under 25% pruning ratio, with only 0.14 performance drop for Qwen2-57B-A14B. The code is available at https://anonymous.4open.science/r/AnonymizedMoNE.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes MoNE, a framework that replaces redundant experts with lightweight 'novices' to compress MoE models. It resides in the 'Output Stability-Based Pruning' leaf under 'Usage-Guided Expert Selection', which contains only two papers including this one. This leaf is part of the broader 'Expert-Level Structured Pruning Methods' branch, which encompasses multiple pruning strategies across six leaves. The sparse population of this specific leaf suggests that output stability as a primary pruning criterion remains relatively underexplored compared to frequency-based or clustering-driven approaches.

The taxonomy reveals that MoNE's immediate neighbors include 'Activation Frequency and Routing Analysis' (three papers) and 'Redundancy-Based Expert Removal' (five papers across two sub-leaves). The broader 'Expert-Level Structured Pruning Methods' branch contains eleven papers total, while sibling branches like 'Intra-Expert Compression Techniques' and 'Hybrid Compression Frameworks' offer complementary compression strategies. MoNE's focus on output variance distinguishes it from frequency-only methods in the neighboring leaf, yet it shares conceptual overlap with redundancy detection approaches that measure expert similarity through output comparisons rather than weight-space metrics.

Among twenty-two candidates examined, the contribution-level analysis shows mixed novelty signals. The core MoNE framework (Contribution A) examined ten candidates and found one potentially refuting prior work, suggesting some overlap exists within the limited search scope. The dual-metric redundancy evaluation using access frequency and output variance (Contribution B) examined two candidates with no clear refutations, indicating this specific combination may be less explored. The robustness claim across architectures, data sources, and sample sizes (Contribution C) examined ten candidates without refutation, though this may reflect the limited scope rather than definitive novelty.

Based on the top-22 semantic matches examined, MoNE appears to occupy a moderately explored niche within usage-guided pruning. The sparse leaf population and limited refutations for two of three contributions suggest potential novelty, though the single refutation for the core framework warrants careful examination of overlapping prior work. The analysis does not cover exhaustive citation networks or domain-specific venues, leaving open questions about related work in specialized MoE compression literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: structured pruning of mixture-of-experts models. The field organizes around several complementary strategies for compressing MoE architectures while preserving their conditional computation benefits. Expert-Level Structured Pruning Methods focus on removing or merging entire experts based on usage patterns, output stability, or clustering criteria, as seen in works like Cluster Expert Pruning[1] and Not All Experts[6]. Intra-Expert Compression Techniques instead target redundancy within individual experts through weight pruning or low-rank decomposition, while Hybrid Compression Frameworks combine expert-level and intra-expert approaches for greater efficiency gains. Additional branches address Adaptive and Learnable Compression (where pruning decisions evolve during training), One-Shot Pruning Strategies (enabling rapid post-training compression), and MoE Construction and Conversion (transforming dense models into sparse MoE variants). Theoretical Foundations and Analysis provide formal guarantees, and Surveys and Systematic Reviews like MoE Inference Survey[4] synthesize emerging trends across these diverse methodologies. A particularly active line of work explores usage-guided expert selection, where pruning decisions rely on tracking which experts contribute meaningfully across different inputs or domains. MoNE[0] exemplifies this direction by emphasizing output stability-based pruning, ensuring that removed experts minimally disrupt model predictions. This approach contrasts with simpler frequency-based methods and aligns closely with Mixture Compressor[5], which also prioritizes preserving critical expert contributions during compression. Meanwhile, domain-specific strategies like Domain Specific Pruning[3] tailor expert removal to particular task distributions, and clustering-based methods such as Cluster Expert Pruning[1] group redundant experts before merging. MoNE[0] sits within the usage-guided cluster, distinguished by its focus on output stability rather than purely activation frequency, offering a middle ground between aggressive one-shot pruning and computationally intensive adaptive retraining schemes.

Claimed Contributions

MoNE: Mixture-of-Novices-and-Experts framework

MoNE is a novel expert pruning method for compressing MoE models by replacing redundant experts with lightweight novices (unbiased estimations of expert outputs) rather than simply removing them, thereby minimizing performance degradation while reducing memory overhead.

10 retrieved papers
Can Refute
Expert redundancy evaluation using access frequency and output variance

The method introduces a dual-metric approach to evaluate expert redundancy by combining expert access frequency (how often experts are selected) and output variance (stability of expert outputs across calibration data), enabling more accurate identification of redundant experts compared to frequency-based methods alone.

2 retrieved papers
Robust structured pruning across multiple dimensions

The method demonstrates robust and effective compression performance across three critical dimensions (model architectures, calibration data sources, and calibration sample sizes) where existing structured pruning methods show suboptimal and unstable degradation, achieving up to 2.72 improvement in average zero-shot accuracy.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MoNE: Mixture-of-Novices-and-Experts framework

MoNE is a novel expert pruning method for compressing MoE models by replacing redundant experts with lightweight novices (unbiased estimations of expert outputs) rather than simply removing them, thereby minimizing performance degradation while reducing memory overhead.

Contribution

Expert redundancy evaluation using access frequency and output variance

The method introduces a dual-metric approach to evaluate expert redundancy by combining expert access frequency (how often experts are selected) and output variance (stability of expert outputs across calibration data), enabling more accurate identification of redundant experts compared to frequency-based methods alone.

Contribution

Robust structured pruning across multiple dimensions

The method demonstrates robust and effective compression performance across three critical dimensions (model architectures, calibration data sources, and calibration sample sizes) where existing structured pruning methods show suboptimal and unstable degradation, achieving up to 2.72 improvement in average zero-shot accuracy.