TD-MoE: Tensor Decomposition for MoE Models
Overview
Overall Novelty Assessment
The paper proposes TD-MoE, a framework that compresses MoE models by jointly factorizing all expert weights within a layer using 3D Tucker decomposition. It resides in the Multi-Dimensional Tensor Factorization leaf, which contains only two papers including this one. This leaf sits under Decomposition-Based Expert Weight Compression, a moderately populated branch with three sub-categories totaling seven papers. The sparse population of this specific leaf suggests that higher-order tensor methods for MoE compression remain relatively underexplored compared to simpler SVD-based approaches, which have three papers in a neighboring leaf.
The taxonomy reveals that Multi-Dimensional Tensor Factorization is distinct from Singular Value Decomposition for Expert Compression, which applies matrix-level SVD per expert, and from Delta-Based Decomposition, which separates shared base weights from expert-specific deltas. Neighboring branches include Inter-Expert Structural Optimization, which combines pruning with low-rank methods, and Expert Consolidation and Merging, which reduces redundancy by combining experts rather than factorizing them. The scope note explicitly excludes SVD-only and parameter-efficient fine-tuning methods, positioning this work as a higher-order alternative to matrix decomposition that captures cross-expert structure.
Among the three contributions analyzed, the 3D rank allocation mechanism examined ten candidates and found one potentially refutable prior work, while the cross-expert tensorization and multi-linear whitening strategies examined three and six candidates respectively with no clear refutations. The limited search scope—nineteen total candidates across all contributions—means these statistics reflect top-K semantic matches rather than exhaustive coverage. The tensorization and whitening contributions appear more novel within this restricted sample, whereas the rank allocation mechanism encounters at least one overlapping prior approach among the examined candidates.
Based on the limited literature search, the work appears to occupy a relatively sparse research direction within MoE compression, with only one sibling paper in its taxonomy leaf. The cross-expert tensorization and whitening strategies show no clear prior overlap among examined candidates, while the rank allocation mechanism has at least one potentially overlapping work. These findings are constrained by the top-19 candidate scope and do not constitute an exhaustive novelty assessment.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose stacking all expert weight matrices in an MoE layer into a three-dimensional tensor and applying Tucker decomposition jointly across experts. This approach captures cross-expert redundancies and shared structure that per-expert decomposition methods overlook, enabling more efficient compression.
The method introduces a whitening transformation applied to the tensorized expert weights using activation statistics. This decorrelates feature dimensions across input and output modes, producing a well-conditioned tensor that improves decomposition quality and enables more effective low-rank approximations.
The authors develop an adaptive scheme that automatically distributes Tucker decomposition ranks across the three tensor dimensions (experts, input features, output features) to satisfy a specified compression budget. This mechanism enables efficient exploration of the rank space while maintaining reconstruction fidelity.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[7] Parameter-efficient mixture-of-experts architecture for pre-trained language models PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Cross-expert tensorization with joint 3D Tucker decomposition
The authors propose stacking all expert weight matrices in an MoE layer into a three-dimensional tensor and applying Tucker decomposition jointly across experts. This approach captures cross-expert redundancies and shared structure that per-expert decomposition methods overlook, enabling more efficient compression.
[31] RTE-GMoE: A Model-agnostic Approach for Relation Triplet Extraction via Graph-based Mixture-of-Expert Mutual Learning PDF
[32] TuckA: Hierarchical Compact Tensor Experts for Efficient Fine-Tuning PDF
[33] LANGUAGE NAVIGATION WITH TUCKER ADAPTATION PDF
Multi-linear whitening strategy for tensor decomposition
The method introduces a whitening transformation applied to the tensorized expert weights using activation statistics. This decorrelates feature dimensions across input and output modes, producing a well-conditioned tensor that improves decomposition quality and enables more effective low-rank approximations.
[1] Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF
[13] MoE-SVD: Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF
[27] Fast and guaranteed tensor decomposition via sketching PDF
[28] Whitening Spherical Gaussian Mixtures in the Large-Dimensional Regime PDF
[29] Learning Fine-grained Parameter Sharing via Sparse Tensor Decomposition PDF
[30] LoRDQ: activation-aware Low-Rank Decomposition and Quantization for Large Language Model Compression PDF
3D rank allocation mechanism for compression budget
The authors develop an adaptive scheme that automatically distributes Tucker decomposition ranks across the three tensor dimensions (experts, input features, output features) to satisfy a specified compression budget. This mechanism enables efficient exploration of the rank space while maintaining reconstruction fidelity.