TD-MoE: Tensor Decomposition for MoE Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Mixture-of-ExpertsModel CompressionTucker Decomposition

Mixture-of-Experts (MoE) architectures have demonstrated remarkable capabilities and scalability for large language models, but incur a prohibitive memory footprint due to duplicated expert parameters. Existing compression approaches, particularly those based on low-rank decomposition, typically operate at the granularity of individual experts. However, such per-expert methods overlook structural redundancies across experts, limiting their compression efficiency and effectiveness. In this work, we introduce TD-MoE (Tensor Decomposition for MoE Compression), a data-aware framework that jointly and holistically factorizes expert weights. Our contributions are threefold: (i) Cross-expert tensorization with joint 3D decomposition, which unifies all experts within a layer into a single tensor and captures shared structure beyond per-expert scope; (ii) A multi-linear whitening strategy, which decorrelates input and output features, yielding a more balanced and data-adaptive decomposition; (iii) A 3D rank allocation mechanism, which dynamically assigns 3D decomposition ranks across dimensions to best meet a target compression ratio while minimizing the reconstruction error. Extensive experiments on Qwen2-57B-A14 and Mixtral-8×7B across seven commonsense reasoning benchmarks demonstrate that TD-MoE achieves almost lossless performance under 20% parameter reduction, and delivers more than 11% and 14% gains over state-of-the-art decomposition-based baselines at 40% and 60% compression. Further ablation studies validate the effectiveness of each component, highlighting the importance of joint factorization, whitening, and rank allocation. The code is available at \href{https://anonymous.4open.science/r/TD-MoE}{https://anonymous.4open.science/r/TD-MoE}.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes TD-MoE, a framework that compresses MoE models by jointly factorizing all expert weights within a layer using 3D Tucker decomposition. It resides in the Multi-Dimensional Tensor Factorization leaf, which contains only two papers including this one. This leaf sits under Decomposition-Based Expert Weight Compression, a moderately populated branch with three sub-categories totaling seven papers. The sparse population of this specific leaf suggests that higher-order tensor methods for MoE compression remain relatively underexplored compared to simpler SVD-based approaches, which have three papers in a neighboring leaf.

The taxonomy reveals that Multi-Dimensional Tensor Factorization is distinct from Singular Value Decomposition for Expert Compression, which applies matrix-level SVD per expert, and from Delta-Based Decomposition, which separates shared base weights from expert-specific deltas. Neighboring branches include Inter-Expert Structural Optimization, which combines pruning with low-rank methods, and Expert Consolidation and Merging, which reduces redundancy by combining experts rather than factorizing them. The scope note explicitly excludes SVD-only and parameter-efficient fine-tuning methods, positioning this work as a higher-order alternative to matrix decomposition that captures cross-expert structure.

Among the three contributions analyzed, the 3D rank allocation mechanism examined ten candidates and found one potentially refutable prior work, while the cross-expert tensorization and multi-linear whitening strategies examined three and six candidates respectively with no clear refutations. The limited search scope—nineteen total candidates across all contributions—means these statistics reflect top-K semantic matches rather than exhaustive coverage. The tensorization and whitening contributions appear more novel within this restricted sample, whereas the rank allocation mechanism encounters at least one overlapping prior approach among the examined candidates.

Based on the limited literature search, the work appears to occupy a relatively sparse research direction within MoE compression, with only one sibling paper in its taxonomy leaf. The cross-expert tensorization and whitening strategies show no clear prior overlap among examined candidates, while the rank allocation mechanism has at least one potentially overlapping work. These findings are constrained by the top-19 candidate scope and do not constitute an exhaustive novelty assessment.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Compressing Mixture-of-Experts models via tensor decomposition. The field addresses the challenge of reducing the substantial memory and computational footprint of MoE architectures, which employ multiple specialized expert networks. The taxonomy reveals several complementary strategies: Decomposition-Based Expert Weight Compression applies low-rank factorizations directly to expert parameters, Inter-Expert Structural Optimization reorganizes or prunes expert connectivity, Expert Consolidation and Merging reduces redundancy by combining similar experts, and Parameter-Efficient Tensorized Adaptation leverages tensor structures for fine-tuning. Additional branches cover Multi-Task Modular Architectures that share compressed components across tasks, Inference and Deployment Optimization for runtime efficiency, Domain-Specific MoE Applications in vision or language domains, and Theoretical Foundations exploring algorithmic guarantees. Representative works such as Parameter-Efficient MoE[7] and MoE-I[2] illustrate early efforts in structured compression, while recent methods like Structured MoE SVD[1] and Delta Decompression MoE[3] push decomposition techniques further. A particularly active line of research focuses on multi-dimensional tensor factorizations that exploit higher-order structure in expert weights, contrasting with simpler matrix-based approaches. TD-MoE[0] sits squarely within this branch, emphasizing tensor decomposition to achieve aggressive compression ratios while preserving model quality. Compared to neighboring work like Parameter-Efficient MoE[7], which introduced early parameter-sharing ideas, TD-MoE[0] adopts a more explicit tensor-algebraic framework to factorize expert layers. This approach complements methods such as MoE-I[2] and Delta Decompression MoE[3], which also target weight compression but differ in their factorization schemes and reconstruction strategies. Meanwhile, works like TT-LoRA MoE[6] and Multilinear MoE[26] explore tensor-train and multilinear structures for adaptation rather than full-model compression, highlighting diverse trade-offs between compression depth, training overhead, and downstream task flexibility. The landscape remains open regarding optimal tensor formats and the interplay between decomposition granularity and expert specialization.

Claimed Contributions

Cross-expert tensorization with joint 3D Tucker decomposition

3 retrieved papers

The authors propose stacking all expert weight matrices in an MoE layer into a three-dimensional tensor and applying Tucker decomposition jointly across experts. This approach captures cross-expert redundancies and shared structure that per-expert decomposition methods overlook, enabling more efficient compression.

3 retrieved papers

Multi-linear whitening strategy for tensor decomposition

6 retrieved papers

The method introduces a whitening transformation applied to the tensorized expert weights using activation statistics. This decorrelates feature dimensions across input and output modes, producing a well-conditioned tensor that improves decomposition quality and enables more effective low-rank approximations.

6 retrieved papers

3D rank allocation mechanism for compression budget

Can Refute

10 retrieved papers

The authors develop an adaptive scheme that automatically distributes Tucker decomposition ranks across the three tensor dimensions (experts, input features, output features) to satisfy a specified compression budget. This mechanism enables efficient exploration of the rank space while maintaining reconstruction fidelity.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] Parameter-efficient mixture-of-experts architecture for pre-trained language models PDF

Gao, Ze-Feng, Zefeng Gao, Liu Pei-yu, Peiyu Liu, Ze-Feng Gao, Zhao, Wayne Xin, Wayne Xin Zhao, Lu, Zhong-Yi, LU Zhong-yi, Wen, Ji-Rong, Ji-Rong Wen, Zhong-Yi Lu (2022)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Cross-expert tensorization with joint 3D Tucker decomposition

[31] RTE-GMoE: A Model-agnostic Approach for Relation Triplet Extraction via Graph-based Mixture-of-Expert Mutual Learning PDF

Cannot Refute

[32] TuckA: Hierarchical Compact Tensor Experts for Efficient Fine-Tuning PDF

Cannot Refute

[33] LANGUAGE NAVIGATION WITH TUCKER ADAPTATION PDF

Cannot Refute

Contribution

Multi-linear whitening strategy for tensor decomposition

[1] Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF

Cannot Refute

[13] MoE-SVD: Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF

Cannot Refute

[27] Fast and guaranteed tensor decomposition via sketching PDF

Cannot Refute

[28] Whitening Spherical Gaussian Mixtures in the Large-Dimensional Regime PDF

Cannot Refute

[29] Learning Fine-grained Parameter Sharing via Sparse Tensor Decomposition PDF

Cannot Refute

[30] LoRDQ: activation-aware Low-Rank Decomposition and Quantization for Large Language Model Compression PDF

Cannot Refute

Contribution

3D rank allocation mechanism for compression budget

[35] Randomized algorithms for low-rank tensor decompositions in the Tucker format PDF

Can Refute

[34] Scalable and Interpretable Machine Learning with Tensor Decomposition PDF

Cannot Refute

[36] Low-rank optimization on Tucker tensor varieties PDF

Cannot Refute

[37] A Semi-Lagrangian Adaptive Rank (SLAR) Method for High-Dimensional Vlasov Dynamics PDF

Cannot Refute

[38] PMU Data Compression in Power Systems Using Adaptive Rank-Based Tensor Ring PDF

Cannot Refute

[39] Rank-adaptive time integration of tree tensor networks PDF

Cannot Refute

[40] Deep learning model compression with rank reduction in tensor decomposition PDF

Cannot Refute

[41] Compressing deep model with pruning and tucker decomposition for smart embedded systems PDF

Cannot Refute

[42] Low-rank tensor completion by Riemannian optimization PDF

Cannot Refute

[43] a-tucker: fast input-adaptive and matricization-free tucker decomposition of higher-order tensors on GPUs PDF

Cannot Refute

TD-MoE: Tensor Decomposition for MoE Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] Parameter-efficient mixture-of-experts architecture for pre-trained language models PDF

Contribution Analysis

Cross-expert tensorization with joint 3D Tucker decomposition

[31] RTE-GMoE: A Model-agnostic Approach for Relation Triplet Extraction via Graph-based Mixture-of-Expert Mutual Learning PDF

[32] TuckA: Hierarchical Compact Tensor Experts for Efficient Fine-Tuning PDF

[33] LANGUAGE NAVIGATION WITH TUCKER ADAPTATION PDF

Multi-linear whitening strategy for tensor decomposition

[1] Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF

[13] MoE-SVD: Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition PDF

[27] Fast and guaranteed tensor decomposition via sketching PDF

[28] Whitening Spherical Gaussian Mixtures in the Large-Dimensional Regime PDF

[29] Learning Fine-grained Parameter Sharing via Sparse Tensor Decomposition PDF

[30] LoRDQ: activation-aware Low-Rank Decomposition and Quantization for Large Language Model Compression PDF

3D rank allocation mechanism for compression budget

[35] Randomized algorithms for low-rank tensor decompositions in the Tucker format PDF

[34] Scalable and Interpretable Machine Learning with Tensor Decomposition PDF

[36] Low-rank optimization on Tucker tensor varieties PDF

[37] A Semi-Lagrangian Adaptive Rank (SLAR) Method for High-Dimensional Vlasov Dynamics PDF

[38] PMU Data Compression in Power Systems Using Adaptive Rank-Based Tensor Ring PDF

[39] Rank-adaptive time integration of tree tensor networks PDF

[40] Deep learning model compression with rank reduction in tensor decomposition PDF

[41] Compressing deep model with pruning and tucker decomposition for smart embedded systems PDF

[42] Low-rank tensor completion by Riemannian optimization PDF

[43] a-tucker: fast input-adaptive and matricization-free tucker decomposition of higher-order tensors on GPUs PDF

Table of Contents