LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Mixture of ExpertsMixture of LoRA ExpertsDynamic routingFully differentiableLoRAMoE

Recent studies have shown that combining parameter-efficient fine-tuning (PEFT) with mixture-of-experts (MoE) is an effective strategy for adapting large language models (LLMs) to the downstream tasks. However, most existing approaches rely on conventional TopK routing, which requires careful hyperparameter tuning and assigns a fixed number of experts to each token. In this work, we propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts that enables adaptive, token-dependent, and layer-wise expert allocation. Our method replaces the non-differentiable TopK selection with a differentiable routing function and a closed-form solution. Moreover, our design allows the model to adaptively determine the number of experts to activate for each token at different layers. In addition, we introduce an analytical sparsity control objective to regularize the number of activated experts. Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks. Our method not only achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes LD-MoLE, a learnable dynamic routing mechanism for mixture of LoRA experts that enables adaptive, token-dependent, and layer-wise expert allocation. It resides in the 'Token-Level and Layer-Wise Dynamic Routing' leaf of the taxonomy, which contains four papers total. This leaf represents a moderately active research direction within the broader 'Routing Mechanisms and Expert Selection Strategies' branch, focusing specifically on methods that route experts adaptively per token and vary selection across network layers, distinguishing it from task-level or static routing approaches.

The taxonomy reveals several neighboring research directions. The sibling leaf 'Task-Aware and Hierarchical Routing' contains five papers exploring task identifiers and multi-level routing structures, while 'Hybrid and Prompt-Aware Routing' (three papers) combines multiple signals for expert selection. The 'Retrieval-Augmented and Bandit-Based Routing' leaf (two papers) employs reinforcement learning strategies. LD-MoLE's token-level focus differentiates it from these task-oriented or retrieval-based approaches, positioning it within a specific niche that emphasizes fine-grained, input-dependent routing rather than coarser task or prompt-based mechanisms.

Among thirty candidates examined through semantic search, none clearly refute the three main contributions. The 'Learnable Dynamic Routing Mechanism' contribution examined ten candidates with zero refutable matches, as did the 'Analytical Sparsity Control Objective' and 'Differentiable Routing with Guaranteed Expert Activation' contributions. This suggests that within the limited search scope, the specific combination of differentiable routing, closed-form solutions, and analytical sparsity control appears relatively distinct. However, the analysis covers only top-K semantic matches and does not constitute an exhaustive literature review across all possible routing mechanisms or sparsity regularization techniques.

Based on the limited search scope of thirty candidates, the work appears to occupy a recognizable position within token-level dynamic routing research. The taxonomy structure indicates this is a moderately populated area with established prior work, yet the specific technical approach combining differentiable routing and sparsity objectives shows no clear overlap among examined candidates. The assessment remains constrained by the search methodology and cannot definitively characterize novelty beyond the analyzed sample.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Learnable dynamic routing for mixture of LoRA experts. This field centers on combining low-rank adaptation (LoRA) modules with mixture-of-experts (MoE) architectures, where a routing mechanism selectively activates subsets of expert adapters to improve parameter efficiency and task specialization. The taxonomy reveals several major branches: Routing Mechanisms and Expert Selection Strategies explore how to assign tokens or inputs to experts, ranging from token-level dynamic routing (e.g., X-LoRA[21], LoRA-Mixer[27]) to more sophisticated layer-wise or hierarchical schemes. Architecture Design and Expert Composition address how experts are structured and combined, including hierarchical designs like HDMoLE[10] and tensor-train decompositions such as TT-LoRA MoE[8]. Training Strategies and Optimization focus on learning objectives and curriculum-based approaches (e.g., Curriculum LoRA Experts[17]), while Multi-Task and Continual Learning Applications and Domain-Specific Applications demonstrate how MoLE frameworks adapt to diverse settings, from medical LLMs (MoE LLMs Medical[6]) to graph tasks (GRAPHMOE[13]). System Optimization and Deployment and Theoretical Foundations round out the landscape with practical efficiency concerns and novel paradigms. A particularly active line of work involves token-level and layer-wise dynamic routing, where methods like X-LoRA[21] and its protein-focused variant X-LoRA Protein[22] learn to scale and combine multiple LoRA experts per layer based on input features. LD-MoLE[0] sits squarely within this branch, emphasizing learnable dynamic routing that adapts expert selection at fine granularity. Compared to neighbors such as LoRA-Mixer[27], which also performs token-level mixing, LD-MoLE[0] likely explores distinct routing parameterizations or training regimes to balance specialization and generalization. Meanwhile, works like Mixture of Routers[3] and AT-MoE[5] investigate alternative routing architectures and attention-based selection, highlighting ongoing debates about the optimal granularity and complexity of routing decisions. These contrasts underscore open questions around scalability, interpretability, and the trade-offs between fine-grained token-level control and coarser expert assignment strategies.

Claimed Contributions

LD-MoLE: Learnable Dynamic Routing Mechanism for Mixture of LoRA Experts

10 retrieved papers

The authors introduce LD-MoLE, a framework that replaces conventional TopK routing with a differentiable Sparsegen-based routing function. A lightweight shared MLP predicts token-dependent sparsity parameters, enabling adaptive and layer-wise expert allocation in a Mixture of LoRA Experts setting.

10 retrieved papers

Analytical Sparsity Control Objective

10 retrieved papers

The authors propose a sparsity loss function that leverages the closed-form Sparsegen solution to directly control the number of activated experts. This loss regularizes the predicted sparsity factor toward values corresponding to a target expert activation level.

10 retrieved papers

Differentiable Routing with Guaranteed Expert Activation

10 retrieved papers

The method introduces a fully differentiable routing mechanism based on Sparsegen projection that guarantees at least one expert is activated per token, avoiding the zero-activation problem while maintaining well-defined gradients for end-to-end optimization.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[21] X-LoRA: Mixture of low-rank adapter experts, a flexible framework for large language models with applications in protein mechanics and molecular design PDF

Eric L. Buehler, Markus J. Buehler (2024)

[22] X-LoRA: Mixture of Low-Rank Adapter Experts, a Flexible Framework for Large Language Models with Applications in Protein Mechanics and Design PDF

Eric L. Buehler, Buehler, Markus J., Markus J. Buehler, E. Buehler (2024)

[27] LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing PDF

Li Wenbing, Song, Zikai, Zhou Hang, Zhang Yun-Yao, Yu, Junqing, Yang Wei (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LD-MoLE: Learnable Dynamic Routing Mechanism for Mixture of LoRA Experts

[58] Mixture-of-Experts with Expert Choice Routing PDF

Cannot Refute

[70] Tutel: Adaptive mixture-of-experts at scale PDF

Cannot Refute

[71] Remoe: Fully differentiable mixture-of-experts with relu routing PDF

Cannot Refute

[72] Soft merging of experts with adaptive routing PDF

Cannot Refute

[73] Theory of mixture-of-experts for mobile edge computing PDF

Cannot Refute

[74] Harder Tasks Need More Experts: Dynamic Routing in MoE Models PDF

Cannot Refute

[75] Dynamic mixture of experts: An auto-tuning approach for efficient transformer models PDF

Cannot Refute

[76] Novel token-level recurrent routing for enhanced mixture-of-experts performance PDF

Cannot Refute

[77] Dynamic Mixture of Experts for Adaptive Computation in Character-Level Transformers PDF

Cannot Refute

[78] DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models PDF

Cannot Refute

Contribution

Analytical Sparsity Control Objective

[60] MoE-LLaVA: Mixture of Experts for Large Vision-Language Models PDF

Cannot Refute

[61] Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts PDF

Cannot Refute

[62] SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts PDF

Cannot Refute

[63] Fsmoe: A flexible and scalable training system for sparse mixture-of-experts models PDF

Cannot Refute

[64] Xmoe: Sparse models with fine-grained and adaptive expert selection PDF

Cannot Refute

[65] Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity PDF

Cannot Refute

[66] Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations PDF

Cannot Refute

[67] Towards Unsupervised Speaker Diarization System for Multilingual Telephone Calls Using Pre-trained Whisper Model and Mixture of Sparse Autoencoders PDF

Cannot Refute

[68] Exploring expert specialization through unsupervised training in sparse mixture of experts PDF

Cannot Refute

[69] Fft-MoE: Efficient federated fine-tuning for foundation models via large-scale sparse MoE under heterogeneous edge PDF

Cannot Refute

Contribution

Differentiable Routing with Guaranteed Expert Activation

[1] Mixture of LoRA Experts PDF

Cannot Refute

[51] Scaling Vision with Sparse Mixture of Experts PDF

Cannot Refute

[52] StableMoE: Stable Routing Strategy for Mixture of Experts PDF

Cannot Refute

[53] Mvmoe: Multi-task vehicle routing solver with mixture-of-experts PDF

Cannot Refute

[54] Maximum score routing for mixture-of-experts PDF

Cannot Refute

[55] On the representation collapse of sparse mixture of experts PDF

Cannot Refute

[56] Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs PDF

Cannot Refute

[57] Maskmoe: Boosting token-level learning via routing mask in mixture-of-experts PDF

Cannot Refute

[58] Mixture-of-Experts with Expert Choice Routing PDF

Cannot Refute

[59] Layerwise recurrent router for mixture-of-experts PDF

Cannot Refute

LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[21] X-LoRA: Mixture of low-rank adapter experts, a flexible framework for large language models with applications in protein mechanics and molecular design PDF

[22] X-LoRA: Mixture of Low-Rank Adapter Experts, a Flexible Framework for Large Language Models with Applications in Protein Mechanics and Design PDF

[27] LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing PDF

Contribution Analysis

LD-MoLE: Learnable Dynamic Routing Mechanism for Mixture of LoRA Experts

[58] Mixture-of-Experts with Expert Choice Routing PDF

[70] Tutel: Adaptive mixture-of-experts at scale PDF

[71] Remoe: Fully differentiable mixture-of-experts with relu routing PDF

[72] Soft merging of experts with adaptive routing PDF

[73] Theory of mixture-of-experts for mobile edge computing PDF

[74] Harder Tasks Need More Experts: Dynamic Routing in MoE Models PDF

[75] Dynamic mixture of experts: An auto-tuning approach for efficient transformer models PDF

[76] Novel token-level recurrent routing for enhanced mixture-of-experts performance PDF

[77] Dynamic Mixture of Experts for Adaptive Computation in Character-Level Transformers PDF

[78] DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models PDF

Analytical Sparsity Control Objective

[60] MoE-LLaVA: Mixture of Experts for Large Vision-Language Models PDF

[61] Statistical Perspective of Top-K Sparse Softmax Gating Mixture of Experts PDF

[62] SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts PDF

[63] Fsmoe: A flexible and scalable training system for sparse mixture-of-experts models PDF

[64] Xmoe: Sparse models with fine-grained and adaptive expert selection PDF

[65] Pangu Pro MoE: Mixture of Grouped Experts for Efficient Sparsity PDF

[66] Finding Fantastic Experts in MoEs: A Unified Study for Expert Dropping Strategies and Observations PDF

[67] Towards Unsupervised Speaker Diarization System for Multilingual Telephone Calls Using Pre-trained Whisper Model and Mixture of Sparse Autoencoders PDF

[68] Exploring expert specialization through unsupervised training in sparse mixture of experts PDF

[69] Fft-MoE: Efficient federated fine-tuning for foundation models via large-scale sparse MoE under heterogeneous edge PDF

Differentiable Routing with Guaranteed Expert Activation

[1] Mixture of LoRA Experts PDF

[51] Scaling Vision with Sparse Mixture of Experts PDF

[52] StableMoE: Stable Routing Strategy for Mixture of Experts PDF

[53] Mvmoe: Multi-task vehicle routing solver with mixture-of-experts PDF

[54] Maximum score routing for mixture-of-experts PDF

[55] On the representation collapse of sparse mixture of experts PDF

[56] Routing Manifold Alignment Improves Generalization of Mixture-of-Experts LLMs PDF

[57] Maskmoe: Boosting token-level learning via routing mask in mixture-of-experts PDF

[58] Mixture-of-Experts with Expert Choice Routing PDF

[59] Layerwise recurrent router for mixture-of-experts PDF

Table of Contents