LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts

ICLR 2026 Conference SubmissionAnonymous Authors
Mixture of ExpertsMixture of LoRA ExpertsDynamic routingFully differentiableLoRAMoE
Abstract:

Recent studies have shown that combining parameter-efficient fine-tuning (PEFT) with mixture-of-experts (MoE) is an effective strategy for adapting large language models (LLMs) to the downstream tasks. However, most existing approaches rely on conventional TopK routing, which requires careful hyperparameter tuning and assigns a fixed number of experts to each token. In this work, we propose LD-MoLE, a Learnable Dynamic routing mechanism for Mixture of LoRA Experts that enables adaptive, token-dependent, and layer-wise expert allocation. Our method replaces the non-differentiable TopK selection with a differentiable routing function and a closed-form solution. Moreover, our design allows the model to adaptively determine the number of experts to activate for each token at different layers. In addition, we introduce an analytical sparsity control objective to regularize the number of activated experts. Extensive experiments on the Qwen3-1.7B and Llama-3.2-3B models show that LD-MoLE achieves the highest average scores compared to state-of-the-art baselines, across a diverse set of benchmarks. Our method not only achieves superior performance, but also demonstrates the ability to learn token-dependent and layer-wise expert allocation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes LD-MoLE, a learnable dynamic routing mechanism for mixture of LoRA experts that enables adaptive, token-dependent, and layer-wise expert allocation. It resides in the 'Token-Level and Layer-Wise Dynamic Routing' leaf of the taxonomy, which contains four papers total. This leaf represents a moderately active research direction within the broader 'Routing Mechanisms and Expert Selection Strategies' branch, focusing specifically on methods that route experts adaptively per token and vary selection across network layers, distinguishing it from task-level or static routing approaches.

The taxonomy reveals several neighboring research directions. The sibling leaf 'Task-Aware and Hierarchical Routing' contains five papers exploring task identifiers and multi-level routing structures, while 'Hybrid and Prompt-Aware Routing' (three papers) combines multiple signals for expert selection. The 'Retrieval-Augmented and Bandit-Based Routing' leaf (two papers) employs reinforcement learning strategies. LD-MoLE's token-level focus differentiates it from these task-oriented or retrieval-based approaches, positioning it within a specific niche that emphasizes fine-grained, input-dependent routing rather than coarser task or prompt-based mechanisms.

Among thirty candidates examined through semantic search, none clearly refute the three main contributions. The 'Learnable Dynamic Routing Mechanism' contribution examined ten candidates with zero refutable matches, as did the 'Analytical Sparsity Control Objective' and 'Differentiable Routing with Guaranteed Expert Activation' contributions. This suggests that within the limited search scope, the specific combination of differentiable routing, closed-form solutions, and analytical sparsity control appears relatively distinct. However, the analysis covers only top-K semantic matches and does not constitute an exhaustive literature review across all possible routing mechanisms or sparsity regularization techniques.

Based on the limited search scope of thirty candidates, the work appears to occupy a recognizable position within token-level dynamic routing research. The taxonomy structure indicates this is a moderately populated area with established prior work, yet the specific technical approach combining differentiable routing and sparsity objectives shows no clear overlap among examined candidates. The assessment remains constrained by the search methodology and cannot definitively characterize novelty beyond the analyzed sample.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Learnable dynamic routing for mixture of LoRA experts. This field centers on combining low-rank adaptation (LoRA) modules with mixture-of-experts (MoE) architectures, where a routing mechanism selectively activates subsets of expert adapters to improve parameter efficiency and task specialization. The taxonomy reveals several major branches: Routing Mechanisms and Expert Selection Strategies explore how to assign tokens or inputs to experts, ranging from token-level dynamic routing (e.g., X-LoRA[21], LoRA-Mixer[27]) to more sophisticated layer-wise or hierarchical schemes. Architecture Design and Expert Composition address how experts are structured and combined, including hierarchical designs like HDMoLE[10] and tensor-train decompositions such as TT-LoRA MoE[8]. Training Strategies and Optimization focus on learning objectives and curriculum-based approaches (e.g., Curriculum LoRA Experts[17]), while Multi-Task and Continual Learning Applications and Domain-Specific Applications demonstrate how MoLE frameworks adapt to diverse settings, from medical LLMs (MoE LLMs Medical[6]) to graph tasks (GRAPHMOE[13]). System Optimization and Deployment and Theoretical Foundations round out the landscape with practical efficiency concerns and novel paradigms. A particularly active line of work involves token-level and layer-wise dynamic routing, where methods like X-LoRA[21] and its protein-focused variant X-LoRA Protein[22] learn to scale and combine multiple LoRA experts per layer based on input features. LD-MoLE[0] sits squarely within this branch, emphasizing learnable dynamic routing that adapts expert selection at fine granularity. Compared to neighbors such as LoRA-Mixer[27], which also performs token-level mixing, LD-MoLE[0] likely explores distinct routing parameterizations or training regimes to balance specialization and generalization. Meanwhile, works like Mixture of Routers[3] and AT-MoE[5] investigate alternative routing architectures and attention-based selection, highlighting ongoing debates about the optimal granularity and complexity of routing decisions. These contrasts underscore open questions around scalability, interpretability, and the trade-offs between fine-grained token-level control and coarser expert assignment strategies.

Claimed Contributions

LD-MoLE: Learnable Dynamic Routing Mechanism for Mixture of LoRA Experts

The authors introduce LD-MoLE, a framework that replaces conventional TopK routing with a differentiable Sparsegen-based routing function. A lightweight shared MLP predicts token-dependent sparsity parameters, enabling adaptive and layer-wise expert allocation in a Mixture of LoRA Experts setting.

10 retrieved papers
Analytical Sparsity Control Objective

The authors propose a sparsity loss function that leverages the closed-form Sparsegen solution to directly control the number of activated experts. This loss regularizes the predicted sparsity factor toward values corresponding to a target expert activation level.

10 retrieved papers
Differentiable Routing with Guaranteed Expert Activation

The method introduces a fully differentiable routing mechanism based on Sparsegen projection that guarantees at least one expert is activated per token, avoiding the zero-activation problem while maintaining well-defined gradients for end-to-end optimization.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

LD-MoLE: Learnable Dynamic Routing Mechanism for Mixture of LoRA Experts

The authors introduce LD-MoLE, a framework that replaces conventional TopK routing with a differentiable Sparsegen-based routing function. A lightweight shared MLP predicts token-dependent sparsity parameters, enabling adaptive and layer-wise expert allocation in a Mixture of LoRA Experts setting.

Contribution

Analytical Sparsity Control Objective

The authors propose a sparsity loss function that leverages the closed-form Sparsegen solution to directly control the number of activated experts. This loss regularizes the predicted sparsity factor toward values corresponding to a target expert activation level.

Contribution

Differentiable Routing with Guaranteed Expert Activation

The method introduces a fully differentiable routing mechanism based on Sparsegen projection that guarantees at least one expert is activated per token, avoiding the zero-activation problem while maintaining well-defined gradients for end-to-end optimization.

LD-MoLE: Learnable Dynamic Routing for Mixture of LoRA Experts | Novelty Validation