Universal Model Routing for Efficient LLM Inference

ICLR 2026 Conference SubmissionAnonymous Authors
model routingadaptive computationlearning to deferefficient inference
Abstract:

Model routing is a simple technique for reducing the inference cost of large language models (LLMs), wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose UniRoute, a new approach to this problem that relies on representing each LLM as afeature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective instantiations of UniRoute, relying on cluster-based routing and a learned cluster map respectively. We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound. Experiments on a range of public benchmarks show the effectiveness of UniRoute in routing amongst more than 30 unseen LLMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces UniRoute, a framework for dynamic model routing that handles previously unseen LLMs at test time by representing each model as a feature vector derived from predictions on representative prompts. This work sits within the 'Universal and Cross-Model Routing Frameworks' leaf of the taxonomy, which contains only three papers total including this one. The leaf focuses specifically on routing systems designed to generalize across heterogeneous or unseen LLM pools, distinguishing it from confidence-based or category-specific routing methods. This represents a relatively sparse research direction within the broader query-adaptive model selection landscape.

The taxonomy reveals that UniRoute's immediate neighbors include confidence-aware routing methods and lookahead-based approaches, which rely on different signals for routing decisions. The broader 'Query-Adaptive Model Selection and Routing' branch encompasses seven distinct sub-areas, from multi-objective optimization to reasoning-aware routing, suggesting a fragmented field with multiple competing paradigms. UniRoute's emphasis on cross-model generalization through learned representations positions it at the intersection of universal routing and representation learning, diverging from methods that require per-model training or task-specific tuning. The scope note explicitly excludes confidence-based methods, clarifying that UniRoute's feature-vector approach represents a distinct technical strategy.

Among the three contributions analyzed, the formalization of the dynamic LLM pool routing problem shows the most substantial prior work overlap: one refutable candidate was identified among ten examined papers. The UniRoute framework itself and the cluster-based instantiations appear more novel, with zero refutable candidates found among four and ten examined papers respectively. However, the literature search examined only 24 total candidates through top-K semantic search and citation expansion, representing a limited sample of the field. The single refutable case suggests that aspects of the problem formalization may have been explored previously, though the specific instantiations and theoretical guarantees appear less anticipated by prior work.

Based on this limited search scope covering 24 candidates across three contributions, UniRoute appears to occupy a relatively under-explored niche within dynamic model routing. The sparse population of its taxonomy leaf and the low refutation rate suggest meaningful novelty in the cross-model generalization approach, though the analysis cannot rule out relevant work outside the top-K semantic matches examined. The field structure indicates active parallel development in related but distinct routing paradigms, positioning UniRoute as one of several competing frameworks rather than a definitive solution.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: dynamic model routing for large language models. The field addresses how to intelligently select, switch, or coordinate among multiple LLMs or model variants to balance quality, latency, cost, and resource constraints. The taxonomy reveals several major branches: Query-Adaptive Model Selection and Routing focuses on matching individual queries to appropriate models based on difficulty or domain; Internal Model Architecture and Execution Optimization examines within-model mechanisms such as mixture-of-experts and layer-skipping; Model Merging and Multi-Model Integration explores combining parameters or outputs from diverse models; Inference Scheduling and Resource Management tackles system-level orchestration and load balancing; Multi-Agent Coordination and Collaboration studies how multiple LLM agents can work together; Continual Learning and Model Adaptation considers evolving model capabilities over time; and Application-Specific Routing tailors routing strategies to domains like code generation or video understanding. Works such as Tryage[10] and MixLLM[2] illustrate early query-adaptive approaches, while Llumnix[5] exemplifies scheduling and resource management at scale. Particularly active lines of work center on learning universal routing policies that generalize across diverse model pools and query distributions, trading off the need for task-specific tuning against the desire for broad applicability. Universal Model Routing[0] sits squarely in this universal and cross-model routing cluster, aiming to develop routing frameworks that adapt to heterogeneous LLM ensembles without extensive retraining. Nearby efforts like Universal LLM Routing[46] share this ambition of generality, while Tryage[10] represents an earlier, more heuristic approach to cascading models by difficulty. A central open question is how to efficiently learn routing policies that remain robust as new models enter the pool or as query distributions shift, with some works exploring online learning (e.g., contextual bandits) and others leveraging distillation or meta-learning. Universal Model Routing[0] contributes to this landscape by emphasizing cross-model generalization, positioning itself as a step toward routing systems that require minimal per-model customization.

Claimed Contributions

UniRoute framework for dynamic model routing

The authors introduce UniRoute, a novel routing framework that represents each LLM as a feature vector based on its prediction errors on representative prompts. This enables routing among previously unseen LLMs without retraining the router.

4 retrieved papers
Cluster-based routing instantiations with theoretical guarantees

The authors propose two concrete implementations of UniRoute using unsupervised and supervised prompt clustering. They provide theoretical analysis showing these methods estimate the optimal routing rule and derive an excess risk bound quantifying their approximation error.

10 retrieved papers
Formalization of dynamic LLM pool routing problem

The authors formally define the problem of routing when the set of available LLMs can change dynamically at test time, extending beyond the standard static pool assumption in prior work. This includes a meta-distribution over LLM pools and characterization of the Bayes-optimal routing rule.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

UniRoute framework for dynamic model routing

The authors introduce UniRoute, a novel routing framework that represents each LLM as a feature vector based on its prediction errors on representative prompts. This enables routing among previously unseen LLMs without retraining the router.

Contribution

Cluster-based routing instantiations with theoretical guarantees

The authors propose two concrete implementations of UniRoute using unsupervised and supervised prompt clustering. They provide theoretical analysis showing these methods estimate the optimal routing rule and derive an excess risk bound quantifying their approximation error.

Contribution

Formalization of dynamic LLM pool routing problem

The authors formally define the problem of routing when the set of available LLMs can change dynamically at test time, extending beyond the standard static pool assumption in prior work. This includes a meta-distribution over LLM pools and characterization of the Bayes-optimal routing rule.

Universal Model Routing for Efficient LLM Inference | Novelty Validation