Universal Model Routing for Efficient LLM Inference

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

model routingadaptive computationlearning to deferefficient inference

Model routing is a simple technique for reducing the inference cost of large language models (LLMs), wherein one maintains a pool of candidate LLMs, and learns to route each prompt to the smallest feasible LLM. Existing works focus on learning a router for a fixed pool of LLMs. In this paper, we consider the problem of dynamic routing, where new, previously unobserved LLMs are available at test time. We propose UniRoute, a new approach to this problem that relies on representing each LLM as afeature vector, derived based on predictions on a set of representative prompts. Based on this, we detail two effective instantiations of UniRoute, relying on cluster-based routing and a learned cluster map respectively. We show that these are estimates of a theoretically optimal routing rule, and quantify their errors via an excess risk bound. Experiments on a range of public benchmarks show the effectiveness of UniRoute in routing amongst more than 30 unseen LLMs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces UniRoute, a framework for dynamic model routing that handles previously unseen LLMs at test time by representing each model as a feature vector derived from predictions on representative prompts. This work sits within the 'Universal and Cross-Model Routing Frameworks' leaf of the taxonomy, which contains only three papers total including this one. The leaf focuses specifically on routing systems designed to generalize across heterogeneous or unseen LLM pools, distinguishing it from confidence-based or category-specific routing methods. This represents a relatively sparse research direction within the broader query-adaptive model selection landscape.

The taxonomy reveals that UniRoute's immediate neighbors include confidence-aware routing methods and lookahead-based approaches, which rely on different signals for routing decisions. The broader 'Query-Adaptive Model Selection and Routing' branch encompasses seven distinct sub-areas, from multi-objective optimization to reasoning-aware routing, suggesting a fragmented field with multiple competing paradigms. UniRoute's emphasis on cross-model generalization through learned representations positions it at the intersection of universal routing and representation learning, diverging from methods that require per-model training or task-specific tuning. The scope note explicitly excludes confidence-based methods, clarifying that UniRoute's feature-vector approach represents a distinct technical strategy.

Among the three contributions analyzed, the formalization of the dynamic LLM pool routing problem shows the most substantial prior work overlap: one refutable candidate was identified among ten examined papers. The UniRoute framework itself and the cluster-based instantiations appear more novel, with zero refutable candidates found among four and ten examined papers respectively. However, the literature search examined only 24 total candidates through top-K semantic search and citation expansion, representing a limited sample of the field. The single refutable case suggests that aspects of the problem formalization may have been explored previously, though the specific instantiations and theoretical guarantees appear less anticipated by prior work.

Based on this limited search scope covering 24 candidates across three contributions, UniRoute appears to occupy a relatively under-explored niche within dynamic model routing. The sparse population of its taxonomy leaf and the low refutation rate suggest meaningful novelty in the cross-model generalization approach, though the analysis cannot rule out relevant work outside the top-K semantic matches examined. The field structure indicates active parallel development in related but distinct routing paradigms, positioning UniRoute as one of several competing frameworks rather than a definitive solution.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: dynamic model routing for large language models. The field addresses how to intelligently select, switch, or coordinate among multiple LLMs or model variants to balance quality, latency, cost, and resource constraints. The taxonomy reveals several major branches: Query-Adaptive Model Selection and Routing focuses on matching individual queries to appropriate models based on difficulty or domain; Internal Model Architecture and Execution Optimization examines within-model mechanisms such as mixture-of-experts and layer-skipping; Model Merging and Multi-Model Integration explores combining parameters or outputs from diverse models; Inference Scheduling and Resource Management tackles system-level orchestration and load balancing; Multi-Agent Coordination and Collaboration studies how multiple LLM agents can work together; Continual Learning and Model Adaptation considers evolving model capabilities over time; and Application-Specific Routing tailors routing strategies to domains like code generation or video understanding. Works such as Tryage[10] and MixLLM[2] illustrate early query-adaptive approaches, while Llumnix[5] exemplifies scheduling and resource management at scale. Particularly active lines of work center on learning universal routing policies that generalize across diverse model pools and query distributions, trading off the need for task-specific tuning against the desire for broad applicability. Universal Model Routing[0] sits squarely in this universal and cross-model routing cluster, aiming to develop routing frameworks that adapt to heterogeneous LLM ensembles without extensive retraining. Nearby efforts like Universal LLM Routing[46] share this ambition of generality, while Tryage[10] represents an earlier, more heuristic approach to cascading models by difficulty. A central open question is how to efficiently learn routing policies that remain robust as new models enter the pool or as query distributions shift, with some works exploring online learning (e.g., contextual bandits) and others leveraging distillation or meta-learning. Universal Model Routing[0] contributes to this landscape by emphasizing cross-model generalization, positioning itself as a step toward routing systems that require minimal per-model customization.

Claimed Contributions

UniRoute framework for dynamic model routing

4 retrieved papers

The authors introduce UniRoute, a novel routing framework that represents each LLM as a feature vector based on its prediction errors on representative prompts. This enables routing among previously unseen LLMs without retraining the router.

4 retrieved papers

Cluster-based routing instantiations with theoretical guarantees

10 retrieved papers

The authors propose two concrete implementations of UniRoute using unsupervised and supervised prompt clustering. They provide theoretical analysis showing these methods estimate the optimal routing rule and derive an excess risk bound quantifying their approximation error.

10 retrieved papers

Formalization of dynamic LLM pool routing problem

Can Refute

10 retrieved papers

The authors formally define the problem of routing when the set of available LLMs can change dynamically at test time, extending beyond the standard static pool assumption in prior work. This includes a meta-distribution over LLM pools and characterization of the Bayes-optimal routing rule.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[10] Tryage: Real-time, intelligent routing of user prompts to large language models PDF

Surya N. Hari, Thomson, Matt, Matt Thomson, S. N. Hari (2023)

[46] Universal LLM Routing with Correctness-Based Representation PDF

W Jitkrittum, H Narasimhan, AS Rawat (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

UniRoute framework for dynamic model routing

[67] Incoherence as Oracle-less Measure of Error in LLM-Based Code Generation PDF

Cannot Refute

[69] HIPPD: Brain-Inspired Hierarchical Information Processing for Personality Detection PDF

Cannot Refute

[70] Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation PDF

Cannot Refute

[71] Benchmarking and Improving LLM Robustness for Personalized Generation PDF

Cannot Refute

Contribution

Cluster-based routing instantiations with theoretical guarantees

[51] Uneven Clustering Routing Protocols for Multi-Hop Cognitive Radio Sensor Networks: General Design Principles and an Illustrative Example PDF

Cannot Refute

[52] Compressive statistical learning with random feature moments PDF

Cannot Refute

[53] Clustering-based Meta Bayesian Optimization with Theoretical Guarantee PDF

Cannot Refute

[54] Group Distributionally Robust Dataset Distillation with Risk Minimization PDF

Cannot Refute

[55] Multiply Robust Estimation for Local Distribution Shifts with Multiple Domains PDF

Cannot Refute

[56] Provable pathways: Learning multiple tasks over multiple paths PDF

Cannot Refute

[57] Can Evolutionary Clustering Have Theoretical Guarantees? PDF

Cannot Refute

[58] Geographical cluster-based routing in sensing-covered networks PDF

Cannot Refute

[59] Improving Approximate and Exact Approaches Based on Decision Diagrams and Dynamic Programming for Combinatorial Optimization PDF

Cannot Refute

[60] Adv-SSL: Adversarial Self-Supervised Representation Learning with Theoretical Guarantees PDF

Cannot Refute

Contribution

Formalization of dynamic LLM pool routing problem

[46] Universal LLM Routing with Correctness-Based Representation PDF

Can Refute

[6] Routing experts: Learning to route dynamic experts in existing multi-modal large language models PDF

Cannot Refute

[9] CARGO: A Framework for Confidence-Aware Routing of Large Language Models PDF

Cannot Refute

[10] Tryage: Real-time, intelligent routing of user prompts to large language models PDF

Cannot Refute

[61] Mixture of experts in large language models PDF

Cannot Refute

[62] Routing experts: Learning to route dynamic experts in multi-modal large language models PDF

Cannot Refute

[63] RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing PDF

Cannot Refute

[64] Routellm: Learning to route llms with preference data PDF

Cannot Refute

[65] Efficient dynamic ensembling for multiple LLM experts PDF

Cannot Refute

[66] Routereval: A comprehensive benchmark for routing llms to explore model-level scaling up in llms PDF

Cannot Refute

Universal Model Routing for Efficient LLM Inference

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[10] Tryage: Real-time, intelligent routing of user prompts to large language models PDF

[46] Universal LLM Routing with Correctness-Based Representation PDF

Contribution Analysis

UniRoute framework for dynamic model routing

[67] Incoherence as Oracle-less Measure of Error in LLM-Based Code Generation PDF

[69] HIPPD: Brain-Inspired Hierarchical Information Processing for Personality Detection PDF

[70] Peering Inside the Black Box: Uncovering LLM Errors in Optimization Modelling through Component-Level Evaluation PDF

[71] Benchmarking and Improving LLM Robustness for Personalized Generation PDF

Cluster-based routing instantiations with theoretical guarantees

[51] Uneven Clustering Routing Protocols for Multi-Hop Cognitive Radio Sensor Networks: General Design Principles and an Illustrative Example PDF

[52] Compressive statistical learning with random feature moments PDF

[53] Clustering-based Meta Bayesian Optimization with Theoretical Guarantee PDF

[54] Group Distributionally Robust Dataset Distillation with Risk Minimization PDF

[55] Multiply Robust Estimation for Local Distribution Shifts with Multiple Domains PDF

[56] Provable pathways: Learning multiple tasks over multiple paths PDF

[57] Can Evolutionary Clustering Have Theoretical Guarantees? PDF

[58] Geographical cluster-based routing in sensing-covered networks PDF

[59] Improving Approximate and Exact Approaches Based on Decision Diagrams and Dynamic Programming for Combinatorial Optimization PDF

[60] Adv-SSL: Adversarial Self-Supervised Representation Learning with Theoretical Guarantees PDF

Formalization of dynamic LLM pool routing problem

[46] Universal LLM Routing with Correctness-Based Representation PDF

[6] Routing experts: Learning to route dynamic experts in existing multi-modal large language models PDF

[9] CARGO: A Framework for Confidence-Aware Routing of Large Language Models PDF

[10] Tryage: Real-time, intelligent routing of user prompts to large language models PDF

[61] Mixture of experts in large language models PDF

[62] Routing experts: Learning to route dynamic experts in multi-modal large language models PDF

[63] RadialRouter: Structured Representation for Efficient and Robust Large Language Models Routing PDF

[64] Routellm: Learning to route llms with preference data PDF

[65] Efficient dynamic ensembling for multiple LLM experts PDF

[66] Routereval: A comprehensive benchmark for routing llms to explore model-level scaling up in llms PDF

Table of Contents