Queue Length Regret Bounds for Contextual Queueing Bandits

ICLR 2026 Conference SubmissionAnonymous Authors
Queueing banditscontextual banditslogistic bandits
Abstract:

We introduce contextual queueing bandits, a new context-aware framework for scheduling while simultaneously learning unknown service rates. Individual jobs carry heterogeneous contextual features, based on which the agent chooses a job and matches it with a server to maximize the departure rate. The service/departure rate is governed by a logistic model of the contextual feature with an unknown server-specific parameter. To evaluate the performance of a policy, we consider queue length regret, defined as the difference in queue length between the policy and the optimal policy. The main challenge in the analysis is that the lists of remaining job features in the queue may differ under our policy versus the optimal policy for a given time step, since they may process jobs in different orders. To address this, we propose the idea of policy-switching queues equipped with a sophisticated coupling argument. This leads to a novel queue length regret decomposition framework, allowing us to understand the short-term effect of choosing a suboptimal job-server pair and its long-term effect on queue state differences. We show that our algorithm, CQB-ε\varepsilon, achieves a regret upper bound of O~(T1/4)\widetilde{\mathcal{O}}(T^{-1/4}). We also consider the setting of adversarially chosen contexts, for which our second algorithm, CQB-Opt, achieves a regret upper bound of O(log2T)\mathcal{O}(\log^2 T). Lastly, we provide experimental results that validate our theoretical findings.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a contextual queueing bandits framework that learns unknown service rates while making job-server matching decisions to maximize departure rates. According to the taxonomy, this work resides in the 'Contextual Bandit Approaches for Queueing' leaf under 'Adaptive Learning and Control in Queueing Systems'. Notably, this leaf contains only the original paper itself—no sibling papers are present. This positioning suggests the work occupies a relatively sparse research direction within the broader field of learning unknown service rates in context-aware queueing systems, which comprises seven total papers across six leaf nodes.

The taxonomy reveals that neighboring research directions include 'Service Rate Estimation and Parameter Inference' (covering classical statistical and Bayesian methods) and 'State-Dependent and Stochastic Service Rate Models' (addressing load-dependent and Markov-modulated service processes). The original paper's leaf explicitly excludes game-theoretic customer arrival strategies, distinguishing it from 'Strategic Customer Behavior and Equilibria'. The framework bridges online learning with queueing dynamics, contrasting with offline parameter estimation approaches and differing from state-dependent models that assume known functional forms. This positioning highlights the work's focus on adaptive decision-making under contextual uncertainty rather than pure parameter inference or equilibrium analysis.

Among twenty-one candidates examined through semantic search and citation expansion, none were found to clearly refute any of the three main contributions. The 'Contextual queueing bandits framework' contribution examined ten candidates with zero refutable matches, suggesting limited prior work directly addressing context-aware job-server matching with unknown service rates. The 'Policy-switching queues with coupling-based regret decomposition' examined only one candidate, indicating a novel analytical technique. The 'CQB-ε and CQB-Opt algorithms' also examined ten candidates without refutation. These statistics reflect a focused but not exhaustive search scope, suggesting the contributions appear novel within the examined literature.

Based on the limited search scope of twenty-one semantically related candidates, the work appears to introduce genuinely new elements to the intersection of contextual bandits and queueing theory. The absence of sibling papers in its taxonomy leaf and zero refutable pairs across contributions support this impression. However, the analysis covers top-K semantic matches rather than an exhaustive survey, leaving open the possibility of relevant work outside this search radius. The novelty assessment is therefore conditional on the examined candidate set.

Taxonomy

Core-task Taxonomy Papers
7
3
Claimed Contributions
21
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: learning unknown service rates in context-aware queueing systems. The field addresses how to estimate, adapt to, and control queueing systems when service rates are not known a priori and may depend on contextual information or system state. The taxonomy reveals several main branches: Service Rate Estimation and Parameter Inference focuses on statistical methods for identifying unknown parameters, often drawing on classical inference techniques; Adaptive Learning and Control in Queueing Systems emphasizes online algorithms that simultaneously learn service characteristics and make control decisions; State-Dependent and Stochastic Service Rate Models captures settings where service rates vary with queue length or evolve stochastically, as seen in early foundational work like Stochastic Service Rates[5]; Application-Specific Queueing System Modeling tailors these ideas to domains such as airport operations (Airport Departure Queueing[1]) or tandem networks (Tandem Load Dependent[3]); and Strategic Customer Behavior and Equilibria examines how rational agents respond to queue information, illustrated by Game Two Queue[6]. These branches are interconnected: parameter inference feeds into adaptive control, state-dependent models inform both estimation and application design, and strategic behavior introduces additional complexity requiring robust learning. A particularly active line of work lies at the intersection of adaptive learning and contextual information, where algorithms must balance exploration and exploitation while accounting for system dynamics. Contextual Queueing Bandits[0] sits squarely in this space, combining bandit-style learning with queueing constraints to handle context-dependent service rates. This contrasts with more classical approaches like Bayesian Parameter Tuning[4], which emphasizes posterior inference over parameters, and with purely state-dependent models such as State Dependent Service[2], which may assume known functional forms. The main trade-offs revolve around model flexibility versus tractability: contextual bandit methods offer adaptive online learning but must manage regret under queueing stability constraints, while parametric inference can be more sample-efficient when the model is well-specified. Open questions include how to efficiently handle high-dimensional contexts, ensure stability during learning, and extend these ideas to multi-server or network settings.

Claimed Contributions

Contextual queueing bandits framework

The authors propose a new framework where jobs carry heterogeneous contextual features, and the agent selects job-server pairs to maximize departure rates governed by a logistic model with unknown server-specific parameters. This is the first work to establish a provable decay rate for queue length regret under contextual queueing bandit settings.

10 retrieved papers
Policy-switching queues with coupling-based regret decomposition

The authors introduce policy-switching queues that follow the learning policy up to a certain round and then switch to the optimal policy. Combined with a coupling argument, this technique decomposes queue length regret into a telescoping sum, enabling analysis of short-term effects of suboptimal choices and long-term effects on queue state differences despite queue state misalignment.

1 retrieved paper
CQB-ε and CQB-Opt algorithms with regret bounds

The authors develop two algorithms: CQB-ε for stochastic contexts achieving a decaying regret bound of eO(T−1/4), and CQB-Opt for adversarial contexts achieving O(log2 T) regret. These bounds are established using the novel regret decomposition framework and monotonicity properties of instantaneous regret and long-term queue state differences.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Contextual queueing bandits framework

The authors propose a new framework where jobs carry heterogeneous contextual features, and the agent selects job-server pairs to maximize departure rates governed by a logistic model with unknown server-specific parameters. This is the first work to establish a provable decay rate for queue length regret under contextual queueing bandit settings.

Contribution

Policy-switching queues with coupling-based regret decomposition

The authors introduce policy-switching queues that follow the learning policy up to a certain round and then switch to the optimal policy. Combined with a coupling argument, this technique decomposes queue length regret into a telescoping sum, enabling analysis of short-term effects of suboptimal choices and long-term effects on queue state differences despite queue state misalignment.

Contribution

CQB-ε and CQB-Opt algorithms with regret bounds

The authors develop two algorithms: CQB-ε for stochastic contexts achieving a decaying regret bound of eO(T−1/4), and CQB-Opt for adversarial contexts achieving O(log2 T) regret. These bounds are established using the novel regret decomposition framework and monotonicity properties of instantaneous regret and long-term queue state differences.

Queue Length Regret Bounds for Contextual Queueing Bandits | Novelty Validation