Efficient Simple Regret Algorithms for Stochastic Contextual Bandits

ICLR 2026 Conference SubmissionAnonymous Authors
contextual banditslogistic banditssimple regretThompson sampling
Abstract:

We study stochastic contextual logistic bandits under the simple regret objective. While simple regret guarantees are known for the linear case, no such results existed for the logistic setting. Building on ideas from contextual linear bandits and self-concordant analysis, we propose the first algorithm that achieves simple regret O~(d/T)\tilde{\mathcal{O}}(d/\sqrt{T}). Notably, the leading term of our regret bound is free of κ=O(exp(S))\kappa=\mathcal O(\exp(S)), where SS is a bound on the magnitude of the unknown parameter vector, while the algorithm remains computationally tractable for finite action sets. We also introduce a new variant of Thompson Sampling adapted to the simple-regret setting, which yields the first simple regret guarantee for randomized algorithms in stochastic contextual linear bandits. Extending these tools to the logistic case, we obtain a Thompson Sampling variant with regret O~(d3/2/T)\tilde{\mathcal O}(d^{3/2}/\sqrt{T}), again free of κ\kappa in the leading term. The randomized algorithms, as expected, are cheaper to run than their deterministic counterparts. Finally, we conducted a series of experiments to empirically validate these theoretical guarantees.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes the first simple regret algorithm for stochastic contextual logistic bandits, achieving Õ(d/√T) regret with a leading term free of the exponential parameter κ. It also introduces a Thompson Sampling variant for simple regret in contextual linear bandits, extending to the logistic case with Õ(d^(3/2)/√T) regret. Within the taxonomy, the work sits in the 'Deterministic Simple Regret Algorithms' leaf alongside one sibling paper (fc11b0c3a4fe1e4e7e3fff4a08148db3), indicating a relatively sparse research direction focused on provable simple regret bounds for contextual bandits under stochastic rewards.

The taxonomy reveals that the broader 'Core Simple Regret Algorithms for Contextual Bandits' branch contains three leaves: deterministic methods, randomized approaches, and Bayesian experimental design. The paper bridges deterministic and randomized categories by proposing both types of algorithms. Neighboring branches address structural assumptions (e.g., Lipschitz rewards, adaptive contexts) and dual-objective frameworks balancing simple and cumulative regret. The scope notes clarify that this work excludes cumulative regret methods and non-contextual settings, positioning it firmly within pure simple regret minimization for contextual bandits with stochastic feedback.

Among seventeen candidates examined across three contributions, none were found to clearly refute the claimed novelty. The first contribution (logistic contextual bandits) examined one candidate with no refutation. The second contribution (randomized simple regret for linear contextual bandits) examined six candidates, all non-refutable or unclear. The third contribution (novel analysis paradigm) examined ten candidates, again with no refutations. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—the paper's contributions appear distinct from existing work, though the small candidate pool (especially one candidate for the logistic claim) limits the strength of this signal.

Based on the limited literature search, the work appears to occupy a genuinely sparse area: simple regret for logistic contextual bandits and randomized simple regret for linear contextual bandits both lack direct prior results among the examined candidates. The taxonomy structure confirms that simple regret minimization itself is less crowded than cumulative regret research. However, the analysis covers only seventeen candidates, not an exhaustive survey, so stronger novelty claims would require broader verification beyond top-K semantic retrieval.

Taxonomy

Core-task Taxonomy Papers
14
3
Claimed Contributions
17
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: simple regret minimization in stochastic contextual bandits. The field centers on identifying the best arm or policy after a fixed exploration budget, rather than minimizing cumulative regret during learning. The taxonomy reveals several main branches: core algorithmic approaches that directly tackle simple regret (including deterministic and randomized strategies), structural assumptions and problem variants that exploit specific reward or context structures, dual-objective frameworks that balance simple regret with cumulative regret or other constraints, extensions to infinite or continuous arm spaces, connections to active learning and policy optimization, and broader bandit theory surveys. Representative works such as Efficient Simple Regret Algorithms[0] and Simple Regret Minimization for[1] illustrate how deterministic elimination-style methods can achieve strong guarantees, while other branches like Minimal Exploration in Structured[3] or Causal Contextual Bandits with[4] show how additional structure (e.g., causal graphs or low-dimensional embeddings) can reduce sample complexity. A particularly active line of research contrasts pure simple regret methods with hybrid approaches that must also control cumulative regret or satisfy resource constraints, as seen in works like Bandits with Knapsacks beyond[14] or Multi-Armed Bandit With Several[12]. Another theme involves extending finite-arm guarantees to infinite or continuous settings, exemplified by Simple regret for infinitely[2], which raises open questions about discretization and adaptive sampling. Within this landscape, Efficient Simple Regret Algorithms[0] sits squarely in the deterministic core branch alongside Simple Regret Minimization for[1], emphasizing tight sample complexity bounds and elimination-based exploration. Compared to neighbors like Minimal Exploration in Structured[3], which leverages problem structure, or randomized approaches such as OSOM[10], the deterministic focus of Efficient Simple Regret Algorithms[0] prioritizes worst-case guarantees and algorithmic simplicity, reflecting a classical yet foundational perspective on the simple regret objective.

Claimed Contributions

First simple regret algorithm for logistic contextual bandits

The authors introduce the first algorithm for stochastic contextual logistic bandits under the simple regret objective, achieving Õ(d/√T) regret with the leading term independent of the potentially exponentially large constant κ. The algorithm uses uncertainty-based action selection with self-concordant analysis.

1 retrieved paper
First simple regret guarantee for randomized algorithms in contextual linear bandits

The authors propose a Thompson Sampling variant specifically designed for simple regret in stochastic contextual linear bandits, providing the first theoretical guarantee for randomized algorithms in this setting with Õ(d^(3/2)/√T) regret bound.

6 retrieved papers
Novel analysis paradigm for randomized simple regret algorithms

The authors develop a new analytical framework for studying randomized algorithms under simple regret objectives in contextual bandits, which differs fundamentally from cumulative regret analysis and could serve as a foundation for analyzing related problems.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

First simple regret algorithm for logistic contextual bandits

The authors introduce the first algorithm for stochastic contextual logistic bandits under the simple regret objective, achieving Õ(d/√T) regret with the leading term independent of the potentially exponentially large constant κ. The algorithm uses uncertainty-based action selection with self-concordant analysis.

Contribution

First simple regret guarantee for randomized algorithms in contextual linear bandits

The authors propose a Thompson Sampling variant specifically designed for simple regret in stochastic contextual linear bandits, providing the first theoretical guarantee for randomized algorithms in this setting with Õ(d^(3/2)/√T) regret bound.

Contribution

Novel analysis paradigm for randomized simple regret algorithms

The authors develop a new analytical framework for studying randomized algorithms under simple regret objectives in contextual bandits, which differs fundamentally from cumulative regret analysis and could serve as a foundation for analyzing related problems.