Towards a Sharp Analysis of Learning Offline ff-Divergence-Regularized Contextual Bandits

ICLR 2026 Conference SubmissionAnonymous Authors
contextual banditspolicy optimization$f$-divergenceregularization
Abstract:

Many offline reinforcement learning algorithms are underpinned by ff-divergence regularization, but their sample complexity defined with respect to regularized objectives still lacks tight analyses, especially in terms of concrete data coverage conditions. In this paper, we study the exact concentrability requirements to achieve the Θ~(ϵ1)\tilde{\Theta}(\epsilon^{-1}) sample complexity for offline ff-divergence-regularized contextual bandits. For reverse Kullback–Leibler (KL) divergence, arguably the most commonly used one, we achieve an O~(ϵ1)\tilde{O}(\epsilon^{-1}) sample complexity under single-policy concentrability for the first time via a novel pessimism-based analysis, surpassing existing O~(ϵ1)\tilde{O}(\epsilon^{-1}) bound under all-policy concentrability and O~(ϵ2)\tilde{O}(\epsilon^{-2}) bound under single-policy concentrability. We also propose a near-matching lower bound, demonstrating that a multiplicative dependency on single-policy concentrability is necessary to maximally exploit the curvature property of reverse KL. Moreover, for ff-divergences with strongly convex ff, to which reverse KL does not belong, we show that the sharp sample complexity Θ~(ϵ1)\tilde{\Theta}(\epsilon^{-1}) is achievable even without pessimistic estimation or single-policy concentrability. We further corroborate our theoretical insights with numerical experiments and extend our analysis to contextual dueling bandits. We believe these results take a significant step towards a comprehensive understanding of objectives with ff-divergence regularization.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes sharp sample complexity bounds for offline f-divergence-regularized contextual bandits, focusing on reverse KL divergence under single-policy concentrability and strongly convex f-divergences without coverage assumptions. It resides in the 'Single-Policy Concentrability Analysis' leaf, which contains only three papers total (including this one). This indicates a relatively sparse research direction within the broader taxonomy of offline f-divergence regularization, suggesting the specific combination of single-policy coverage and tight complexity analysis remains underexplored compared to more general frameworks.

The taxonomy reveals two main branches: 'Concentrability Assumptions and Coverage Conditions' (where this paper sits) and 'Structured Regularization and Robustness' (focusing on latent structure and d-rectangular frameworks). The sibling leaf 'Strongly Convex f-Divergence Without Pessimism' addresses a complementary setting where strong convexity eliminates the need for pessimistic estimation. The paper bridges these directions by analyzing both reverse KL (requiring pessimism) and strongly convex f-divergences (coverage-free), positioning itself at the intersection of coverage-dependent and coverage-independent regimes within the concentrability branch.

Among fourteen candidates examined, Contribution A (sharp KL complexity under single-policy concentrability) shows one refutable candidate out of ten examined, suggesting some prior overlap in this specific setting. Contribution B (coverage-free bounds for strongly convex f) examined only one candidate with no refutation, indicating limited prior work in this direction. Contribution C (moment-based pessimism analysis) examined three candidates with no refutations, suggesting the technical approach may be relatively novel. The limited search scope (fourteen total candidates) means these findings reflect top-K semantic matches rather than exhaustive coverage of the field.

Given the sparse taxonomy structure and limited search scope, the work appears to address a relatively underexplored intersection of single-policy concentrability and f-divergence regularization. The analysis covers top-tier semantic matches but does not claim exhaustive literature review. The presence of one refutable candidate for the main KL contribution suggests some incremental overlap, while the technical approach and strongly convex analysis show fewer direct precedents within the examined set.

Taxonomy

Core-task Taxonomy Papers
4
3
Claimed Contributions
14
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: sample complexity analysis of offline f-divergence-regularized contextual bandits. The field centers on understanding how much offline data is needed to learn effective policies when regularization based on f-divergences is applied. The taxonomy divides into two main branches: 'Concentrability Assumptions and Coverage Conditions' examines how the distribution mismatch between behavior and target policies affects sample requirements, while 'Structured Regularization and Robustness' focuses on how different regularization schemes and model structures influence learning guarantees. Within the first branch, works typically analyze single-policy or multi-policy concentrability coefficients that quantify coverage quality, whereas the second branch explores how specific f-divergence choices (such as KL or chi-squared) and linear or nonlinear function classes shape robustness and complexity bounds. Representative studies like KL-Regularized Bandits[2] and Sharp Offline f-Divergence[4] illustrate these complementary perspectives on offline learning. A particularly active line of work contrasts the tightness of bounds under varying concentrability conditions versus the flexibility offered by different regularization families. Some studies pursue sharp, instance-dependent rates that adapt to the actual coverage provided by logged data, while others emphasize worst-case robustness across broader f-divergence classes. Sharp f-Divergence Bandits[0] sits within the single-policy concentrability analysis cluster, closely related to Sharp Offline f-Divergence[4] and KL-Regularized Bandits[2]. Compared to these neighbors, Sharp f-Divergence Bandits[0] emphasizes deriving tight sample complexity guarantees under a single behavior policy's coverage, whereas KL-Regularized Bandits[2] focuses specifically on KL regularization and Sharp Offline f-Divergence[4] explores sharpness across a broader f-divergence family. This positioning highlights ongoing questions about whether instance-dependent analysis or general robustness frameworks yield more practical offline learning guarantees.

Claimed Contributions

Sharp sample complexity for KL-regularized contextual bandits under single-policy concentrability

The authors propose a pessimism-based algorithm (KL-PCB) that achieves Õ(ε⁻¹) sample complexity for offline KL-regularized contextual bandits under single-policy concentrability, improving upon prior work that required all-policy concentrability or achieved only Õ(ε⁻²) rates. They also provide a near-matching lower bound demonstrating that multiplicative dependency on single-policy concentrability is necessary.

10 retrieved papers
Can Refute
Coverage-free sample complexity for strongly convex f-divergence regularization

The authors design a lightweight algorithm (f-CB) for f-divergence-regularized contextual bandits with strongly convex f that achieves Θ̃(ε⁻¹) sample complexity without requiring pessimistic estimation or any coverage conditions. This result is certified by matching upper and lower bounds.

1 retrieved paper
Novel moment-based analysis technique coupling pessimism with KL curvature

The authors develop a new analytical technique that combines algorithmic pessimism with the curvature properties of KL-regularized objectives through a moment-based machinery. This approach refines mean-value-type arguments and avoids requiring uniform control over function class discrepancies, which may be of independent interest beyond this specific application.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Sharp sample complexity for KL-regularized contextual bandits under single-policy concentrability

The authors propose a pessimism-based algorithm (KL-PCB) that achieves Õ(ε⁻¹) sample complexity for offline KL-regularized contextual bandits under single-policy concentrability, improving upon prior work that required all-policy concentrability or achieved only Õ(ε⁻²) rates. They also provide a near-matching lower bound demonstrating that multiplicative dependency on single-policy concentrability is necessary.

Contribution

Coverage-free sample complexity for strongly convex f-divergence regularization

The authors design a lightweight algorithm (f-CB) for f-divergence-regularized contextual bandits with strongly convex f that achieves Θ̃(ε⁻¹) sample complexity without requiring pessimistic estimation or any coverage conditions. This result is certified by matching upper and lower bounds.

Contribution

Novel moment-based analysis technique coupling pessimism with KL curvature

The authors develop a new analytical technique that combines algorithmic pessimism with the curvature properties of KL-regularized objectives through a moment-based machinery. This approach refines mean-value-type arguments and avoids requiring uniform control over function class discrepancies, which may be of independent interest beyond this specific application.