Towards a Sharp Analysis of Learning Offline $f$ -Divergence-Regularized Contextual Bandits

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

contextual banditspolicy optimization$f$-divergenceregularization

Many offline reinforcement learning algorithms are underpinned by $f$ -divergence regularization, but their sample complexity defined with respect to regularized objectives still lacks tight analyses, especially in terms of concrete data coverage conditions. In this paper, we study the exact concentrability requirements to achieve the $\tilde{\Theta}(\epsilon^{-1})$ sample complexity for offline $f$ -divergence-regularized contextual bandits. For reverse Kullback–Leibler (KL) divergence, arguably the most commonly used one, we achieve an $\tilde{O}(\epsilon^{-1})$ sample complexity under single-policy concentrability for the first time via a novel pessimism-based analysis, surpassing existing $\tilde{O}(\epsilon^{-1})$ bound under all-policy concentrability and $\tilde{O}(\epsilon^{-2})$ bound under single-policy concentrability. We also propose a near-matching lower bound, demonstrating that a multiplicative dependency on single-policy concentrability is necessary to maximally exploit the curvature property of reverse KL. Moreover, for $f$ -divergences with strongly convex $f$ , to which reverse KL does not belong, we show that the sharp sample complexity $\tilde{\Theta}(\epsilon^{-1})$ is achievable even without pessimistic estimation or single-policy concentrability. We further corroborate our theoretical insights with numerical experiments and extend our analysis to contextual dueling bandits. We believe these results take a significant step towards a comprehensive understanding of objectives with $f$ -divergence regularization.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes sharp sample complexity bounds for offline f-divergence-regularized contextual bandits, focusing on reverse KL divergence under single-policy concentrability and strongly convex f-divergences without coverage assumptions. It resides in the 'Single-Policy Concentrability Analysis' leaf, which contains only three papers total (including this one). This indicates a relatively sparse research direction within the broader taxonomy of offline f-divergence regularization, suggesting the specific combination of single-policy coverage and tight complexity analysis remains underexplored compared to more general frameworks.

The taxonomy reveals two main branches: 'Concentrability Assumptions and Coverage Conditions' (where this paper sits) and 'Structured Regularization and Robustness' (focusing on latent structure and d-rectangular frameworks). The sibling leaf 'Strongly Convex f-Divergence Without Pessimism' addresses a complementary setting where strong convexity eliminates the need for pessimistic estimation. The paper bridges these directions by analyzing both reverse KL (requiring pessimism) and strongly convex f-divergences (coverage-free), positioning itself at the intersection of coverage-dependent and coverage-independent regimes within the concentrability branch.

Among fourteen candidates examined, Contribution A (sharp KL complexity under single-policy concentrability) shows one refutable candidate out of ten examined, suggesting some prior overlap in this specific setting. Contribution B (coverage-free bounds for strongly convex f) examined only one candidate with no refutation, indicating limited prior work in this direction. Contribution C (moment-based pessimism analysis) examined three candidates with no refutations, suggesting the technical approach may be relatively novel. The limited search scope (fourteen total candidates) means these findings reflect top-K semantic matches rather than exhaustive coverage of the field.

Given the sparse taxonomy structure and limited search scope, the work appears to address a relatively underexplored intersection of single-policy concentrability and f-divergence regularization. The analysis covers top-tier semantic matches but does not claim exhaustive literature review. The presence of one refutable candidate for the main KL contribution suggests some incremental overlap, while the technical approach and strongly convex analysis show fewer direct precedents within the examined set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: sample complexity analysis of offline f-divergence-regularized contextual bandits. The field centers on understanding how much offline data is needed to learn effective policies when regularization based on f-divergences is applied. The taxonomy divides into two main branches: 'Concentrability Assumptions and Coverage Conditions' examines how the distribution mismatch between behavior and target policies affects sample requirements, while 'Structured Regularization and Robustness' focuses on how different regularization schemes and model structures influence learning guarantees. Within the first branch, works typically analyze single-policy or multi-policy concentrability coefficients that quantify coverage quality, whereas the second branch explores how specific f-divergence choices (such as KL or chi-squared) and linear or nonlinear function classes shape robustness and complexity bounds. Representative studies like KL-Regularized Bandits[2] and Sharp Offline f-Divergence[4] illustrate these complementary perspectives on offline learning. A particularly active line of work contrasts the tightness of bounds under varying concentrability conditions versus the flexibility offered by different regularization families. Some studies pursue sharp, instance-dependent rates that adapt to the actual coverage provided by logged data, while others emphasize worst-case robustness across broader f-divergence classes. Sharp f-Divergence Bandits[0] sits within the single-policy concentrability analysis cluster, closely related to Sharp Offline f-Divergence[4] and KL-Regularized Bandits[2]. Compared to these neighbors, Sharp f-Divergence Bandits[0] emphasizes deriving tight sample complexity guarantees under a single behavior policy's coverage, whereas KL-Regularized Bandits[2] focuses specifically on KL regularization and Sharp Offline f-Divergence[4] explores sharpness across a broader f-divergence family. This positioning highlights ongoing questions about whether instance-dependent analysis or general robustness frameworks yield more practical offline learning guarantees.

Claimed Contributions

Sharp sample complexity for KL-regularized contextual bandits under single-policy concentrability

Can Refute

10 retrieved papers

The authors propose a pessimism-based algorithm (KL-PCB) that achieves Õ(ε⁻¹) sample complexity for offline KL-regularized contextual bandits under single-policy concentrability, improving upon prior work that required all-policy concentrability or achieved only Õ(ε⁻²) rates. They also provide a near-matching lower bound demonstrating that multiplicative dependency on single-policy concentrability is necessary.

10 retrieved papers

Can Refute

Coverage-free sample complexity for strongly convex f-divergence regularization

1 retrieved paper

The authors design a lightweight algorithm (f-CB) for f-divergence-regularized contextual bandits with strongly convex f that achieves Θ̃(ε⁻¹) sample complexity without requiring pessimistic estimation or any coverage conditions. This result is certified by matching upper and lower bounds.

1 retrieved paper

Novel moment-based analysis technique coupling pessimism with KL curvature

3 retrieved papers

The authors develop a new analytical technique that combines algorithmic pessimism with the curvature properties of KL-regularized objectives through a moment-based machinery. This approach refines mean-value-type arguments and avoids requiring uniform control over function class discrepancies, which may be of independent interest beyond this specific application.

3 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability PDF

Zhao, Qingyue, Ji, Kaixuan, Qingyue Zhao, Heyang, Kaixuan Ji, Zhang, Tong, Heyang Zhao, Gu, Quanquan, Tong Zhang, Quanquan Gu (2025)

[4] Towards a Sharp Analysis of Offline Policy Learning for $f$ -Divergence-Regularized Contextual Bandits PDF

Zhao, Qingyue, Ji, Kaixuan, Heyang, Zhang, Tong, Gu, Quanquan (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Sharp sample complexity for KL-regularized contextual bandits under single-policy concentrability

[6] Bridging offline reinforcement learning and imitation learning: A tale of pessimism PDF

Can Refute

[1] Towards a Sharp Analysis of Offline Policy Learning for -Divergence-Regularized Contextual Bandits PDF

Cannot Refute

[2] Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability PDF

Cannot Refute

[5] A unified framework of policy learning for contextual bandit with confounding bias and missing observations PDF

Cannot Refute

[7] Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning PDF

Cannot Refute

[8] Pessimism for Offline Linear Contextual Bandits using Confidence Sets PDF

Cannot Refute

[9] Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer PDF

Cannot Refute

[10] Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning PDF

Cannot Refute

[11] Pessimistic decision-making for recommender systems PDF

Cannot Refute

[12] Pessimistic reward models for off-policy learning in recommendation PDF

Cannot Refute

Contribution

Coverage-free sample complexity for strongly convex f-divergence regularization

[1] Towards a Sharp Analysis of Offline Policy Learning for -Divergence-Regularized Contextual Bandits PDF

Cannot Refute

Contribution

Novel moment-based analysis technique coupling pessimism with KL curvature

[1] Towards a Sharp Analysis of Offline Policy Learning for -Divergence-Regularized Contextual Bandits PDF

Cannot Refute

[2] Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability PDF

Cannot Refute

[4] Towards a Sharp Analysis of Offline Policy Learning for $f$ -Divergence-Regularized Contextual Bandits PDF

Cannot Refute

Towards a Sharp Analysis of Learning Offline fff-Divergence-Regularized Contextual Bandits

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability PDF

[4] Towards a Sharp Analysis of Offline Policy Learning for fff-Divergence-Regularized Contextual Bandits PDF

Contribution Analysis

Sharp sample complexity for KL-regularized contextual bandits under single-policy concentrability

[6] Bridging offline reinforcement learning and imitation learning: A tale of pessimism PDF

[1] Towards a Sharp Analysis of Offline Policy Learning for -Divergence-Regularized Contextual Bandits PDF

[2] Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability PDF

[5] A unified framework of policy learning for contextual bandit with confounding bias and missing observations PDF

[7] Feel-Good Thompson Sampling for Contextual Bandits and Reinforcement Learning PDF

[8] Pessimism for Offline Linear Contextual Bandits using Confidence Sets PDF

[9] Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer PDF

[10] Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning PDF

[11] Pessimistic decision-making for recommender systems PDF

[12] Pessimistic reward models for off-policy learning in recommendation PDF

Coverage-free sample complexity for strongly convex f-divergence regularization

[1] Towards a Sharp Analysis of Offline Policy Learning for -Divergence-Regularized Contextual Bandits PDF

Novel moment-based analysis technique coupling pessimism with KL curvature

[1] Towards a Sharp Analysis of Offline Policy Learning for -Divergence-Regularized Contextual Bandits PDF

[2] Nearly Optimal Sample Complexity of Offline KL-Regularized Contextual Bandits under Single-Policy Concentrability PDF

[4] Towards a Sharp Analysis of Offline Policy Learning for fff-Divergence-Regularized Contextual Bandits PDF

Table of Contents

Towards a Sharp Analysis of Learning Offline $f$ -Divergence-Regularized Contextual Bandits

[4] Towards a Sharp Analysis of Offline Policy Learning for $f$ -Divergence-Regularized Contextual Bandits PDF

[4] Towards a Sharp Analysis of Offline Policy Learning for $f$ -Divergence-Regularized Contextual Bandits PDF