KL-Regularization Is Sufficient in Contextual Bandits and RLHF

ICLR 2026 Conference Withdrawn SubmissionJoongkyu Lee, Min-hwan Oh
banditRLHFdueling banditKL regularization
Abstract:

Recently, reinforcement learning from human feedback (RLHF) has demonstrated remarkable efficiency in fine-tuning large language models (LLMs), fueling a surge of interest in KL regularization. Yet, the theoretical foundations of KL regularization remain underexplored. Many prior works employ either explicit online exploration strategies—such as UCB, Thompson sampling, and forced sampling—or optimism-embedded optimization techniques (e.g., Xie et al. 2024) in addition to KL regularization to achieve sublinear regret in online RLHF. In this paper, we show, for the first time to our best knowledge, that such additional exploration strategies are unnecessary if KL regularization is already included. That is, KL regularization alone suffices to guarantee sublinear regret. To handle general function classes, we assume access to an online regression oracle and propose KL-EXP (and its RLHF variant, OEPO), which achieves logarithmic KL-regularized regret—the standard objective in KL-regularized contextual bandits and RLHF—while also attaining an unregularized regret of O~(logNTReg_Sq(T))\tilde{\mathcal{O}}(\sqrt{\smash[b]{\log N \cdot T \text{Reg}\_{\text{Sq}}(T) } }), where NN is the number of actions, TT is the total number of rounds, and Reg_Sq(T)\text{Reg}\_{\text{Sq}}(T) is the online regression oracle bound. To the best of our knowledge, this is the first result to achieve regret with only logarithmic dependence on NN in oracle-based contextual bandits. As a special case, in linear contextual bandits, we establish a O~(dTlogN)\tilde{\mathcal{O}}(\sqrt{dT \log N}) bound on the unregularized regret, where dd is the feature dimension. To our best knowledge, this is the first O~(dTlogN)\tilde{\mathcal{O}}(\sqrt{dT \log N})-type regret bound achieved without resorting to supLin-type algorithms, making it substantially more practical.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper claims that KL regularization alone suffices for sublinear regret in contextual bandits and RLHF, eliminating the need for explicit exploration strategies like UCB or Thompson sampling. It sits in the 'Regret Bounds and Sample Complexity' leaf alongside three sibling papers within the 'Theoretical Foundations and Regret Analysis' branch. This leaf represents a focused but active research direction within a 27-paper taxonomy, suggesting moderate competition in establishing tight theoretical guarantees for KL-regularized learning. The taxonomy structure reveals this is a core theoretical question rather than a crowded algorithmic design space.

The taxonomy places this work within a broader theoretical ecosystem that includes 'Offline, Online, and Hybrid Learning Settings' as a sibling leaf, and 'Policy Optimization and Gradient Methods' in a neighboring branch. The scope note for the parent branch emphasizes formal regret analysis distinct from algorithmic design, while the exclude note clarifies that empirical methods belong elsewhere. Nearby leaves address iterative training frameworks and regularization strategies, indicating the field distinguishes between proving what KL regularization achieves theoretically versus how to implement it efficiently. The paper's position suggests it bridges foundational theory with the practical question of whether additional exploration mechanisms are necessary.

Among 18 candidates examined, the contribution 'KL-regularization alone suffices for sublinear regret' shows no clear refutation across 5 candidates, while 'KL-EXP algorithm with logarithmic KL-regularized regret' similarly faces no refutation among 3 candidates. However, the claim of 'first regret bound with logarithmic dependence on number of actions' examined 10 candidates and found 4 potentially refutable prior works. This pattern suggests the core sufficiency claim appears more novel within the limited search scope, whereas the specific logarithmic-in-actions bound may have more substantial prior work. The analysis explicitly covers top-K semantic matches plus citation expansion, not an exhaustive literature review.

Based on examining 18 candidates from semantic search, the paper's central theoretical claim about KL regularization sufficiency appears relatively unexplored, while the specific regret bound formulation encounters more prior work. The limited search scope means these findings reflect proximity in embedding space and citation networks rather than comprehensive field coverage. The taxonomy context suggests this work addresses a recognized gap in understanding when additional exploration is theoretically necessary beyond KL constraints.

Taxonomy

Core-task Taxonomy Papers
27
3
Claimed Contributions
18
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: KL-regularized contextual bandits and reinforcement learning from human feedback. The field organizes around five main branches that together address the challenge of aligning models with human preferences while maintaining theoretical rigor and practical safety. Theoretical Foundations and Regret Analysis establishes formal guarantees for KL-regularized policies, examining sample complexity and convergence rates in settings where divergence from a reference policy must be controlled. Algorithmic Methods and Optimization Techniques develops practical training procedures—ranging from online iterative schemes like Online Iterative RLHF[6] to offline approaches such as Offline Policy Learning[3]—that balance exploration, exploitation, and computational efficiency. Reward and Preference Modeling focuses on learning accurate human preference signals, with works like Helpful Harmless Assistant[1] and Iterative Preference Learning[2] demonstrating how to elicit and refine feedback at scale. Applications and Domain-Specific Extensions adapt these methods to diverse settings including dialogue systems, video generation (Video Generation Feedback[4]), and recommendation (RewardRec[25]), while Privacy and Safety Considerations address robustness concerns through differential privacy (KL Differential Privacy[9]) and guardrails against reward overoptimization (Mitigating Overoptimization[15]). Within the theoretical branch, a particularly active line of work investigates tight regret bounds and sample complexity under KL constraints. KL Regularization Sufficiency[0] sits squarely in this cluster, examining when and why KL regularization alone suffices for efficient learning. Nearby, Logarithmic Regret KL[7] establishes logarithmic regret guarantees in specific bandit settings, while Sharp KL Analysis[13] provides refined characterizations of the regret-divergence trade-off. These works collectively address a central question: how much regularization is necessary to prevent catastrophic deviation from a safe reference policy without sacrificing learning speed? The interplay between KL Q Learning[23] and Maximum Entropy Failures[24] further highlights tensions between entropy-based exploration and strict KL control, suggesting that different problem structures may demand tailored regularization strategies rather than a one-size-fits-all approach.

Claimed Contributions

KL-regularization alone suffices for sublinear regret

The authors demonstrate that KL-regularization by itself is sufficient to achieve sublinear regret in contextual bandits and RLHF, without requiring additional exploration mechanisms such as UCB, Thompson sampling, or forced sampling that prior works employed.

5 retrieved papers
KL-EXP algorithm with logarithmic KL-regularized regret

The authors introduce the KL-EXP algorithm (and its RLHF variant OEPO) that achieves logarithmic KL-regularized regret by relying solely on KL-regularization, using an online regression oracle to handle general function classes.

3 retrieved papers
First regret bound with logarithmic dependence on number of actions

The authors achieve an unregularized regret bound of order square root of log N times T times the regression oracle bound, improving the dependence on the number of actions N from square root of N to square root of log N in oracle-efficient contextual bandits.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

KL-regularization alone suffices for sublinear regret

The authors demonstrate that KL-regularization by itself is sufficient to achieve sublinear regret in contextual bandits and RLHF, without requiring additional exploration mechanisms such as UCB, Thompson sampling, or forced sampling that prior works employed.

Contribution

KL-EXP algorithm with logarithmic KL-regularized regret

The authors introduce the KL-EXP algorithm (and its RLHF variant OEPO) that achieves logarithmic KL-regularized regret by relying solely on KL-regularization, using an online regression oracle to handle general function classes.

Contribution

First regret bound with logarithmic dependence on number of actions

The authors achieve an unregularized regret bound of order square root of log N times T times the regression oracle bound, improving the dependence on the number of actions N from square root of N to square root of log N in oracle-efficient contextual bandits.