From Curiosity to Caution: Mitigating Reward Hacking for Best-of- $N$ with Pessimism

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Reward HackingReward ModelsPessimismInference-time ScalingLarge Language Models

Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question of how best to use the additional compute remains open. A popular approach is Best-of- $N$ (BoN) sampling, where $N$ candidate responses are generated, scored according to a reward model, and the highest-scoring response is selected. While this approach can improve performance, it is vulnerable to reward hacking, where performance degrades as $N$ increases due to the selection of responses that exploit imperfections in the reward model instead of genuinely improving generation quality. Prior attempts to mitigate reward hacking---via stronger reward models or heavy-handed distributional regularization---either fail to fully address over-optimization or are too conservative to exploit additional compute. In this work, we explore the principle of pessimism in reinforcement learning (RL), which uses lower confidence bounds on value estimates to avoid out-of-distribution (OOD) actions with uncertain reward estimates. Our approach, termed as caution, can be seen as the reverse of curiosity: where curiosity (e.g., via Random Network Distillation, RND) rewards prediction error as a signal of novelty, caution penalizes prediction error as a signal of distributional uncertainty. Practically, caution trains an error model on typical responses and uses its prediction error to lower reward estimates for atypical ones. Our extensive empirical evaluation demonstrates that caution is a simple, computationally efficient approach that substantially mitigates reward hacking in BoN sampling. We also provide a theoretical analysis in a simplified linear setting, which shows that caution provably improves over the standard BoN approach. Together, our results not only establish caution as a practical solution to reward hacking, but also provide evidence that curiosity-based approaches can be a general OOD detection technique in LLM settings.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes 'caution,' a pessimistic reward estimation method for Best-of-N sampling that penalizes responses with uncertain reward estimates to mitigate reward hacking. It resides in the 'Pessimism via Prediction Error Penalization' leaf, which contains only one sibling paper ('Learning a Pessimistic Reward'). This represents a relatively sparse research direction within the broader taxonomy of thirteen papers across multiple mitigation strategies, suggesting the specific approach of penalizing prediction error as a proxy for distributional uncertainty is less explored than ensemble-based or training-time interventions.

The taxonomy reveals several neighboring approaches to the same core problem. The sibling branch 'Conservative Bounds and Exploration Constraints' applies lower confidence bounds during inference scaling but does not focus on prediction error as the uncertainty signal. Adjacent branches include 'Ensemble-Based Overoptimization Mitigation' and 'Bayesian Reward Modeling,' which aggregate multiple models or quantify uncertainty probabilistically rather than penalizing single-model prediction variance. The paper's position suggests it offers a computationally lighter alternative to ensemble methods while remaining distinct from architectural improvements like hidden state regularization.

Among thirty candidates examined, each of the three contributions shows at least one refutable candidate. The 'caution' mechanism itself, the dual curiosity-caution relationship for out-of-distribution detection, and the theoretical analysis each examined ten candidates with one appearing to provide overlapping prior work. This indicates that within the limited search scope, some aspects of the approach have precedent, though nine out of ten candidates per contribution did not clearly refute the claims. The statistics suggest moderate novelty: the core ideas are not entirely unprecedented, but substantial gaps remain among the examined literature.

Given the limited search scope of thirty semantically similar papers, the analysis captures nearby work but cannot claim exhaustive coverage. The sparse taxonomy leaf and the nine non-refutable candidates per contribution suggest the specific combination of pessimism via prediction error penalization may offer incremental advances over existing methods. However, the presence of refutable candidates indicates that key conceptual elements—pessimistic reward adjustment, uncertainty-based penalization, or theoretical guarantees—have appeared in prior work, warranting careful positioning relative to the sibling paper and related ensemble or Bayesian approaches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Mitigating reward hacking in Best-of-N sampling with pessimistic reward estimation. The field addresses a fundamental challenge in aligning large language models at inference time: when selecting the best response from N candidates using a learned reward model, overoptimization can lead to reward hacking, where high-scoring outputs exploit model weaknesses rather than genuinely satisfy user intent. The taxonomy organizes solutions into several main branches. Pessimistic and Conservative Reward Estimation Methods focus on downweighting uncertain or potentially overestimated rewards, often by penalizing prediction errors or incorporating uncertainty estimates. Reward Model Ensemble and Aggregation Techniques combine multiple reward signals to reduce reliance on any single model's biases, as seen in works like Reward model ensembles help[2] and Bayesian reward models for[4]. Reward Model Architecture and Training Improvements seek to build more robust reward functions from the ground up, including approaches like Regularizing hidden states enables[1] and Agentic Reward Modeling[9]. Theoretical Analysis and Comparative Evaluation of Inference-Time Alignment provides formal understanding and empirical comparisons, exemplified by Is best-of-n the best[3] and Optimal Stopping vs Best-of-[10]. Finally, Analogous Decision-Making Models from Other Domains draw on related frameworks such as Model of the best-of-[8]. Within this landscape, a particularly active line of work explores how to inject conservatism directly into reward scoring to counteract overoptimization. From Curiosity to Caution[0] sits squarely in the Pessimism via Prediction Error Penalization cluster, alongside Learning a Pessimistic Reward[12], both of which penalize outputs where the reward model exhibits high uncertainty or prediction variance. This contrasts with ensemble-based strategies like those in Reward model ensembles help[2], which aggregate multiple models rather than modifying a single reward signal. Another nearby direction, represented by SAFFRON-1[5] and related work, emphasizes architectural or training-time interventions to improve reward model calibration. The central trade-off across these branches is between computational overhead—ensembles and sophisticated architectures can be expensive—and the degree of conservatism introduced, with pessimistic methods offering a lightweight alternative that directly targets the regions of reward space most prone to hacking.

Claimed Contributions

Caution: a pessimistic reward estimation approach for Best-of-N sampling

Can Refute

10 retrieved papers

The authors introduce caution, an inference-time method that applies pessimism to reward estimation by penalizing out-of-distribution responses using prediction error from a trained error model. This approach mitigates reward hacking in Best-of-N sampling by subtracting per-response uncertainty estimates from reward model scores.

10 retrieved papers

Can Refute

Dual relationship between curiosity and caution for OOD detection

Can Refute

10 retrieved papers

The authors establish that caution is conceptually dual to curiosity-based exploration methods. While curiosity rewards prediction error to encourage exploration, caution penalizes prediction error to avoid uncertain out-of-distribution responses, providing a new perspective on using curiosity-style techniques for pessimistic policy learning.

10 retrieved papers

Can Refute

Theoretical analysis proving caution improves over standard Best-of-N

Can Refute

10 retrieved papers

The authors provide a theoretical guarantee in a simplified linear setting demonstrating that their caution-regularized reward estimate leads to provably better performance than standard Best-of-N sampling, while also establishing the first theoretical validation of curiosity-style methods for out-of-distribution detection.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[12] Learning a Pessimistic Reward in RLHF: KL Regularization is Not Necessary PDF

Y Xu, H Kang, T Suresh, Y Wan, G Singh (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Caution: a pessimistic reward estimation approach for Best-of-N sampling

[23] Offline Learning for Combinatorial Multi-armed Bandits PDF

Can Refute

[20] Domain: Mildly conservative model-based offline reinforcement learning PDF

Cannot Refute

[21] Conservative q-learning for offline reinforcement learning PDF

Cannot Refute

[22] Conservative offline distributional reinforcement learning PDF

Cannot Refute

[24] CLARE: Conservative Model-Based Reward Learning for Offline Inverse Reinforcement Learning PDF

Cannot Refute

[25] Conservative bayesian model-based value expansion for offline policy optimization PDF

Cannot Refute

[26] Deterministic uncertainty propagation for improved model-based offline reinforcement learning PDF

Cannot Refute

[27] Improving Exploration in ActorâCritic With Weakly Pessimistic Value Estimation and Optimistic Policy Optimization PDF

Cannot Refute

[28] Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning PDF

Cannot Refute

[29] Lightweight Uncertainty for Offline Reinforcement Learning via Bayesian Posterior PDF

Cannot Refute

Contribution

Dual relationship between curiosity and caution for OOD detection

[34] Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning PDF

Can Refute

[30] Learning Confidence for Out-of-Distribution Detection in Neural Networks PDF

Cannot Refute

[31] Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles PDF

Cannot Refute

[32] Joint Out-of-Distribution Detection and Uncertainty Estimation for Trajectory Prediction PDF

Cannot Refute

[33] Application of Uncertainty to Out-of-Distribution Detection for Autonomous Driving Perception Safety PDF

Cannot Refute

[35] Adaptive labeling for efficient out-of-distribution model evaluation PDF

Cannot Refute

[36] Safe reinforcement learning with model uncertainty estimates PDF

Cannot Refute

[37] Uncertainty-based out-of-distribution detection in deep reinforcement learning PDF

Cannot Refute

[38] Frequentist uncertainty estimates for deep learning PDF

Cannot Refute

[39] Continual Evidential Deep Learning for Out-of-Distribution Detection PDF

Cannot Refute

Contribution

Theoretical analysis proving caution improves over standard Best-of-N

[3] Is best-of-n the best of them? coverage, scaling, and optimality in inference-time alignment PDF

Can Refute

[1] Regularizing hidden states enables learning generalizable reward model for llms PDF

Cannot Refute

[2] Reward model ensembles help mitigate overoptimization PDF

Cannot Refute

[8] Model of the best-of- nest-site selection process in honeybees PDF

Cannot Refute

[14] Evaluation of Best-of-N Sampling Strategies for Language Model Alignment PDF

Cannot Refute

[15] Improving reinforcement learning from human feedback with efficient reward model ensemble PDF

Cannot Refute

[16] Inference-Time Reward Hacking in Large Language Models PDF

Cannot Refute

[17] Regularized best-of-n sampling with minimum bayes risk objective for language model alignment PDF

Cannot Refute

[18] Uncertainty-aware reward model: Teaching reward models to know what is unknown PDF

Cannot Refute

[19] Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling PDF

Cannot Refute

From Curiosity to Caution: Mitigating Reward Hacking for Best-of-NNN with Pessimism

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[12] Learning a Pessimistic Reward in RLHF: KL Regularization is Not Necessary PDF

Contribution Analysis

Caution: a pessimistic reward estimation approach for Best-of-N sampling

[23] Offline Learning for Combinatorial Multi-armed Bandits PDF

[20] Domain: Mildly conservative model-based offline reinforcement learning PDF

[21] Conservative q-learning for offline reinforcement learning PDF

[22] Conservative offline distributional reinforcement learning PDF

[24] CLARE: Conservative Model-Based Reward Learning for Offline Inverse Reinforcement Learning PDF

[25] Conservative bayesian model-based value expansion for offline policy optimization PDF

[26] Deterministic uncertainty propagation for improved model-based offline reinforcement learning PDF

[27] Improving Exploration in ActorâCritic With Weakly Pessimistic Value Estimation and Optimistic Policy Optimization PDF

[28] Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning PDF

[29] Lightweight Uncertainty for Offline Reinforcement Learning via Bayesian Posterior PDF

Dual relationship between curiosity and caution for OOD detection

[34] Pessimistic bootstrapping for uncertainty-driven offline reinforcement learning PDF

[30] Learning Confidence for Out-of-Distribution Detection in Neural Networks PDF

[31] Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles PDF

[32] Joint Out-of-Distribution Detection and Uncertainty Estimation for Trajectory Prediction PDF

[33] Application of Uncertainty to Out-of-Distribution Detection for Autonomous Driving Perception Safety PDF

[35] Adaptive labeling for efficient out-of-distribution model evaluation PDF

[36] Safe reinforcement learning with model uncertainty estimates PDF

[37] Uncertainty-based out-of-distribution detection in deep reinforcement learning PDF

[38] Frequentist uncertainty estimates for deep learning PDF

[39] Continual Evidential Deep Learning for Out-of-Distribution Detection PDF

Theoretical analysis proving caution improves over standard Best-of-N

[3] Is best-of-n the best of them? coverage, scaling, and optimality in inference-time alignment PDF

[1] Regularizing hidden states enables learning generalizable reward model for llms PDF

[2] Reward model ensembles help mitigate overoptimization PDF

[8] Model of the best-of- nest-site selection process in honeybees PDF

[14] Evaluation of Best-of-N Sampling Strategies for Language Model Alignment PDF

[15] Improving reinforcement learning from human feedback with efficient reward model ensemble PDF

[16] Inference-Time Reward Hacking in Large Language Models PDF

[17] Regularized best-of-n sampling with minimum bayes risk objective for language model alignment PDF

[18] Uncertainty-aware reward model: Teaching reward models to know what is unknown PDF

[19] Treebon: Enhancing inference-time alignment with speculative tree-search and best-of-n sampling PDF

Table of Contents

From Curiosity to Caution: Mitigating Reward Hacking for Best-of- $N$ with Pessimism

[27] Improving Exploration in ActorâCritic With Weakly Pessimistic Value Estimation and Optimistic Policy Optimization PDF