From Curiosity to Caution: Mitigating Reward Hacking for Best-of-NN with Pessimism

ICLR 2026 Conference SubmissionAnonymous Authors
Reward HackingReward ModelsPessimismInference-time ScalingLarge Language Models
Abstract:

Inference-time compute scaling has emerged as a powerful paradigm for improving language model performance on a wide range of tasks, but the question of how best to use the additional compute remains open. A popular approach is Best-of-NN (BoN) sampling, where NN candidate responses are generated, scored according to a reward model, and the highest-scoring response is selected. While this approach can improve performance, it is vulnerable to reward hacking, where performance degrades as NN increases due to the selection of responses that exploit imperfections in the reward model instead of genuinely improving generation quality. Prior attempts to mitigate reward hacking---via stronger reward models or heavy-handed distributional regularization---either fail to fully address over-optimization or are too conservative to exploit additional compute. In this work, we explore the principle of pessimism in reinforcement learning (RL), which uses lower confidence bounds on value estimates to avoid out-of-distribution (OOD) actions with uncertain reward estimates. Our approach, termed as caution, can be seen as the reverse of curiosity: where curiosity (e.g., via Random Network Distillation, RND) rewards prediction error as a signal of novelty, caution penalizes prediction error as a signal of distributional uncertainty. Practically, caution trains an error model on typical responses and uses its prediction error to lower reward estimates for atypical ones. Our extensive empirical evaluation demonstrates that caution is a simple, computationally efficient approach that substantially mitigates reward hacking in BoN sampling. We also provide a theoretical analysis in a simplified linear setting, which shows that caution provably improves over the standard BoN approach. Together, our results not only establish caution as a practical solution to reward hacking, but also provide evidence that curiosity-based approaches can be a general OOD detection technique in LLM settings.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes 'caution,' a pessimistic reward estimation method for Best-of-N sampling that penalizes responses with uncertain reward estimates to mitigate reward hacking. It resides in the 'Pessimism via Prediction Error Penalization' leaf, which contains only one sibling paper ('Learning a Pessimistic Reward'). This represents a relatively sparse research direction within the broader taxonomy of thirteen papers across multiple mitigation strategies, suggesting the specific approach of penalizing prediction error as a proxy for distributional uncertainty is less explored than ensemble-based or training-time interventions.

The taxonomy reveals several neighboring approaches to the same core problem. The sibling branch 'Conservative Bounds and Exploration Constraints' applies lower confidence bounds during inference scaling but does not focus on prediction error as the uncertainty signal. Adjacent branches include 'Ensemble-Based Overoptimization Mitigation' and 'Bayesian Reward Modeling,' which aggregate multiple models or quantify uncertainty probabilistically rather than penalizing single-model prediction variance. The paper's position suggests it offers a computationally lighter alternative to ensemble methods while remaining distinct from architectural improvements like hidden state regularization.

Among thirty candidates examined, each of the three contributions shows at least one refutable candidate. The 'caution' mechanism itself, the dual curiosity-caution relationship for out-of-distribution detection, and the theoretical analysis each examined ten candidates with one appearing to provide overlapping prior work. This indicates that within the limited search scope, some aspects of the approach have precedent, though nine out of ten candidates per contribution did not clearly refute the claims. The statistics suggest moderate novelty: the core ideas are not entirely unprecedented, but substantial gaps remain among the examined literature.

Given the limited search scope of thirty semantically similar papers, the analysis captures nearby work but cannot claim exhaustive coverage. The sparse taxonomy leaf and the nine non-refutable candidates per contribution suggest the specific combination of pessimism via prediction error penalization may offer incremental advances over existing methods. However, the presence of refutable candidates indicates that key conceptual elements—pessimistic reward adjustment, uncertainty-based penalization, or theoretical guarantees—have appeared in prior work, warranting careful positioning relative to the sibling paper and related ensemble or Bayesian approaches.

Taxonomy

Core-task Taxonomy Papers
13
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Mitigating reward hacking in Best-of-N sampling with pessimistic reward estimation. The field addresses a fundamental challenge in aligning large language models at inference time: when selecting the best response from N candidates using a learned reward model, overoptimization can lead to reward hacking, where high-scoring outputs exploit model weaknesses rather than genuinely satisfy user intent. The taxonomy organizes solutions into several main branches. Pessimistic and Conservative Reward Estimation Methods focus on downweighting uncertain or potentially overestimated rewards, often by penalizing prediction errors or incorporating uncertainty estimates. Reward Model Ensemble and Aggregation Techniques combine multiple reward signals to reduce reliance on any single model's biases, as seen in works like Reward model ensembles help[2] and Bayesian reward models for[4]. Reward Model Architecture and Training Improvements seek to build more robust reward functions from the ground up, including approaches like Regularizing hidden states enables[1] and Agentic Reward Modeling[9]. Theoretical Analysis and Comparative Evaluation of Inference-Time Alignment provides formal understanding and empirical comparisons, exemplified by Is best-of-n the best[3] and Optimal Stopping vs Best-of-[10]. Finally, Analogous Decision-Making Models from Other Domains draw on related frameworks such as Model of the best-of-[8]. Within this landscape, a particularly active line of work explores how to inject conservatism directly into reward scoring to counteract overoptimization. From Curiosity to Caution[0] sits squarely in the Pessimism via Prediction Error Penalization cluster, alongside Learning a Pessimistic Reward[12], both of which penalize outputs where the reward model exhibits high uncertainty or prediction variance. This contrasts with ensemble-based strategies like those in Reward model ensembles help[2], which aggregate multiple models rather than modifying a single reward signal. Another nearby direction, represented by SAFFRON-1[5] and related work, emphasizes architectural or training-time interventions to improve reward model calibration. The central trade-off across these branches is between computational overhead—ensembles and sophisticated architectures can be expensive—and the degree of conservatism introduced, with pessimistic methods offering a lightweight alternative that directly targets the regions of reward space most prone to hacking.

Claimed Contributions

Caution: a pessimistic reward estimation approach for Best-of-N sampling

The authors introduce caution, an inference-time method that applies pessimism to reward estimation by penalizing out-of-distribution responses using prediction error from a trained error model. This approach mitigates reward hacking in Best-of-N sampling by subtracting per-response uncertainty estimates from reward model scores.

10 retrieved papers
Can Refute
Dual relationship between curiosity and caution for OOD detection

The authors establish that caution is conceptually dual to curiosity-based exploration methods. While curiosity rewards prediction error to encourage exploration, caution penalizes prediction error to avoid uncertain out-of-distribution responses, providing a new perspective on using curiosity-style techniques for pessimistic policy learning.

10 retrieved papers
Can Refute
Theoretical analysis proving caution improves over standard Best-of-N

The authors provide a theoretical guarantee in a simplified linear setting demonstrating that their caution-regularized reward estimate leads to provably better performance than standard Best-of-N sampling, while also establishing the first theoretical validation of curiosity-style methods for out-of-distribution detection.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Caution: a pessimistic reward estimation approach for Best-of-N sampling

The authors introduce caution, an inference-time method that applies pessimism to reward estimation by penalizing out-of-distribution responses using prediction error from a trained error model. This approach mitigates reward hacking in Best-of-N sampling by subtracting per-response uncertainty estimates from reward model scores.

Contribution

Dual relationship between curiosity and caution for OOD detection

The authors establish that caution is conceptually dual to curiosity-based exploration methods. While curiosity rewards prediction error to encourage exploration, caution penalizes prediction error to avoid uncertain out-of-distribution responses, providing a new perspective on using curiosity-style techniques for pessimistic policy learning.

Contribution

Theoretical analysis proving caution improves over standard Best-of-N

The authors provide a theoretical guarantee in a simplified linear setting demonstrating that their caution-regularized reward estimate leads to provably better performance than standard Best-of-N sampling, while also establishing the first theoretical validation of curiosity-style methods for out-of-distribution detection.