Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Confidence CalibrationUncertainty EstimationLarge Language Models
Abstract:

A safe and trustworthy use of Large Language Models (LLMs) requires an accurate expression of confidence in their answers. We propose a novel Reinforcement Learning approach that allows to directly fine-tune LLMs to express calibrated confidence estimates alongside their answers to factual questions. Our method optimizes a reward based on the logarithmic scoring rule, explicitly penalizing both over- and under-confidence. This encourages the model to align its confidence estimates with the actual predictive accuracy. The optimal policy under our reward design would result in perfectly calibrated confidence expressions. Unlike prior approaches that decouple confidence estimation from response generation, our method integrates confidence calibration seamlessly into the generative process of the LLM. Empirically, we demonstrate that models trained with our approach exhibit substantially improved calibration and generalize to unseen tasks without further fine-tuning, suggesting the emergence of general confidence awareness. We provide our training and evaluation code in the supplementary and will make it publicly available upon acceptance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a reinforcement learning method to fine-tune LLMs for calibrated confidence expression using a logarithmic scoring rule reward. It resides in the 'Reinforcement Learning-Based Calibration' leaf, which contains only two papers total: the original work and one sibling (Rewarding Doubt RL). This represents a sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting that RL-based calibration remains relatively underexplored compared to supervised fine-tuning or post-hoc methods.

The taxonomy reveals neighboring approaches in sibling leaves: 'Fine-Tuning for Calibration' (four papers using supervised methods), 'Post-Hoc Calibration Methods' (three papers applying rescaling without parameter updates), and 'Prompting-Based Calibration' (three papers using strategic prompt design). The paper diverges from these by integrating confidence calibration directly into the generative process via RL rather than decoupling estimation from generation. Its position under 'Calibration Techniques and Training Methods' distinguishes it from elicitation-focused work in other branches that extract confidence without training interventions.

Among 30 candidates examined, the core RL approach shows substantial prior work: five of ten candidates can refute Contribution A. The logarithmic scoring rule reward (Contribution B) appears more distinctive, with only one refutable candidate among ten examined. Generalization to unseen tasks (Contribution C) faces moderate overlap, with four of ten candidates providing relevant prior work. The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage of the field.

The analysis suggests mixed novelty: the RL framing has precedent in the sparse two-paper leaf, while the specific reward design and generalization claims show less direct overlap among examined candidates. The small leaf size indicates an emerging research direction, though the presence of a closely related sibling paper and multiple refutable candidates for the core contribution tempers claims of fundamental novelty within the limited search scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
10
Refutable Paper

Research Landscape Overview

Core task: Calibrated confidence expression in large language model responses. The field addresses how to ensure that when LLMs express uncertainty or confidence, these expressions align with actual correctness rates. The taxonomy organizes work into several major branches: Confidence Elicitation Methods explore how to extract uncertainty signals from models (e.g., verbalized probabilities as in Verbalized Confidence Scores[25] or consistency-based measures); Calibration Techniques and Training Methods develop approaches to improve alignment between expressed and true confidence, including supervised fine-tuning (Simple Supervised Uncertainty[12]), reinforcement learning strategies (Rewarding Doubt RL[34]), and post-hoc adjustments (Calibration Tuning[14]); Evaluation and Benchmarking establish metrics and datasets to measure calibration quality across domains like question answering (QA Calibration[47]) and clinical settings (Clinical Confidence Benchmarking[32]); Theoretical Foundations examine the mathematical underpinnings and failure modes; Specialized Applications extend calibration to retrieval-augmented generation (RAG Uncertainty[33]) or biomedical contexts (Biomedical Calibration[20]); and Foundational Uncertainty Quantification provides broader surveys and conceptual frameworks (Uncertainty Quantification Survey[1]). Recent work reveals contrasting philosophies: some methods rely on eliciting natural language confidence statements and then calibrating them through training (Express Uncertainty[3], Just Ask Calibration[6]), while others leverage internal model signals like token probabilities or sampling consistency (Self-Consistency Confidence[29]). A particularly active line uses reinforcement learning to directly reward well-calibrated doubt, as seen in Rewarding Doubt[0] and its closely related predecessor Rewarding Doubt RL[34], which both frame calibration as an RL objective rather than a supervised regression problem. This RL-based approach contrasts with simpler supervised methods (Simple Supervised Uncertainty[12]) that train on labeled correctness data, and with post-hoc recalibration techniques (Calibration Tuning[14]) that adjust outputs without retraining. Rewarding Doubt[0] sits squarely within the reinforcement learning calibration cluster, emphasizing adaptive learning of doubt expressions through reward signals, distinguishing it from static elicitation or one-shot tuning strategies prevalent elsewhere in the taxonomy.

Claimed Contributions

Reinforcement learning approach for calibrated confidence expression in LLMs

The authors introduce a reinforcement learning method that trains large language models to generate calibrated numerical confidence scores together with their answers. Unlike prior work that decouples confidence estimation from generation, this approach integrates confidence calibration seamlessly into the LLM's generative process.

10 retrieved papers
Can Refute
Reward function based on logarithmic scoring rule

The authors design a reward function using the logarithmic scoring rule that penalizes overconfidence and underconfidence. This proper scoring rule ensures that the optimal policy results in perfectly calibrated confidence expressions, as the expected reward is maximized when predicted confidence equals the true epistemic probability.

10 retrieved papers
Can Refute
Generalization of confidence awareness to unseen tasks

The authors demonstrate that their method enables models to generalize their learned confidence calibration abilities to out-of-domain datasets without additional fine-tuning. This suggests the model develops a general awareness of its own uncertainty rather than task-specific calibration patterns.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Reinforcement learning approach for calibrated confidence expression in LLMs

The authors introduce a reinforcement learning method that trains large language models to generate calibrated numerical confidence scores together with their answers. Unlike prior work that decouples confidence estimation from generation, this approach integrates confidence calibration seamlessly into the LLM's generative process.

Contribution

Reward function based on logarithmic scoring rule

The authors design a reward function using the logarithmic scoring rule that penalizes overconfidence and underconfidence. This proper scoring rule ensures that the optimal policy results in perfectly calibrated confidence expressions, as the expected reward is maximized when predicted confidence equals the true epistemic probability.

Contribution

Generalization of confidence awareness to unseen tasks

The authors demonstrate that their method enables models to generalize their learned confidence calibration abilities to out-of-domain datasets without additional fine-tuning. This suggests the model develops a general awareness of its own uncertainty rather than task-specific calibration patterns.

Rewarding Doubt: A Reinforcement Learning Approach to Calibrated Confidence Expression of Large Language Models | Novelty Validation