Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Reinforcement learning; Large Language Model; Active Learning; Reasoning

Large Language Models (LLMs) have recently improved mathematical reasoning through Reinforcement Learning with Verifiable Reward (RLVR). However, existing RLVR algorithms require large query budgets, making annotation costly. We investigate whether fewer but more informative queries can yield similar or superior performance, introducing active learning (AL) into RLVR. We identify that classic AL sampling strategies fail to outperform random selection in this setting, due to ignoring \textbf{objective uncertainty} when only selecting by subjective uncertainty. This work proposes an \textbf{uncertainty consistency} metric to evaluate how well subjective uncertainty aligns with objective uncertainty. In the offline setting, this alignment is measured using the Point-Biserial Correlation Coefficient (PBC). For online training, because of limited sampling and dynamically shifting output distributions, PBC estimation is difficult. Therefore, we introduce a new online variant, computed from normalized advantage and subjective uncertainty. Theoretically, we prove that the online variant is strictly negatively correlated with offline PBC and supports better sample selection. Experiments show our method consistently outperforms random and classic AL baselines, achieving full-dataset performance while training on only 30% of the data, effectively reducing the cost of RLVR for reasoning tasks.\footnote{The code is available at \hyperref[https://anonymous.4open.science/r/uncertainty-consistency-235C]{https://anonymous.4open.science/r/uncertainty-consistency-235C}. }

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces an uncertainty consistency metric to guide query selection in reinforcement learning with verifiable rewards, specifically targeting mathematical reasoning tasks. Within the taxonomy, it occupies the 'Active Learning and Uncertainty-Based Query Selection' leaf under 'Data Selection and Query Strategies for RLVR'. Notably, this leaf contains only the original paper itself—no sibling papers appear in this category. This positioning suggests the work addresses a relatively sparse research direction within the broader RLVR landscape, which comprises fifty papers across approximately thirty-six distinct topics.

The taxonomy reveals that neighboring research directions focus on complementary aspects of data efficiency. The sibling leaf 'Data Curation and Sample Filtering Strategies' contains three papers addressing diversity and difficulty-based filtering, while the parent branch connects to 'Core RLVR Frameworks' with five papers on policy optimization and three on exploration mechanisms. The scope note explicitly distinguishes active learning approaches from random or static selection methods and from exploration during policy rollout. This structural context indicates the paper bridges a gap between general RLVR training dynamics and principled data selection, occupying territory that existing work has not extensively explored.

Among thirty candidates examined through semantic search and citation expansion, none were identified as clearly refuting any of the three main contributions. The uncertainty consistency metric examined ten candidates with zero refutable matches, as did the theoretical analysis of the online variant and the overall active learning framework. This absence of overlapping prior work across all contributions suggests that the specific combination of uncertainty alignment measurement, online adaptation via normalized advantage, and theoretical correlation guarantees represents a novel synthesis. However, this assessment reflects the limited search scope rather than exhaustive coverage of all potentially relevant literature.

The analysis indicates the work introduces genuinely new concepts within its immediate research area, particularly given the unpopulated taxonomy leaf and lack of refuting candidates among thirty examined papers. The theoretical grounding connecting offline and online uncertainty metrics appears distinctive. Nonetheless, the limited search scale and the broader RLVR field's rapid evolution mean that related ideas in adjacent domains—such as uncertainty quantification in general active learning or reward modeling—may exist outside the examined candidate set.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: query selection for reinforcement learning with verifiable reward. This emerging field addresses how agents can efficiently learn from environments where outcomes can be automatically verified—such as code execution, mathematical proofs, or game rules—by strategically choosing which queries or training examples to explore. The taxonomy reveals a rich structure spanning ten major branches. Core RLVR Frameworks and Algorithmic Foundations establish baseline methods for leveraging verifiable signals, while Data Selection and Query Strategies for RLVR focus on intelligent sampling and active learning to maximize training efficiency. Adaptive and Procedural Training Environments generate diverse problem instances, and Process-Level Supervision and Intermediate Reasoning inject finer-grained feedback beyond terminal rewards. Extensions Beyond Standard Verifiable Domains and Domain-Specific RLVR Applications explore how these ideas generalize to less structured tasks and specialized settings like embodied agents or visual reasoning. Verification and Safety Mechanisms ensure robustness, Theoretical Analysis and Empirical Evaluation provide principled understanding, Reward Function Structure and Exploitation study how to best use verifiable signals, and Self-Improvement and Self-Reward Mechanisms enable agents to bootstrap their own learning. Several active lines of work highlight key trade-offs in the field. One central question is how to balance exploration breadth with sample efficiency: methods like Reasoning Gym[1] and Trust But Verify[3] emphasize procedurally generated curricula and verification-driven filtering, while others such as Divergence Choice[4] and Annotation-Free Query Rewriting[5] focus on selecting high-value queries without extensive annotation. Another contrast appears between approaches that rely purely on outcome verification versus those incorporating intermediate process supervision, as seen in works like RLVE[6] and RLVMR[8]. Within this landscape, Uncertainty Consistency Query[0] sits naturally among active learning and uncertainty-based query selection strategies, emphasizing principled uncertainty estimation to guide which queries merit exploration. Compared to nearby efforts like Divergence Choice[4], which prioritizes distributional divergence, or Trust But Verify[3], which leans on verification as a filter, Uncertainty Consistency Query[0] offers a complementary angle by leveraging consistency across model predictions to identify informative training instances.

Claimed Contributions

Uncertainty consistency metric for query selection in RLVR

10 retrieved papers

The authors introduce a metric that measures the alignment between subjective uncertainty (model perplexity) and objective uncertainty (correctness). In the offline setting, this is measured using the Point-Biserial Correlation Coefficient (PBC), while in the online setting, a new variant is computed from normalized advantage and subjective uncertainty.

10 retrieved papers

Theoretical analysis of online uncertainty consistency metric

10 retrieved papers

The authors provide theoretical proofs showing that their online uncertainty consistency metric is strictly negatively correlated with the offline PBC metric and that maximizing the online metric is equivalent to maximizing the decrease in sample uncertainty under mild conditions.

10 retrieved papers

Active learning framework for RLVR with reduced data requirements

10 retrieved papers

The authors introduce an active learning approach to RLVR that achieves full-dataset performance while training on only 30% of the data by selecting queries based on uncertainty consistency, effectively reducing the cost of RLVR for reasoning tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Uncertainty consistency metric for query selection in RLVR

[61] Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners PDF

Cannot Refute

[62] An Evaluation of Estimative Uncertainty in Large Language Models PDF

Cannot Refute

[63] Can large language models faithfully express their intrinsic uncertainty in words? PDF

Cannot Refute

[64] Introspective Planning: Aligning Robots' Uncertainty with Inherent Task Ambiguity PDF

Cannot Refute

[65] Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models PDF

Cannot Refute

[66] Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting PDF

Cannot Refute

[67] An Uncertainty-Driven Adaptive Self-Alignment Framework for Large Language Models PDF

Cannot Refute

[68] Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models PDF

Cannot Refute

[69] Investigating Human-Aligned Large Language Model Uncertainty PDF

Cannot Refute

[70] Teaching Language Models to Faithfully Express their Uncertainty PDF

Cannot Refute

Contribution

Theoretical analysis of online uncertainty consistency metric

[71] On off-line and on-line Bayesian filtering for uncertainty quantification of structural deterioration PDF

Cannot Refute

[72] Machine Learning. The Science of Selection under Uncertainty PDF

Cannot Refute

[73] Online Conformal Selection with Accept-to-Reject Changes PDF

Cannot Refute

[74] Towards Robust Offline-to-Online Reinforcement Learning via Uncertainty and Smoothness PDF

Cannot Refute

[75] Combining Thermodynamics-based Model of the Centrifugal Compressors and Active Machine Learning for Enhanced Industrial Design Optimization PDF

Cannot Refute

[76] Human-AI Collaborative Uncertainty Quantification PDF

Cannot Refute

[77] Model uncertainty quantification of a degradation model of miter gates using normalizing flow-based likelihood-free inference PDF

Cannot Refute

[78] Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization PDF

Cannot Refute

[79] Offline vs. Online Learning in Model-based RL: Lessons for Data Collection Strategies PDF

Cannot Refute

[80] Active Classification with Uncertainty Comparison Queries PDF

Cannot Refute

Contribution

Active learning framework for RLVR with reduced data requirements

[51] Meta-AAD: Active anomaly detection with deep reinforcement learning PDF

Cannot Refute

[52] OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification PDF

Cannot Refute

[53] VARIQuery: VAE Segment-Based Active Learning for Query Selection in Preference-Based Reinforcement Learning PDF

Cannot Refute

[54] Distributionally Robust Statistical Verification with Imprecise Neural Networks PDF

Cannot Refute

[55] Graph-based reinforcement learning for active learning in real time: An application in modeling river networks PDF

Cannot Refute

[56] On Improving Deep Active Learning with Formal Verification PDF

Cannot Refute

[57] Decomposition of uncertainty for active learning and reliable reinforcement learning in stochastic systems PDF

Cannot Refute

[58] Active Learning Methodologies and Applications PDF

Cannot Refute

[59] Provably Sample-Efficient RL with Side Information about Latent Dynamics PDF

Cannot Refute

[60] PretrainZero: Reinforcement Active Pretraining PDF

Cannot Refute

Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Uncertainty consistency metric for query selection in RLVR

[61] Robots That Ask For Help: Uncertainty Alignment for Large Language Model Planners PDF

[62] An Evaluation of Estimative Uncertainty in Large Language Models PDF

[63] Can large language models faithfully express their intrinsic uncertainty in words? PDF

[64] Introspective Planning: Aligning Robots' Uncertainty with Inherent Task Ambiguity PDF

[65] Human-Alignment and Calibration of Inference-Time Uncertainty in Large Language Models PDF

[66] Investigating Uncertainty Calibration of Aligned Language Models under the Multiple-Choice Setting PDF

[67] An Uncertainty-Driven Adaptive Self-Alignment Framework for Large Language Models PDF

[68] Psychometric Alignment: Capturing Human Knowledge Distributions via Language Models PDF

[69] Investigating Human-Aligned Large Language Model Uncertainty PDF

[70] Teaching Language Models to Faithfully Express their Uncertainty PDF

Theoretical analysis of online uncertainty consistency metric

[71] On off-line and on-line Bayesian filtering for uncertainty quantification of structural deterioration PDF

[72] Machine Learning. The Science of Selection under Uncertainty PDF

[73] Online Conformal Selection with Accept-to-Reject Changes PDF

[74] Towards Robust Offline-to-Online Reinforcement Learning via Uncertainty and Smoothness PDF

[75] Combining Thermodynamics-based Model of the Centrifugal Compressors and Active Machine Learning for Enhanced Industrial Design Optimization PDF

[76] Human-AI Collaborative Uncertainty Quantification PDF

[77] Model uncertainty quantification of a degradation model of miter gates using normalizing flow-based likelihood-free inference PDF

[78] Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization PDF

[79] Offline vs. Online Learning in Model-based RL: Lessons for Data Collection Strategies PDF

[80] Active Classification with Uncertainty Comparison Queries PDF

Active learning framework for RLVR with reduced data requirements

[51] Meta-AAD: Active anomaly detection with deep reinforcement learning PDF

[52] OPV: Outcome-based Process Verifier for Efficient Long Chain-of-Thought Verification PDF

[53] VARIQuery: VAE Segment-Based Active Learning for Query Selection in Preference-Based Reinforcement Learning PDF

[54] Distributionally Robust Statistical Verification with Imprecise Neural Networks PDF

[55] Graph-based reinforcement learning for active learning in real time: An application in modeling river networks PDF

[56] On Improving Deep Active Learning with Formal Verification PDF

[57] Decomposition of uncertainty for active learning and reliable reinforcement learning in stochastic systems PDF

[58] Active Learning Methodologies and Applications PDF

[59] Provably Sample-Efficient RL with Side Information about Latent Dynamics PDF

[60] PretrainZero: Reinforcement Active Pretraining PDF

Table of Contents