Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement learning; Large Language Model; Active Learning; Reasoning
Abstract:

Large Language Models (LLMs) have recently improved mathematical reasoning through Reinforcement Learning with Verifiable Reward (RLVR). However, existing RLVR algorithms require large query budgets, making annotation costly. We investigate whether fewer but more informative queries can yield similar or superior performance, introducing active learning (AL) into RLVR. We identify that classic AL sampling strategies fail to outperform random selection in this setting, due to ignoring \textbf{objective uncertainty} when only selecting by subjective uncertainty. This work proposes an \textbf{uncertainty consistency} metric to evaluate how well subjective uncertainty aligns with objective uncertainty. In the offline setting, this alignment is measured using the Point-Biserial Correlation Coefficient (PBC). For online training, because of limited sampling and dynamically shifting output distributions, PBC estimation is difficult. Therefore, we introduce a new online variant, computed from normalized advantage and subjective uncertainty. Theoretically, we prove that the online variant is strictly negatively correlated with offline PBC and supports better sample selection. Experiments show our method consistently outperforms random and classic AL baselines, achieving full-dataset performance while training on only 30% of the data, effectively reducing the cost of RLVR for reasoning tasks.\footnote{The code is available at \hyperref[https://anonymous.4open.science/r/uncertainty-consistency-235C]{https://anonymous.4open.science/r/uncertainty-consistency-235C}. }

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces an uncertainty consistency metric to guide query selection in reinforcement learning with verifiable rewards, specifically targeting mathematical reasoning tasks. Within the taxonomy, it occupies the 'Active Learning and Uncertainty-Based Query Selection' leaf under 'Data Selection and Query Strategies for RLVR'. Notably, this leaf contains only the original paper itself—no sibling papers appear in this category. This positioning suggests the work addresses a relatively sparse research direction within the broader RLVR landscape, which comprises fifty papers across approximately thirty-six distinct topics.

The taxonomy reveals that neighboring research directions focus on complementary aspects of data efficiency. The sibling leaf 'Data Curation and Sample Filtering Strategies' contains three papers addressing diversity and difficulty-based filtering, while the parent branch connects to 'Core RLVR Frameworks' with five papers on policy optimization and three on exploration mechanisms. The scope note explicitly distinguishes active learning approaches from random or static selection methods and from exploration during policy rollout. This structural context indicates the paper bridges a gap between general RLVR training dynamics and principled data selection, occupying territory that existing work has not extensively explored.

Among thirty candidates examined through semantic search and citation expansion, none were identified as clearly refuting any of the three main contributions. The uncertainty consistency metric examined ten candidates with zero refutable matches, as did the theoretical analysis of the online variant and the overall active learning framework. This absence of overlapping prior work across all contributions suggests that the specific combination of uncertainty alignment measurement, online adaptation via normalized advantage, and theoretical correlation guarantees represents a novel synthesis. However, this assessment reflects the limited search scope rather than exhaustive coverage of all potentially relevant literature.

The analysis indicates the work introduces genuinely new concepts within its immediate research area, particularly given the unpopulated taxonomy leaf and lack of refuting candidates among thirty examined papers. The theoretical grounding connecting offline and online uncertainty metrics appears distinctive. Nonetheless, the limited search scale and the broader RLVR field's rapid evolution mean that related ideas in adjacent domains—such as uncertainty quantification in general active learning or reward modeling—may exist outside the examined candidate set.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: query selection for reinforcement learning with verifiable reward. This emerging field addresses how agents can efficiently learn from environments where outcomes can be automatically verified—such as code execution, mathematical proofs, or game rules—by strategically choosing which queries or training examples to explore. The taxonomy reveals a rich structure spanning ten major branches. Core RLVR Frameworks and Algorithmic Foundations establish baseline methods for leveraging verifiable signals, while Data Selection and Query Strategies for RLVR focus on intelligent sampling and active learning to maximize training efficiency. Adaptive and Procedural Training Environments generate diverse problem instances, and Process-Level Supervision and Intermediate Reasoning inject finer-grained feedback beyond terminal rewards. Extensions Beyond Standard Verifiable Domains and Domain-Specific RLVR Applications explore how these ideas generalize to less structured tasks and specialized settings like embodied agents or visual reasoning. Verification and Safety Mechanisms ensure robustness, Theoretical Analysis and Empirical Evaluation provide principled understanding, Reward Function Structure and Exploitation study how to best use verifiable signals, and Self-Improvement and Self-Reward Mechanisms enable agents to bootstrap their own learning. Several active lines of work highlight key trade-offs in the field. One central question is how to balance exploration breadth with sample efficiency: methods like Reasoning Gym[1] and Trust But Verify[3] emphasize procedurally generated curricula and verification-driven filtering, while others such as Divergence Choice[4] and Annotation-Free Query Rewriting[5] focus on selecting high-value queries without extensive annotation. Another contrast appears between approaches that rely purely on outcome verification versus those incorporating intermediate process supervision, as seen in works like RLVE[6] and RLVMR[8]. Within this landscape, Uncertainty Consistency Query[0] sits naturally among active learning and uncertainty-based query selection strategies, emphasizing principled uncertainty estimation to guide which queries merit exploration. Compared to nearby efforts like Divergence Choice[4], which prioritizes distributional divergence, or Trust But Verify[3], which leans on verification as a filter, Uncertainty Consistency Query[0] offers a complementary angle by leveraging consistency across model predictions to identify informative training instances.

Claimed Contributions

Uncertainty consistency metric for query selection in RLVR

The authors introduce a metric that measures the alignment between subjective uncertainty (model perplexity) and objective uncertainty (correctness). In the offline setting, this is measured using the Point-Biserial Correlation Coefficient (PBC), while in the online setting, a new variant is computed from normalized advantage and subjective uncertainty.

10 retrieved papers
Theoretical analysis of online uncertainty consistency metric

The authors provide theoretical proofs showing that their online uncertainty consistency metric is strictly negatively correlated with the offline PBC metric and that maximizing the online metric is equivalent to maximizing the decrease in sample uncertainty under mild conditions.

10 retrieved papers
Active learning framework for RLVR with reduced data requirements

The authors introduce an active learning approach to RLVR that achieves full-dataset performance while training on only 30% of the data by selecting queries based on uncertainty consistency, effectively reducing the cost of RLVR for reasoning tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Uncertainty consistency metric for query selection in RLVR

The authors introduce a metric that measures the alignment between subjective uncertainty (model perplexity) and objective uncertainty (correctness). In the offline setting, this is measured using the Point-Biserial Correlation Coefficient (PBC), while in the online setting, a new variant is computed from normalized advantage and subjective uncertainty.

Contribution

Theoretical analysis of online uncertainty consistency metric

The authors provide theoretical proofs showing that their online uncertainty consistency metric is strictly negatively correlated with the offline PBC metric and that maximizing the online metric is equivalent to maximizing the decrease in sample uncertainty under mild conditions.

Contribution

Active learning framework for RLVR with reduced data requirements

The authors introduce an active learning approach to RLVR that achieves full-dataset performance while training on only 30% of the data by selecting queries based on uncertainty consistency, effectively reducing the cost of RLVR for reasoning tasks.

Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR | Novelty Validation