Sample Lottery: Unsupervised Discovery of Critical Instances for LLM Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language ModelReinforcement Learning with Verifiable Reward

Reinforcement Learning with Verifiable Reward (RLVR) has equipped large language models (LLMs) with the capability of reasoning over complicated logical problems through policy optimization. However, conventional methods require complete annotation of the entire dataset and allocate computation uniformly over all samples. We articulate the lottery sample hypothesis in policy optimization of LLMs: a large training set contains a small subset that, when trained alone, yields performance comparable to that of the full dataset. This paper therefore explores the following question: How can we identify these lottery-winning samples from the original dataset without access to answers? Unlike prior efforts that analyze the effect of different samples in the training set with complete annotation, this paper focuses on the unsupervised discovery of critical instances for LLM reasoning and proposes a novel framework termed Complementary Conformal Selection (CONST). Specifically, CONST evaluates the importance of samples by considering two complementary components: procedural volatility and outcome volatility. Procedural volatility measures the potential variations during the LLM’s reasoning process, while outcome volatility captures inconsistencies in the final answer. Subsequently, conformal prediction is used to obtain a prediction set whose cardinality serves as the criterion for selecting the lottery-winning samples for annotation. We also provide a theoretical analysis, showing that CONST can effectively approximate the optimal policy. Extensive experiments on various LLMs across different datasets demonstrate the effectiveness of CONST.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the lottery sample hypothesis for reinforcement learning with verifiable reward in LLMs, proposing that a small subset of training instances can yield performance comparable to the full dataset. It resides in the 'Critical Instance Discovery for Policy Optimization' leaf, which contains only two papers total. This leaf sits within the broader 'Instance Selection and Sample Efficiency' branch, indicating a relatively sparse research direction focused specifically on unsupervised identification of high-value training samples through volatility or uncertainty measures. The small population suggests this is an emerging rather than saturated area.

The taxonomy reveals that neighboring work primarily explores alternative strategies for improving LLM efficiency. The sibling leaf 'In-Context Example Mining' addresses demonstration selection for few-shot learning, while adjacent branches tackle knowledge integration through retrieval-augmented inference and internal representation manipulation via activation-based intervention. The scope note explicitly excludes supervised selection methods and general active learning, positioning this work at the intersection of policy optimization and unsupervised sample prioritization. This boundary clarifies that the contribution targets training-time instance discovery rather than inference-time retrieval or supervised curation.

Among the three contributions analyzed, the lottery sample hypothesis examined ten candidates and found one potentially refutable prior work, suggesting some overlap in the conceptual foundation. The CONST framework examined only two candidates with no clear refutations, indicating limited direct precedent within the search scope. The theoretical analysis component examined ten candidates without refutations. These statistics reflect a search of twenty-two total candidates, not an exhaustive literature review. The framework and theoretical components appear more novel within this limited examination, while the core hypothesis shows measurable prior exploration.

Based on the top-22 semantic matches examined, the work appears to occupy a sparsely populated research direction with modest prior overlap on its foundational hypothesis but less precedent for its specific framework. The limited search scope means potentially relevant work outside these candidates remains unexamined. The taxonomy structure suggests the paper addresses an underexplored intersection of policy optimization and unsupervised instance selection, though the single refutable finding indicates the core motivation has some established grounding.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: unsupervised discovery of critical instances for LLM reasoning. The field organizes around five main branches that reflect different strategies for improving LLM performance without extensive labeled data. Instance Selection and Sample Efficiency focuses on identifying high-value training or inference examples, often through methods that prioritize informative samples or optimize policy learning. Knowledge Integration and Retrieval emphasizes augmenting models with external information sources, such as retrieval-based approaches like Rethinking with Retrieval[1] that dynamically incorporate relevant context. Internal Representation Analysis and Manipulation examines how models encode and process information internally, enabling interventions such as Inference Time Intervention[4] or probing latent structures as in Bias in Latent Space[10]. Unsupervised Structure and Pattern Extraction targets the automatic discovery of regularities, relations, or templates from unstructured data, exemplified by works like API Relations Discovery[5] and Unsupervised Log Parser[7]. Finally, Domain-Specific Unsupervised Applications adapt these techniques to specialized settings, ranging from creative tasks like Creative Analogy Mining[11] to trajectory analysis in HiCoTraj[19]. A particularly active line of work within Instance Selection explores how to identify critical training instances that disproportionately influence model behavior, balancing sample efficiency with policy optimization. Sample Lottery[0] sits squarely in this space, proposing mechanisms to discover instances that act as pivotal learning signals for reasoning tasks. This emphasis contrasts with nearby efforts like Confidence Dilution[15], which examines how instance selection interacts with model uncertainty, and RLZero[8], which integrates reinforcement learning to refine instance prioritization. Meanwhile, works in Unsupervised Structure and Pattern Extraction, such as Unsupervised Distractor Generation[3], highlight complementary challenges in mining structural patterns without supervision. The central tension across these branches revolves around whether to focus on curating better training data, leveraging external knowledge, or directly manipulating internal model states—each offering distinct trade-offs in interpretability, scalability, and domain adaptability.

Claimed Contributions

Lottery sample hypothesis for RLVR on LLMs

Can Refute

10 retrieved papers

The authors formalize a hypothesis stating that a small subset of training samples can achieve performance comparable to the full dataset when used alone for reinforcement learning with verifiable reward on large language models. This hypothesis motivates the unsupervised discovery of critical instances.

10 retrieved papers

Can Refute

Complementary Conformal Selection (CONST) framework

2 retrieved papers

The authors propose CONST, a novel framework that identifies critical training instances without requiring ground truth annotations. CONST evaluates sample importance using procedural volatility and outcome volatility, then applies conformal prediction to select lottery-winning samples for annotation and optimization.

2 retrieved papers

Theoretical analysis of CONST approximating optimal policy

10 retrieved papers

The authors establish a theoretical generalization bound demonstrating that under the lottery sample hypothesis and certain standard assumptions, CONST can effectively approximate the optimal policy parameter setup with sufficiently large question datasets and verified rewards.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[15] Reasoning under Uncertainty: Efficient LLM Inference via Unsupervised Confidence Dilution and Convergent Adaptive Sampling PDF

Zhenning Shi, Yi-jia Zhu, Yi Xie, Yijia Zhu, Junhan Shi, Guorui Xie, Haotian Zhang, Yong Jiang, Congcong Miao, Qing Li (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Lottery sample hypothesis for RLVR on LLMs

[25] Reinforcement learning for reasoning in large language models with one training example PDF

Can Refute

[21] Sample trajectory selection method based on large language model in reinforcement learning PDF

Cannot Refute

[22] Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning PDF

Cannot Refute

[23] Prompt optimization with EASE? efficient ordering-aware automated selection of exemplars PDF

Cannot Refute

[24] Sample efficient preference alignment in LLMs via active exploration PDF

Cannot Refute

[26] GENEREIT: generating multi-talented reinforcement learning agents PDF

Cannot Refute

[27] Inference-aware fine-tuning for best-of-n sampling in large language models PDF

Cannot Refute

[28] Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay PDF

Cannot Refute

[29] Advances in Statistical Inference and Policy Optimization for Reinforcement Learning PDF

Cannot Refute

[30] Tempera: Test-time prompting via reinforcement learning PDF

Cannot Refute

Contribution

Complementary Conformal Selection (CONST) framework

[41] Financial Time Series: Adaptive Forecasting Frameworks PDF

Cannot Refute

[42] Applying Conformal Prediction for LLM Multi-Label Text Classification PDF

Cannot Refute

Contribution

Theoretical analysis of CONST approximating optimal policy

[31] Online difficulty filtering for reasoning oriented reinforcement learning PDF

Cannot Refute

[32] Generalized proximal policy optimization with sample reuse PDF

Cannot Refute

[33] RL-Selector: Reinforcement Learning-Guided Data Selection via Redundancy Assessment PDF

Cannot Refute

[34] Safety Filtering While Training: Improving the Performance and Sample Efficiency of Reinforcement Learning Agents PDF

Cannot Refute

[35] Reinforcement Learning-Controlled Subspace Ensemble Sampling for Complex Data Structures PDF

Cannot Refute

[36] Generalizing Off-Policy Learning under Sample Selection Bias PDF

Cannot Refute

[37] Communication-Efficient Policy Gradient Methods for Distributed Reinforcement Learning PDF

Cannot Refute

[38] Model-Based Reinforcement Learning for Cavity Filter Tuning PDF

Cannot Refute

[39] Imbalanced Sample Selection With Deep Reinforcement Learning for Fault Diagnosis PDF

Cannot Refute

[40] Adversarial Defense Mechanisms for Supervised Learning PDF

Cannot Refute

Sample Lottery: Unsupervised Discovery of Critical Instances for LLM Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[15] Reasoning under Uncertainty: Efficient LLM Inference via Unsupervised Confidence Dilution and Convergent Adaptive Sampling PDF

Contribution Analysis

Lottery sample hypothesis for RLVR on LLMs

[25] Reinforcement learning for reasoning in large language models with one training example PDF

[21] Sample trajectory selection method based on large language model in reinforcement learning PDF

[22] Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning PDF

[23] Prompt optimization with EASE? efficient ordering-aware automated selection of exemplars PDF

[24] Sample efficient preference alignment in LLMs via active exploration PDF

[26] GENEREIT: generating multi-talented reinforcement learning agents PDF

[27] Inference-aware fine-tuning for best-of-n sampling in large language models PDF

[28] Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay PDF

[29] Advances in Statistical Inference and Policy Optimization for Reinforcement Learning PDF

[30] Tempera: Test-time prompting via reinforcement learning PDF

Complementary Conformal Selection (CONST) framework

[41] Financial Time Series: Adaptive Forecasting Frameworks PDF

[42] Applying Conformal Prediction for LLM Multi-Label Text Classification PDF

Theoretical analysis of CONST approximating optimal policy

[31] Online difficulty filtering for reasoning oriented reinforcement learning PDF

[32] Generalized proximal policy optimization with sample reuse PDF

[33] RL-Selector: Reinforcement Learning-Guided Data Selection via Redundancy Assessment PDF

[34] Safety Filtering While Training: Improving the Performance and Sample Efficiency of Reinforcement Learning Agents PDF

[35] Reinforcement Learning-Controlled Subspace Ensemble Sampling for Complex Data Structures PDF

[36] Generalizing Off-Policy Learning under Sample Selection Bias PDF

[37] Communication-Efficient Policy Gradient Methods for Distributed Reinforcement Learning PDF

[38] Model-Based Reinforcement Learning for Cavity Filter Tuning PDF

[39] Imbalanced Sample Selection With Deep Reinforcement Learning for Fault Diagnosis PDF

[40] Adversarial Defense Mechanisms for Supervised Learning PDF

Table of Contents