Sample Lottery: Unsupervised Discovery of Critical Instances for LLM Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelReinforcement Learning with Verifiable Reward
Abstract:

Reinforcement Learning with Verifiable Reward (RLVR) has equipped large language models (LLMs) with the capability of reasoning over complicated logical problems through policy optimization. However, conventional methods require complete annotation of the entire dataset and allocate computation uniformly over all samples. We articulate the lottery sample hypothesis in policy optimization of LLMs: a large training set contains a small subset that, when trained alone, yields performance comparable to that of the full dataset. This paper therefore explores the following question: How can we identify these lottery-winning samples from the original dataset without access to answers? Unlike prior efforts that analyze the effect of different samples in the training set with complete annotation, this paper focuses on the unsupervised discovery of critical instances for LLM reasoning and proposes a novel framework termed Complementary Conformal Selection (CONST). Specifically, CONST evaluates the importance of samples by considering two complementary components: procedural volatility and outcome volatility. Procedural volatility measures the potential variations during the LLM’s reasoning process, while outcome volatility captures inconsistencies in the final answer. Subsequently, conformal prediction is used to obtain a prediction set whose cardinality serves as the criterion for selecting the lottery-winning samples for annotation. We also provide a theoretical analysis, showing that CONST can effectively approximate the optimal policy. Extensive experiments on various LLMs across different datasets demonstrate the effectiveness of CONST.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces the lottery sample hypothesis for reinforcement learning with verifiable reward in LLMs, proposing that a small subset of training instances can yield performance comparable to the full dataset. It resides in the 'Critical Instance Discovery for Policy Optimization' leaf, which contains only two papers total. This leaf sits within the broader 'Instance Selection and Sample Efficiency' branch, indicating a relatively sparse research direction focused specifically on unsupervised identification of high-value training samples through volatility or uncertainty measures. The small population suggests this is an emerging rather than saturated area.

The taxonomy reveals that neighboring work primarily explores alternative strategies for improving LLM efficiency. The sibling leaf 'In-Context Example Mining' addresses demonstration selection for few-shot learning, while adjacent branches tackle knowledge integration through retrieval-augmented inference and internal representation manipulation via activation-based intervention. The scope note explicitly excludes supervised selection methods and general active learning, positioning this work at the intersection of policy optimization and unsupervised sample prioritization. This boundary clarifies that the contribution targets training-time instance discovery rather than inference-time retrieval or supervised curation.

Among the three contributions analyzed, the lottery sample hypothesis examined ten candidates and found one potentially refutable prior work, suggesting some overlap in the conceptual foundation. The CONST framework examined only two candidates with no clear refutations, indicating limited direct precedent within the search scope. The theoretical analysis component examined ten candidates without refutations. These statistics reflect a search of twenty-two total candidates, not an exhaustive literature review. The framework and theoretical components appear more novel within this limited examination, while the core hypothesis shows measurable prior exploration.

Based on the top-22 semantic matches examined, the work appears to occupy a sparsely populated research direction with modest prior overlap on its foundational hypothesis but less precedent for its specific framework. The limited search scope means potentially relevant work outside these candidates remains unexamined. The taxonomy structure suggests the paper addresses an underexplored intersection of policy optimization and unsupervised instance selection, though the single refutable finding indicates the core motivation has some established grounding.

Taxonomy

Core-task Taxonomy Papers
20
3
Claimed Contributions
22
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: unsupervised discovery of critical instances for LLM reasoning. The field organizes around five main branches that reflect different strategies for improving LLM performance without extensive labeled data. Instance Selection and Sample Efficiency focuses on identifying high-value training or inference examples, often through methods that prioritize informative samples or optimize policy learning. Knowledge Integration and Retrieval emphasizes augmenting models with external information sources, such as retrieval-based approaches like Rethinking with Retrieval[1] that dynamically incorporate relevant context. Internal Representation Analysis and Manipulation examines how models encode and process information internally, enabling interventions such as Inference Time Intervention[4] or probing latent structures as in Bias in Latent Space[10]. Unsupervised Structure and Pattern Extraction targets the automatic discovery of regularities, relations, or templates from unstructured data, exemplified by works like API Relations Discovery[5] and Unsupervised Log Parser[7]. Finally, Domain-Specific Unsupervised Applications adapt these techniques to specialized settings, ranging from creative tasks like Creative Analogy Mining[11] to trajectory analysis in HiCoTraj[19]. A particularly active line of work within Instance Selection explores how to identify critical training instances that disproportionately influence model behavior, balancing sample efficiency with policy optimization. Sample Lottery[0] sits squarely in this space, proposing mechanisms to discover instances that act as pivotal learning signals for reasoning tasks. This emphasis contrasts with nearby efforts like Confidence Dilution[15], which examines how instance selection interacts with model uncertainty, and RLZero[8], which integrates reinforcement learning to refine instance prioritization. Meanwhile, works in Unsupervised Structure and Pattern Extraction, such as Unsupervised Distractor Generation[3], highlight complementary challenges in mining structural patterns without supervision. The central tension across these branches revolves around whether to focus on curating better training data, leveraging external knowledge, or directly manipulating internal model states—each offering distinct trade-offs in interpretability, scalability, and domain adaptability.

Claimed Contributions

Lottery sample hypothesis for RLVR on LLMs

The authors formalize a hypothesis stating that a small subset of training samples can achieve performance comparable to the full dataset when used alone for reinforcement learning with verifiable reward on large language models. This hypothesis motivates the unsupervised discovery of critical instances.

10 retrieved papers
Can Refute
Complementary Conformal Selection (CONST) framework

The authors propose CONST, a novel framework that identifies critical training instances without requiring ground truth annotations. CONST evaluates sample importance using procedural volatility and outcome volatility, then applies conformal prediction to select lottery-winning samples for annotation and optimization.

2 retrieved papers
Theoretical analysis of CONST approximating optimal policy

The authors establish a theoretical generalization bound demonstrating that under the lottery sample hypothesis and certain standard assumptions, CONST can effectively approximate the optimal policy parameter setup with sufficiently large question datasets and verified rewards.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Lottery sample hypothesis for RLVR on LLMs

The authors formalize a hypothesis stating that a small subset of training samples can achieve performance comparable to the full dataset when used alone for reinforcement learning with verifiable reward on large language models. This hypothesis motivates the unsupervised discovery of critical instances.

Contribution

Complementary Conformal Selection (CONST) framework

The authors propose CONST, a novel framework that identifies critical training instances without requiring ground truth annotations. CONST evaluates sample importance using procedural volatility and outcome volatility, then applies conformal prediction to select lottery-winning samples for annotation and optimization.

Contribution

Theoretical analysis of CONST approximating optimal policy

The authors establish a theoretical generalization bound demonstrating that under the lottery sample hypothesis and certain standard assumptions, CONST can effectively approximate the optimal policy parameter setup with sufficiently large question datasets and verified rewards.