Sample Lottery: Unsupervised Discovery of Critical Instances for LLM Reasoning
Overview
Overall Novelty Assessment
The paper introduces the lottery sample hypothesis for reinforcement learning with verifiable reward in LLMs, proposing that a small subset of training instances can yield performance comparable to the full dataset. It resides in the 'Critical Instance Discovery for Policy Optimization' leaf, which contains only two papers total. This leaf sits within the broader 'Instance Selection and Sample Efficiency' branch, indicating a relatively sparse research direction focused specifically on unsupervised identification of high-value training samples through volatility or uncertainty measures. The small population suggests this is an emerging rather than saturated area.
The taxonomy reveals that neighboring work primarily explores alternative strategies for improving LLM efficiency. The sibling leaf 'In-Context Example Mining' addresses demonstration selection for few-shot learning, while adjacent branches tackle knowledge integration through retrieval-augmented inference and internal representation manipulation via activation-based intervention. The scope note explicitly excludes supervised selection methods and general active learning, positioning this work at the intersection of policy optimization and unsupervised sample prioritization. This boundary clarifies that the contribution targets training-time instance discovery rather than inference-time retrieval or supervised curation.
Among the three contributions analyzed, the lottery sample hypothesis examined ten candidates and found one potentially refutable prior work, suggesting some overlap in the conceptual foundation. The CONST framework examined only two candidates with no clear refutations, indicating limited direct precedent within the search scope. The theoretical analysis component examined ten candidates without refutations. These statistics reflect a search of twenty-two total candidates, not an exhaustive literature review. The framework and theoretical components appear more novel within this limited examination, while the core hypothesis shows measurable prior exploration.
Based on the top-22 semantic matches examined, the work appears to occupy a sparsely populated research direction with modest prior overlap on its foundational hypothesis but less precedent for its specific framework. The limited search scope means potentially relevant work outside these candidates remains unexamined. The taxonomy structure suggests the paper addresses an underexplored intersection of policy optimization and unsupervised instance selection, though the single refutable finding indicates the core motivation has some established grounding.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors formalize a hypothesis stating that a small subset of training samples can achieve performance comparable to the full dataset when used alone for reinforcement learning with verifiable reward on large language models. This hypothesis motivates the unsupervised discovery of critical instances.
The authors propose CONST, a novel framework that identifies critical training instances without requiring ground truth annotations. CONST evaluates sample importance using procedural volatility and outcome volatility, then applies conformal prediction to select lottery-winning samples for annotation and optimization.
The authors establish a theoretical generalization bound demonstrating that under the lottery sample hypothesis and certain standard assumptions, CONST can effectively approximate the optimal policy parameter setup with sufficiently large question datasets and verified rewards.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[15] Reasoning under Uncertainty: Efficient LLM Inference via Unsupervised Confidence Dilution and Convergent Adaptive Sampling PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Lottery sample hypothesis for RLVR on LLMs
The authors formalize a hypothesis stating that a small subset of training samples can achieve performance comparable to the full dataset when used alone for reinforcement learning with verifiable reward on large language models. This hypothesis motivates the unsupervised discovery of critical instances.
[25] Reinforcement learning for reasoning in large language models with one training example PDF
[21] Sample trajectory selection method based on large language model in reinforcement learning PDF
[22] Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning PDF
[23] Prompt optimization with EASE? efficient ordering-aware automated selection of exemplars PDF
[24] Sample efficient preference alignment in LLMs via active exploration PDF
[26] GENEREIT: generating multi-talented reinforcement learning agents PDF
[27] Inference-aware fine-tuning for best-of-n sampling in large language models PDF
[28] Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay PDF
[29] Advances in Statistical Inference and Policy Optimization for Reinforcement Learning PDF
[30] Tempera: Test-time prompting via reinforcement learning PDF
Complementary Conformal Selection (CONST) framework
The authors propose CONST, a novel framework that identifies critical training instances without requiring ground truth annotations. CONST evaluates sample importance using procedural volatility and outcome volatility, then applies conformal prediction to select lottery-winning samples for annotation and optimization.
Theoretical analysis of CONST approximating optimal policy
The authors establish a theoretical generalization bound demonstrating that under the lottery sample hypothesis and certain standard assumptions, CONST can effectively approximate the optimal policy parameter setup with sufficiently large question datasets and verified rewards.