RESTRAIN: From Spurious Votes to Signals — Self-Training RL with Self-Penalization

ICLR 2026 Conference SubmissionAnonymous Authors
LLMself-trainingRLunsupervised learningself-penalization
Abstract:

Reinforcement learning with human-annotated data has boosted chain-of-thought reasoning in large reasoning models, but these gains come at high costs in labeled data while faltering on harder tasks. A natural next step is experience-driven learning, where models improve without curated labels by adapting to unlabeled data. We introduce REinforcement learning with Self-resTRAINt training (RESTRAIN), a self-penalizing RL framework that converts the absence of gold labels into a useful learning signal. Instead of overcommitting to spurious majority votes, RESTRAIN exploits signals from the model’s entire answer distribution: penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains. This self-penalization mechanism integrates seamlessly into policy optimization methods such as GRPO, enabling continual self-improvement without supervision. On challenging reasoning benchmarks, RESTRAIN delivers large gains using only unlabeled data. With Qwen3-4B-Base and OctoThinker Hybrid-8B-Base, it boosts Pass@1 by up to +140.7% on AIME25, +36.2% on MMLU STEM, and +19.6% on GPQA-Diamond, nearly matching gold-label training while using no gold labels. These results demonstrate that RESTRAIN establishes a scalable path toward stronger reasoning without gold labels.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces RESTRAIN, a self-penalizing reinforcement learning framework that enables models to improve on unlabeled data by exploiting signals from the entire answer distribution rather than relying solely on majority votes. It resides in the 'Majority Voting with Self-Penalization' leaf of the taxonomy, which contains only two papers total. This leaf sits within the broader 'Majority Voting and Distribution-Based Learning' branch, indicating a relatively sparse but emerging research direction focused on avoiding spurious convergence in self-improvement settings.

The taxonomy reveals that RESTRAIN's immediate neighbors include evolutionary and consistency-based selection methods in a sibling leaf, as well as broader self-reward generation approaches (e.g., confidence-based rewards, AI-generated feedback) in adjacent branches. While the field contains substantial work on self-feedback and synthetic preference generation (seven papers in Self-Feedback alone), the specific combination of majority voting with explicit penalization mechanisms remains less explored. The taxonomy's scope notes clarify that RESTRAIN's penalization focus distinguishes it from pure majority voting or evolutionary novelty promotion methods.

Among the 24 candidates examined across three contributions, none were found to clearly refute any aspect of RESTRAIN. The core framework (10 candidates examined, 0 refutable), pseudo-label weighting scheme (5 candidates, 0 refutable), and negative rollout penalization (9 candidates, 0 refutable) all appear to lack direct prior work within this limited search scope. This suggests that while the broader field of self-improving RL is active, the specific integration of distribution-based penalization with policy optimization methods like GRPO represents a relatively unexplored combination.

Based on the top-24 semantic matches examined, RESTRAIN appears to occupy a novel position at the intersection of majority voting and self-penalization. However, this assessment is constrained by the limited search scope and the taxonomy's focus on self-improving RL without gold labels. The analysis does not cover exhaustive prior work in supervised learning, traditional RL with external rewards, or related fields where similar penalization ideas might exist under different framing.

Taxonomy

Core-task Taxonomy Papers
49
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Self-improving reinforcement learning without gold labels. This field explores how agents can bootstrap their own learning signals in the absence of external supervision, relying instead on self-generated rewards, peer comparisons, or iterative refinement. The taxonomy reveals several major branches: Self-Reward Generation and Preference Learning focuses on methods that let models judge their own outputs or learn from synthetic preferences (e.g., Self Rewarding[17], RLAIF vs RLHF[2]); Majority Voting and Distribution-Based Learning emphasizes aggregating multiple candidate solutions to identify high-quality trajectories (e.g., RESTRAIN[0], RESTRAIN Self-Driven[31]); Curriculum and Online Learning Frameworks address how to schedule tasks or adapt policies over time (e.g., TTRL[1], SSRL[5]); Theoretical Foundations and Cognitive Frameworks provide conceptual underpinnings (e.g., Cognitive Behaviors STaRs[3]); and Domain-Specific Applications, Robotic and Embodied RL, and Specialized Learning Paradigms tackle concrete settings ranging from code generation (RLCoder[10]) to robotic manipulation (SERL[21]) and web navigation (WebRL[19]). A particularly active line of work centers on majority voting and self-penalization strategies, where agents sample multiple rollouts and use agreement or distributional properties to filter or reweight training data. RESTRAIN[0] sits squarely in this cluster, proposing a self-penalization mechanism that discourages overconfident majority votes and encourages exploration of diverse high-quality solutions. This contrasts with simpler majority-voting schemes and aligns closely with RESTRAIN Self-Driven[31], which extends the idea to fully autonomous settings. Meanwhile, works like SSRL[5] and Cognitive Behaviors STaRs[3] explore complementary angles—SSRL[5] emphasizes staged self-improvement with curriculum design, while Cognitive Behaviors STaRs[3] integrates cognitive reasoning steps into the self-training loop. Across these branches, a central tension emerges between exploiting strong majority signals and maintaining sufficient exploration to avoid reward hacking or premature convergence, a challenge that RESTRAIN[0] addresses through its penalization framework.

Claimed Contributions

RESTRAIN framework for self-driven RL with self-penalization

The authors propose RESTRAIN, a reinforcement learning framework that enables models to self-improve on unlabeled data by penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains, without requiring gold labels or external supervision.

9 retrieved papers
Pseudo-label weighting scheme based on answer frequency

The authors develop a weighting mechanism that assigns weights to pseudo-labels based on their frequency across multiple rollouts, using a monotonic shaping function to down-weight spurious low-frequency answers while avoiding the brittleness of strict majority voting.

5 retrieved papers
Negative rollout penalization mechanism

The authors introduce a penalization mechanism that explicitly penalizes all rollouts when majority consensus is very low, encouraging the model to explore alternative reasoning paths in unreliable supervision scenarios where no answer can be confidently trusted.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RESTRAIN framework for self-driven RL with self-penalization

The authors propose RESTRAIN, a reinforcement learning framework that enables models to self-improve on unlabeled data by penalizing overconfident rollouts and low-consistency examples while preserving promising reasoning chains, without requiring gold labels or external supervision.

Contribution

Pseudo-label weighting scheme based on answer frequency

The authors develop a weighting mechanism that assigns weights to pseudo-labels based on their frequency across multiple rollouts, using a monotonic shaping function to down-weight spurious low-frequency answers while avoiding the brittleness of strict majority voting.

Contribution

Negative rollout penalization mechanism

The authors introduce a penalization mechanism that explicitly penalizes all rollouts when majority consensus is very low, encouraging the model to explore alternative reasoning paths in unreliable supervision scenarios where no answer can be confidently trusted.