Large Reasoning Models Learn Better Alignment from Flawed Thinking

ICLR 2026 Conference SubmissionAnonymous Authors
large reasoning modelsafety alignmentrobustnessrlhf
Abstract:

Large reasoning models (LRMs) “think” by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability — all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes RECAP, a reinforcement learning method that trains models to override flawed reasoning trajectories using counter-aligned chain-of-thought prefills. According to the taxonomy, this work occupies the 'Reinforcement Learning-Based Counter-Alignment' leaf under 'Counter-Aligned Reasoning Trajectory Methods'. Notably, this leaf contains only the original paper itself—no sibling papers are present. This positioning suggests the paper addresses a relatively sparse research direction within the broader field of safety alignment for reasoning models, though the taxonomy includes only two papers total across both major branches.

The taxonomy reveals two primary methodological branches: Counter-Aligned Reasoning Trajectory Methods (where this paper resides) and Adversarial Chain-of-Thought Tuning Methods. The latter includes work on 'Snowball Effect Mitigation Techniques' that prevent progressive amplification of reasoning deviations. The taxonomy's scope notes clarify that counter-aligned prefilling methods (like RECAP) differ from adversarial training approaches by focusing on overriding flawed premises rather than preventing reasoning deviation amplification. This structural separation indicates the paper explores a distinct intervention point—teaching models to self-correct during reasoning rather than hardening them against adversarial inputs during training.

Among 30 candidate papers examined, the contribution-level analysis shows mixed novelty signals. The core RECAP method (Contribution 1) examined 10 candidates with zero refutable matches, suggesting limited direct prior work on RL-based counter-aligned prefilling within the search scope. However, the claim of simultaneous improvement across safety, helpfulness, and reasoning (Contribution 2) found one refutable candidate among 10 examined, indicating some overlap with existing multi-objective alignment work. The robustness claim under adaptive attacks (Contribution 3) showed no refutations across 10 candidates, though this reflects the limited search scale rather than exhaustive coverage.

Given the constrained literature search (30 candidates from semantic search), the paper appears to introduce a relatively novel training paradigm within its specific methodological niche. The absence of sibling papers in the taxonomy leaf and the low refutation rate for the core method suggest meaningful differentiation from prior work, though the single-paper taxonomy structure limits confidence in assessing field saturation. The analysis captures top-K semantic matches but does not guarantee comprehensive coverage of all relevant safety alignment or reasoning model literature.

Taxonomy

Core-task Taxonomy Papers
1
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: improving safety alignment in large reasoning models through counter-aligned chain-of-thought prefilling. This emerging field addresses a critical challenge in modern AI systems: ensuring that models capable of sophisticated reasoning do not produce harmful outputs even when prompted adversarially. The taxonomy reveals two primary branches that capture distinct methodological approaches. Counter-Aligned Reasoning Trajectory Methods focus on generating and leveraging reasoning paths that deliberately expose or counteract misaligned behavior, often through reinforcement learning or other trajectory-optimization techniques. Adversarial Chain-of-Thought Tuning Methods, by contrast, emphasize adversarial training regimes that directly manipulate intermediate reasoning steps to stress-test and harden model safety. Together, these branches reflect a shared recognition that safety alignment must extend beyond surface-level output filtering to the internal reasoning processes themselves. Within Counter-Aligned Reasoning Trajectory Methods, a particularly active line of work explores reinforcement learning-based counter-alignment, where models learn to recognize and avoid flawed reasoning patterns through iterative feedback. Flawed Thinking[0] exemplifies this direction by using counter-aligned chain-of-thought prefilling to guide models away from unsafe trajectories during inference. This approach contrasts with adversarial tuning strategies such as AdvChain[1], which instead injects adversarial reasoning chains during training to preemptively expose vulnerabilities. The central trade-off across these branches involves balancing the computational cost of generating diverse counter-aligned trajectories against the robustness gains achieved. Flawed Thinking[0] sits squarely within the reinforcement learning-based counter-alignment cluster, emphasizing inference-time intervention rather than adversarial pre-training, and thus offers a complementary perspective to methods that rely on adversarial chain manipulation during the tuning phase.

Claimed Contributions

RECAP: Robust Safety Alignment via Counter-Aligned Prefilling

RECAP is a reinforcement learning method that trains large reasoning models on a mixture of counter-aligned chain-of-thought prefills and standard prompts. By exposing models to deliberately flawed reasoning traces during training, RECAP teaches them to recover from misleading trajectories without requiring additional training cost or modifications beyond standard RLHF.

10 retrieved papers
Simultaneous improvement of safety, helpfulness, and reasoning capability

RECAP delivers substantial gains across multiple dimensions: improved safety on direct harmful and jailbreaking benchmarks, reduced overrefusal on benign queries, and enhanced mathematical reasoning performance. These improvements are achieved while maintaining similar inference-time token budgets and are supported by theoretical analysis demonstrating higher expected reward under both prefilled and non-prefilled evaluation.

10 retrieved papers
Can Refute
Persistent robustness under adaptive attacks via increased self-reflection

RECAP-trained models demonstrate sustained safety even when subjected to adaptive attacks designed to bypass their self-reflection mechanisms, including full CoT hijacking and iterative prefill reset attacks. Analysis reveals that these models engage in self-reflection significantly more frequently than vanilla RLHF models, actively revising unsafe or mistaken reasoning during generation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RECAP: Robust Safety Alignment via Counter-Aligned Prefilling

RECAP is a reinforcement learning method that trains large reasoning models on a mixture of counter-aligned chain-of-thought prefills and standard prompts. By exposing models to deliberately flawed reasoning traces during training, RECAP teaches them to recover from misleading trajectories without requiring additional training cost or modifications beyond standard RLHF.

Contribution

Simultaneous improvement of safety, helpfulness, and reasoning capability

RECAP delivers substantial gains across multiple dimensions: improved safety on direct harmful and jailbreaking benchmarks, reduced overrefusal on benign queries, and enhanced mathematical reasoning performance. These improvements are achieved while maintaining similar inference-time token budgets and are supported by theoretical analysis demonstrating higher expected reward under both prefilled and non-prefilled evaluation.

Contribution

Persistent robustness under adaptive attacks via increased self-reflection

RECAP-trained models demonstrate sustained safety even when subjected to adaptive attacks designed to bypass their self-reflection mechanisms, including full CoT hijacking and iterative prefill reset attacks. Analysis reveals that these models engage in self-reflection significantly more frequently than vanilla RLHF models, actively revising unsafe or mistaken reasoning during generation.