Large Reasoning Models Learn Better Alignment from Flawed Thinking

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

large reasoning modelsafety alignmentrobustnessrlhf

Large reasoning models (LRMs) “think” by generating structured chain-of-thought (CoT) before producing a final answer, yet they still lack the ability to reason critically about safety alignment and are easily biased when a flawed premise is injected into their thought process. We propose RECAP (Robust Safety Alignment via Counter-Aligned Prefilling), a principled reinforcement learning (RL) method for post-training that explicitly teaches models to override flawed reasoning trajectories and reroute to safe and helpful responses. RECAP trains on a mixture of synthetically generated counter-aligned CoT prefills and standard prompts, requires no additional training cost or modifications beyond vanilla reinforcement learning from human feedback (RLHF), and substantially improves safety and jailbreak robustness, reduces overrefusal, and preserves core reasoning capability — all while maintaining inference token budget. Extensive analysis shows that RECAP-trained models engage in self-reflection more frequently and remain robust under adaptive attacks, preserving safety even after repeated attempts to override their reasoning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes RECAP, a reinforcement learning method that trains models to override flawed reasoning trajectories using counter-aligned chain-of-thought prefills. According to the taxonomy, this work occupies the 'Reinforcement Learning-Based Counter-Alignment' leaf under 'Counter-Aligned Reasoning Trajectory Methods'. Notably, this leaf contains only the original paper itself—no sibling papers are present. This positioning suggests the paper addresses a relatively sparse research direction within the broader field of safety alignment for reasoning models, though the taxonomy includes only two papers total across both major branches.

The taxonomy reveals two primary methodological branches: Counter-Aligned Reasoning Trajectory Methods (where this paper resides) and Adversarial Chain-of-Thought Tuning Methods. The latter includes work on 'Snowball Effect Mitigation Techniques' that prevent progressive amplification of reasoning deviations. The taxonomy's scope notes clarify that counter-aligned prefilling methods (like RECAP) differ from adversarial training approaches by focusing on overriding flawed premises rather than preventing reasoning deviation amplification. This structural separation indicates the paper explores a distinct intervention point—teaching models to self-correct during reasoning rather than hardening them against adversarial inputs during training.

Among 30 candidate papers examined, the contribution-level analysis shows mixed novelty signals. The core RECAP method (Contribution 1) examined 10 candidates with zero refutable matches, suggesting limited direct prior work on RL-based counter-aligned prefilling within the search scope. However, the claim of simultaneous improvement across safety, helpfulness, and reasoning (Contribution 2) found one refutable candidate among 10 examined, indicating some overlap with existing multi-objective alignment work. The robustness claim under adaptive attacks (Contribution 3) showed no refutations across 10 candidates, though this reflects the limited search scale rather than exhaustive coverage.

Given the constrained literature search (30 candidates from semantic search), the paper appears to introduce a relatively novel training paradigm within its specific methodological niche. The absence of sibling papers in the taxonomy leaf and the low refutation rate for the core method suggest meaningful differentiation from prior work, though the single-paper taxonomy structure limits confidence in assessing field saturation. The analysis captures top-K semantic matches but does not guarantee comprehensive coverage of all relevant safety alignment or reasoning model literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: improving safety alignment in large reasoning models through counter-aligned chain-of-thought prefilling. This emerging field addresses a critical challenge in modern AI systems: ensuring that models capable of sophisticated reasoning do not produce harmful outputs even when prompted adversarially. The taxonomy reveals two primary branches that capture distinct methodological approaches. Counter-Aligned Reasoning Trajectory Methods focus on generating and leveraging reasoning paths that deliberately expose or counteract misaligned behavior, often through reinforcement learning or other trajectory-optimization techniques. Adversarial Chain-of-Thought Tuning Methods, by contrast, emphasize adversarial training regimes that directly manipulate intermediate reasoning steps to stress-test and harden model safety. Together, these branches reflect a shared recognition that safety alignment must extend beyond surface-level output filtering to the internal reasoning processes themselves. Within Counter-Aligned Reasoning Trajectory Methods, a particularly active line of work explores reinforcement learning-based counter-alignment, where models learn to recognize and avoid flawed reasoning patterns through iterative feedback. Flawed Thinking[0] exemplifies this direction by using counter-aligned chain-of-thought prefilling to guide models away from unsafe trajectories during inference. This approach contrasts with adversarial tuning strategies such as AdvChain[1], which instead injects adversarial reasoning chains during training to preemptively expose vulnerabilities. The central trade-off across these branches involves balancing the computational cost of generating diverse counter-aligned trajectories against the robustness gains achieved. Flawed Thinking[0] sits squarely within the reinforcement learning-based counter-alignment cluster, emphasizing inference-time intervention rather than adversarial pre-training, and thus offers a complementary perspective to methods that rely on adversarial chain manipulation during the tuning phase.

Claimed Contributions

RECAP: Robust Safety Alignment via Counter-Aligned Prefilling

10 retrieved papers

RECAP is a reinforcement learning method that trains large reasoning models on a mixture of counter-aligned chain-of-thought prefills and standard prompts. By exposing models to deliberately flawed reasoning traces during training, RECAP teaches them to recover from misleading trajectories without requiring additional training cost or modifications beyond standard RLHF.

10 retrieved papers

Simultaneous improvement of safety, helpfulness, and reasoning capability

Can Refute

10 retrieved papers

RECAP delivers substantial gains across multiple dimensions: improved safety on direct harmful and jailbreaking benchmarks, reduced overrefusal on benign queries, and enhanced mathematical reasoning performance. These improvements are achieved while maintaining similar inference-time token budgets and are supported by theoretical analysis demonstrating higher expected reward under both prefilled and non-prefilled evaluation.

10 retrieved papers

Can Refute

Persistent robustness under adaptive attacks via increased self-reflection

10 retrieved papers

RECAP-trained models demonstrate sustained safety even when subjected to adaptive attacks designed to bypass their self-reflection mechanisms, including full CoT hijacking and iterative prefill reset attacks. Analysis reveals that these models engage in self-reflection significantly more frequently than vanilla RLHF models, actively revising unsafe or mistaken reasoning during generation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RECAP: Robust Safety Alignment via Counter-Aligned Prefilling

[22] Demystifying Long Chain-of-Thought Reasoning in LLMs PDF

Cannot Refute

[23] Training language models to self-correct via reinforcement learning PDF

Cannot Refute

[24] SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning PDF

Cannot Refute

[25] ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search PDF

Cannot Refute

[26] Spectral policy optimization: Coloring your incorrect reasoning in grpo PDF

Cannot Refute

[27] Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms PDF

Cannot Refute

[28] Reinforcement Learning for Reasoning in Large Language Models with One Training Example PDF

Cannot Refute

[29] The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning PDF

Cannot Refute

[30] MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL PDF

Cannot Refute

[31] Self-rewarding correction for mathematical reasoning PDF

Cannot Refute

Contribution

Simultaneous improvement of safety, helpfulness, and reasoning capability

[6] ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization PDF

Can Refute

[2] Deliberative alignment: Reasoning enables safer language models PDF

Cannot Refute

[3] ChatGPT for good? On opportunities and challenges of large language models for education PDF

Cannot Refute

[4] Multi-expert prompting improves reliability, safety, and usefulness of large language models PDF

Cannot Refute

[5] Responsible AI in construction safety: Systematic evaluation of large language models and prompt engineering PDF

Cannot Refute

[7] Reasoning-to-defend: Safety-aware reasoning can defend large language models from jailbreaking PDF

Cannot Refute

[8] Intentionreasoner: Facilitating adaptive llm safeguards through intent reasoning and selective query refinement PDF

Cannot Refute

[9] SAFER: Advancing Safety Alignment via Efficient Ex-Ante Reasoning PDF

Cannot Refute

[10] Superficial safety alignment hypothesis PDF

Cannot Refute

[11] Enhancing AI Trustworthiness Through Automated Reasoning: A Novel Method for Explaining Deep Learning and LLM Reasoning PDF

Cannot Refute

Contribution

Persistent robustness under adaptive attacks via increased self-reflection

[12] LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs PDF

Cannot Refute

[13] Chain-of-scrutiny: Detecting backdoor attacks for large language models PDF

Cannot Refute

[14] AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security PDF

Cannot Refute

[15] Breaking agents: Compromising autonomous llm agents through malfunction amplification PDF

Cannot Refute

[16] When to Trust Context: Self-Reflective Debates for Context Reliability PDF

Cannot Refute

[17] Latent syntax weaving in large language model representations: A novel mechanism for self-referential consistency in neural architectures PDF

Cannot Refute

[18] Stochastic constraint self-reflective syntax reconstruction in large language model internal representational spaces PDF

Cannot Refute

[19] Quantitative self-reflection protocols for self-replicating memory chains in large language models: A technical investigation PDF

Cannot Refute

[20] Multi-agent LLM debate unveils the premise left unsaid PDF

Cannot Refute

[21] ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs PDF

Cannot Refute

Large Reasoning Models Learn Better Alignment from Flawed Thinking

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

RECAP: Robust Safety Alignment via Counter-Aligned Prefilling

[22] Demystifying Long Chain-of-Thought Reasoning in LLMs PDF

[23] Training language models to self-correct via reinforcement learning PDF

[24] SimpleTIR: End-to-End Reinforcement Learning for Multi-Turn Tool-Integrated Reasoning PDF

[25] ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search PDF

[26] Spectral policy optimization: Coloring your incorrect reasoning in grpo PDF

[27] Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms PDF

[28] Reinforcement Learning for Reasoning in Large Language Models with One Training Example PDF

[29] The Surprising Effectiveness of Negative Reinforcement in LLM Reasoning PDF

[30] MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL PDF

[31] Self-rewarding correction for mathematical reasoning PDF

Simultaneous improvement of safety, helpfulness, and reasoning capability

[6] ERPO: Advancing Safety Alignment via Ex-Ante Reasoning Preference Optimization PDF

[2] Deliberative alignment: Reasoning enables safer language models PDF

[3] ChatGPT for good? On opportunities and challenges of large language models for education PDF

[4] Multi-expert prompting improves reliability, safety, and usefulness of large language models PDF

[5] Responsible AI in construction safety: Systematic evaluation of large language models and prompt engineering PDF

[7] Reasoning-to-defend: Safety-aware reasoning can defend large language models from jailbreaking PDF

[8] Intentionreasoner: Facilitating adaptive llm safeguards through intent reasoning and selective query refinement PDF

[9] SAFER: Advancing Safety Alignment via Efficient Ex-Ante Reasoning PDF

[10] Superficial safety alignment hypothesis PDF

[11] Enhancing AI Trustworthiness Through Automated Reasoning: A Novel Method for Explaining Deep Learning and LLM Reasoning PDF

Persistent robustness under adaptive attacks via increased self-reflection

[12] LARGO: Latent Adversarial Reflection through Gradient Optimization for Jailbreaking LLMs PDF

[13] Chain-of-scrutiny: Detecting backdoor attacks for large language models PDF

[14] AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security PDF

[15] Breaking agents: Compromising autonomous llm agents through malfunction amplification PDF

[16] When to Trust Context: Self-Reflective Debates for Context Reliability PDF

[17] Latent syntax weaving in large language model representations: A novel mechanism for self-referential consistency in neural architectures PDF

[18] Stochastic constraint self-reflective syntax reconstruction in large language model internal representational spaces PDF

[19] Quantitative self-reflection protocols for self-replicating memory chains in large language models: A technical investigation PDF

[20] Multi-agent LLM debate unveils the premise left unsaid PDF

[21] ShadowCoT: Cognitive Hijacking for Stealthy Reasoning Backdoors in LLMs PDF

Table of Contents