Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Large Language ModelLLM SafetyOver-refusalSafetyAlignment

Large language models (LLMs) aligned for safety often suffer from over-refusal—the tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models' helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the model’s ability to reject genuinely harmful content. We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the model’s learning dynamics. To address it, we introduce a preceding alignment stage, DCR: $\textbf{D}$ iscernment via $\textbf{C}$ ontrastive $\textbf{R}$ efinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLM’s capacity to distinguish truly toxic prompts from superficially toxic ones. Evaluation across diverse benchmarks shows that our method effectively reduces over-refusal while preserving the safety benefits of alignment. Importantly, it achieves this with minimal degradation of general capabilities, offering a more principled and robust direction for safety alignment.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DCR (Discernment via Contrastive Refinement), a training-based method using contrastive learning to help models distinguish truly toxic prompts from superficially toxic ones. It resides in the 'Contrastive and Discriminative Training Approaches' leaf, which contains only three papers total including this one. This represents a relatively sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting the specific approach of contrastive refinement for over-refusal mitigation is not yet heavily explored.

The taxonomy reveals that DCR sits within the 'Training-Based Safety Alignment and Mitigation' branch, which includes neighboring leaves such as 'Reasoning-Enhanced Safety Alignment' (3 papers), 'Preference Optimization and Refusal Training' (3 papers), and 'Foundational Safety Alignment' (3 papers). These adjacent directions pursue different training paradigms—reasoning-based methods incorporate explicit thought processes, while preference optimization uses techniques like DPO. The contrastive approach diverges by focusing on explicit discrimination between toxic and benign prompts during training, rather than reasoning triggers or preference signals, occupying a distinct methodological niche within the training-based landscape.

Among 15 total candidates examined across three contributions, no clearly refuting prior work was identified. The core DCR method examined 10 candidates with zero refutations, the theoretical analysis examined 3 with zero refutations, and the empirical characterization examined 2 with zero refutations. This limited search scope—15 candidates from semantic search and citation expansion—suggests the analysis captures nearby work but cannot claim exhaustive coverage. The absence of refutations among examined candidates indicates that within this bounded search, the specific combination of contrastive refinement for over-refusal appears relatively unexplored, though the small candidate pool limits confidence in this assessment.

Based on the limited literature search, the work appears to occupy a sparsely populated methodological niche within over-refusal mitigation. The taxonomy structure shows active research in adjacent training approaches, but the specific contrastive refinement strategy has fewer direct precedents among the 15 candidates examined. The analysis covers top semantic matches and citations but does not represent an exhaustive field survey, leaving open the possibility of relevant work outside this search scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Reducing over-refusal in safety-aligned large language models. The field addresses the challenge that models trained to avoid harmful outputs often refuse benign requests, limiting their utility. The taxonomy reveals several complementary research directions: Mechanistic Understanding and Representation Analysis explores how safety behaviors emerge in model internals, while Training-Based Safety Alignment and Mitigation develops methods to improve alignment during model training. Inference-Time Mitigation Strategies focus on post-hoc interventions that adjust model behavior without retraining, and Evaluation Frameworks and Benchmarks provide systematic ways to measure over-refusal alongside safety. Additional branches examine Adversarial Robustness, Domain-Specific challenges, and Alternative Refusal Strategies, reflecting the multifaceted nature of balancing safety with helpfulness. Works like Navigating OverKill[3] and OR Bench[47] have helped establish evaluation standards, while training approaches range from contrastive methods to representation-based interventions. Within the training-based branch, contrastive and discriminative approaches have emerged as a particularly active line of work, aiming to teach models finer-grained distinctions between harmful and safe requests. Contrastive Refinement[0] exemplifies this direction by using contrastive learning to sharpen the boundary between legitimate refusals and over-cautious rejections, positioning itself alongside methods like Curated Malicious Data[33] that carefully construct training sets to improve discrimination. This contrasts with broader training strategies such as Safety Patching[4] or decoupled approaches like Decoupled Refusal Training[10], which separate safety mechanisms from general capabilities. A central tension across these methods involves maintaining robustness against adversarial attacks while reducing false positives on benign inputs—a trade-off that contrastive techniques address by emphasizing explicit negative examples. The original work fits naturally within this discriminative training cluster, sharing with nearby efforts a focus on refining decision boundaries through carefully designed training signals rather than relying solely on inference-time corrections or architectural modifications.

Claimed Contributions

DCR: Discernment via Contrastive Refinement

10 retrieved papers

The authors propose a two-stage safety alignment framework where the first stage applies contrastive learning on intermediate representations to help LLMs distinguish truly toxic prompts from seemingly toxic ones, thereby reducing over-refusal while preserving safety.

10 retrieved papers

Theoretical analysis linking contrastive learning to gradient similarity reduction

3 retrieved papers

The authors establish a theoretical connection (Proposition 1) showing that contrastive learning on intermediate activations reduces the kernel similarity between prompts in gradient space, which is the root cause of over-refusal.

3 retrieved papers

Empirical characterization of over-refusal via learning dynamics

2 retrieved papers

The authors provide the first explicit empirical study demonstrating that over-refusal stems from high gradient similarity between toxic and seemingly toxic prompts, tracked through kernel similarity measures during fine-tuning.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] Towards comprehensive post safety alignment of large language models via safety patching PDF

Zhao WeiXiang, Hu, Yulin, Weixiang Zhao, Li Zhuojun, Yulin Hu, Deng Yang, Zhuojun Li, Guo Jiahe, Yang Deng, Yanyan Zhao, Zhao Yan-yan, Bing Qin, Qin Bing, Tat-Seng Chua, Chua, Tat-Seng, Liu Ting (2024)

[33] Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? PDF

Wang Yan-bo, Guan, Jiyang, Liang Jian, He, Ran (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DCR: Discernment via Contrastive Refinement

[2] Reasoning-to-defend: Safety-aware reasoning can defend large language models from jailbreaking PDF

Cannot Refute

[22] Safeconstellations: Steering llm safety to reduce over-refusals through task-specific trajectory PDF

Cannot Refute

[54] Adversarial contrastive decoding: Boosting safety alignment of large language models via opposite prompt optimization PDF

Cannot Refute

[55] The art of saying no: Contextual noncompliance in language models PDF

Cannot Refute

[56] Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models PDF

Cannot Refute

[57] Governance in Motion: Co-evolution of Constitutions and AI models for Scalable Safety PDF

Cannot Refute

[58] Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! PDF

Cannot Refute

[59] Good Teachers, Better Students: A Survey of Reward Models for LLM PDF

Cannot Refute

[60] Safety Alignment of Large Language Models via Contrasting Safe and Harmful Distributions PDF

Cannot Refute

[61] Human-Centered AI in Safety-Critical Systems: From Educational Simulations to NLP and Privacy-Aware Redundancy PDF

Cannot Refute

Contribution

Theoretical analysis linking contrastive learning to gradient similarity reduction

[51] Efficient test-time prompt tuning for vision-language models PDF

Cannot Refute

[52] Addressing Bias in VLMs for Glaucoma Detection Without Protected Attribute Supervision PDF

Cannot Refute

[53] GCML: Gradient Coherence Guided Meta-Learning for Cross-Domain Emerging Topic Rumor Detection PDF

Cannot Refute

Contribution

Empirical characterization of over-refusal via learning dynamics

[62] GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis PDF

Cannot Refute

[63] GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation PDF

Cannot Refute

Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] Towards comprehensive post safety alignment of large language models via safety patching PDF

[33] Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? PDF

Contribution Analysis

DCR: Discernment via Contrastive Refinement

[2] Reasoning-to-defend: Safety-aware reasoning can defend large language models from jailbreaking PDF

[22] Safeconstellations: Steering llm safety to reduce over-refusals through task-specific trajectory PDF

[54] Adversarial contrastive decoding: Boosting safety alignment of large language models via opposite prompt optimization PDF

[55] The art of saying no: Contextual noncompliance in language models PDF

[56] Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models PDF

[57] Governance in Motion: Co-evolution of Constitutions and AI models for Scalable Safety PDF

[58] Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! PDF

[59] Good Teachers, Better Students: A Survey of Reward Models for LLM PDF

[60] Safety Alignment of Large Language Models via Contrasting Safe and Harmful Distributions PDF

[61] Human-Centered AI in Safety-Critical Systems: From Educational Simulations to NLP and Privacy-Aware Redundancy PDF

Theoretical analysis linking contrastive learning to gradient similarity reduction

[51] Efficient test-time prompt tuning for vision-language models PDF

[52] Addressing Bias in VLMs for Glaucoma Detection Without Protected Attribute Supervision PDF

[53] GCML: Gradient Coherence Guided Meta-Learning for Cross-Domain Emerging Topic Rumor Detection PDF

Empirical characterization of over-refusal via learning dynamics

[62] GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis PDF

[63] GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation PDF

Table of Contents