Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModelLLM SafetyOver-refusalSafetyAlignment
Abstract:

Large language models (LLMs) aligned for safety often suffer from over-refusal—the tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models' helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the model’s ability to reject genuinely harmful content. We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the model’s learning dynamics. To address it, we introduce a preceding alignment stage, DCR: D\textbf{D}iscernment via C\textbf{C}ontrastive R\textbf{R}efinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLM’s capacity to distinguish truly toxic prompts from superficially toxic ones. Evaluation across diverse benchmarks shows that our method effectively reduces over-refusal while preserving the safety benefits of alignment. Importantly, it achieves this with minimal degradation of general capabilities, offering a more principled and robust direction for safety alignment.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DCR (Discernment via Contrastive Refinement), a training-based method using contrastive learning to help models distinguish truly toxic prompts from superficially toxic ones. It resides in the 'Contrastive and Discriminative Training Approaches' leaf, which contains only three papers total including this one. This represents a relatively sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting the specific approach of contrastive refinement for over-refusal mitigation is not yet heavily explored.

The taxonomy reveals that DCR sits within the 'Training-Based Safety Alignment and Mitigation' branch, which includes neighboring leaves such as 'Reasoning-Enhanced Safety Alignment' (3 papers), 'Preference Optimization and Refusal Training' (3 papers), and 'Foundational Safety Alignment' (3 papers). These adjacent directions pursue different training paradigms—reasoning-based methods incorporate explicit thought processes, while preference optimization uses techniques like DPO. The contrastive approach diverges by focusing on explicit discrimination between toxic and benign prompts during training, rather than reasoning triggers or preference signals, occupying a distinct methodological niche within the training-based landscape.

Among 15 total candidates examined across three contributions, no clearly refuting prior work was identified. The core DCR method examined 10 candidates with zero refutations, the theoretical analysis examined 3 with zero refutations, and the empirical characterization examined 2 with zero refutations. This limited search scope—15 candidates from semantic search and citation expansion—suggests the analysis captures nearby work but cannot claim exhaustive coverage. The absence of refutations among examined candidates indicates that within this bounded search, the specific combination of contrastive refinement for over-refusal appears relatively unexplored, though the small candidate pool limits confidence in this assessment.

Based on the limited literature search, the work appears to occupy a sparsely populated methodological niche within over-refusal mitigation. The taxonomy structure shows active research in adjacent training approaches, but the specific contrastive refinement strategy has fewer direct precedents among the 15 candidates examined. The analysis covers top semantic matches and citations but does not represent an exhaustive field survey, leaving open the possibility of relevant work outside this search scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
15
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Reducing over-refusal in safety-aligned large language models. The field addresses the challenge that models trained to avoid harmful outputs often refuse benign requests, limiting their utility. The taxonomy reveals several complementary research directions: Mechanistic Understanding and Representation Analysis explores how safety behaviors emerge in model internals, while Training-Based Safety Alignment and Mitigation develops methods to improve alignment during model training. Inference-Time Mitigation Strategies focus on post-hoc interventions that adjust model behavior without retraining, and Evaluation Frameworks and Benchmarks provide systematic ways to measure over-refusal alongside safety. Additional branches examine Adversarial Robustness, Domain-Specific challenges, and Alternative Refusal Strategies, reflecting the multifaceted nature of balancing safety with helpfulness. Works like Navigating OverKill[3] and OR Bench[47] have helped establish evaluation standards, while training approaches range from contrastive methods to representation-based interventions. Within the training-based branch, contrastive and discriminative approaches have emerged as a particularly active line of work, aiming to teach models finer-grained distinctions between harmful and safe requests. Contrastive Refinement[0] exemplifies this direction by using contrastive learning to sharpen the boundary between legitimate refusals and over-cautious rejections, positioning itself alongside methods like Curated Malicious Data[33] that carefully construct training sets to improve discrimination. This contrasts with broader training strategies such as Safety Patching[4] or decoupled approaches like Decoupled Refusal Training[10], which separate safety mechanisms from general capabilities. A central tension across these methods involves maintaining robustness against adversarial attacks while reducing false positives on benign inputs—a trade-off that contrastive techniques address by emphasizing explicit negative examples. The original work fits naturally within this discriminative training cluster, sharing with nearby efforts a focus on refining decision boundaries through carefully designed training signals rather than relying solely on inference-time corrections or architectural modifications.

Claimed Contributions

DCR: Discernment via Contrastive Refinement

The authors propose a two-stage safety alignment framework where the first stage applies contrastive learning on intermediate representations to help LLMs distinguish truly toxic prompts from seemingly toxic ones, thereby reducing over-refusal while preserving safety.

10 retrieved papers
Theoretical analysis linking contrastive learning to gradient similarity reduction

The authors establish a theoretical connection (Proposition 1) showing that contrastive learning on intermediate activations reduces the kernel similarity between prompts in gradient space, which is the root cause of over-refusal.

3 retrieved papers
Empirical characterization of over-refusal via learning dynamics

The authors provide the first explicit empirical study demonstrating that over-refusal stems from high gradient similarity between toxic and seemingly toxic prompts, tracked through kernel similarity measures during fine-tuning.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DCR: Discernment via Contrastive Refinement

The authors propose a two-stage safety alignment framework where the first stage applies contrastive learning on intermediate representations to help LLMs distinguish truly toxic prompts from seemingly toxic ones, thereby reducing over-refusal while preserving safety.

Contribution

Theoretical analysis linking contrastive learning to gradient similarity reduction

The authors establish a theoretical connection (Proposition 1) showing that contrastive learning on intermediate activations reduces the kernel similarity between prompts in gradient space, which is the root cause of over-refusal.

Contribution

Empirical characterization of over-refusal via learning dynamics

The authors provide the first explicit empirical study demonstrating that over-refusal stems from high gradient similarity between toxic and seemingly toxic prompts, tracked through kernel similarity measures during fine-tuning.

Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement | Novelty Validation