SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

ICLR 2026 Conference SubmissionAnonymous Authors
Safety AlignmentLLM Fine-tuningPreferencesLarge Language ModelsAI Safety
Abstract:

As Large Language Models (LLMs) are increasingly deployed in real-world applications, balancing both helpfulness and safety has become a central challenge. A natural approach is to incorporate safety constraints into Reinforcement Learning from Human Feedback (RLHF), where recent studies have shown promising progress. However, these methods often rely on auxiliary networks or multi-stage pipelines, thereby increasing complexity. In this work, we revisit the safety alignment objective itself and demonstrate that it admits a closed-form solution, yielding a theoretically grounded and provably equivalent reformulation that enables a direct and tractable optimization procedure. Building on this insight, we propose SafeDPO, a lightweight method derived from this formulation, which preserves the optimality of the underlying safety-constrained objective while requiring only one additional hyperparameter and minimal modifications to existing preference-based training methods. At the same time, it eliminates the need for reward models, cost models, and online sampling. Despite its simplicity, SafeDPO achieves comparable or superior results to state-of-the-art safety alignment methods in both theoretical soundness and empirical performance. Experiments on the PKU-SafeRLHF-30K benchmark show that SafeDPO consistently improves safety while maintaining competitive helpfulness. Ablation studies further show that the additional hyperparameter provides a flexible mechanism to enhance safety without altering the theoretical optimum, and confirm that SafeDPO scales reliably to LLMs with up to 13B parameters. Overall, our results highlight that a simple, theory-driven objective can provide a lightweight yet effective solution for safety alignment in practice.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SafeDPO, a method for safety-constrained alignment in large language models through a closed-form reformulation of the safety objective. It resides in the 'Direct Preference Optimization Variants' leaf of the taxonomy, which contains only two papers total (including this one). This places the work in a relatively sparse but active research direction within preference-based alignment methods. The sibling paper in this leaf focuses on general DPO improvements, while SafeDPO specifically targets safety-constrained formulations, suggesting a specialized niche within an emerging subfield.

The taxonomy reveals that SafeDPO's parent branch, 'Preference-Based Alignment Methods', contains three distinct approaches: DPO variants, RLHF/reward-based methods, and human preference datasets. Neighboring leaves include RLHF approaches like PKU-SafeRLHF and datasets like BeaverTails that separate helpfulness from harmlessness. The scope note for DPO variants explicitly excludes reward-based methods using critic networks, positioning SafeDPO as part of the reward-free optimization paradigm. Adjacent branches address fine-tuning stage safety and inference-time methods, indicating that SafeDPO operates at the initial alignment stage rather than post-deployment adaptation.

Among the three identified contributions, the closed-form reformulation examined nine candidates with zero refutable prior work, while the SafeDPO algorithm examined ten candidates, also with zero refutations. The safety-aware preference transformation was not examined against any candidates. The limited search scope (19 total candidates examined across all contributions) suggests these findings reflect top-K semantic matches rather than exhaustive coverage. Given the sparse population of the DPO variants leaf and the absence of clear refutations among examined candidates, the core algorithmic contributions appear relatively novel within the constrained search space, though the theoretical reformulation's novelty depends on how it relates to broader optimization literature not captured here.

Based on the limited literature search covering 19 candidates, SafeDPO appears to occupy a distinct position within the emerging DPO-based safety alignment space. The analysis captures semantic neighbors and direct citations but does not exhaustively cover all theoretical optimization work or parallel developments in constrained preference learning. The sparse taxonomy leaf and zero refutations among examined candidates suggest potential novelty, though definitive assessment would require broader coverage of optimization theory and concurrent safety-constrained methods.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
19
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: safety alignment in large language models. The field has evolved into a rich ecosystem organized around several major branches. Theoretical Foundations and Limitations explores fundamental questions about what alignment can and cannot achieve, as exemplified by works like Fundamental Limitations Alignment[1]. Safety Alignment Techniques encompasses the methodological core, including preference-based methods such as direct preference optimization variants and reinforcement learning from human feedback approaches like PKU-SafeRLHF[40]. Safety Evaluation and Benchmarking provides the measurement infrastructure through datasets like BeaverTails[20] and domain-specific benchmarks such as MedSafetyBench[17] and SafeLawBench[4]. Robustness and Defense Against Safety Attacks addresses adversarial concerns including multilingual jailbreaks, while Safety Risks and Vulnerability Analysis investigates how fine-tuning can compromise safety as shown in Fine-Tuning Compromises Safety[26]. Alignment Beyond Safety broadens the scope to moral alignment and other objectives, and Domain-Specific Applications tailors methods to specialized contexts like medicine and traffic safety. A central tension runs through many branches: the trade-off between model capability and safety, explored in Safety-Capability Trade-offs[30], and the challenge of maintaining alignment during post-training adaptation as studied in Post-Fine-Tuning Alignment[5]. Within preference-based alignment methods, researchers debate the relative merits of different optimization strategies, with some work suggesting DPO Superior PPO[29] in certain settings. SafeDPO[0] sits squarely in this active subfield of direct preference optimization variants, proposing refinements to balance safety objectives more effectively during alignment. Its emphasis contrasts with broader approaches like Safety-Aware Fine-Tuning[3], which addresses safety degradation across diverse fine-tuning scenarios, and complements neighboring work on optimizing the preference learning process itself. The landscape reveals ongoing efforts to make alignment both more robust and more practical across deployment contexts.

Claimed Contributions

Closed-form reformulation of safety alignment objective

The authors derive a tractable, closed-form formulation of the constrained safety alignment problem by reformulating it as an unconstrained optimization problem with a modified reward function. This reformulation eliminates the need for surrogate relaxations or auxiliary models while preserving optimality.

9 retrieved papers
SafeDPO algorithm

The authors introduce SafeDPO, a simple training method that incorporates binary safety indicators into preference optimization through a safety-aware preference transformation and an optional safety margin. It enables direct, single-stage policy updates without requiring reward models, cost models, or online sampling.

10 retrieved papers
Safety-aware preference transformation

The authors propose a transformation function that reorders preference pairs based on safety indicators: safe winners remain unchanged, unsafe winners are swapped with safe losers, and unsafe-unsafe pairs are discarded. This transformation provides an unbiased estimator of the cost-augmented distribution without requiring access to the latent cost function.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Closed-form reformulation of safety alignment objective

The authors derive a tractable, closed-form formulation of the constrained safety alignment problem by reformulating it as an unconstrained optimization problem with a modified reward function. This reformulation eliminates the need for surrogate relaxations or auxiliary models while preserving optimality.

Contribution

SafeDPO algorithm

The authors introduce SafeDPO, a simple training method that incorporates binary safety indicators into preference optimization through a safety-aware preference transformation and an optional safety margin. It enables direct, single-stage policy updates without requiring reward models, cost models, or online sampling.

Contribution

Safety-aware preference transformation

The authors propose a transformation function that reorders preference pairs based on safety indicators: safe winners remain unchanged, unsafe winners are swapped with safe losers, and unsafe-unsafe pairs are discarded. This transformation provides an unbiased estimator of the cost-augmented distribution without requiring access to the latent cost function.