SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Safety AlignmentLLM Fine-tuningPreferencesLarge Language ModelsAI Safety

As Large Language Models (LLMs) are increasingly deployed in real-world applications, balancing both helpfulness and safety has become a central challenge. A natural approach is to incorporate safety constraints into Reinforcement Learning from Human Feedback (RLHF), where recent studies have shown promising progress. However, these methods often rely on auxiliary networks or multi-stage pipelines, thereby increasing complexity. In this work, we revisit the safety alignment objective itself and demonstrate that it admits a closed-form solution, yielding a theoretically grounded and provably equivalent reformulation that enables a direct and tractable optimization procedure. Building on this insight, we propose SafeDPO, a lightweight method derived from this formulation, which preserves the optimality of the underlying safety-constrained objective while requiring only one additional hyperparameter and minimal modifications to existing preference-based training methods. At the same time, it eliminates the need for reward models, cost models, and online sampling. Despite its simplicity, SafeDPO achieves comparable or superior results to state-of-the-art safety alignment methods in both theoretical soundness and empirical performance. Experiments on the PKU-SafeRLHF-30K benchmark show that SafeDPO consistently improves safety while maintaining competitive helpfulness. Ablation studies further show that the additional hyperparameter provides a flexible mechanism to enhance safety without altering the theoretical optimum, and confirm that SafeDPO scales reliably to LLMs with up to 13B parameters. Overall, our results highlight that a simple, theory-driven objective can provide a lightweight yet effective solution for safety alignment in practice.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes SafeDPO, a method for safety-constrained alignment in large language models through a closed-form reformulation of the safety objective. It resides in the 'Direct Preference Optimization Variants' leaf of the taxonomy, which contains only two papers total (including this one). This places the work in a relatively sparse but active research direction within preference-based alignment methods. The sibling paper in this leaf focuses on general DPO improvements, while SafeDPO specifically targets safety-constrained formulations, suggesting a specialized niche within an emerging subfield.

The taxonomy reveals that SafeDPO's parent branch, 'Preference-Based Alignment Methods', contains three distinct approaches: DPO variants, RLHF/reward-based methods, and human preference datasets. Neighboring leaves include RLHF approaches like PKU-SafeRLHF and datasets like BeaverTails that separate helpfulness from harmlessness. The scope note for DPO variants explicitly excludes reward-based methods using critic networks, positioning SafeDPO as part of the reward-free optimization paradigm. Adjacent branches address fine-tuning stage safety and inference-time methods, indicating that SafeDPO operates at the initial alignment stage rather than post-deployment adaptation.

Among the three identified contributions, the closed-form reformulation examined nine candidates with zero refutable prior work, while the SafeDPO algorithm examined ten candidates, also with zero refutations. The safety-aware preference transformation was not examined against any candidates. The limited search scope (19 total candidates examined across all contributions) suggests these findings reflect top-K semantic matches rather than exhaustive coverage. Given the sparse population of the DPO variants leaf and the absence of clear refutations among examined candidates, the core algorithmic contributions appear relatively novel within the constrained search space, though the theoretical reformulation's novelty depends on how it relates to broader optimization literature not captured here.

Based on the limited literature search covering 19 candidates, SafeDPO appears to occupy a distinct position within the emerging DPO-based safety alignment space. The analysis captures semantic neighbors and direct citations but does not exhaustively cover all theoretical optimization work or parallel developments in constrained preference learning. The sparse taxonomy leaf and zero refutations among examined candidates suggest potential novelty, though definitive assessment would require broader coverage of optimization theory and concurrent safety-constrained methods.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: safety alignment in large language models. The field has evolved into a rich ecosystem organized around several major branches. Theoretical Foundations and Limitations explores fundamental questions about what alignment can and cannot achieve, as exemplified by works like Fundamental Limitations Alignment[1]. Safety Alignment Techniques encompasses the methodological core, including preference-based methods such as direct preference optimization variants and reinforcement learning from human feedback approaches like PKU-SafeRLHF[40]. Safety Evaluation and Benchmarking provides the measurement infrastructure through datasets like BeaverTails[20] and domain-specific benchmarks such as MedSafetyBench[17] and SafeLawBench[4]. Robustness and Defense Against Safety Attacks addresses adversarial concerns including multilingual jailbreaks, while Safety Risks and Vulnerability Analysis investigates how fine-tuning can compromise safety as shown in Fine-Tuning Compromises Safety[26]. Alignment Beyond Safety broadens the scope to moral alignment and other objectives, and Domain-Specific Applications tailors methods to specialized contexts like medicine and traffic safety. A central tension runs through many branches: the trade-off between model capability and safety, explored in Safety-Capability Trade-offs[30], and the challenge of maintaining alignment during post-training adaptation as studied in Post-Fine-Tuning Alignment[5]. Within preference-based alignment methods, researchers debate the relative merits of different optimization strategies, with some work suggesting DPO Superior PPO[29] in certain settings. SafeDPO[0] sits squarely in this active subfield of direct preference optimization variants, proposing refinements to balance safety objectives more effectively during alignment. Its emphasis contrasts with broader approaches like Safety-Aware Fine-Tuning[3], which addresses safety degradation across diverse fine-tuning scenarios, and complements neighboring work on optimizing the preference learning process itself. The landscape reveals ongoing efforts to make alignment both more robust and more practical across deployment contexts.

Claimed Contributions

Closed-form reformulation of safety alignment objective

9 retrieved papers

The authors derive a tractable, closed-form formulation of the constrained safety alignment problem by reformulating it as an unconstrained optimization problem with a modified reward function. This reformulation eliminates the need for surrogate relaxations or auxiliary models while preserving optimality.

9 retrieved papers

SafeDPO algorithm

10 retrieved papers

The authors introduce SafeDPO, a simple training method that incorporates binary safety indicators into preference optimization through a safety-aware preference transformation and an optional safety margin. It enables direct, single-stage policy updates without requiring reward models, cost models, or online sampling.

10 retrieved papers

Safety-aware preference transformation

0 retrieved papers

The authors propose a transformation function that reorders preference pairs based on safety indicators: safe winners remain unchanged, unsafe winners are swapped with safe losers, and unsafe-unsafe pairs are discarded. This transformation provides an unbiased estimator of the cost-augmented distribution without requiring access to the latent cost function.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[29] Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study PDF

Xu Shusheng, Fu Wei, Shusheng Xu, Gao, Jiaxuan, Wei Fu, Ye Wenjie, Jiaxuan Gao, Liu Weilin, Wenjie Ye, Mei, Zhiyu, Weiling Liu, Wang Guang-Ju, Zhiyu Mei, Yu Chao, Guangju Wang, Wu Yi, Chao Yu, Yi Wu (2024) • International Conference on Machine Learning

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Closed-form reformulation of safety alignment objective

[62] Linear alignment: A closed-form solution for aligning human preferences without tuning and feedback PDF

Cannot Refute

[63] Alignment of large language models with constrained learning PDF

Cannot Refute

[64] Joint verification and refinement of language models for safety-constrained planning PDF

Cannot Refute

[65] Beyond Intentions: A Critical Survey of Misalignment in LLMs. PDF

Cannot Refute

[66] Distributional preference alignment of llms via optimal transport PDF

Cannot Refute

[67] Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment PDF

Cannot Refute

[68] A Survey on Training-free Alignment of Large Language Models PDF

Cannot Refute

[69] Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLMs PDF

Cannot Refute

[70] DarkPatterns-LLM: A Multi-Layer Benchmark for Detecting Manipulative and Harmful AI Behavior PDF

Cannot Refute

Contribution

SafeDPO algorithm

[51] Direct preference optimization: Your language model is secretly a reward model PDF

Cannot Refute

[52] Bi-factorial preference optimization: Balancing safety-helpfulness in language models PDF

Cannot Refute

[53] Margin-aware Preference Optimization for Aligning Diffusion Models without Reference PDF

Cannot Refute

[54] Adversarial Preference Learning for Robust LLM Alignment PDF

Cannot Refute

[55] Efficient preference-based reinforcement learning using learned dynamics models PDF

Cannot Refute

[56] Efficient preference-based reinforcement learning via aligned experience estimation PDF

Cannot Refute

[57] Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model PDF

Cannot Refute

[58] UniAPL: A Unified Adversarial Preference Learning Framework for Instruct-Following PDF

Cannot Refute

[59] A survey of direct preference optimization PDF

Cannot Refute

[60] Safety-aware preference-based learning for safety-critical control PDF

Cannot Refute

Contribution

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[29] Is DPO Superior to PPO for LLM Alignment? A Comprehensive Study PDF

Contribution Analysis

Closed-form reformulation of safety alignment objective

[62] Linear alignment: A closed-form solution for aligning human preferences without tuning and feedback PDF

[63] Alignment of large language models with constrained learning PDF

[64] Joint verification and refinement of language models for safety-constrained planning PDF

[65] Beyond Intentions: A Critical Survey of Misalignment in LLMs. PDF

[66] Distributional preference alignment of llms via optimal transport PDF

[67] Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment PDF

[68] A Survey on Training-free Alignment of Large Language Models PDF

[69] Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLMs PDF

[70] DarkPatterns-LLM: A Multi-Layer Benchmark for Detecting Manipulative and Harmful AI Behavior PDF

SafeDPO algorithm

[51] Direct preference optimization: Your language model is secretly a reward model PDF

[52] Bi-factorial preference optimization: Balancing safety-helpfulness in language models PDF

[53] Margin-aware Preference Optimization for Aligning Diffusion Models without Reference PDF

[54] Adversarial Preference Learning for Robust LLM Alignment PDF

[55] Efficient preference-based reinforcement learning using learned dynamics models PDF

[56] Efficient preference-based reinforcement learning via aligned experience estimation PDF

[57] Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model PDF

[58] UniAPL: A Unified Adversarial Preference Learning Framework for Instruct-Following PDF

[59] A survey of direct preference optimization PDF

[60] Safety-aware preference-based learning for safety-critical control PDF

Safety-aware preference transformation

Table of Contents