Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

safetyjailbreakdiffusion language models

Diffusion language models (DLMs) generate tokens in parallel through iterative denoising, which can reduce latency and enable bidirectional conditioning. However, the safety risks posed by jailbreak attacks that exploit this inference mechanism are not well understood. In this paper, we reveal that DLMs have a critical vulnerability stemming from their iterative denoising process and propose a countermeasure. Specifically, our investigation identifies that if an affirmative token for a harmful query appears at an intermediate step, subsequent denoising can be steered toward a harmful response even in aligned models. Furthermore, we demonstrate that the vulnerability enables existing optimization-based jailbreak attacks to be applied to MDLMs. Building on this analysis, we propose a novel safety alignment method tailored to DLMs that trains models to generate safe responses from contaminated intermediate denoising steps containing affirmative tokens. Our experiments indicate that the proposed method significantly mitigates the vulnerability with minimal impact on task performance. Furthermore, our method also improves robustness against conventional jailbreak attacks. Our work underscores the need for DLM-specific safety research.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper identifies a priming vulnerability in diffusion language models (DLMs) where affirmative tokens appearing at intermediate denoising steps can steer aligned models toward harmful outputs, and proposes Recovery Alignment (RA) to train models to generate safe responses from contaminated intermediate states. Within the taxonomy, it resides in the 'Diffusion Model-Specific Vulnerabilities' leaf alongside three sibling papers examining diffusion-specific attack surfaces. This leaf is relatively sparse compared to the broader 'General Jailbreak Attack Strategies' category, suggesting that diffusion-specific safety research remains an emerging area with fewer established works.

The taxonomy tree reveals that diffusion-specific vulnerabilities form a distinct branch separate from general jailbreak attacks and multimodal exploits. Neighboring leaves include 'Diffusion Model-Specific Alignment' (containing one defense paper) and 'General Jailbreak Attack Strategies' (five papers on optimization-based attacks). The paper bridges attack analysis and defense: it characterizes a vulnerability mechanism while proposing a training-based countermeasure. This positions it at the intersection of vulnerability discovery and alignment methods, connecting to both 'Training-Based Safety Alignment' and 'Adversarial Training Approaches' branches through its RA technique.

Among the 22 candidates examined, none clearly refute the three core contributions. The priming vulnerability discovery examined 3 candidates with no refutations, suggesting limited prior work explicitly characterizing this temporal attack surface in DLMs. The Recovery Alignment method examined 10 candidates without refutation, indicating that training models to recover from contaminated intermediate states appears novel within the search scope. The theoretical lower bound contribution examined 9 candidates, also without refutation. These statistics reflect a focused literature search rather than exhaustive coverage, but suggest the work addresses gaps in understanding diffusion-specific safety dynamics.

Given the limited search scope of 22 candidates and the sparse population of the diffusion-specific vulnerability leaf, the work appears to contribute novel insights into temporal attack surfaces unique to iterative denoising architectures. The analysis does not cover broader alignment literature or non-diffusion safety methods, so the assessment is constrained to the examined semantic neighborhood. The combination of vulnerability characterization and tailored defense within a relatively underexplored research direction suggests substantive originality within the scope analyzed.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: safety alignment for diffusion language models against jailbreak attacks. The field organizes around four main branches that together capture the adversarial dynamics of language model safety. Jailbreak Attack Mechanisms and Vulnerabilities explores how adversaries exploit model weaknesses—ranging from prompt engineering tactics to diffusion-specific vulnerabilities that arise from the iterative denoising process. Safety Alignment Defense Methods encompasses techniques for hardening models, including training-time interventions like Deep Safety Alignment[3] and inference-time safeguards such as Diffuguard[15]. Safety Evaluation and Benchmarking provides systematic testbeds like JailbreakDiffBench[31] to measure robustness, while Surveys and Broad Perspectives synthesize lessons across attack and defense paradigms, as seen in Responsible LLMs Survey[13]. A particularly active tension exists between diffusion-specific attack research and corresponding defenses. Works like Diffusionattacker[5] and Devil Behind Mask[20] reveal that diffusion models' sequential generation can be manipulated through carefully crafted priming or masking strategies, while Diffusion Alignment Position[4] investigates where in the denoising trajectory alignment is most fragile. Priming Vulnerability[0] sits squarely within this cluster, examining how early-stage prompts can bypass safety filters by exploiting the model's iterative refinement. Compared to Diffuguard[15], which proposes runtime monitoring to detect harmful trajectories, and Devil Behind Mask[20], which focuses on adversarial masking techniques, Priming Vulnerability[0] emphasizes the temporal dimension of alignment—showing that vulnerabilities emerge not just from what is prompted but when during diffusion it is introduced. This work highlights an underexplored attack surface where standard alignment methods may fail to account for the unique sequential structure of diffusion-based generation.

Claimed Contributions

Discovery and characterization of priming vulnerability in MDLMs

3 retrieved papers

The authors identify and systematically analyze a critical safety vulnerability specific to Masked Diffusion Language Models where affirmative tokens appearing during intermediate denoising steps can bias subsequent generation toward harmful outputs, even in safety-aligned models. They design controlled attacks to quantify this vulnerability and demonstrate its severity through experiments.

3 retrieved papers

Recovery Alignment (RA) method for MDLM safety

10 retrieved papers

The authors propose a novel safety alignment framework tailored to MDLMs that explicitly trains models to generate safe responses from contaminated intermediate states containing affirmative tokens. This approach addresses the priming vulnerability by teaching models recovery trajectories from harmful intermediate states back to safety.

10 retrieved papers

Theoretical lower bound for exploiting priming vulnerability without intervention

9 retrieved papers

The authors derive a tractable theoretical lower bound (Theorem 4.1) that enables optimization-based jailbreak attacks to exploit the priming vulnerability without requiring direct intervention in the denoising process. This demonstrates that realistic attackers who can only modify prompts can still leverage this vulnerability through gradient-based optimization.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] Where to start alignment? diffusion large language model may demand a distinct position PDF

Xie, Zhixin, Luo Jun (2025)

[15] Diffuguard: How intrinsic safety is lost and found in diffusion large language models PDF

Li Zherui, Nie Zheng, Zhou Zhen-hong, Guo Yu-Fei, Liu Yu-e, Zhang Yitong, Cheng Yu, Wen, Qingsong, Wang Kun, Zhang Jia-heng (2025)

[20] The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs PDF

Wen Zichen, Qu JiaShu, Liu, Dongrui, Liu Zhi-Yuan, Yang Yicun, Xu Haoyun, Liu Xuyang, LI Weijia, Lu Chaochao, Shao, Jing, He, Conghui, Zhang, Linfeng (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Discovery and characterization of priming vulnerability in MDLMs

[5] Diffusionattacker: Diffusion-driven prompt manipulation for llm jailbreak PDF

Cannot Refute

[49] Mma-diffusion: Multimodal attack on diffusion models PDF

Cannot Refute

[50] Token Perturbation Guidance for Diffusion Models PDF

Cannot Refute

Contribution

Recovery Alignment (RA) method for MDLM safety

[3] Safety Alignment Should Be Made More Than Just a Few Tokens Deep PDF

Cannot Refute

[29] How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States PDF

Cannot Refute

[41] Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training PDF

Cannot Refute

[42] Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation PDF

Cannot Refute

[43] MixAT: Combining Continuous and Discrete Adversarial Training for LLMs PDF

Cannot Refute

[44] Safety misalignment against large language models PDF

Cannot Refute

[45] PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training PDF

Cannot Refute

[46] Gameplay filters: Robust zero-shot safety through adversarial imagination PDF

Cannot Refute

[47] Advancing LLM Safe Alignment with Safety Representation Ranking PDF

Cannot Refute

[48] Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks PDF

Cannot Refute

Contribution

Theoretical lower bound for exploiting priming vulnerability without intervention

[39] DiffTextPure: Defending Large Language Models with Diffusion Purifiers PDF

Cannot Refute

[51] Modifier Unlocked: Jailbreaking Text-to-Image Models Through Prompts PDF

Cannot Refute

[52] DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models PDF

Cannot Refute

[53] DiffPAD: Denoising Diffusion-Based Adversarial Patch Decontamination PDF

Cannot Refute

[54] A change of heart: Backdoor attacks on security-centric diffusion models PDF

Cannot Refute

[55] Anti-Inpainting: A Proactive Defense against Malicious Diffusion-based Inpainters under Unknown Conditions PDF

Cannot Refute

[56] Diffusion Model-Based Assisted Attacks on Pufsecured Telematics and Medical Devices PDF

Cannot Refute

[57] Beyond Memorization: Gradient Projection Enables Selective Learning in Diffusion Models PDF

Cannot Refute

[58] In the Blink of an Eye: A Unified Theory for Feature Emergence in Generative Models PDF

Cannot Refute

Toward Safer Diffusion Language Models: Discovery and Mitigation of Priming Vulnerability

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] Where to start alignment? diffusion large language model may demand a distinct position PDF

[15] Diffuguard: How intrinsic safety is lost and found in diffusion large language models PDF

[20] The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs PDF

Contribution Analysis

Discovery and characterization of priming vulnerability in MDLMs

[5] Diffusionattacker: Diffusion-driven prompt manipulation for llm jailbreak PDF

[49] Mma-diffusion: Multimodal attack on diffusion models PDF

[50] Token Perturbation Guidance for Diffusion Models PDF

Recovery Alignment (RA) method for MDLM safety

[3] Safety Alignment Should Be Made More Than Just a Few Tokens Deep PDF

[29] How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States PDF

[41] Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training PDF

[42] Targeted vaccine: Safety alignment for large language models against harmful fine-tuning via layer-wise perturbation PDF

[43] MixAT: Combining Continuous and Discrete Adversarial Training for LLMs PDF

[44] Safety misalignment against large language models PDF

[45] PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training PDF

[46] Gameplay filters: Robust zero-shot safety through adversarial imagination PDF

[47] Advancing LLM Safe Alignment with Safety Representation Ranking PDF

[48] Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks PDF

Theoretical lower bound for exploiting priming vulnerability without intervention

[39] DiffTextPure: Defending Large Language Models with Diffusion Purifiers PDF

[51] Modifier Unlocked: Jailbreaking Text-to-Image Models Through Prompts PDF

[52] DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models PDF

[53] DiffPAD: Denoising Diffusion-Based Adversarial Patch Decontamination PDF

[54] A change of heart: Backdoor attacks on security-centric diffusion models PDF

[55] Anti-Inpainting: A Proactive Defense against Malicious Diffusion-based Inpainters under Unknown Conditions PDF

[56] Diffusion Model-Based Assisted Attacks on Pufsecured Telematics and Medical Devices PDF

[57] Beyond Memorization: Gradient Projection Enables Selective Learning in Diffusion Models PDF

[58] In the Blink of an Eye: A Unified Theory for Feature Emergence in Generative Models PDF

Table of Contents