Diffusion Alignment as Variataional Expectation-Maximization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Diffusion ModelAlignmentRLHFTest time search

Diffusion alignment aims to optimize diffusion models for the downstream objective. While existing methods based on reinforcement learning or direct backpropagation achieve considerable success in maximizing rewards, they often suffer from reward over-optimization and mode collapse. We introduce Diffusion Alignment as Variational Expectation-Maximization (DAV), a framework that formulates diffusion alignment as an iterative process alternating between two complementary phases: the E-step and the M-step. In the E-step, we employ test-time search to generate diverse and reward-aligned samples. In the M-step, we refine the diffusion model using samples discovered by the E-step. We demonstrate that DAV can optimize reward while preserving diversity for both continuous and discrete tasks: text-to-image synthesis and DNA sequence design.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a variational expectation-maximization framework for diffusion alignment, alternating between test-time search (E-step) and model refinement (M-step). It resides in the 'Expectation-Maximization Formulations' leaf under 'Variational and Probabilistic Alignment Frameworks', which contains only one sibling paper among the 50 total papers surveyed. This sparse population suggests the EM-based approach represents a relatively underexplored direction within the broader alignment landscape, where most work concentrates on RL-based fine-tuning or test-time guidance methods.

The taxonomy reveals neighboring branches pursuing related goals through different mechanisms. 'GFlowNet-Guided Alignment' offers an alternative probabilistic framework using flow networks, while 'Test-Time Alignment Without Training' (including SMC-based methods) achieves alignment without parameter updates. The 'Reward-Based Alignment via Reinforcement Learning' branch, containing multiple leaves addressing sparse rewards and diversity-oriented training, represents a more crowded research direction. The paper's variational formulation bridges these areas by combining test-time search with iterative model refinement, positioning it at the intersection of probabilistic inference and training-based alignment.

Among 28 candidates examined across three contributions, the analysis found 5 refutable pairs. The DAV framework itself (10 candidates examined, 2 refutable) and the E-step test-time search (10 candidates, 2 refutable) show moderate prior overlap, while the M-step forward-KL distillation (8 candidates, 1 refutable) appears less contested. These statistics indicate that within the limited search scope, some aspects of the approach have precedent in the examined literature, though the specific EM formulation combining both phases may offer a novel integration. The relatively small candidate pool means substantial prior work could exist beyond the top-30 semantic matches.

Based on the limited literature search, the work appears to occupy a sparsely populated methodological niche, with only one sibling paper in its taxonomy leaf. The contribution-level statistics suggest partial novelty: while individual components (test-time search, KL distillation) have some precedent among examined candidates, the integrated EM framework may represent a distinctive synthesis. However, the analysis covers only 28 candidates from semantic search, leaving open the possibility of relevant work outside this scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: aligning diffusion models with downstream objectives while preserving diversity. The field has organized itself around several complementary strategies for steering generative models toward desired outcomes without collapsing sample variety. Reward-Based Alignment via Reinforcement Learning encompasses methods that treat diffusion sampling as a sequential decision process, optimizing for task-specific rewards through policy gradient techniques or large-scale RL frameworks like Large-scale RL[8]. Test-Time Alignment Without Training and Preference-Based Alignment Frameworks offer alternative pathways: the former adjusts sampling dynamics on the fly (e.g., Test-time Alignment[3]), while the latter leverages human or automated feedback to refine model behavior. Variational and Probabilistic Alignment Frameworks, including Expectation-Maximization formulations, provide principled probabilistic tools for balancing alignment and diversity, often drawing on divergence measures as in f-Divergence Alignment[9]. Meanwhile, branches such as Diversity Enhancement Techniques, Controllable Diversity Management, and Reward-Diversity Trade-Off Analysis explicitly address the tension between optimizing for performance and maintaining output variety, with works like Reward-Diversity Tradeoffs[24] analyzing this balance. Domain-Specific Applications and Conditional Generation branches demonstrate how these ideas translate to concrete settings, from motion synthesis (DiverseMotion[38]) to data augmentation and discrete text diffusion. A particularly active line of inquiry centers on variational and probabilistic methods that cast alignment as an inference problem, enabling rigorous control over the reward-diversity trade-off. Diffusion Alignment VEM[0] exemplifies this approach by formulating alignment through a variational expectation-maximization lens, offering a structured way to incorporate downstream objectives while retaining distributional richness. This contrasts with purely RL-driven strategies like Sparse Reward Alignment[1], which may struggle with exploration in sparse feedback regimes, and with test-time methods such as Training-Free Alignment[25], which avoid retraining but can be less stable. Diffusion Alignment VEM[0] sits naturally alongside other probabilistic frameworks like f-Divergence Alignment[9], sharing a focus on principled divergence minimization, yet it distinguishes itself by leveraging EM-style iterative refinement to balance alignment strength and sample diversity more explicitly than gradient-based RL approaches.

Claimed Contributions

Diffusion Alignment as Variational Expectation-Maximization (DAV) framework

Can Refute

10 retrieved papers

The authors propose a novel framework that formulates diffusion model alignment as a variational EM algorithm. The framework alternates between an E-step that uses test-time search to discover diverse, high-reward samples and an M-step that refines the diffusion model by distilling knowledge from discovered samples using forward-KL minimization.

10 retrieved papers

Can Refute

E-step test-time search for posterior inference

Can Refute

10 retrieved papers

The authors introduce an E-step that performs test-time search guided by a soft Q-function to effectively discover high-reward, multi-modal trajectories from the variational posterior distribution, enabling thorough exploration of promising regions while preserving diversity.

10 retrieved papers

Can Refute

M-step forward-KL distillation for model refinement

Can Refute

8 retrieved papers

The authors propose an M-step that updates the diffusion model by minimizing forward-KL divergence rather than reverse-KL, which is a mode-covering objective that encourages the model to cover all diverse modes discovered through the E-step, preventing mode collapse.

8 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[45] Diffusion Alignment as Variational Expectation-Maximization PDF

Lee Jae-Woo, Kim, Minsu, Jaewoo Lee, Choi, Sanghyeok, Minsu Kim, Song, Inhyuck, Sanghyeok Choi, Yun Sujin, Inhyuck Song, Sujin Yun, Hyeongyu Kang, Yun Taeyoung, Woocheol Shin, Taeyoung Yun, Park, Jinkyoo, Kiyoung Om, Jinkyoo Park (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Diffusion Alignment as Variational Expectation-Maximization (DAV) framework

[51] Learning Diffusion Priors from Observations by Expectation Maximization PDF

Can Refute

[55] EM Distillation for One-step Diffusion Models PDF

Can Refute

[52] Diffputer: Empowering diffusion models for missing data imputation PDF

Cannot Refute

[53] Learning Diffusion Model from Noisy Measurement using Principled Expectation-Maximization Method PDF

Cannot Refute

[54] Variational SchrÃ¶dinger Diffusion Models PDF

Cannot Refute

[56] Fast Diffusion EM: a diffusion model for blind inverse problems with application to deconvolution PDF

Cannot Refute

[57] An expectation-maximization algorithm for training clean diffusion models from corrupted observations PDF

Cannot Refute

[58] Unleashing the potential of diffusion models for incomplete data imputation PDF

Cannot Refute

[59] EMControl: Adding Conditional Control to Text-to-Image Diffusion Models via Expectation-Maximization PDF

Cannot Refute

[60] Blind inversion using latent diffusion priors PDF

Cannot Refute

Contribution

E-step test-time search for posterior inference

[3] Test-time alignment of diffusion models without reward over-optimization PDF

Can Refute

[71] Inference-Time Alignment in Diffusion Models with Reward-Guided Generation: Tutorial and Review PDF

Can Refute

[69] Test-time alignment via hypothesis reweighting PDF

Cannot Refute

[70] Tree reward-aligned search for treasure in masked diffusion language models PDF

Cannot Refute

[72] TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling PDF

Cannot Refute

[73] Dynamic Search for Inference-Time Alignment in Diffusion Models PDF

Cannot Refute

[74] Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search PDF

Cannot Refute

[75] Inference Time Alignment with Reward-Guided Tree Search PDF

Cannot Refute

[76] ARGS: Alignment as Reward-Guided Search PDF

Cannot Refute

[77] Reward-Guided Tree Search for Inference Time Alignment of Large Language Models PDF

Cannot Refute

Contribution

M-step forward-KL distillation for model refinement

[64] Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design PDF

Can Refute

[61] Di o: Distilling masked diffusion models into one-step generator PDF

Cannot Refute

[62] Variational distillation of diffusion policies into mixture of experts PDF

Cannot Refute

[63] Forward KL Regularized Preference Optimization for Aligning Diffusion Policies PDF

Cannot Refute

[65] Importance Weighted Score Matching for Diffusion Samplers with Enhanced Mode Coverage PDF

Cannot Refute

[66] Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model PDF

Cannot Refute

[67] One-step Diffusion Models with f-Divergence Distribution Matching PDF

Cannot Refute

[68] Statistical Divergences and Density Estimation for Anomaly Detection and Generative Modeling PDF

Cannot Refute

Diffusion Alignment as Variataional Expectation-Maximization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[45] Diffusion Alignment as Variational Expectation-Maximization PDF

Contribution Analysis

Diffusion Alignment as Variational Expectation-Maximization (DAV) framework

[51] Learning Diffusion Priors from Observations by Expectation Maximization PDF

[55] EM Distillation for One-step Diffusion Models PDF

[52] Diffputer: Empowering diffusion models for missing data imputation PDF

[53] Learning Diffusion Model from Noisy Measurement using Principled Expectation-Maximization Method PDF

[54] Variational SchrÃ¶dinger Diffusion Models PDF

[56] Fast Diffusion EM: a diffusion model for blind inverse problems with application to deconvolution PDF

[57] An expectation-maximization algorithm for training clean diffusion models from corrupted observations PDF

[58] Unleashing the potential of diffusion models for incomplete data imputation PDF

[59] EMControl: Adding Conditional Control to Text-to-Image Diffusion Models via Expectation-Maximization PDF

[60] Blind inversion using latent diffusion priors PDF

E-step test-time search for posterior inference

[3] Test-time alignment of diffusion models without reward over-optimization PDF

[71] Inference-Time Alignment in Diffusion Models with Reward-Guided Generation: Tutorial and Review PDF

[69] Test-time alignment via hypothesis reweighting PDF

[70] Tree reward-aligned search for treasure in masked diffusion language models PDF

[72] TreeBoN: Enhancing Inference-Time Alignment with Speculative Tree-Search and Best-of-N Sampling PDF

[73] Dynamic Search for Inference-Time Alignment in Diffusion Models PDF

[74] Adaptive Test-Time Reasoning via Reward-Guided Dual-Phase Search PDF

[75] Inference Time Alignment with Reward-Guided Tree Search PDF

[76] ARGS: Alignment as Reward-Guided Search PDF

[77] Reward-Guided Tree Search for Inference Time Alignment of Large Language Models PDF

M-step forward-KL distillation for model refinement

[64] Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design PDF

[61] Di o: Distilling masked diffusion models into one-step generator PDF

[62] Variational distillation of diffusion policies into mixture of experts PDF

[63] Forward KL Regularized Preference Optimization for Aligning Diffusion Policies PDF

[65] Importance Weighted Score Matching for Diffusion Samplers with Enhanced Mode Coverage PDF

[66] Diffusion As Self-Distillation: End-to-End Latent Diffusion In One Model PDF

[67] One-step Diffusion Models with f-Divergence Distribution Matching PDF

[68] Statistical Divergences and Density Estimation for Anomaly Detection and Generative Modeling PDF

Table of Contents