Diffusion Alignment as Variataional Expectation-Maximization

ICLR 2026 Conference SubmissionAnonymous Authors
Diffusion ModelAlignmentRLHFTest time search
Abstract:

Diffusion alignment aims to optimize diffusion models for the downstream objective. While existing methods based on reinforcement learning or direct backpropagation achieve considerable success in maximizing rewards, they often suffer from reward over-optimization and mode collapse. We introduce Diffusion Alignment as Variational Expectation-Maximization (DAV), a framework that formulates diffusion alignment as an iterative process alternating between two complementary phases: the E-step and the M-step. In the E-step, we employ test-time search to generate diverse and reward-aligned samples. In the M-step, we refine the diffusion model using samples discovered by the E-step. We demonstrate that DAV can optimize reward while preserving diversity for both continuous and discrete tasks: text-to-image synthesis and DNA sequence design.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a variational expectation-maximization framework for diffusion alignment, alternating between test-time search (E-step) and model refinement (M-step). It resides in the 'Expectation-Maximization Formulations' leaf under 'Variational and Probabilistic Alignment Frameworks', which contains only one sibling paper among the 50 total papers surveyed. This sparse population suggests the EM-based approach represents a relatively underexplored direction within the broader alignment landscape, where most work concentrates on RL-based fine-tuning or test-time guidance methods.

The taxonomy reveals neighboring branches pursuing related goals through different mechanisms. 'GFlowNet-Guided Alignment' offers an alternative probabilistic framework using flow networks, while 'Test-Time Alignment Without Training' (including SMC-based methods) achieves alignment without parameter updates. The 'Reward-Based Alignment via Reinforcement Learning' branch, containing multiple leaves addressing sparse rewards and diversity-oriented training, represents a more crowded research direction. The paper's variational formulation bridges these areas by combining test-time search with iterative model refinement, positioning it at the intersection of probabilistic inference and training-based alignment.

Among 28 candidates examined across three contributions, the analysis found 5 refutable pairs. The DAV framework itself (10 candidates examined, 2 refutable) and the E-step test-time search (10 candidates, 2 refutable) show moderate prior overlap, while the M-step forward-KL distillation (8 candidates, 1 refutable) appears less contested. These statistics indicate that within the limited search scope, some aspects of the approach have precedent in the examined literature, though the specific EM formulation combining both phases may offer a novel integration. The relatively small candidate pool means substantial prior work could exist beyond the top-30 semantic matches.

Based on the limited literature search, the work appears to occupy a sparsely populated methodological niche, with only one sibling paper in its taxonomy leaf. The contribution-level statistics suggest partial novelty: while individual components (test-time search, KL distillation) have some precedent among examined candidates, the integrated EM framework may represent a distinctive synthesis. However, the analysis covers only 28 candidates from semantic search, leaving open the possibility of relevant work outside this scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: aligning diffusion models with downstream objectives while preserving diversity. The field has organized itself around several complementary strategies for steering generative models toward desired outcomes without collapsing sample variety. Reward-Based Alignment via Reinforcement Learning encompasses methods that treat diffusion sampling as a sequential decision process, optimizing for task-specific rewards through policy gradient techniques or large-scale RL frameworks like Large-scale RL[8]. Test-Time Alignment Without Training and Preference-Based Alignment Frameworks offer alternative pathways: the former adjusts sampling dynamics on the fly (e.g., Test-time Alignment[3]), while the latter leverages human or automated feedback to refine model behavior. Variational and Probabilistic Alignment Frameworks, including Expectation-Maximization formulations, provide principled probabilistic tools for balancing alignment and diversity, often drawing on divergence measures as in f-Divergence Alignment[9]. Meanwhile, branches such as Diversity Enhancement Techniques, Controllable Diversity Management, and Reward-Diversity Trade-Off Analysis explicitly address the tension between optimizing for performance and maintaining output variety, with works like Reward-Diversity Tradeoffs[24] analyzing this balance. Domain-Specific Applications and Conditional Generation branches demonstrate how these ideas translate to concrete settings, from motion synthesis (DiverseMotion[38]) to data augmentation and discrete text diffusion. A particularly active line of inquiry centers on variational and probabilistic methods that cast alignment as an inference problem, enabling rigorous control over the reward-diversity trade-off. Diffusion Alignment VEM[0] exemplifies this approach by formulating alignment through a variational expectation-maximization lens, offering a structured way to incorporate downstream objectives while retaining distributional richness. This contrasts with purely RL-driven strategies like Sparse Reward Alignment[1], which may struggle with exploration in sparse feedback regimes, and with test-time methods such as Training-Free Alignment[25], which avoid retraining but can be less stable. Diffusion Alignment VEM[0] sits naturally alongside other probabilistic frameworks like f-Divergence Alignment[9], sharing a focus on principled divergence minimization, yet it distinguishes itself by leveraging EM-style iterative refinement to balance alignment strength and sample diversity more explicitly than gradient-based RL approaches.

Claimed Contributions

Diffusion Alignment as Variational Expectation-Maximization (DAV) framework

The authors propose a novel framework that formulates diffusion model alignment as a variational EM algorithm. The framework alternates between an E-step that uses test-time search to discover diverse, high-reward samples and an M-step that refines the diffusion model by distilling knowledge from discovered samples using forward-KL minimization.

10 retrieved papers
Can Refute
E-step test-time search for posterior inference

The authors introduce an E-step that performs test-time search guided by a soft Q-function to effectively discover high-reward, multi-modal trajectories from the variational posterior distribution, enabling thorough exploration of promising regions while preserving diversity.

10 retrieved papers
Can Refute
M-step forward-KL distillation for model refinement

The authors propose an M-step that updates the diffusion model by minimizing forward-KL divergence rather than reverse-KL, which is a mode-covering objective that encourages the model to cover all diverse modes discovered through the E-step, preventing mode collapse.

8 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Diffusion Alignment as Variational Expectation-Maximization (DAV) framework

The authors propose a novel framework that formulates diffusion model alignment as a variational EM algorithm. The framework alternates between an E-step that uses test-time search to discover diverse, high-reward samples and an M-step that refines the diffusion model by distilling knowledge from discovered samples using forward-KL minimization.

Contribution

E-step test-time search for posterior inference

The authors introduce an E-step that performs test-time search guided by a soft Q-function to effectively discover high-reward, multi-modal trajectories from the variational posterior distribution, enabling thorough exploration of promising regions while preserving diversity.

Contribution

M-step forward-KL distillation for model refinement

The authors propose an M-step that updates the diffusion model by minimizing forward-KL divergence rather than reverse-KL, which is a mode-covering objective that encourages the model to cover all diverse modes discovered through the E-step, preventing mode collapse.

Diffusion Alignment as Variataional Expectation-Maximization | Novelty Validation