Reinforcing Diffusion Models by Direct Group Preference Optimization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Diffusion Models; Reinforcement Learning;

While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost‑effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Direct Group Preference Optimization (DGPO), an online reinforcement learning algorithm that learns from group-level preferences without relying on policy gradients. It resides in the 'Group Relative Policy Optimization' leaf of the taxonomy, which contains only three papers total (including this work and two siblings: Mask-GRPO and DanceGRPO). This represents a relatively sparse research direction within the broader policy gradient family, suggesting the group-relative optimization paradigm is still emerging compared to more established branches like discrete-time policy gradient approaches or direct preference optimization methods.

The taxonomy reveals that DGPO's immediate neighbors include discrete-time policy gradient methods (DPOK, Diffusion RL Training) and continuous-time stochastic control formulations, both of which rely on traditional RL frameworks. Nearby branches such as Direct Preference Optimization Methods (Diffusion DPO, self-play approaches) and Variance Reduction techniques address overlapping challenges—sample efficiency and training stability—but through fundamentally different mechanisms. The group relative optimization leaf explicitly excludes pairwise-only methods and non-group-based approaches, positioning DGPO as a distinct alternative that aggregates feedback across sample batches rather than individual comparisons or gradient-based updates.

Among the three contributions analyzed, the core DGPO algorithm examined ten candidates with zero refutations, suggesting novelty within the limited search scope. The advantage-based weighting strategy examined two candidates and found one refutable match, indicating some overlap with prior work on group preference weighting. The timestep clip strategy also examined ten candidates with no refutations. Overall, the analysis covered twenty-two total candidates from semantic search and citation expansion, not an exhaustive literature review. The advantage weighting component appears to have the most substantial prior work among the contributions examined.

Based on the limited search scope of twenty-two candidates, DGPO appears to introduce a novel algorithmic framework within the sparse group-relative optimization direction. The core algorithm and timestep clipping show no clear refutations among examined papers, while the advantage weighting has identifiable precedent. The taxonomy context suggests this work addresses a recognized gap—efficient deterministic sampling for group preferences—in a relatively underdeveloped research area, though the search scope does not cover all potentially relevant prior work in broader RL or diffusion literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Reinforcement learning for aligning diffusion models with human preferences. The field has organized itself into several major branches that reflect different methodological philosophies and application contexts. Policy Gradient and Online RL Methods (including works like DPOK[3] and Diffusion RL Training[1]) emphasize iterative policy updates through gradient-based optimization, often requiring online interaction with reward models. Direct Preference Optimization Methods (such as Diffusion DPO[15]) bypass explicit reward modeling by learning directly from pairwise comparisons, trading sample efficiency for computational simplicity. Reward Modeling and Feedback Integration focuses on constructing and refining reward signals from human annotations, while Multi-Objective and Personalized Alignment addresses the challenge of balancing diverse user preferences. Domain-Specific Applications tailor these techniques to particular modalities like text-to-image generation or speech synthesis, and Architectural and Component-Level Optimization explores modifications to model components such as text encoders. Offline and Data-Driven Approaches leverage pre-collected datasets, One-Step and Fast Sampling Alignment targets inference efficiency, and Theoretical Foundations and Analysis provides formal guarantees and convergence properties. Within the policy gradient family, a particularly active line of work centers on group-relative optimization strategies that aggregate feedback across batches of samples to stabilize training. Direct Group Preference[0] sits squarely in this cluster alongside Mask-GRPO[19] and DanceGRPO[23], all of which refine how group-level comparisons are constructed and weighted. These methods contrast with approaches like Fine-tune Diffusion Preference[2] or Inference-Time Alignment[4], which either operate in a supervised fine-tuning regime or defer alignment adjustments to the sampling phase. A recurring theme across branches is the tension between sample efficiency and computational overhead: online RL methods can adapt quickly but demand substantial interaction, while offline and direct preference techniques reduce this burden at the cost of potential distribution mismatch. Direct Group Preference[0] emphasizes leveraging group statistics to improve robustness, positioning itself as a middle ground that retains online adaptability while mitigating variance through collective feedback signals.

Claimed Contributions

Direct Group Preference Optimization (DGPO) algorithm

10 retrieved papers

DGPO is a novel online reinforcement learning method for diffusion models that learns directly from group-level preferences without requiring a stochastic policy. This eliminates the need for inefficient SDE-based samplers and enables the use of efficient deterministic ODE samplers, resulting in significantly faster training while leveraging fine-grained relative preference information within groups.

10 retrieved papers

Advantage-based weighting strategy for group preferences

Can Refute

2 retrieved papers

The method introduces a weighting scheme based on normalized advantages that assigns larger weights to samples deviating more from the group average. This design ensures the sum of weights in positive and negative groups are equal, eliminating the intractable partition function while enabling the model to effectively learn relative preference relationships.

2 retrieved papers

Can Refute

Timestep Clip Strategy

10 retrieved papers

A training technique that restricts timestep sampling to a range excluding very small timesteps, preventing the model from overfitting to artifacts such as blurriness in samples generated with few steps. This strategy enables effective training even when using computationally efficient few-step generation for online rollouts.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[19] Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation PDF

Luo Yi-fu, Hu Xin-Hao, Yifu Luo, Fan Keyu, Xinhao Hu, Sun Haoyuan, Keyu Fan, Chen Zeyu, Haoyuan Sun, Xia, Bo, Zeyu Chen, Zhang Tian-Tian, Bo Xia, Chang, Yongzhe, Tiantian Zhang, Wang Xueqian, Yongzhe Chang, Xueqian Wang (2025) • arXiv.org

[23] DanceGRPO: Unleashing GRPO on Visual Generation PDF

Xue, Zeyue, Wu Jie, Gao Yu, Kong Fang-yuan, Zhu Lingting, Chen, Mengzhao, Liu Zhi-heng, Liu Wei, Guo, Qiushan, Huang Weilin, Luo, Ping (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Direct Group Preference Optimization (DGPO) algorithm

[9] Large-scale Reinforcement Learning for Diffusion Models PDF

Cannot Refute

[11] Using human feedback to fine-tune diffusion models without any reward model PDF

Cannot Refute

[15] Diffusion Model Alignment Using Direct Preference Optimization PDF

Cannot Refute

[27] Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control PDF

Cannot Refute

[38] DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models PDF

Cannot Refute

[53] Reinforcement learning for fine-tuning text-to-speech diffusion models PDF

Cannot Refute

[54] D3PO: Preference-Based Alignment of Discrete Diffusion Models PDF

Cannot Refute

[55] Dynamic Prompt Optimizing for Text-to-Image Generation PDF

Cannot Refute

[56] Mira: Towards mitigating reward hacking in inference-time alignment of t2i diffusion models PDF

Cannot Refute

[57] Zeroth-order optimization meets human feedback: Provable learning via ranking oracles PDF

Cannot Refute

Contribution

Advantage-based weighting strategy for group preferences

[51] Reinforcement Learning for Large Language Models via Group Preference Reward Shaping PDF

Can Refute

[52] Towards Efficient Multi-Agent and Temporal Credit Assignment in Reinforcement Learning PDF

Cannot Refute

Contribution

Timestep Clip Strategy

[58] Remasking discrete diffusion models with inference-time scaling PDF

Cannot Refute

[59] Sinsr: diffusion-based image super-resolution in a single step PDF

Cannot Refute

[60] Latent consistency models: Synthesizing high-resolution images with few-step inference PDF

Cannot Refute

[61] Fastdiff: A fast conditional diffusion model for high-quality speech synthesis PDF

Cannot Refute

[62] One step diffusion via shortcut models PDF

Cannot Refute

[63] Common Diffusion Noise Schedules and Sample Steps are Flawed PDF

Cannot Refute

[64] Fast Sampling of Diffusion Models with Exponential Integrator PDF

Cannot Refute

[65] Align Your Steps: Optimizing Sampling Schedules in Diffusion Models PDF

Cannot Refute

[66] Fast Sampling of Diffusion Models via Operator Learning PDF

Cannot Refute

[67] Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality PDF

Cannot Refute

Reinforcing Diffusion Models by Direct Group Preference Optimization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[19] Reinforcement Learning Meets Masked Generative Models: Mask-GRPO for Text-to-Image Generation PDF

[23] DanceGRPO: Unleashing GRPO on Visual Generation PDF

Contribution Analysis

Direct Group Preference Optimization (DGPO) algorithm

[9] Large-scale Reinforcement Learning for Diffusion Models PDF

[11] Using human feedback to fine-tune diffusion models without any reward model PDF

[15] Diffusion Model Alignment Using Direct Preference Optimization PDF

[27] Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control PDF

[38] DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models PDF

[53] Reinforcement learning for fine-tuning text-to-speech diffusion models PDF

[54] D3PO: Preference-Based Alignment of Discrete Diffusion Models PDF

[55] Dynamic Prompt Optimizing for Text-to-Image Generation PDF

[56] Mira: Towards mitigating reward hacking in inference-time alignment of t2i diffusion models PDF

[57] Zeroth-order optimization meets human feedback: Provable learning via ranking oracles PDF

Advantage-based weighting strategy for group preferences

[51] Reinforcement Learning for Large Language Models via Group Preference Reward Shaping PDF

[52] Towards Efficient Multi-Agent and Temporal Credit Assignment in Reinforcement Learning PDF

Timestep Clip Strategy

[58] Remasking discrete diffusion models with inference-time scaling PDF

[59] Sinsr: diffusion-based image super-resolution in a single step PDF

[60] Latent consistency models: Synthesizing high-resolution images with few-step inference PDF

[61] Fastdiff: A fast conditional diffusion model for high-quality speech synthesis PDF

[62] One step diffusion via shortcut models PDF

[63] Common Diffusion Noise Schedules and Sample Steps are Flawed PDF

[64] Fast Sampling of Diffusion Models with Exponential Integrator PDF

[65] Align Your Steps: Optimizing Sampling Schedules in Diffusion Models PDF

[66] Fast Sampling of Diffusion Models via Operator Learning PDF

[67] Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality PDF

Table of Contents