Reinforcing Diffusion Models by Direct Group Preference Optimization

ICLR 2026 Conference SubmissionAnonymous Authors
Diffusion Models; Reinforcement Learning;
Abstract:

While reinforcement learning methods such as Group Relative Preference Optimization (GRPO) have significantly enhanced Large Language Models, adapting them to diffusion models remains challenging. In particular, GRPO demands a stochastic policy, yet the most cost‑effective diffusion samplers are based on deterministic ODEs. Recent work addresses this issue by using inefficient SDE-based samplers to induce stochasticity, but this reliance on model-agnostic Gaussian noise leads to slow convergence. To resolve this conflict, we propose Direct Group Preference Optimization (DGPO), a new online RL algorithm that dispenses with the policy-gradient framework entirely. DGPO learns directly from group-level preferences, which utilize relative information of samples within groups. This design eliminates the need for inefficient stochastic policies, unlocking the use of efficient deterministic ODE samplers and faster training. Extensive results show that DGPO trains around 20 times faster than existing state-of-the-art methods and achieves superior performance on both in-domain and out-of-domain reward metrics.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Direct Group Preference Optimization (DGPO), an online reinforcement learning algorithm that learns from group-level preferences without relying on policy gradients. It resides in the 'Group Relative Policy Optimization' leaf of the taxonomy, which contains only three papers total (including this work and two siblings: Mask-GRPO and DanceGRPO). This represents a relatively sparse research direction within the broader policy gradient family, suggesting the group-relative optimization paradigm is still emerging compared to more established branches like discrete-time policy gradient approaches or direct preference optimization methods.

The taxonomy reveals that DGPO's immediate neighbors include discrete-time policy gradient methods (DPOK, Diffusion RL Training) and continuous-time stochastic control formulations, both of which rely on traditional RL frameworks. Nearby branches such as Direct Preference Optimization Methods (Diffusion DPO, self-play approaches) and Variance Reduction techniques address overlapping challenges—sample efficiency and training stability—but through fundamentally different mechanisms. The group relative optimization leaf explicitly excludes pairwise-only methods and non-group-based approaches, positioning DGPO as a distinct alternative that aggregates feedback across sample batches rather than individual comparisons or gradient-based updates.

Among the three contributions analyzed, the core DGPO algorithm examined ten candidates with zero refutations, suggesting novelty within the limited search scope. The advantage-based weighting strategy examined two candidates and found one refutable match, indicating some overlap with prior work on group preference weighting. The timestep clip strategy also examined ten candidates with no refutations. Overall, the analysis covered twenty-two total candidates from semantic search and citation expansion, not an exhaustive literature review. The advantage weighting component appears to have the most substantial prior work among the contributions examined.

Based on the limited search scope of twenty-two candidates, DGPO appears to introduce a novel algorithmic framework within the sparse group-relative optimization direction. The core algorithm and timestep clipping show no clear refutations among examined papers, while the advantage weighting has identifiable precedent. The taxonomy context suggests this work addresses a recognized gap—efficient deterministic sampling for group preferences—in a relatively underdeveloped research area, though the search scope does not cover all potentially relevant prior work in broader RL or diffusion literature.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Reinforcement learning for aligning diffusion models with human preferences. The field has organized itself into several major branches that reflect different methodological philosophies and application contexts. Policy Gradient and Online RL Methods (including works like DPOK[3] and Diffusion RL Training[1]) emphasize iterative policy updates through gradient-based optimization, often requiring online interaction with reward models. Direct Preference Optimization Methods (such as Diffusion DPO[15]) bypass explicit reward modeling by learning directly from pairwise comparisons, trading sample efficiency for computational simplicity. Reward Modeling and Feedback Integration focuses on constructing and refining reward signals from human annotations, while Multi-Objective and Personalized Alignment addresses the challenge of balancing diverse user preferences. Domain-Specific Applications tailor these techniques to particular modalities like text-to-image generation or speech synthesis, and Architectural and Component-Level Optimization explores modifications to model components such as text encoders. Offline and Data-Driven Approaches leverage pre-collected datasets, One-Step and Fast Sampling Alignment targets inference efficiency, and Theoretical Foundations and Analysis provides formal guarantees and convergence properties. Within the policy gradient family, a particularly active line of work centers on group-relative optimization strategies that aggregate feedback across batches of samples to stabilize training. Direct Group Preference[0] sits squarely in this cluster alongside Mask-GRPO[19] and DanceGRPO[23], all of which refine how group-level comparisons are constructed and weighted. These methods contrast with approaches like Fine-tune Diffusion Preference[2] or Inference-Time Alignment[4], which either operate in a supervised fine-tuning regime or defer alignment adjustments to the sampling phase. A recurring theme across branches is the tension between sample efficiency and computational overhead: online RL methods can adapt quickly but demand substantial interaction, while offline and direct preference techniques reduce this burden at the cost of potential distribution mismatch. Direct Group Preference[0] emphasizes leveraging group statistics to improve robustness, positioning itself as a middle ground that retains online adaptability while mitigating variance through collective feedback signals.

Claimed Contributions

Direct Group Preference Optimization (DGPO) algorithm

DGPO is a novel online reinforcement learning method for diffusion models that learns directly from group-level preferences without requiring a stochastic policy. This eliminates the need for inefficient SDE-based samplers and enables the use of efficient deterministic ODE samplers, resulting in significantly faster training while leveraging fine-grained relative preference information within groups.

10 retrieved papers
Advantage-based weighting strategy for group preferences

The method introduces a weighting scheme based on normalized advantages that assigns larger weights to samples deviating more from the group average. This design ensures the sum of weights in positive and negative groups are equal, eliminating the intractable partition function while enabling the model to effectively learn relative preference relationships.

2 retrieved papers
Can Refute
Timestep Clip Strategy

A training technique that restricts timestep sampling to a range excluding very small timesteps, preventing the model from overfitting to artifacts such as blurriness in samples generated with few steps. This strategy enables effective training even when using computationally efficient few-step generation for online rollouts.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Direct Group Preference Optimization (DGPO) algorithm

DGPO is a novel online reinforcement learning method for diffusion models that learns directly from group-level preferences without requiring a stochastic policy. This eliminates the need for inefficient SDE-based samplers and enables the use of efficient deterministic ODE samplers, resulting in significantly faster training while leveraging fine-grained relative preference information within groups.

Contribution

Advantage-based weighting strategy for group preferences

The method introduces a weighting scheme based on normalized advantages that assigns larger weights to samples deviating more from the group average. This design ensures the sum of weights in positive and negative groups are equal, eliminating the intractable partition function while enabling the model to effectively learn relative preference relationships.

Contribution

Timestep Clip Strategy

A training technique that restricts timestep sampling to a range excluding very small timesteps, preventing the model from overfitting to artifacts such as blurriness in samples generated with few steps. This strategy enables effective training even when using computationally efficient few-step generation for online rollouts.

Reinforcing Diffusion Models by Direct Group Preference Optimization | Novelty Validation