DiffusionNFT: Online Diffusion Reinforcement with Forward Process

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.3 Download Report PDF

Diffusion ModelsReinforcement LearningFlow Matching

Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks. These include solver restrictions, forward–reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to $25\times$ more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: online reinforcement learning for diffusion models. This emerging field explores how to adapt and optimize diffusion-based generative models through direct interaction with reward signals or feedback mechanisms. The taxonomy reveals several major branches: Policy Gradient Methods for Diffusion Model Fine-Tuning focuses on adapting standard RL algorithms like PPO to the unique structure of diffusion processes, as seen in works such as DPOK[3] and Large-scale RL Diffusion[4]. Flow Matching and Forward Process Optimization investigates alternative parameterizations and training objectives that can simplify or accelerate learning, while Diffusion as Generative Components in RL Systems examines how diffusion models serve as policy representations or world models within broader RL architectures. Application-Specific Diffusion RL targets domains like text-to-image generation (RL Text-to-Image[2]), robotics, and autonomous systems, whereas Theoretical Foundations and Algorithmic Innovations address convergence guarantees, sample efficiency, and novel algorithmic designs. Finally, Offline-to-Online and Hybrid Learning Paradigms bridge pre-trained diffusion models with online fine-tuning strategies, balancing data efficiency and exploration. A particularly active line of work centers on sample-efficient fine-tuning: methods like Feedback Efficient Finetuning[5] and Human-Feedback Efficient[8] aim to minimize the number of reward queries needed to align diffusion outputs with human preferences or task objectives. Another contrasting direction emphasizes scalability and robustness, with studies such as Efficient Online Diffusion[1] and RL Diffusion Tutorial[6] providing practical frameworks for large-scale deployment. DiffusionNFT[0] sits within the Flow Matching and Forward Process Optimization branch, focusing on reinforcement learning applied directly to the forward diffusion process rather than solely the reverse denoising steps. This approach distinguishes it from reverse-process methods like DPOK[3] and aligns it more closely with forward-process innovations, offering a complementary perspective on where and how RL signals can be injected into the diffusion pipeline to improve generation quality and task alignment.

Claimed Contributions

Diffusion Negative-aware FineTuning (DiffusionNFT) paradigm

10 retrieved papers

The authors propose DiffusionNFT, a novel online reinforcement learning approach for diffusion models that operates on the forward diffusion process rather than the reverse process. It contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective without requiring likelihood estimation.

10 retrieved papers

Forward-process RL formulation with practical benefits

4 retrieved papers

The forward-process formulation enables training with any black-box solvers (not restricted to first-order SDE samplers), requires only clean images rather than full sampling trajectories for optimization, maintains compatibility with standard diffusion training pipelines, and naturally supports off-policy learning without importance sampling.

4 retrieved papers

Implicit parameterization technique for reinforcement guidance

Can Refute

10 retrieved papers

Instead of learning a separate guidance model and employing guided sampling at inference, the method uses an implicit parameterization that directly integrates reinforcement guidance into a single policy model. This allows continuous RL on one model and eliminates the need for combining multiple models during sampling.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Diffusion Negative-aware FineTuning (DiffusionNFT) paradigm

[3] DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models PDF

Cannot Refute

[6] Understanding reinforcement learning-based fine-tuning of diffusion models: A tutorial and review PDF

Cannot Refute

[8] Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning PDF

Cannot Refute

[9] Enhancing Sample Efficiency in Online Reinforcement Learning via Policy-Guided Diffusion Models PDF

Cannot Refute

[35] Offline Reinforcement Learning With Reverse Diffusion Guide Policy PDF

Cannot Refute

[62] A simple and effective reinforcement learning method for text-to-image diffusion fine-tuning PDF

Cannot Refute

[63] Controllable guidance in reinforcement learning using diffusion models PDF

Cannot Refute

[64] DFRL-DS: A Diffusion-based Reinforcement Learning Algorithm in Discrete Actions for Base Station Energy-saving Control PDF

Cannot Refute

[65] Step-level Reward for Free in RL-based T2I Diffusion Model Fine-tuning PDF

Cannot Refute

[66] Diffusion models as optimizers for efficient planning in offline rl PDF

Cannot Refute

Contribution

Forward-process RL formulation with practical benefits

[58] Amortizing intractable inference in diffusion models for vision, language, and control PDF

Cannot Refute

[59] Diffusion Model for Data-Driven Black-Box Optimization PDF

Cannot Refute

[60] Diffusion-BBO: Diffusion-Based Inverse Modeling for Online Black-Box Optimization PDF

Cannot Refute

[61] Integrating Diffusion Models into Model-Based Reinforcement Learning for Real-Time Robotic Control A Theoretical Review PDF

Cannot Refute

Contribution

Implicit parameterization technique for reinforcement guidance

[54] Policy-guided diffusion PDF

Can Refute

[4] Large-scale reinforcement learning for diffusion models PDF

Cannot Refute

[20] Enhancing sample efficiency and exploration in reinforcement learning through the integration of diffusion models and proximal policy optimization PDF

Cannot Refute

[35] Offline Reinforcement Learning With Reverse Diffusion Guide Policy PDF

Cannot Refute

[51] IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies PDF

Cannot Refute

[52] Offline Goal-Conditioned Reinforcement Learning with Elastic-Subgoal Diffused Policy Learning PDF

Cannot Refute

[53] Uncertainty-aware multi-objective reinforcement learning-guided diffusion models for 3D de novo molecular design PDF

Cannot Refute

[55] Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models PDF

Cannot Refute

[56] Toward multi-task generalization in autonomous navigation: A human-in-the-loop adversarial reinforcement learning with diffusion policy PDF

Cannot Refute

[57] Diffusion-dice: In-sample diffusion guidance for offline reinforcement learning PDF

Cannot Refute

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Diffusion Negative-aware FineTuning (DiffusionNFT) paradigm

[3] DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models PDF

[6] Understanding reinforcement learning-based fine-tuning of diffusion models: A tutorial and review PDF

[8] Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning PDF

[9] Enhancing Sample Efficiency in Online Reinforcement Learning via Policy-Guided Diffusion Models PDF

[35] Offline Reinforcement Learning With Reverse Diffusion Guide Policy PDF

[62] A simple and effective reinforcement learning method for text-to-image diffusion fine-tuning PDF

[63] Controllable guidance in reinforcement learning using diffusion models PDF

[64] DFRL-DS: A Diffusion-based Reinforcement Learning Algorithm in Discrete Actions for Base Station Energy-saving Control PDF

[65] Step-level Reward for Free in RL-based T2I Diffusion Model Fine-tuning PDF

[66] Diffusion models as optimizers for efficient planning in offline rl PDF

Forward-process RL formulation with practical benefits

[58] Amortizing intractable inference in diffusion models for vision, language, and control PDF

[59] Diffusion Model for Data-Driven Black-Box Optimization PDF

[60] Diffusion-BBO: Diffusion-Based Inverse Modeling for Online Black-Box Optimization PDF

[61] Integrating Diffusion Models into Model-Based Reinforcement Learning for Real-Time Robotic Control A Theoretical Review PDF

Implicit parameterization technique for reinforcement guidance

[54] Policy-guided diffusion PDF

[4] Large-scale reinforcement learning for diffusion models PDF

[20] Enhancing sample efficiency and exploration in reinforcement learning through the integration of diffusion models and proximal policy optimization PDF

[35] Offline Reinforcement Learning With Reverse Diffusion Guide Policy PDF

[51] IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies PDF

[52] Offline Goal-Conditioned Reinforcement Learning with Elastic-Subgoal Diffused Policy Learning PDF

[53] Uncertainty-aware multi-objective reinforcement learning-guided diffusion models for 3D de novo molecular design PDF

[55] Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models PDF

[56] Toward multi-task generalization in autonomous navigation: A human-in-the-loop adversarial reinforcement learning with diffusion policy PDF

[57] Diffusion-dice: In-sample diffusion guidance for offline reinforcement learning PDF

Table of Contents