DiffusionNFT: Online Diffusion Reinforcement with Forward Process

ICLR 2026 Conference SubmissionAnonymous Authors
Diffusion ModelsReinforcement LearningFlow Matching
Abstract:

Online reinforcement learning (RL) has been central to post-training language models, but its extension to diffusion models remains challenging due to intractable likelihoods. Recent works discretize the reverse sampling process to enable GRPO-style training, yet they inherit fundamental drawbacks. These include solver restrictions, forward–reverse inconsistency, and complicated integration with classifier-free guidance (CFG). We introduce Diffusion Negative-aware FineTuning (DiffusionNFT), a new online RL paradigm that optimizes diffusion models directly on the forward process via flow matching. DiffusionNFT contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective. This formulation enables training with arbitrary black-box solvers, eliminates the need for likelihood estimation, and requires only clean images rather than sampling trajectories for policy optimization. DiffusionNFT is up to 25×25\times more efficient than FlowGRPO in head-to-head comparisons, while being CFG-free. For instance, DiffusionNFT improves the GenEval score from 0.24 to 0.98 within 1k steps, while FlowGRPO achieves 0.95 with over 5k steps and additional CFG employment. By leveraging multiple reward models, DiffusionNFT significantly boosts the performance of SD3.5-Medium in every benchmark tested.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: online reinforcement learning for diffusion models. This emerging field explores how to adapt and optimize diffusion-based generative models through direct interaction with reward signals or feedback mechanisms. The taxonomy reveals several major branches: Policy Gradient Methods for Diffusion Model Fine-Tuning focuses on adapting standard RL algorithms like PPO to the unique structure of diffusion processes, as seen in works such as DPOK[3] and Large-scale RL Diffusion[4]. Flow Matching and Forward Process Optimization investigates alternative parameterizations and training objectives that can simplify or accelerate learning, while Diffusion as Generative Components in RL Systems examines how diffusion models serve as policy representations or world models within broader RL architectures. Application-Specific Diffusion RL targets domains like text-to-image generation (RL Text-to-Image[2]), robotics, and autonomous systems, whereas Theoretical Foundations and Algorithmic Innovations address convergence guarantees, sample efficiency, and novel algorithmic designs. Finally, Offline-to-Online and Hybrid Learning Paradigms bridge pre-trained diffusion models with online fine-tuning strategies, balancing data efficiency and exploration. A particularly active line of work centers on sample-efficient fine-tuning: methods like Feedback Efficient Finetuning[5] and Human-Feedback Efficient[8] aim to minimize the number of reward queries needed to align diffusion outputs with human preferences or task objectives. Another contrasting direction emphasizes scalability and robustness, with studies such as Efficient Online Diffusion[1] and RL Diffusion Tutorial[6] providing practical frameworks for large-scale deployment. DiffusionNFT[0] sits within the Flow Matching and Forward Process Optimization branch, focusing on reinforcement learning applied directly to the forward diffusion process rather than solely the reverse denoising steps. This approach distinguishes it from reverse-process methods like DPOK[3] and aligns it more closely with forward-process innovations, offering a complementary perspective on where and how RL signals can be injected into the diffusion pipeline to improve generation quality and task alignment.

Claimed Contributions

Diffusion Negative-aware FineTuning (DiffusionNFT) paradigm

The authors propose DiffusionNFT, a novel online reinforcement learning approach for diffusion models that operates on the forward diffusion process rather than the reverse process. It contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective without requiring likelihood estimation.

10 retrieved papers
Forward-process RL formulation with practical benefits

The forward-process formulation enables training with any black-box solvers (not restricted to first-order SDE samplers), requires only clean images rather than full sampling trajectories for optimization, maintains compatibility with standard diffusion training pipelines, and naturally supports off-policy learning without importance sampling.

4 retrieved papers
Implicit parameterization technique for reinforcement guidance

Instead of learning a separate guidance model and employing guided sampling at inference, the method uses an implicit parameterization that directly integrates reinforcement guidance into a single policy model. This allows continuous RL on one model and eliminates the need for combining multiple models during sampling.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Diffusion Negative-aware FineTuning (DiffusionNFT) paradigm

The authors propose DiffusionNFT, a novel online reinforcement learning approach for diffusion models that operates on the forward diffusion process rather than the reverse process. It contrasts positive and negative generations to define an implicit policy improvement direction, naturally incorporating reinforcement signals into the supervised learning objective without requiring likelihood estimation.

Contribution

Forward-process RL formulation with practical benefits

The forward-process formulation enables training with any black-box solvers (not restricted to first-order SDE samplers), requires only clean images rather than full sampling trajectories for optimization, maintains compatibility with standard diffusion training pipelines, and naturally supports off-policy learning without importance sampling.

Contribution

Implicit parameterization technique for reinforcement guidance

Instead of learning a separate guidance model and employing guided sampling at inference, the method uses an implicit parameterization that directly integrates reinforcement guidance into a single policy model. This allows continuous RL on one model and eliminates the need for combining multiple models during sampling.