Flow Matching Policy Gradients

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Flow MatchingPolicy Gradient

Flow-based generative models, including diffusion models, excel at modeling continuous distributions in high-dimensional spaces. In this work, we introduce Flow Policy Optimization (FPO), a simple on-policy reinforcement learning algorithm that brings flow matching into the policy gradient framework. FPO casts policy optimization as maximizing an advantage-weighted ratio computed from the conditional flow matching loss, in a manner compatible with the popular PPO-clip framework. It sidesteps the need for exact likelihood computation while preserving the generative capabilities of flow-based models. Unlike prior approaches for diffusion-based reinforcement learning that bind training to a specific sampling method, FPO is agnostic to the choice of diffusion or flow integration at both training and inference time. We show that FPO can train diffusion-style policies from scratch in a variety of continuous control tasks. We find that flow-based models can capture multimodal action distributions and achieve higher performance than Gaussian policies, particularly in under-conditioned settings.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Flow Policy Optimization (FPO), an on-policy reinforcement learning algorithm that integrates flow matching into the policy gradient framework through advantage-weighted ratio objectives compatible with PPO-clip. It resides in the 'Policy Gradient Methods for Flow Matching' leaf, which contains five papers total including this work. This leaf sits within the broader 'Online RL Fine-Tuning of Flow-Based Policies' branch, indicating a moderately active research direction focused on adapting flow-based generative models through environment interaction. The taxonomy reveals this is neither a highly crowded nor sparse area, with sibling leaves exploring actor-critic frameworks, noise injection strategies, and vision-language-action model fine-tuning.

The taxonomy structure shows neighboring research directions that contextualize FPO's positioning. Adjacent leaves include actor-critic frameworks that employ separate value networks for temporal-difference learning, and noise injection strategies that convert deterministic flows to stochastic processes for exploration. The broader taxonomy also reveals parallel offline RL branches using reward-weighting or Q-learning with flow policies, and imitation learning approaches that bypass explicit reward optimization. FPO's scope note emphasizes advantage-weighted objectives and direct gradient estimation, explicitly excluding actor-critic methods with separate value networks and reward-weighted offline approaches, suggesting it occupies a distinct methodological niche within the online RL landscape.

Among twenty-six candidates examined across three contributions, the analysis reveals mixed novelty signals. The core FPO algorithm examined ten candidates with one appearing to provide overlapping prior work, suggesting some methodological precedent exists within this limited search scope. The advantage-weighted flow matching ratio examined six candidates with two potentially refutable, indicating this specific formulation may have closer antecedents. The sampling-agnostic framework examined ten candidates with none clearly refutable, suggesting this aspect may represent a more distinctive contribution. These statistics reflect a targeted semantic search rather than exhaustive coverage, meaning additional relevant work may exist beyond the examined set.

Based on the limited search scope of twenty-six semantically similar papers, FPO appears to offer incremental refinements within an established research direction rather than opening entirely new territory. The sampling-agnostic training and inference framework shows the strongest novelty signal, while the core algorithm and advantage-weighting formulation have identifiable precedents among examined candidates. The taxonomy positioning in a moderately populated leaf with four sibling papers suggests the work contributes to an active but not saturated research area, though definitive assessment would require broader literature coverage beyond top-K semantic matches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Reinforcement learning for flow-based generative policies. This emerging field combines continuous normalizing flows and diffusion-style generative models with reinforcement learning to produce expressive, multimodal policies. The taxonomy reveals several major branches: online RL fine-tuning methods that adapt pre-trained flow policies via policy gradients or actor-critic schemes; offline RL approaches leveraging flow models for conservative or distributional value learning; imitation learning and behavior cloning frameworks that use flows to capture expert demonstrations; generative flow networks (GFlowNets) designed for compositional or discrete generation tasks; domain-specific applications ranging from robotics to molecular design; and architectural innovations addressing training stability and efficiency. Works such as RL Flow Matching[1] and ReinFlow[2] illustrate how policy gradient techniques can be tailored to flow matching objectives, while GFlowNet-focused studies like GFlowNet Entropy[14] and GFlowNet Policy Gradients[23] explore credit assignment in structured generation. Meanwhile, offline methods such as Diffusion Q Learning[38] and Conservative Latent Flow[29] emphasize safe policy improvement from static datasets. A particularly active line of research centers on online fine-tuning of flow-based policies using policy gradient methods, where the challenge is to balance exploration with the computational cost of sampling from iterative flow models. Flow Matching Policy[0] sits squarely in this branch, focusing on policy gradient optimization for flow matching architectures. It shares methodological kinship with RL Flow Matching[1] and Flow Policy Online[3], which similarly adapt gradient-based RL to continuous generative policies, but differs in its specific treatment of the flow matching objective and reward integration. Nearby works like Flow GRPO[11] and Fine Grained GRPO[18] explore group-relative policy optimization variants, trading off sample efficiency against gradient variance. In contrast, single-step distillation approaches such as Flow Single Step[12] and One Step MeanFlow[36] sacrifice some expressiveness for faster inference, highlighting a recurring trade-off between policy flexibility and computational overhead. Understanding where Flow Matching Policy[0] falls within this spectrum—balancing iterative refinement with gradient stability—helps clarify its contributions relative to these closely related efforts.

Claimed Contributions

Flow Policy Optimization (FPO) algorithm

Can Refute

10 retrieved papers

FPO is a policy gradient method that optimizes flow-based generative models by maximizing an advantage-weighted ratio computed from the conditional flow matching loss. It avoids exact likelihood computation while remaining compatible with PPO-style training and is agnostic to the choice of sampling method during both training and inference.

10 retrieved papers

Can Refute

Advantage-weighted flow matching ratio for policy updates

Can Refute

6 retrieved papers

The method replaces the standard PPO likelihood ratio with a proxy ratio derived from flow matching losses, enabling policy updates without computing exact likelihoods. This ratio is shown to correspond to the ratio of evidence lower bounds under current and old policies.

6 retrieved papers

Can Refute

Sampling-agnostic training and inference framework

10 retrieved papers

FPO treats the sampling procedure as a black box during rollouts, allowing flexible integration with any deterministic or stochastic sampling approach and any number of denoising steps. This contrasts with denoising MDP methods that require specific stochastic samplers.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Reinforcement Learning for Flow-Matching Policies PDF

Pfrommer, Samuel, Huang Yixiao, Samuel Pfrommer, Sojoudi, Somayeh, Yixiao Huang, Somayeh Sojoudi (2025)

[11] Flow-grpo: Training flow matching models via online rl PDF

Liu Jie, Liu Gong-Ye, Liang, Jiajun, Li, Yangguang, Liu Jiaheng, Wang, Xintao, Wan Pengfei, Zhang Di, Ouyang, Wanli (2025)

[18] Fine-Grained GRPO for Precise Preference Alignment in Flow Models PDF

Zhou Yu-jie, Ling, Pengyang, Yujie Zhou, Pengyang Ling, Wang Yi-bin, Jiazi Bu, Zang, Yuhang, Yibin Wang, Wang, Jiaqi, Yuhang Zang, Niu Li, Jiaqi Wang, Zhai, Guangtao, Li Niu, Guangtao Zhai (2025)

[32] SuperFlow: Training Flow Matching Models with RL on the Fly PDF

Kaijie Chen, Zhiyang Xu, Ying Shen, Zihao Lin, Yuguang Yao, Lifu Huang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Flow Policy Optimization (FPO) algorithm

[11] Flow-grpo: Training flow matching models via online rl PDF

Can Refute

[1] Reinforcement Learning for Flow-Matching Policies PDF

Cannot Refute

[2] ReinFlow: Fine-tuning flow matching policy with online reinforcement learning PDF

Cannot Refute

[10] Flow network based generative models for non-iterative diverse candidate generation PDF

Cannot Refute

[16] Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization PDF

Cannot Refute

[51] Local flow matching generative models PDF

Cannot Refute

[52] Random policy evaluation uncovers policies of generative flow networks PDF

Cannot Refute

[53] TempFlow-GRPO: When Timing Matters for GRPO in Flow Models PDF

Cannot Refute

[54] Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control PDF

Cannot Refute

[55] Energy-Weighted Flow Matching for Offline Reinforcement Learning PDF

Cannot Refute

Contribution

Advantage-weighted flow matching ratio for policy updates

[35] : Online RL Fine-tuning for Flow-based Vision-Language-Action Models PDF

Can Refute

[47] Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models PDF

Can Refute

[17] Fine-tuning Flow Matching Generative Models with Intermediate Feedback PDF

Cannot Refute

[48] Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models PDF

Cannot Refute

[49] Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models PDF

Cannot Refute

[50] Value-Anchored Group Policy Optimization for Flow Models PDF

Cannot Refute

Contribution

Sampling-agnostic training and inference framework

[56] Training Diffusion Models with Reinforcement Learning PDF

Cannot Refute

[57] D3p: Dynamic denoising diffusion policy via reinforcement learning PDF

Cannot Refute

[58] Adding conditional control to diffusion models with reinforcement learning PDF

Cannot Refute

[59] Large-scale Reinforcement Learning for Diffusion Models PDF

Cannot Refute

[60] IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies PDF

Cannot Refute

[61] Efficient diffusion policies for offline reinforcement learning PDF

Cannot Refute

[62] Steering Your Diffusion Policy with Latent Space Reinforcement Learning PDF

Cannot Refute

[63] Diffusion policies as an expressive policy class for offline reinforcement learning PDF

Cannot Refute

[64] VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL PDF

Cannot Refute

[65] Hierarchical diffusion for offline decision making PDF

Cannot Refute

Flow Matching Policy Gradients

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Reinforcement Learning for Flow-Matching Policies PDF

[11] Flow-grpo: Training flow matching models via online rl PDF

[18] Fine-Grained GRPO for Precise Preference Alignment in Flow Models PDF

[32] SuperFlow: Training Flow Matching Models with RL on the Fly PDF

Contribution Analysis

Flow Policy Optimization (FPO) algorithm

[11] Flow-grpo: Training flow matching models via online rl PDF

[1] Reinforcement Learning for Flow-Matching Policies PDF

[2] ReinFlow: Fine-tuning flow matching policy with online reinforcement learning PDF

[10] Flow network based generative models for non-iterative diverse candidate generation PDF

[16] Online Reward-Weighted Fine-Tuning of Flow Matching with Wasserstein Regularization PDF

[51] Local flow matching generative models PDF

[52] Random policy evaluation uncovers policies of generative flow networks PDF

[53] TempFlow-GRPO: When Timing Matters for GRPO in Flow Models PDF

[54] Adjoint matching: Fine-tuning flow and diffusion generative models with memoryless stochastic optimal control PDF

[55] Energy-Weighted Flow Matching for Offline Reinforcement Learning PDF

Advantage-weighted flow matching ratio for policy updates

[35] : Online RL Fine-tuning for Flow-based Vision-Language-Action Models PDF

[47] Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models PDF

[17] Fine-tuning Flow Matching Generative Models with Intermediate Feedback PDF

[48] Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models PDF

[49] Neighbor GRPO: Contrastive ODE Policy Optimization Aligns Flow Models PDF

[50] Value-Anchored Group Policy Optimization for Flow Models PDF

Sampling-agnostic training and inference framework

[56] Training Diffusion Models with Reinforcement Learning PDF

[57] D3p: Dynamic denoising diffusion policy via reinforcement learning PDF

[58] Adding conditional control to diffusion models with reinforcement learning PDF

[59] Large-scale Reinforcement Learning for Diffusion Models PDF

[60] IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies PDF

[61] Efficient diffusion policies for offline reinforcement learning PDF

[62] Steering Your Diffusion Policy with Latent Space Reinforcement Learning PDF

[63] Diffusion policies as an expressive policy class for offline reinforcement learning PDF

[64] VARD: Efficient and Dense Fine-Tuning for Diffusion Models with Value-based RL PDF

[65] Hierarchical diffusion for offline decision making PDF

Table of Contents