One-Step Flow Q-Learning: Addressing the Diffusion Policy Bottleneck in Offline Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Reinforcement learningDiffusion ModelFlow MatchingOffline Reinforcement Learning

Diffusion Q-Learning (DQL) has established diffusion policies as a high-performing paradigm for offline reinforcement learning, but its reliance on multi-step denoising for action generation renders both training and inference slow and fragile. Existing efforts to accelerate DQL toward one-step denoising typically rely on auxiliary modules or policy distillation, sacrificing either simplicity or performance. It remains unclear whether a one-step policy can be trained directly without such trade-offs. To this end, we introduce One-Step Flow Q-Learning (OFQL), a novel framework that enables effective one-step action generation during both training and inference, without auxiliary modules or distillation. OFQL reformulates the DQL policy within the Flow Matching (FM) paradigm but departs from conventional FM by learning an average velocity field that directly supports accurate one-step action generation. This design removes the need for multi-step denoising and backpropagation-through-time updates, resulting in substantially faster and more robust learning. Extensive experiments on the D4RL benchmark show that OFQL, despite generating actions in a single step, not only significantly reduces computation during both training and inference but also outperforms multi-step DQL by a large margin. Furthermore, OFQL surpasses all other baselines, achieving state-of-the-art performance in D4RL.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes One-Step Flow Q-Learning (OFQL), which reformulates diffusion Q-learning within the flow matching paradigm to enable single-step action generation without auxiliary modules or distillation. It resides in the 'Flow Matching-Based One-Step Policies' leaf, which contains five papers including the original work. This leaf is part of the broader 'One-Step Action Generation Methods' branch, indicating a moderately active research direction focused on eliminating iterative denoising. The taxonomy shows twenty-seven total papers across multiple branches, suggesting that one-step generation is a significant but not dominant theme within the field.

The taxonomy reveals that OFQL's leaf sits alongside 'Consistency Distillation for Acceleration' (three papers) and 'Unified Generative Policy Frameworks' (one paper) within the one-step generation category. Neighboring branches include 'Multi-Step Diffusion Policy Methods' with sub-areas for guidance-based optimization and modular training, as well as 'World Model and Latent Space Methods' that integrate diffusion with learned dynamics. The scope note for OFQL's leaf explicitly excludes diffusion-based methods and consistency distillation, positioning flow matching as a distinct mathematical approach. This structural separation suggests the paper targets a specific methodological niche rather than competing directly with the larger multi-step diffusion community.

Among twenty-one candidates examined, seven refutable pairs were identified across three contributions. The core OFQL framework examined nine candidates with three appearing to provide overlapping prior work, while the average velocity field learning contribution examined ten candidates with two potential refutations. The elimination of multi-step denoising examined only two candidates, both flagged as refutable. These statistics indicate that within the limited search scope, each contribution faces at least some prior work overlap, though the majority of examined candidates (fourteen of twenty-one) were non-refutable or unclear. The relatively small candidate pool means the analysis captures top semantic matches rather than exhaustive coverage.

Given the limited search scope of twenty-one candidates, the analysis suggests moderate novelty concerns primarily around the elimination of multi-step denoising, where both examined papers appeared relevant. The flow matching framework and velocity field learning show more mixed signals, with most candidates non-refutable. The taxonomy context indicates OFQL occupies a recognized but not overcrowded research direction, though the sibling papers in the same leaf warrant careful comparison to establish incremental contributions beyond existing flow-based one-step approaches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Accelerating diffusion-based offline reinforcement learning with one-step action generation. The field has evolved around a central tension between expressive multi-step diffusion policies and the computational cost of iterative sampling at deployment. The taxonomy reflects this divide through several main branches: One-Step Action Generation Methods seek to distill or directly learn policies that produce actions in a single forward pass, often via consistency models, flow matching, or reward-aware distillation techniques such as Reward-Aware Consistency Distillation[1] and Flow Q-Learning[10]. Multi-Step Diffusion Policy Methods retain the iterative denoising framework but explore efficiency gains through streaming architectures like Streaming Diffusion Policy[19], modular training schemes, or trust-region constraints as in Diffusion Trust Region[12]. Specialized Application Domains address settings such as multi-agent coordination (Multi-agent Flow Matching[23]) or safety-critical tasks (Safe Feasibility-Guided Diffusion[6]), while World Model and Latent Space Methods integrate diffusion with learned dynamics or hierarchical skill representations. Theoretical and Survey Contributions provide broader perspectives on the landscape, as seen in Diffusion Model-Based RL Review[26]. A particularly active line of work centers on flow matching-based one-step policies, which leverage continuous normalizing flows to bypass multi-step sampling while preserving expressiveness. One-Step Flow Q-Learning[0] sits squarely within this cluster, aiming to combine the benefits of flow-based generation with value-guided action selection in a single inference step. This contrasts with nearby approaches like Flow-Based Single-Step Completion[8], which may emphasize trajectory completion over action-level Q-learning, and One-Step Generative MeanFlow[24], which explores mean-field approximations for generative policies. Compared to consistency-based methods such as One-step Diffusion Policy[5] or reward-aware distillation schemes like Reward-Aware Consistency Distillation[1], flow matching offers a distinct mathematical framework that can simplify training dynamics. The central open question across these branches remains how to balance sample quality, computational speed, and the ability to incorporate value functions or safety constraints without sacrificing the multimodal expressiveness that originally motivated diffusion models in offline RL.

Claimed Contributions

One-Step Flow Q-Learning (OFQL) framework

Can Refute

9 retrieved papers

The authors propose OFQL, a new offline RL framework that reformulates Diffusion Q-Learning within the Flow Matching paradigm. Unlike prior methods, OFQL achieves efficient one-step action generation without requiring auxiliary models, policy distillation, or multi-stage training procedures.

9 retrieved papers

Can Refute

Average velocity field learning for one-step generation

Can Refute

10 retrieved papers

The authors introduce a novel approach that learns an average velocity field instead of the conventional marginal velocity field used in Flow Matching. This design enables accurate direct action prediction from a single step, eliminating the need for iterative denoising and curved trajectory approximations.

10 retrieved papers

Can Refute

Elimination of multi-step denoising and BPTT in policy learning

Can Refute

2 retrieved papers

By adopting the average velocity field formulation, OFQL removes the computational bottleneck of multi-step denoising chains and recursive gradient propagation (BPTT) that plague diffusion-based policies. This results in faster training, more stable optimization, and improved inference efficiency.

2 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[7] Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation PDF

Nguyen, Thanh, Yoo, Chang D., Thanh Nguyen, Chang D. Yoo (2025)

[8] Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning PDF

Fleming Cody, Prajwal Koirala, Cody H. Fleming (2025)

[10] Flow Q-Learning PDF

Park, Seohong, Li, Qiyang, Seohong Park, Levine, Sergey, Qiyang Li, Sergey Levine (2025)

[24] One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow PDF

Zeyuan Wang, Da Li, Yulin Chen, Ye Shi, Liang Bai, Tianyuan Yu, Yanwei Fu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

One-Step Flow Q-Learning (OFQL) framework

[7] Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation PDF

Can Refute

[8] Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning PDF

Can Refute

[24] One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow PDF

Can Refute

[2] Diffusion-dice: In-sample diffusion guidance for offline reinforcement learning PDF

Cannot Refute

[12] Diffusion Policies creating a Trust Region for Offline Reinforcement Learning PDF

Cannot Refute

[28] Offline RL Without Off-Policy Evaluation PDF

Cannot Refute

[29] SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling PDF

Cannot Refute

[30] Gta: Generative trajectory augmentation with guidance for offline reinforcement learning PDF

Cannot Refute

[31] RecFlow Policy: Fast and Accurate Visuomotor Policy Learning via Rectified Action Flow PDF

Cannot Refute

Contribution

Average velocity field learning for one-step generation

[7] Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation PDF

Can Refute

[35] Splitmeanflow: Interval splitting consistency in few-step generative modeling PDF

Can Refute

[32] Flowmp: Learning motion fields for robot planning with conditional flow matching PDF

Cannot Refute

[33] High-dimensional Mean-Field Games by Particle-based Flow Matching PDF

Cannot Refute

[34] Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories PDF

Cannot Refute

[36] Flow matching with semidiscrete couplings PDF

Cannot Refute

[37] Parametric model reduction of mean-field and stochastic systems via higher-order action matching PDF

Cannot Refute

[38] Flow map matching PDF

Cannot Refute

[39] Consistency Flow Matching: Defining Straight Flows with Velocity Consistency PDF

Cannot Refute

[40] FlowPolicy: Enabling Fast and Robust 3D Flow-based Policy via Consistency Flow Matching for Robot Manipulation PDF

Cannot Refute

Contribution

Elimination of multi-step denoising and BPTT in policy learning

[7] Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation PDF

Can Refute

[8] Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning PDF

Can Refute

One-Step Flow Q-Learning: Addressing the Diffusion Policy Bottleneck in Offline Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[7] Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation PDF

[8] Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning PDF

[10] Flow Q-Learning PDF

[24] One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow PDF

Contribution Analysis

One-Step Flow Q-Learning (OFQL) framework

[7] Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation PDF

[8] Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning PDF

[24] One-Step Generative Policies with Q-Learning: A Reformulation of MeanFlow PDF

[2] Diffusion-dice: In-sample diffusion guidance for offline reinforcement learning PDF

[12] Diffusion Policies creating a Trust Region for Offline Reinforcement Learning PDF

[28] Offline RL Without Off-Policy Evaluation PDF

[29] SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling PDF

[30] Gta: Generative trajectory augmentation with guidance for offline reinforcement learning PDF

[31] RecFlow Policy: Fast and Accurate Visuomotor Policy Learning via Rectified Action Flow PDF

Average velocity field learning for one-step generation

[7] Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation PDF

[35] Splitmeanflow: Interval splitting consistency in few-step generative modeling PDF

[32] Flowmp: Learning motion fields for robot planning with conditional flow matching PDF

[33] High-dimensional Mean-Field Games by Particle-based Flow Matching PDF

[34] Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories PDF

[36] Flow matching with semidiscrete couplings PDF

[37] Parametric model reduction of mean-field and stochastic systems via higher-order action matching PDF

[38] Flow map matching PDF

[39] Consistency Flow Matching: Defining Straight Flows with Velocity Consistency PDF

[40] FlowPolicy: Enabling Fast and Robust 3D Flow-based Policy via Consistency Flow Matching for Robot Manipulation PDF

Elimination of multi-step denoising and BPTT in policy learning

[7] Revisiting Diffusion Q-Learning: From Iterative Denoising to One-Step Action Generation PDF

[8] Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning PDF

Table of Contents