One-Step Flow Q-Learning: Addressing the Diffusion Policy Bottleneck in Offline Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement learningDiffusion ModelFlow MatchingOffline Reinforcement Learning
Abstract:

Diffusion Q-Learning (DQL) has established diffusion policies as a high-performing paradigm for offline reinforcement learning, but its reliance on multi-step denoising for action generation renders both training and inference slow and fragile. Existing efforts to accelerate DQL toward one-step denoising typically rely on auxiliary modules or policy distillation, sacrificing either simplicity or performance. It remains unclear whether a one-step policy can be trained directly without such trade-offs. To this end, we introduce One-Step Flow Q-Learning (OFQL), a novel framework that enables effective one-step action generation during both training and inference, without auxiliary modules or distillation. OFQL reformulates the DQL policy within the Flow Matching (FM) paradigm but departs from conventional FM by learning an average velocity field that directly supports accurate one-step action generation. This design removes the need for multi-step denoising and backpropagation-through-time updates, resulting in substantially faster and more robust learning. Extensive experiments on the D4RL benchmark show that OFQL, despite generating actions in a single step, not only significantly reduces computation during both training and inference but also outperforms multi-step DQL by a large margin. Furthermore, OFQL surpasses all other baselines, achieving state-of-the-art performance in D4RL.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes One-Step Flow Q-Learning (OFQL), which reformulates diffusion Q-learning within the flow matching paradigm to enable single-step action generation without auxiliary modules or distillation. It resides in the 'Flow Matching-Based One-Step Policies' leaf, which contains five papers including the original work. This leaf is part of the broader 'One-Step Action Generation Methods' branch, indicating a moderately active research direction focused on eliminating iterative denoising. The taxonomy shows twenty-seven total papers across multiple branches, suggesting that one-step generation is a significant but not dominant theme within the field.

The taxonomy reveals that OFQL's leaf sits alongside 'Consistency Distillation for Acceleration' (three papers) and 'Unified Generative Policy Frameworks' (one paper) within the one-step generation category. Neighboring branches include 'Multi-Step Diffusion Policy Methods' with sub-areas for guidance-based optimization and modular training, as well as 'World Model and Latent Space Methods' that integrate diffusion with learned dynamics. The scope note for OFQL's leaf explicitly excludes diffusion-based methods and consistency distillation, positioning flow matching as a distinct mathematical approach. This structural separation suggests the paper targets a specific methodological niche rather than competing directly with the larger multi-step diffusion community.

Among twenty-one candidates examined, seven refutable pairs were identified across three contributions. The core OFQL framework examined nine candidates with three appearing to provide overlapping prior work, while the average velocity field learning contribution examined ten candidates with two potential refutations. The elimination of multi-step denoising examined only two candidates, both flagged as refutable. These statistics indicate that within the limited search scope, each contribution faces at least some prior work overlap, though the majority of examined candidates (fourteen of twenty-one) were non-refutable or unclear. The relatively small candidate pool means the analysis captures top semantic matches rather than exhaustive coverage.

Given the limited search scope of twenty-one candidates, the analysis suggests moderate novelty concerns primarily around the elimination of multi-step denoising, where both examined papers appeared relevant. The flow matching framework and velocity field learning show more mixed signals, with most candidates non-refutable. The taxonomy context indicates OFQL occupies a recognized but not overcrowded research direction, though the sibling papers in the same leaf warrant careful comparison to establish incremental contributions beyond existing flow-based one-step approaches.

Taxonomy

Core-task Taxonomy Papers
27
3
Claimed Contributions
21
Contribution Candidate Papers Compared
7
Refutable Paper

Research Landscape Overview

Core task: Accelerating diffusion-based offline reinforcement learning with one-step action generation. The field has evolved around a central tension between expressive multi-step diffusion policies and the computational cost of iterative sampling at deployment. The taxonomy reflects this divide through several main branches: One-Step Action Generation Methods seek to distill or directly learn policies that produce actions in a single forward pass, often via consistency models, flow matching, or reward-aware distillation techniques such as Reward-Aware Consistency Distillation[1] and Flow Q-Learning[10]. Multi-Step Diffusion Policy Methods retain the iterative denoising framework but explore efficiency gains through streaming architectures like Streaming Diffusion Policy[19], modular training schemes, or trust-region constraints as in Diffusion Trust Region[12]. Specialized Application Domains address settings such as multi-agent coordination (Multi-agent Flow Matching[23]) or safety-critical tasks (Safe Feasibility-Guided Diffusion[6]), while World Model and Latent Space Methods integrate diffusion with learned dynamics or hierarchical skill representations. Theoretical and Survey Contributions provide broader perspectives on the landscape, as seen in Diffusion Model-Based RL Review[26]. A particularly active line of work centers on flow matching-based one-step policies, which leverage continuous normalizing flows to bypass multi-step sampling while preserving expressiveness. One-Step Flow Q-Learning[0] sits squarely within this cluster, aiming to combine the benefits of flow-based generation with value-guided action selection in a single inference step. This contrasts with nearby approaches like Flow-Based Single-Step Completion[8], which may emphasize trajectory completion over action-level Q-learning, and One-Step Generative MeanFlow[24], which explores mean-field approximations for generative policies. Compared to consistency-based methods such as One-step Diffusion Policy[5] or reward-aware distillation schemes like Reward-Aware Consistency Distillation[1], flow matching offers a distinct mathematical framework that can simplify training dynamics. The central open question across these branches remains how to balance sample quality, computational speed, and the ability to incorporate value functions or safety constraints without sacrificing the multimodal expressiveness that originally motivated diffusion models in offline RL.

Claimed Contributions

One-Step Flow Q-Learning (OFQL) framework

The authors propose OFQL, a new offline RL framework that reformulates Diffusion Q-Learning within the Flow Matching paradigm. Unlike prior methods, OFQL achieves efficient one-step action generation without requiring auxiliary models, policy distillation, or multi-stage training procedures.

9 retrieved papers
Can Refute
Average velocity field learning for one-step generation

The authors introduce a novel approach that learns an average velocity field instead of the conventional marginal velocity field used in Flow Matching. This design enables accurate direct action prediction from a single step, eliminating the need for iterative denoising and curved trajectory approximations.

10 retrieved papers
Can Refute
Elimination of multi-step denoising and BPTT in policy learning

By adopting the average velocity field formulation, OFQL removes the computational bottleneck of multi-step denoising chains and recursive gradient propagation (BPTT) that plague diffusion-based policies. This results in faster training, more stable optimization, and improved inference efficiency.

2 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

One-Step Flow Q-Learning (OFQL) framework

The authors propose OFQL, a new offline RL framework that reformulates Diffusion Q-Learning within the Flow Matching paradigm. Unlike prior methods, OFQL achieves efficient one-step action generation without requiring auxiliary models, policy distillation, or multi-stage training procedures.

Contribution

Average velocity field learning for one-step generation

The authors introduce a novel approach that learns an average velocity field instead of the conventional marginal velocity field used in Flow Matching. This design enables accurate direct action prediction from a single step, eliminating the need for iterative denoising and curved trajectory approximations.

Contribution

Elimination of multi-step denoising and BPTT in policy learning

By adopting the average velocity field formulation, OFQL removes the computational bottleneck of multi-step denoising chains and recursive gradient propagation (BPTT) that plague diffusion-based policies. This results in faster training, more stable optimization, and improved inference efficiency.