Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Offline Reinforcement LearningGenerative ModelsFlow Matching

Generative models such as diffusion and flow-matching offer expressive policies for offline reinforcement learning (RL) by capturing rich, multimodal action distributions, but their iterative sampling introduces high inference costs and training instability due to gradient propagation across sampling steps. We propose the \textit{Single-Step Completion Policy} (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation. In an off-policy actor-critic framework, SSCP combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, without requiring long backpropagation chains. Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability over diffusion-based baselines. We further extend SSCP to goal-conditioned RL (GCRL), enabling flat policies to exploit subgoal structures without explicit hierarchical inference. SSCP achieves strong results across standard offline RL and GCRL benchmarks, positioning it as a versatile, expressive, and efficient framework for deep RL and sequential decision-making.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Single-Step Completion Policy (SSCP), a flow-based generative policy that predicts direct completion vectors for one-shot action generation. It resides in the 'MeanFlow and Direct Velocity Prediction' leaf, which contains four papers total (including this one). This leaf sits within the broader 'Single-Step and Fast Inference Flow Policies' branch, indicating a moderately active research direction focused on reducing iterative sampling overhead. The taxonomy shows this is a well-defined niche within the larger flow-matching policy landscape, neither overcrowded nor entirely sparse.

The taxonomy reveals several neighboring research directions. The sibling leaf 'Consistency and Reflow-Based Acceleration' (three papers) pursues similar inference speedups through distillation rather than direct prediction. The parent branch connects to 'Reinforcement Learning Integration with Flow Matching', which houses multiple subcategories for policy gradients, actor-critic methods, and reward weighting. The 'Robotic Manipulation with Flow-Based Policies' branch explores application domains, while 'Flow Matching Foundations and Training Methods' provides theoretical underpinnings. SSCP bridges fast inference techniques with RL integration, positioning itself at the intersection of efficiency and expressiveness.

Among 24 candidates examined, the analysis identified limited prior work overlap. Contribution A (single-step completion) examined 10 candidates with 1 appearing to refute, suggesting some existing work on direct velocity prediction but not comprehensive coverage. Contribution B (off-policy actor-critic framework) examined 10 candidates with 2 refutable, indicating moderate prior exploration of critic-based training for flow policies. Contribution C (goal-conditioned extension) examined 4 candidates with 1 refutable, reflecting sparser prior work on hierarchical-to-flat distillation. The search scope of 24 papers represents a focused but not exhaustive literature review.

Based on the limited search scope, SSCP appears to occupy a recognizable position within an active research area. The taxonomy structure and sibling papers suggest the core idea of single-step flow completion has precedents, though the specific combination with actor-critic training and goal-conditioned extensions may offer incremental novelty. The analysis does not cover the full breadth of flow-matching policy literature, leaving open questions about how SSCP compares to methods outside the top-24 semantic matches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient policy learning using flow-based single-step completion models. The field has rapidly organized around several complementary directions. Flow Matching Foundations and Training Methods establish the mathematical and algorithmic underpinnings, including local flow matching Local Flow Matching[1] and Riemannian extensions Riemannian Flow Matching[20]. Reinforcement Learning Integration with Flow Matching explores how to combine flow models with RL objectives, yielding methods like Flow-GRPO[2], ReinFlow[6], and Reward-Weighted Flow Matching[9] that adapt flow training to reward signals. Single-Step and Fast Inference Flow Policies focus on reducing computational overhead by predicting velocities or actions directly, as seen in FlowPolicy[3] and MP1[14]. Robotic Manipulation with Flow-Based Policies applies these techniques to dexterous control tasks, while Offline and Imitation Learning with Flow Matching addresses data-driven settings where expert demonstrations guide flow construction. Multi-Agent and Compositional Flow Policies extend the framework to coordinated behaviors, and Comparative Studies and Unified Frameworks synthesize insights across these branches. A particularly active line of work centers on accelerating inference without sacrificing expressiveness. Single-Step Flow Policy[0] exemplifies this trend by learning to predict completions in one forward pass, contrasting with multi-step integration approaches like Multi-Step Integration Policy[36]. Nearby works such as DM1[23] and OMP[47] similarly emphasize direct velocity prediction or mean-flow estimation to bypass iterative sampling. These methods trade off the flexibility of full diffusion trajectories for speed and simplicity, a trade-off also explored in VFP[8] and Streaming Flow Policy[7]. Single-Step Flow Policy[0] sits squarely within this cluster of fast-inference techniques, sharing the goal of efficient action generation with MP1[14] and DM1[23], yet distinguished by its specific approach to single-step completion. Open questions remain about how to best balance sample quality, computational cost, and the ability to incorporate online feedback across these diverse strategies.

Claimed Contributions

Single-Step Completion Policy (SSCP) for efficient generative policy learning

Can Refute

10 retrieved papers

The authors propose SSCP, a generative policy trained with an augmented flow-matching objective to predict completion vectors from intermediate flow samples. This enables one-shot action generation, combining the expressiveness of generative models with the efficiency of unimodal policies without requiring iterative sampling or long backpropagation chains.

10 retrieved papers

Can Refute

Off-policy actor-critic framework compatible with SSCP

Can Refute

10 retrieved papers

The authors develop an off-policy actor-critic framework (SSCQL) that integrates SSCP with behavior-constrained policy gradient methods. This framework avoids backpropagation through time by using single-step completion, enabling stable training and efficient offline-to-online adaptation.

10 retrieved papers

Can Refute

Framework for distilling hierarchical behavior into flat policies (GC-SSCP)

Can Refute

4 retrieved papers

The authors extend the single-step completion principle to goal-conditioned RL, creating GC-SSCP. This method distills hierarchical subgoal-exploiting behavior into a flat inference policy that uses shared architectures across reasoning levels, enabling efficient goal-reaching without explicit hierarchical inference.

4 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[14] MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation PDF

Wang, Ziyi, Li Peiming, Liu, Mengyuan (2025)

[23] DM1: MeanFlow with Dispersive Regularization for 1-Step Robotic Manipulation PDF

Zou Guowei, Wang Haitao, Wu Hejun, Qian Yukun, Wang, Yuhang, Li Weibing (2025) • arXiv.org

[47] OMP: One-step Meanflow Policy with Directional Alignment PDF

Han Fang, Yize Huang, Yuheng Zhao, Paul Weng, Xiao Li, Yutong Ban (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Single-Step Completion Policy (SSCP) for efficient generative policy learning

[55] Flow Q-Learning PDF

Can Refute

[1] Local flow matching generative models PDF

Cannot Refute

[6] ReinFlow: Fine-tuning flow matching policy with online reinforcement learning PDF

Cannot Refute

[7] Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories PDF

Cannot Refute

[14] MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation PDF

Cannot Refute

[16] Adaflow: Imitation learning with variance-adaptive flow-based policies PDF

Cannot Refute

[22] FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency PDF

Cannot Refute

[23] DM1: MeanFlow with Dispersive Regularization for 1-Step Robotic Manipulation PDF

Cannot Refute

[27] OM2P: Offline multi-agent mean-flow policy PDF

Cannot Refute

[56] ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training PDF

Cannot Refute

Contribution

Off-policy actor-critic framework compatible with SSCP

[57] Behavior-regularized diffusion policy optimization for offline reinforcement learning PDF

Can Refute

[58] Behavior regularized offline reinforcement learning PDF

Can Refute

[59] Soft Actor-Critic Algorithms and Applications PDF

Cannot Refute

[60] Idql: Implicit q-learning as an actor-critic method with diffusion policies PDF

Cannot Refute

[61] BRAC+: Improved Behavior Regularized Actor Critic for Offline Reinforcement Learning PDF

Cannot Refute

[62] Towards Safer Rehabilitation: Improving Gait Trajectory Tracking for Lower Limb Exoskeletons Using Offline Reinforcement Learning PDF

Cannot Refute

[63] Offline Reinforcement Learning with Fisher Divergence Critic Regularization PDF

Cannot Refute

[64] Dual Behavior Regularized Offline Deterministic ActorâCritic PDF

Cannot Refute

[65] An Offline Multi-Agent Reinforcement Learning Framework for Radio Resource Management PDF

Cannot Refute

[66] Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL PDF

Cannot Refute

Contribution

Framework for distilling hierarchical behavior into flat policies (GC-SSCP)

[52] Flattening Hierarchies with Policy Bootstrapping PDF

Can Refute

[51] Hierarchical diffusion policy for kinematics-aware multi-task robotic manipulation PDF

Cannot Refute

[53] Imitating Graph-Based Planning with Goal-Conditioned Policies PDF

Cannot Refute

[54] Hierarchical reinforcement learning with curriculum learning and subpolicy transfer in navigation environments PDF

Cannot Refute

Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[14] MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation PDF

[23] DM1: MeanFlow with Dispersive Regularization for 1-Step Robotic Manipulation PDF

[47] OMP: One-step Meanflow Policy with Directional Alignment PDF

Contribution Analysis

Single-Step Completion Policy (SSCP) for efficient generative policy learning

[55] Flow Q-Learning PDF

[1] Local flow matching generative models PDF

[6] ReinFlow: Fine-tuning flow matching policy with online reinforcement learning PDF

[7] Streaming Flow Policy: Simplifying diffusion/flow-matching policies by treating action trajectories as flow trajectories PDF

[14] MP1: MeanFlow Tames Policy Learning in 1-step for Robotic Manipulation PDF

[16] Adaflow: Imitation learning with variance-adaptive flow-based policies PDF

[22] FreqPolicy: Efficient Flow-based Visuomotor Policy via Frequency Consistency PDF

[23] DM1: MeanFlow with Dispersive Regularization for 1-Step Robotic Manipulation PDF

[27] OM2P: Offline multi-agent mean-flow policy PDF

[56] ManiFlow: A General Robot Manipulation Policy via Consistency Flow Training PDF

Off-policy actor-critic framework compatible with SSCP

[57] Behavior-regularized diffusion policy optimization for offline reinforcement learning PDF

[58] Behavior regularized offline reinforcement learning PDF

[59] Soft Actor-Critic Algorithms and Applications PDF

[60] Idql: Implicit q-learning as an actor-critic method with diffusion policies PDF

[61] BRAC+: Improved Behavior Regularized Actor Critic for Offline Reinforcement Learning PDF

[62] Towards Safer Rehabilitation: Improving Gait Trajectory Tracking for Lower Limb Exoskeletons Using Offline Reinforcement Learning PDF

[63] Offline Reinforcement Learning with Fisher Divergence Critic Regularization PDF

[64] Dual Behavior Regularized Offline Deterministic ActorâCritic PDF

[65] An Offline Multi-Agent Reinforcement Learning Framework for Radio Resource Management PDF

[66] Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL PDF

Framework for distilling hierarchical behavior into flat policies (GC-SSCP)

[52] Flattening Hierarchies with Policy Bootstrapping PDF

[51] Hierarchical diffusion policy for kinematics-aware multi-task robotic manipulation PDF

[53] Imitating Graph-Based Planning with Goal-Conditioned Policies PDF

[54] Hierarchical reinforcement learning with curriculum learning and subpolicy transfer in navigation environments PDF

Table of Contents

[64] Dual Behavior Regularized Offline Deterministic ActorâCritic PDF