Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Offline Reinforcement LearningGenerative ModelsFlow Matching
Abstract:

Generative models such as diffusion and flow-matching offer expressive policies for offline reinforcement learning (RL) by capturing rich, multimodal action distributions, but their iterative sampling introduces high inference costs and training instability due to gradient propagation across sampling steps. We propose the \textit{Single-Step Completion Policy} (SSCP), a generative policy trained with an augmented flow-matching objective to predict direct completion vectors from intermediate flow samples, enabling accurate, one-shot action generation. In an off-policy actor-critic framework, SSCP combines the expressiveness of generative models with the training and inference efficiency of unimodal policies, without requiring long backpropagation chains. Our method scales effectively to offline, offline-to-online, and online RL settings, offering substantial gains in speed and adaptability over diffusion-based baselines. We further extend SSCP to goal-conditioned RL (GCRL), enabling flat policies to exploit subgoal structures without explicit hierarchical inference. SSCP achieves strong results across standard offline RL and GCRL benchmarks, positioning it as a versatile, expressive, and efficient framework for deep RL and sequential decision-making.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Single-Step Completion Policy (SSCP), a flow-based generative policy that predicts direct completion vectors for one-shot action generation. It resides in the 'MeanFlow and Direct Velocity Prediction' leaf, which contains four papers total (including this one). This leaf sits within the broader 'Single-Step and Fast Inference Flow Policies' branch, indicating a moderately active research direction focused on reducing iterative sampling overhead. The taxonomy shows this is a well-defined niche within the larger flow-matching policy landscape, neither overcrowded nor entirely sparse.

The taxonomy reveals several neighboring research directions. The sibling leaf 'Consistency and Reflow-Based Acceleration' (three papers) pursues similar inference speedups through distillation rather than direct prediction. The parent branch connects to 'Reinforcement Learning Integration with Flow Matching', which houses multiple subcategories for policy gradients, actor-critic methods, and reward weighting. The 'Robotic Manipulation with Flow-Based Policies' branch explores application domains, while 'Flow Matching Foundations and Training Methods' provides theoretical underpinnings. SSCP bridges fast inference techniques with RL integration, positioning itself at the intersection of efficiency and expressiveness.

Among 24 candidates examined, the analysis identified limited prior work overlap. Contribution A (single-step completion) examined 10 candidates with 1 appearing to refute, suggesting some existing work on direct velocity prediction but not comprehensive coverage. Contribution B (off-policy actor-critic framework) examined 10 candidates with 2 refutable, indicating moderate prior exploration of critic-based training for flow policies. Contribution C (goal-conditioned extension) examined 4 candidates with 1 refutable, reflecting sparser prior work on hierarchical-to-flat distillation. The search scope of 24 papers represents a focused but not exhaustive literature review.

Based on the limited search scope, SSCP appears to occupy a recognizable position within an active research area. The taxonomy structure and sibling papers suggest the core idea of single-step flow completion has precedents, though the specific combination with actor-critic training and goal-conditioned extensions may offer incremental novelty. The analysis does not cover the full breadth of flow-matching policy literature, leaving open questions about how SSCP compares to methods outside the top-24 semantic matches.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: efficient policy learning using flow-based single-step completion models. The field has rapidly organized around several complementary directions. Flow Matching Foundations and Training Methods establish the mathematical and algorithmic underpinnings, including local flow matching Local Flow Matching[1] and Riemannian extensions Riemannian Flow Matching[20]. Reinforcement Learning Integration with Flow Matching explores how to combine flow models with RL objectives, yielding methods like Flow-GRPO[2], ReinFlow[6], and Reward-Weighted Flow Matching[9] that adapt flow training to reward signals. Single-Step and Fast Inference Flow Policies focus on reducing computational overhead by predicting velocities or actions directly, as seen in FlowPolicy[3] and MP1[14]. Robotic Manipulation with Flow-Based Policies applies these techniques to dexterous control tasks, while Offline and Imitation Learning with Flow Matching addresses data-driven settings where expert demonstrations guide flow construction. Multi-Agent and Compositional Flow Policies extend the framework to coordinated behaviors, and Comparative Studies and Unified Frameworks synthesize insights across these branches. A particularly active line of work centers on accelerating inference without sacrificing expressiveness. Single-Step Flow Policy[0] exemplifies this trend by learning to predict completions in one forward pass, contrasting with multi-step integration approaches like Multi-Step Integration Policy[36]. Nearby works such as DM1[23] and OMP[47] similarly emphasize direct velocity prediction or mean-flow estimation to bypass iterative sampling. These methods trade off the flexibility of full diffusion trajectories for speed and simplicity, a trade-off also explored in VFP[8] and Streaming Flow Policy[7]. Single-Step Flow Policy[0] sits squarely within this cluster of fast-inference techniques, sharing the goal of efficient action generation with MP1[14] and DM1[23], yet distinguished by its specific approach to single-step completion. Open questions remain about how to best balance sample quality, computational cost, and the ability to incorporate online feedback across these diverse strategies.

Claimed Contributions

Single-Step Completion Policy (SSCP) for efficient generative policy learning

The authors propose SSCP, a generative policy trained with an augmented flow-matching objective to predict completion vectors from intermediate flow samples. This enables one-shot action generation, combining the expressiveness of generative models with the efficiency of unimodal policies without requiring iterative sampling or long backpropagation chains.

10 retrieved papers
Can Refute
Off-policy actor-critic framework compatible with SSCP

The authors develop an off-policy actor-critic framework (SSCQL) that integrates SSCP with behavior-constrained policy gradient methods. This framework avoids backpropagation through time by using single-step completion, enabling stable training and efficient offline-to-online adaptation.

10 retrieved papers
Can Refute
Framework for distilling hierarchical behavior into flat policies (GC-SSCP)

The authors extend the single-step completion principle to goal-conditioned RL, creating GC-SSCP. This method distills hierarchical subgoal-exploiting behavior into a flat inference policy that uses shared architectures across reasoning levels, enabling efficient goal-reaching without explicit hierarchical inference.

4 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Single-Step Completion Policy (SSCP) for efficient generative policy learning

The authors propose SSCP, a generative policy trained with an augmented flow-matching objective to predict completion vectors from intermediate flow samples. This enables one-shot action generation, combining the expressiveness of generative models with the efficiency of unimodal policies without requiring iterative sampling or long backpropagation chains.

Contribution

Off-policy actor-critic framework compatible with SSCP

The authors develop an off-policy actor-critic framework (SSCQL) that integrates SSCP with behavior-constrained policy gradient methods. This framework avoids backpropagation through time by using single-step completion, enabling stable training and efficient offline-to-online adaptation.

Contribution

Framework for distilling hierarchical behavior into flat policies (GC-SSCP)

The authors extend the single-step completion principle to goal-conditioned RL, creating GC-SSCP. This method distills hierarchical subgoal-exploiting behavior into a flat inference policy that uses shared architectures across reasoning levels, enabling efficient goal-reaching without explicit hierarchical inference.