Flow Actor-Critic for Offline Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

reinforcement learningoffline reinforcement learningflow actor-criticflow policiesflow matching

The dataset distributions in offline reinforcement learning (RL) often exhibit complex and multi-modal distributions, necessitating expressive policies to capture such distributions beyond widely-used Gaussian policies. To handle such complex and multi-modal datasets, in this paper, we propose Flow Actor-Critic, a new actor-critic method for offline RL, based on recent flow policies. The proposed method not only uses the flow model for actor as in previous flow policies but also exploits the expressive flow model for conservative critic acquisition to prevent Q-value explosion in out-of-data regions. To this end, we propose a new form of critic regularizer based on the accurate proxy behavior model obtained as a byproduct of flow-based actor design. Leveraging the flow model in this joint way, we achieve new state-of-the-art performance for test datasets of offline RL including the D4RL and recent OGBench benchmarks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Flow Actor-Critic, which integrates flow-based policies with conservative critic learning for offline reinforcement learning. Within the taxonomy, it resides in the 'Conservative Critics with Flow-Based Regularization' leaf under 'Flow-Based Actor-Critic and Value Learning'. This leaf contains only two papers total, including the original work and one sibling (Diffusion Actor-Critic). This positioning suggests a relatively sparse research direction focused specifically on coupling expressive generative policies with pessimistic value estimation, rather than the broader actor-critic landscape.

The taxonomy reveals neighboring leaves addressing related but distinct approaches: 'Distributional and Energy-Guided Flow Critics' explores full return distributions or energy-based guidance, while 'Q-Guided Flow Policies and Expressive Value Learning' emphasizes Q-function guidance for policy training. Sibling branches include 'Behavior Regularization and Constraint Formulation', which enforces constraints via explicit density estimation or diffusion-based regularization, and 'Efficient One-Step and Accelerated Flow Policies', which prioritizes computational efficiency over conservative value learning. The paper's focus on joint flow-based actor and critic design distinguishes it from purely policy-centric or efficiency-driven methods.

Among 24 candidates examined, the contribution-level analysis shows mixed novelty signals. The 'Flow-based critic penalization using behavior proxy density' examined 10 candidates with 1 refutable match, suggesting some prior overlap in using flow models for conservative critics. The 'Flow Actor-Critic method for offline RL' examined 10 candidates with no refutations, indicating potential novelty in the joint framework design. The 'Confidence-weighted critic penalization operator' examined 4 candidates with 1 refutable match, pointing to possible precedent in weighted penalization schemes. The limited search scope (24 papers) means these findings reflect top semantic matches rather than exhaustive coverage.

Given the sparse taxonomy leaf and limited search scope, the work appears to occupy a relatively underexplored niche combining flow-based actors with conservative critics. The joint exploitation of flow models for both policy expressiveness and critic regularization distinguishes it from sibling work focused solely on diffusion-based actors. However, the refutable matches for two contributions suggest that individual components may have precedent, even if the integrated framework is novel. A broader literature search beyond 24 candidates would clarify whether the joint design represents a substantive advance or incremental synthesis.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: offline reinforcement learning with expressive flow-based policies. The field has evolved around leveraging normalizing flows and diffusion models to represent complex, multimodal action distributions from static datasets. The taxonomy reveals several major branches: some focus on architectural innovations and training stability for flow-based policies, while others pursue efficient one-step or accelerated sampling to reduce computational overhead. A substantial cluster addresses actor-critic frameworks that pair flow-based actors with conservative value learning, and another examines behavior regularization strategies to prevent distributional shift. Parallel lines explore diffusion policies for offline RL, specialized applications in robotics or continuous control, and extensions to online exploration or out-of-distribution adaptation. Works like Flow to Control[2] and MaxEnt Flow[3] illustrate early efforts to integrate flow matching with policy optimization, whereas Diffusion Policies Offline[6] and Efficient Diffusion Policies[20] highlight the diffusion-based counterpart. Recent activity centers on balancing expressiveness with computational efficiency and ensuring conservative value estimates under distribution shift. Flow Actor-Critic[0] sits within the actor-critic and value learning branch, specifically under conservative critics with flow-based regularization, closely neighboring Diffusion Actor-Critic[18]. Both emphasize coupling expressive generative policies with pessimistic Q-functions to mitigate overestimation in offline settings. Compared to approaches like SAC Flow[4] or OM2P[5], which also integrate flow-based actors with soft actor-critic frameworks, Flow Actor-Critic[0] appears to place stronger emphasis on explicit conservatism in the critic. Meanwhile, works such as Flow Q-Learning[8] and FlowQ[19] explore alternative value-learning formulations, and Shortcut Models[7] or Flow Single-Step[9] pursue faster inference. The landscape reflects an ongoing trade-off between model expressiveness, sample efficiency, and the need for robust off-policy evaluation.

Claimed Contributions

Flow-based critic penalization using behavior proxy density

Can Refute

10 retrieved papers

The authors introduce a novel critic penalization method that directly identifies out-of-distribution regions using the tractable density estimates from a flow behavior proxy policy. This penalization preserves the original Bellman operator in confident in-distribution regions while suppressing Q-value overestimation for out-of-distribution actions.

10 retrieved papers

Can Refute

Flow Actor-Critic method for offline RL

10 retrieved papers

The authors present Flow Actor-Critic (FAC), which jointly exploits the flow model for both actor design and conservative critic acquisition. Unlike previous flow policies that only use flow models for the actor, FAC leverages the expressive flow model in both components to handle complex and multi-modal dataset distributions.

10 retrieved papers

Confidence-weighted critic penalization operator

Can Refute

4 retrieved papers

The authors define a weight function based on flow behavior proxy density that vanishes in well-supported regions and increases linearly as density decreases. This yields a new Bellman operator that maintains unbiased Q-values in confident in-distribution regions while gradually suppressing values in low-confidence areas.

4 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[18] Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning PDF

Linjiajie Fang, Zhang Jing, Ruoxue Liu, Wang Wenjia, Jing Zhang, Jing-bing Yi, Wenjia Wang, Bing-Yi Jing (2024) • International Conference on Learning Representations

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Flow-based critic penalization using behavior proxy density

[41] Supported value regularization for offline reinforcement learning PDF

Can Refute

[48] Supported policy optimization for offline reinforcement learning PDF

Cannot Refute

[49] PCDT: Pessimistic Critic Decision Transformer for Offline Reinforcement Learning PDF

Cannot Refute

[50] BRAC+: Improved Behavior Regularized Actor Critic for Offline Reinforcement Learning PDF

Cannot Refute

[51] Offline Reinforcement Learning with Fisher Divergence Critic Regularization PDF

Cannot Refute

[52] BRAC+: Going Deeper with Behavior Regularized Offline Reinforcement Learning PDF

Cannot Refute

[53] Dual behavior regularized offline deterministic actorâcritic PDF

Cannot Refute

[54] Should i run offline reinforcement learning or behavioral cloning? PDF

Cannot Refute

[55] Aligning diffusion behaviors with q-functions for efficient continuous control PDF

Cannot Refute

[56] Offline Reinforcement Learning with Soft Behavior Regularization PDF

Cannot Refute

Contribution

Flow Actor-Critic method for offline RL

[3] Maximum entropy reinforcement learning via energy-based normalizing flow PDF

Cannot Refute

[9] Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning PDF

Cannot Refute

[11] Unleashing Flow Policies with Distributional Critics PDF

Cannot Refute

[14] GENFLOWRL: Generative Object-Centric Flow Matching for Reward Shaping in Visual Reinforcement Learning PDF

Cannot Refute

[18] Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning PDF

Cannot Refute

[19] FlowQ: Energy-Guided Flow Policies for Offline Reinforcement Learning PDF

Cannot Refute

[31] Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning PDF

Cannot Refute

[45] Idql: Implicit q-learning as an actor-critic method with diffusion policies PDF

Cannot Refute

[46] Generative Adversarial Soft ActorâCritic PDF

Cannot Refute

[47] Stable Conservative Q-Learning for Offline Reinforcement Learning PDF

Cannot Refute

Contribution

Confidence-weighted critic penalization operator

[41] Supported value regularization for offline reinforcement learning PDF

Can Refute

[42] Double Actors and Uncertainty-Weighted Critics for Offline Reinforcement Learning PDF

Cannot Refute

[43] Leveraging score-based models for generating penalization in model-based offline reinforcement learning PDF

Cannot Refute

[44] Offline Reinforcement Learning with Uncertainty Critic Regularization Based on Density Estimation PDF

Cannot Refute

Flow Actor-Critic for Offline Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[18] Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning PDF

Contribution Analysis

Flow-based critic penalization using behavior proxy density

[41] Supported value regularization for offline reinforcement learning PDF

[48] Supported policy optimization for offline reinforcement learning PDF

[49] PCDT: Pessimistic Critic Decision Transformer for Offline Reinforcement Learning PDF

[50] BRAC+: Improved Behavior Regularized Actor Critic for Offline Reinforcement Learning PDF

[51] Offline Reinforcement Learning with Fisher Divergence Critic Regularization PDF

[52] BRAC+: Going Deeper with Behavior Regularized Offline Reinforcement Learning PDF

[53] Dual behavior regularized offline deterministic actorâcritic PDF

[54] Should i run offline reinforcement learning or behavioral cloning? PDF

[55] Aligning diffusion behaviors with q-functions for efficient continuous control PDF

[56] Offline Reinforcement Learning with Soft Behavior Regularization PDF

Flow Actor-Critic method for offline RL

[3] Maximum entropy reinforcement learning via energy-based normalizing flow PDF

[9] Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning PDF

[11] Unleashing Flow Policies with Distributional Critics PDF

[14] GENFLOWRL: Generative Object-Centric Flow Matching for Reward Shaping in Visual Reinforcement Learning PDF

[18] Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning PDF

[19] FlowQ: Energy-Guided Flow Policies for Offline Reinforcement Learning PDF

[31] Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning PDF

[45] Idql: Implicit q-learning as an actor-critic method with diffusion policies PDF

[46] Generative Adversarial Soft ActorâCritic PDF

[47] Stable Conservative Q-Learning for Offline Reinforcement Learning PDF

Confidence-weighted critic penalization operator

[41] Supported value regularization for offline reinforcement learning PDF

[42] Double Actors and Uncertainty-Weighted Critics for Offline Reinforcement Learning PDF

[43] Leveraging score-based models for generating penalization in model-based offline reinforcement learning PDF

[44] Offline Reinforcement Learning with Uncertainty Critic Regularization Based on Density Estimation PDF

Table of Contents

[53] Dual behavior regularized offline deterministic actorâcritic PDF

[46] Generative Adversarial Soft ActorâCritic PDF