Flow Actor-Critic for Offline Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
reinforcement learningoffline reinforcement learningflow actor-criticflow policiesflow matching
Abstract:

The dataset distributions in offline reinforcement learning (RL) often exhibit complex and multi-modal distributions, necessitating expressive policies to capture such distributions beyond widely-used Gaussian policies. To handle such complex and multi-modal datasets, in this paper, we propose Flow Actor-Critic, a new actor-critic method for offline RL, based on recent flow policies. The proposed method not only uses the flow model for actor as in previous flow policies but also exploits the expressive flow model for conservative critic acquisition to prevent Q-value explosion in out-of-data regions. To this end, we propose a new form of critic regularizer based on the accurate proxy behavior model obtained as a byproduct of flow-based actor design. Leveraging the flow model in this joint way, we achieve new state-of-the-art performance for test datasets of offline RL including the D4RL and recent OGBench benchmarks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Flow Actor-Critic, which integrates flow-based policies with conservative critic learning for offline reinforcement learning. Within the taxonomy, it resides in the 'Conservative Critics with Flow-Based Regularization' leaf under 'Flow-Based Actor-Critic and Value Learning'. This leaf contains only two papers total, including the original work and one sibling (Diffusion Actor-Critic). This positioning suggests a relatively sparse research direction focused specifically on coupling expressive generative policies with pessimistic value estimation, rather than the broader actor-critic landscape.

The taxonomy reveals neighboring leaves addressing related but distinct approaches: 'Distributional and Energy-Guided Flow Critics' explores full return distributions or energy-based guidance, while 'Q-Guided Flow Policies and Expressive Value Learning' emphasizes Q-function guidance for policy training. Sibling branches include 'Behavior Regularization and Constraint Formulation', which enforces constraints via explicit density estimation or diffusion-based regularization, and 'Efficient One-Step and Accelerated Flow Policies', which prioritizes computational efficiency over conservative value learning. The paper's focus on joint flow-based actor and critic design distinguishes it from purely policy-centric or efficiency-driven methods.

Among 24 candidates examined, the contribution-level analysis shows mixed novelty signals. The 'Flow-based critic penalization using behavior proxy density' examined 10 candidates with 1 refutable match, suggesting some prior overlap in using flow models for conservative critics. The 'Flow Actor-Critic method for offline RL' examined 10 candidates with no refutations, indicating potential novelty in the joint framework design. The 'Confidence-weighted critic penalization operator' examined 4 candidates with 1 refutable match, pointing to possible precedent in weighted penalization schemes. The limited search scope (24 papers) means these findings reflect top semantic matches rather than exhaustive coverage.

Given the sparse taxonomy leaf and limited search scope, the work appears to occupy a relatively underexplored niche combining flow-based actors with conservative critics. The joint exploitation of flow models for both policy expressiveness and critic regularization distinguishes it from sibling work focused solely on diffusion-based actors. However, the refutable matches for two contributions suggest that individual components may have precedent, even if the integrated framework is novel. A broader literature search beyond 24 candidates would clarify whether the joint design represents a substantive advance or incremental synthesis.

Taxonomy

Core-task Taxonomy Papers
40
3
Claimed Contributions
24
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: offline reinforcement learning with expressive flow-based policies. The field has evolved around leveraging normalizing flows and diffusion models to represent complex, multimodal action distributions from static datasets. The taxonomy reveals several major branches: some focus on architectural innovations and training stability for flow-based policies, while others pursue efficient one-step or accelerated sampling to reduce computational overhead. A substantial cluster addresses actor-critic frameworks that pair flow-based actors with conservative value learning, and another examines behavior regularization strategies to prevent distributional shift. Parallel lines explore diffusion policies for offline RL, specialized applications in robotics or continuous control, and extensions to online exploration or out-of-distribution adaptation. Works like Flow to Control[2] and MaxEnt Flow[3] illustrate early efforts to integrate flow matching with policy optimization, whereas Diffusion Policies Offline[6] and Efficient Diffusion Policies[20] highlight the diffusion-based counterpart. Recent activity centers on balancing expressiveness with computational efficiency and ensuring conservative value estimates under distribution shift. Flow Actor-Critic[0] sits within the actor-critic and value learning branch, specifically under conservative critics with flow-based regularization, closely neighboring Diffusion Actor-Critic[18]. Both emphasize coupling expressive generative policies with pessimistic Q-functions to mitigate overestimation in offline settings. Compared to approaches like SAC Flow[4] or OM2P[5], which also integrate flow-based actors with soft actor-critic frameworks, Flow Actor-Critic[0] appears to place stronger emphasis on explicit conservatism in the critic. Meanwhile, works such as Flow Q-Learning[8] and FlowQ[19] explore alternative value-learning formulations, and Shortcut Models[7] or Flow Single-Step[9] pursue faster inference. The landscape reflects an ongoing trade-off between model expressiveness, sample efficiency, and the need for robust off-policy evaluation.

Claimed Contributions

Flow-based critic penalization using behavior proxy density

The authors introduce a novel critic penalization method that directly identifies out-of-distribution regions using the tractable density estimates from a flow behavior proxy policy. This penalization preserves the original Bellman operator in confident in-distribution regions while suppressing Q-value overestimation for out-of-distribution actions.

10 retrieved papers
Can Refute
Flow Actor-Critic method for offline RL

The authors present Flow Actor-Critic (FAC), which jointly exploits the flow model for both actor design and conservative critic acquisition. Unlike previous flow policies that only use flow models for the actor, FAC leverages the expressive flow model in both components to handle complex and multi-modal dataset distributions.

10 retrieved papers
Confidence-weighted critic penalization operator

The authors define a weight function based on flow behavior proxy density that vanishes in well-supported regions and increases linearly as density decreases. This yields a new Bellman operator that maintains unbiased Q-values in confident in-distribution regions while gradually suppressing values in low-confidence areas.

4 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Flow-based critic penalization using behavior proxy density

The authors introduce a novel critic penalization method that directly identifies out-of-distribution regions using the tractable density estimates from a flow behavior proxy policy. This penalization preserves the original Bellman operator in confident in-distribution regions while suppressing Q-value overestimation for out-of-distribution actions.

Contribution

Flow Actor-Critic method for offline RL

The authors present Flow Actor-Critic (FAC), which jointly exploits the flow model for both actor design and conservative critic acquisition. Unlike previous flow policies that only use flow models for the actor, FAC leverages the expressive flow model in both components to handle complex and multi-modal dataset distributions.

Contribution

Confidence-weighted critic penalization operator

The authors define a weight function based on flow behavior proxy density that vanishes in well-supported regions and increases linearly as density decreases. This yields a new Bellman operator that maintains unbiased Q-values in confident in-distribution regions while gradually suppressing values in low-confidence areas.

Flow Actor-Critic for Offline Reinforcement Learning | Novelty Validation