ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Offline reinforcement learningsupport constraintflow model

Offline reinforcement learning (RL) aims to learn the optimal policy from a fixed behavior policy dataset without additional environment interaction. One common challenge that arises in this setting is the out-of-distribution (OOD) error, which occurs when the policy leaves the training distribution. Prior methods penalize a statistical distance term to keep the policy close to the behavior policy, but this constrains policy improvement and may not completely prevent OOD actions. Another challenge is that the optimal policy distribution can be multimodal and difficult to represent. Recent works apply diffusion or flow policies to address this problem, but it is unclear how to avoid OOD errors while retaining policy expressiveness. We propose ReFORM, an offline RL method based on flow policies that enforces the less restrictive support constraint by construction. ReFORM learns a BC flow policy with a bounded source distribution to capture the support of the action distribution, then optimizes a reflected flow that generates bounded noise for the BC flow while keeping the support, to maximize the performance. Across 40 challenging tasks from the OGBench benchmark with datasets of varying quality and using a constant set of hyperparameters for all tasks, ReFORM dominates all baselines with hand-tuned hyperparameters on the performance profile curves.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ReFORM, an offline RL method using flow-based policies with reflected noise to enforce support constraints while optimizing performance. It resides in the 'Flow-Based Policies with Support Constraints' leaf, which currently contains only this paper within the broader 'Expressive Policy Representations' branch. This places the work in a relatively sparse research direction focused on using normalizing flows or reflected flows to guarantee that learned policies remain within the data support by construction, rather than through statistical penalties.

The taxonomy reveals that neighboring approaches tackle OOD avoidance through different mechanisms: 'Diffusion-Based Policies' use diffusion models for multimodal action distributions, 'Behavior Policy Proximity Constraints' enforce explicit support matching, and 'Conservative Q-Function Estimation' penalizes unseen action values. ReFORM's flow-based construction with reflected noise sits at the intersection of expressive policy representations and geometric support guarantees, diverging from penalty-based methods (e.g., Conservative Q-Learning) and diffusion guidance approaches (e.g., Diffusion OOD) by building constraints directly into the generative process.

Among 21 candidates examined, the contribution-level analysis shows mixed novelty signals. The core ReFORM framework examined 10 candidates with 1 refutable match, suggesting moderate prior overlap. The reflected flow mechanism examined only 1 candidate with no refutations, indicating less explored territory. However, the support-constrained optimization framework examined 10 candidates with 6 refutable matches, pointing to substantial existing work on support constraints in offline RL. The limited search scope (21 papers from semantic search) means these findings reflect proximity to known work rather than exhaustive coverage.

Based on the top-21 semantic matches, ReFORM appears to occupy a niche combining flow-based generative modeling with explicit support constraints, an area with sparse direct precedents but surrounded by related constraint mechanisms. The reflected flow component shows fewer overlaps, while the broader support-constraint framing connects to established methods. The analysis captures local novelty within the examined neighborhood but cannot assess whether deeper literature contains closer antecedents.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Learning optimal policies from offline datasets while avoiding out-of-distribution actions. The field has organized itself around several complementary strategies for handling the fundamental challenge of extrapolation error when learning from fixed data. Value Function Regularization and Constraint Methods (e.g., Conservative Q-Learning[21], MOPO[18]) penalize or bound Q-values to prevent overestimation on unseen actions, while Policy Constraint and Regularization Approaches enforce behavioral similarity to the dataset through explicit regularization or support constraints. Model-Based Offline RL leverages learned dynamics models with uncertainty penalties, and Distributional and Stationary Distribution Methods (such as OptiDICE[9]) reframe the problem through occupancy measures. Meanwhile, Expressive Policy Representations explore richer parameterizations—including diffusion models (Diffusion Policies[13]) and flow-based architectures—that can naturally respect data support, and specialized branches address representation learning, offline-to-online transfer, multi-agent settings, and theoretical foundations. Recent work has increasingly focused on how expressive policy classes can implicitly enforce distributional constraints without heavy regularization overhead. ReFORM[0] exemplifies this direction by using flow-based policies with support constraints, ensuring that generated actions remain within the convex hull of observed data while maintaining expressiveness for multimodal behaviors. This contrasts with simpler behavioral cloning or explicit penalty methods like Conservative Q-Learning[21], which may be overly conservative, and with diffusion-based approaches (Diffusion OOD[25]) that require careful guidance mechanisms. Nearby efforts such as Support Constraint[34] and Supported Policy[36] similarly emphasize geometric or probabilistic guarantees on action support, while works like Implicit Q-Learning[1] and Anti-exploration[5] tackle OOD avoidance through implicit value learning or pessimistic exploration suppression. The interplay between policy expressiveness and strict in-distribution guarantees remains an active frontier, with ReFORM[0] positioned among methods that prioritize provable support constraints alongside flexible generative modeling.

Claimed Contributions

ReFORM: support-constrained offline RL via reflected flow noise manipulation

Can Refute

10 retrieved papers

The authors introduce ReFORM, a two-stage flow policy method that learns a BC flow policy with bounded source distribution and optimizes a reflected flow noise generator. This approach enforces support constraints by construction without requiring regularization hyperparameters, avoiding OOD actions while maintaining policy expressiveness.

10 retrieved papers

Can Refute

Reflected flow for generating constrained multimodal noise

1 retrieved paper

The authors propose using reflected flow to generate multimodal noise that stays within the bounded support of the BC flow policy's source distribution. This enables the policy to capture complex multimodal action distributions while provably avoiding out-of-distribution actions.

1 retrieved paper

Support-constrained policy optimization framework

Can Refute

10 retrieved papers

The authors formalize offline RL as a support-constrained optimization problem (Eq. 5) where the learned policy's support must be contained within the behavior policy's support. They prove this is less restrictive than KL divergence constraints yet more reliable than Wasserstein distance constraints for preventing OOD actions.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ReFORM: support-constrained offline RL via reflected flow noise manipulation

[63] Constrained policy optimization with explicit behavior density for offline reinforcement learning PDF

Can Refute

[42] Out-of-distribution adaptation in offline rl: Counterfactual reasoning via causal normalizing flows PDF

Cannot Refute

[60] Flow to better: Offline preference-based reinforcement learning via preferred trajectory generation PDF

Cannot Refute

[61] COFlownet: Conservative constraints on flows enable high-quality candidate generation PDF

Cannot Refute

[62] Extremum Flow Matching for Offline Goal Conditioned Reinforcement Learning PDF

Cannot Refute

[64] Let offline rl flow: Training conservative agents in the latent space of normalizing flows PDF

Cannot Refute

[65] Q-Guided Flow Q-Learning PDF

Cannot Refute

[66] RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization PDF

Cannot Refute

[67] Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning PDF

Cannot Refute

[68] Out of Distribution Adaptation in Offline RL via Causal Normalizing Flows PDF

Cannot Refute

Contribution

Reflected flow for generating constrained multimodal noise

[59] RF-POLICY: Rectified flows are computation-adaptive decision makers PDF

Cannot Refute

Contribution

Support-constrained policy optimization framework

[34] Policy Constraint by Only Support Constraint for Offline Reinforcement Learning PDF

Can Refute

[51] Supported value regularization for offline reinforcement learning PDF

Can Refute

[52] Beyond ood state actions: Supported cross-domain offline reinforcement learning PDF

Can Refute

[53] Supported Trust Region Optimization for Offline Reinforcement Learning PDF

Can Refute

[54] Projection implicit q-learning with support constraint for offline reinforcement learning PDF

Can Refute

[56] In-sample actor critic for offline reinforcement learning PDF

Can Refute

[36] Supported policy optimization for offline reinforcement learning PDF

Cannot Refute

[55] DiffCPS: Diffusion Model based Constrained Policy Search for Offline Reinforcement Learning PDF

Cannot Refute

[57] Weighted Policy Constraints for Offline Reinforcement Learning PDF

Cannot Refute

[58] Poce: Primal policy optimization with conservative estimation for multi-constraint offline reinforcement learning PDF

Cannot Refute

ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

ReFORM: support-constrained offline RL via reflected flow noise manipulation

[63] Constrained policy optimization with explicit behavior density for offline reinforcement learning PDF

[42] Out-of-distribution adaptation in offline rl: Counterfactual reasoning via causal normalizing flows PDF

[60] Flow to better: Offline preference-based reinforcement learning via preferred trajectory generation PDF

[61] COFlownet: Conservative constraints on flows enable high-quality candidate generation PDF

[62] Extremum Flow Matching for Offline Goal Conditioned Reinforcement Learning PDF

[64] Let offline rl flow: Training conservative agents in the latent space of normalizing flows PDF

[65] Q-Guided Flow Q-Learning PDF

[66] RAMAC: Multimodal Risk-Aware Offline Reinforcement Learning and the Role of Behavior Regularization PDF

[67] Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning PDF

[68] Out of Distribution Adaptation in Offline RL via Causal Normalizing Flows PDF

Reflected flow for generating constrained multimodal noise

[59] RF-POLICY: Rectified flows are computation-adaptive decision makers PDF

Support-constrained policy optimization framework

[34] Policy Constraint by Only Support Constraint for Offline Reinforcement Learning PDF

[51] Supported value regularization for offline reinforcement learning PDF

[52] Beyond ood state actions: Supported cross-domain offline reinforcement learning PDF

[53] Supported Trust Region Optimization for Offline Reinforcement Learning PDF

[54] Projection implicit q-learning with support constraint for offline reinforcement learning PDF

[56] In-sample actor critic for offline reinforcement learning PDF

[36] Supported policy optimization for offline reinforcement learning PDF

[55] DiffCPS: Diffusion Model based Constrained Policy Search for Offline Reinforcement Learning PDF

[57] Weighted Policy Constraints for Offline Reinforcement Learning PDF

[58] Poce: Primal policy optimization with conservative estimation for multi-constraint offline reinforcement learning PDF

Table of Contents