ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors
Offline reinforcement learningsupport constraintflow model
Abstract:

Offline reinforcement learning (RL) aims to learn the optimal policy from a fixed behavior policy dataset without additional environment interaction. One common challenge that arises in this setting is the out-of-distribution (OOD) error, which occurs when the policy leaves the training distribution. Prior methods penalize a statistical distance term to keep the policy close to the behavior policy, but this constrains policy improvement and may not completely prevent OOD actions. Another challenge is that the optimal policy distribution can be multimodal and difficult to represent. Recent works apply diffusion or flow policies to address this problem, but it is unclear how to avoid OOD errors while retaining policy expressiveness. We propose ReFORM, an offline RL method based on flow policies that enforces the less restrictive support constraint by construction. ReFORM learns a BC flow policy with a bounded source distribution to capture the support of the action distribution, then optimizes a reflected flow that generates bounded noise for the BC flow while keeping the support, to maximize the performance. Across 40 challenging tasks from the OGBench benchmark with datasets of varying quality and using a constant set of hyperparameters for all tasks, ReFORM dominates all baselines with hand-tuned hyperparameters on the performance profile curves.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ReFORM, an offline RL method using flow-based policies with reflected noise to enforce support constraints while optimizing performance. It resides in the 'Flow-Based Policies with Support Constraints' leaf, which currently contains only this paper within the broader 'Expressive Policy Representations' branch. This places the work in a relatively sparse research direction focused on using normalizing flows or reflected flows to guarantee that learned policies remain within the data support by construction, rather than through statistical penalties.

The taxonomy reveals that neighboring approaches tackle OOD avoidance through different mechanisms: 'Diffusion-Based Policies' use diffusion models for multimodal action distributions, 'Behavior Policy Proximity Constraints' enforce explicit support matching, and 'Conservative Q-Function Estimation' penalizes unseen action values. ReFORM's flow-based construction with reflected noise sits at the intersection of expressive policy representations and geometric support guarantees, diverging from penalty-based methods (e.g., Conservative Q-Learning) and diffusion guidance approaches (e.g., Diffusion OOD) by building constraints directly into the generative process.

Among 21 candidates examined, the contribution-level analysis shows mixed novelty signals. The core ReFORM framework examined 10 candidates with 1 refutable match, suggesting moderate prior overlap. The reflected flow mechanism examined only 1 candidate with no refutations, indicating less explored territory. However, the support-constrained optimization framework examined 10 candidates with 6 refutable matches, pointing to substantial existing work on support constraints in offline RL. The limited search scope (21 papers from semantic search) means these findings reflect proximity to known work rather than exhaustive coverage.

Based on the top-21 semantic matches, ReFORM appears to occupy a niche combining flow-based generative modeling with explicit support constraints, an area with sparse direct precedents but surrounded by related constraint mechanisms. The reflected flow component shows fewer overlaps, while the broader support-constraint framing connects to established methods. The analysis captures local novelty within the examined neighborhood but cannot assess whether deeper literature contains closer antecedents.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
21
Contribution Candidate Papers Compared
7
Refutable Paper

Research Landscape Overview

Core task: Learning optimal policies from offline datasets while avoiding out-of-distribution actions. The field has organized itself around several complementary strategies for handling the fundamental challenge of extrapolation error when learning from fixed data. Value Function Regularization and Constraint Methods (e.g., Conservative Q-Learning[21], MOPO[18]) penalize or bound Q-values to prevent overestimation on unseen actions, while Policy Constraint and Regularization Approaches enforce behavioral similarity to the dataset through explicit regularization or support constraints. Model-Based Offline RL leverages learned dynamics models with uncertainty penalties, and Distributional and Stationary Distribution Methods (such as OptiDICE[9]) reframe the problem through occupancy measures. Meanwhile, Expressive Policy Representations explore richer parameterizations—including diffusion models (Diffusion Policies[13]) and flow-based architectures—that can naturally respect data support, and specialized branches address representation learning, offline-to-online transfer, multi-agent settings, and theoretical foundations. Recent work has increasingly focused on how expressive policy classes can implicitly enforce distributional constraints without heavy regularization overhead. ReFORM[0] exemplifies this direction by using flow-based policies with support constraints, ensuring that generated actions remain within the convex hull of observed data while maintaining expressiveness for multimodal behaviors. This contrasts with simpler behavioral cloning or explicit penalty methods like Conservative Q-Learning[21], which may be overly conservative, and with diffusion-based approaches (Diffusion OOD[25]) that require careful guidance mechanisms. Nearby efforts such as Support Constraint[34] and Supported Policy[36] similarly emphasize geometric or probabilistic guarantees on action support, while works like Implicit Q-Learning[1] and Anti-exploration[5] tackle OOD avoidance through implicit value learning or pessimistic exploration suppression. The interplay between policy expressiveness and strict in-distribution guarantees remains an active frontier, with ReFORM[0] positioned among methods that prioritize provable support constraints alongside flexible generative modeling.

Claimed Contributions

ReFORM: support-constrained offline RL via reflected flow noise manipulation

The authors introduce ReFORM, a two-stage flow policy method that learns a BC flow policy with bounded source distribution and optimizes a reflected flow noise generator. This approach enforces support constraints by construction without requiring regularization hyperparameters, avoiding OOD actions while maintaining policy expressiveness.

10 retrieved papers
Can Refute
Reflected flow for generating constrained multimodal noise

The authors propose using reflected flow to generate multimodal noise that stays within the bounded support of the BC flow policy's source distribution. This enables the policy to capture complex multimodal action distributions while provably avoiding out-of-distribution actions.

1 retrieved paper
Support-constrained policy optimization framework

The authors formalize offline RL as a support-constrained optimization problem (Eq. 5) where the learned policy's support must be contained within the behavior policy's support. They prove this is less restrictive than KL divergence constraints yet more reliable than Wasserstein distance constraints for preventing OOD actions.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ReFORM: support-constrained offline RL via reflected flow noise manipulation

The authors introduce ReFORM, a two-stage flow policy method that learns a BC flow policy with bounded source distribution and optimizes a reflected flow noise generator. This approach enforces support constraints by construction without requiring regularization hyperparameters, avoiding OOD actions while maintaining policy expressiveness.

Contribution

Reflected flow for generating constrained multimodal noise

The authors propose using reflected flow to generate multimodal noise that stays within the bounded support of the BC flow policy's source distribution. This enables the policy to capture complex multimodal action distributions while provably avoiding out-of-distribution actions.

Contribution

Support-constrained policy optimization framework

The authors formalize offline RL as a support-constrained optimization problem (Eq. 5) where the learned policy's support must be contained within the behavior policy's support. They prove this is less restrictive than KL divergence constraints yet more reliable than Wasserstein distance constraints for preventing OOD actions.

ReFORM: Reflected Flows for On-support Offline RL via Noise Manipulation | Novelty Validation