Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Reinforcement LearningOffline-to-Online Reinforcement LearningFlow MatchingNoise Injection

Generative models have recently demonstrated remarkable success across diverse domains, motivating their adoption as expressive policies in reinforcement learning (RL). While they have shown strong performance in offline RL, particularly where the target distribution is well defined, their extension to online fine-tuning has largely been treated as a direct continuation of offline pre-training, leaving key challenges unaddressed. In this paper, we propose Flow Matching with Injected Noise for Offline-to-Online RL (FINO), a novel method that leverages flow matching-based policies to enhance sample efficiency for offline-to-online RL. FINO facilitates effective exploration by injecting noise into policy training, thereby encouraging a broader range of actions beyond those observed in the offline dataset. In addition to exploration-enhanced flow policy training, we combine an entropy-guided sampling mechanism to balance exploration and exploitation, allowing the policy to adapt its behavior throughout online fine-tuning. Experiments across diverse, challenging tasks demonstrate that FINO consistently achieves superior performance under limited online budgets.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes FINO, a method combining noise injection and entropy-guided sampling to improve offline-to-online reinforcement learning with flow matching policies. According to the taxonomy, this work resides in the 'Noise Injection and Entropy-Guided Exploration' leaf under 'Offline-to-Online Transition and Exploration'. Notably, this leaf contains only the original paper itself—no sibling papers are listed—suggesting this specific combination of noise injection and entropy guidance represents a relatively sparse research direction within the broader offline-to-online flow matching landscape.

The taxonomy reveals that FINO's parent branch ('Offline-to-Online Transition and Exploration') contains three other leaves: adaptive post-training for vision-language-action models, action chunking for sample-efficient fine-tuning, and unified online-offline learning via implicit regularization. These neighboring directions address similar offline-to-online challenges but through different mechanisms—VLA-specific objectives, temporally extended actions, or implicit value regularization. The broader taxonomy shows related work in energy-guided training and critic design, but these focus on training-time objectives or value learning rather than exploration strategies during online fine-tuning, clarifying FINO's distinct positioning.

Among 29 candidates examined, the noise-injected training scheme (Contribution 2) shows one refutable candidate from 10 examined, indicating some prior work on noise-based exploration exists within the limited search scope. The overall FINO framework (Contribution 1) and entropy-guided sampling (Contribution 3) show no refutable candidates among 9 and 10 examined respectively, suggesting these specific combinations appear less directly addressed in the top-30 semantic matches. The statistics indicate moderate prior work overlap for the noise injection component, while the integrated framework and entropy mechanism appear more distinctive within this limited candidate pool.

Based on the top-29 semantic matches examined, FINO appears to occupy a relatively unexplored niche combining noise injection with entropy-guided sampling for flow-based offline-to-online RL. The taxonomy structure confirms this is a sparse leaf with no listed siblings, though the limited search scope means potentially relevant work outside the top-K matches remains unexamined. The analysis covers semantic proximity and citation-based expansion but does not constitute an exhaustive field survey.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: offline-to-online reinforcement learning with flow matching policies. This emerging field combines flow-based generative models with reinforcement learning to enable smooth transitions from offline datasets to online fine-tuning. The taxonomy reveals several interconnected research directions. Flow-Based Policy Architecture and Training Stability addresses the foundational design of flow matching networks and their convergence properties. Energy-Guided and Reward-Conditioned Flow Training explores how to steer generative policies toward high-reward regions using energy functions or explicit reward conditioning, as seen in works like Extremum Flow Matching[13] and Energy-Weighted Flow Matching[14]. Critic Design and Value Function Learning focuses on learning robust value estimates that can guide flow policies, with approaches ranging from distributional critics to expressive value representations. Offline Behavior Modeling and Constraint Enforcement tackles how to leverage offline data while avoiding out-of-distribution actions, often through behavioral cloning constraints or density-based regularization. Domain Transfer and Distribution Shift Handling examines robustness across different environments, while Inference-Time Guidance and Goal-Conditioned Learning investigates steering pre-trained flows at test time. The Offline-to-Online Transition and Exploration branch specifically addresses the challenge of moving from static datasets to active learning. Within the offline-to-online transition landscape, a handful of works explore different mechanisms for balancing exploitation of learned policies with exploration of new behaviors. Some methods like EXPO[1] and floq[5] emphasize principled exploration strategies that build on flow-based policy representations, while others such as SAC Flow[3] adapt entropy-regularized frameworks to the flow setting. Flow Matching Injected Noise[0] sits within the Noise Injection and Entropy-Guided Exploration cluster, focusing on how stochastic perturbations can facilitate exploration during the online phase. This contrasts with approaches like FlowQ[2] or Flow Q-Learning[4], which may prioritize tighter integration between flow generation and Q-function learning. The central tension across these branches involves maintaining the expressiveness of flow models while ensuring stable online improvement and adequate exploration, a challenge that Flow Matching Injected Noise[0] addresses through its noise injection mechanism.

Claimed Contributions

Flow Matching with Injected Noise for Offline-to-Online RL (FINO)

9 retrieved papers

The authors introduce FINO, a method that injects noise into flow matching during offline pre-training to expand the action space beyond the dataset, enabling more effective exploration during subsequent online fine-tuning in reinforcement learning.

9 retrieved papers

Noise-injected training scheme for flow matching

Can Refute

10 retrieved papers

The authors propose a training objective that injects controlled noise into the flow matching formulation, encouraging the policy to explore a broader range of actions beyond those present in the offline dataset while maintaining valid continuous normalizing flows.

10 retrieved papers

Can Refute

Entropy-guided sampling mechanism

10 retrieved papers

The authors introduce a sampling mechanism that constructs a distribution over candidate actions based on their action-values and dynamically adjusts a temperature parameter using policy entropy, enabling adaptive balancing between exploration and exploitation during online fine-tuning.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Flow Matching with Injected Noise for Offline-to-Online RL (FINO)

[1] EXPO: Stable Reinforcement Learning with Expressive Policies PDF

Cannot Refute

[2] FlowQ: Energy-Guided Flow Policies for Offline Reinforcement Learning PDF

Cannot Refute

[3] SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling PDF

Cannot Refute

[4] Flow q-learning PDF

Cannot Refute

[8] Balancing Signal and Variance: Adaptive Offline RL Post-Training for VLA Flow Models PDF

Cannot Refute

[13] Extremum Flow Matching for Offline Goal Conditioned Reinforcement Learning PDF

Cannot Refute

[14] Energy-Weighted Flow Matching for Offline Reinforcement Learning PDF

Cannot Refute

[15] Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Data PDF

Cannot Refute

[23] Bayesian Design Principles for Offline-to-Online Reinforcement Learning PDF

Cannot Refute

Contribution

Noise-injected training scheme for flow matching

[25] ReinFlow: Fine-tuning flow matching policy with online reinforcement learning PDF

Can Refute

[24] Flow-GRPO: Training Flow Matching Models via Online RL PDF

Cannot Refute

[26] Stochastic Flow Matching for Resolving Small-Scale Physics PDF

Cannot Refute

[27] On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations PDF

Cannot Refute

[28] Tempflow-grpo: When timing matters for grpo in flow models PDF

Cannot Refute

[29] Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching PDF

Cannot Refute

[30] Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling PDF

Cannot Refute

[31] Actionflow: Efficient, accurate, and fast policies with spatially symmetric flow matching PDF

Cannot Refute

[32] ARFlow: Human Action-Reaction Flow Matching with Physical Guidance PDF

Cannot Refute

[33] OSCAR: Orthogonal Stochastic Control for Alignment-Respecting Diversity in Flow Matching PDF

Cannot Refute

Contribution

Entropy-guided sampling mechanism

Cannot Refute

[35] Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning PDF

Cannot Refute

[36] CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning PDF

Cannot Refute

[37] Entropy-based exploration conduction for multi-step reasoning PDF

Cannot Refute

[38] Reasoning with exploration: An entropy perspective PDF

Cannot Refute

[39] First return, entropy-eliciting explore PDF

Cannot Refute

[40] An adaptive entropy-regularization framework for multi-agent reinforcement learning PDF

Cannot Refute

[41] Approximate information for efficient exploration-exploitation strategies PDF

Cannot Refute

[42] Agentic entropy-balanced policy optimization PDF

Cannot Refute

[43] A max-min entropy framework for reinforcement learning PDF

Cannot Refute

Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Flow Matching with Injected Noise for Offline-to-Online RL (FINO)

[1] EXPO: Stable Reinforcement Learning with Expressive Policies PDF

[2] FlowQ: Energy-Guided Flow Policies for Offline Reinforcement Learning PDF

[3] SAC Flow: Sample-Efficient Reinforcement Learning of Flow-Based Policies via Velocity-Reparameterized Sequential Modeling PDF

[4] Flow q-learning PDF

[8] Balancing Signal and Variance: Adaptive Offline RL Post-Training for VLA Flow Models PDF

[13] Extremum Flow Matching for Offline Goal Conditioned Reinforcement Learning PDF

[14] Energy-Weighted Flow Matching for Offline Reinforcement Learning PDF

[15] Composite Flow Matching for Reinforcement Learning with Shifted-Dynamics Data PDF

[23] Bayesian Design Principles for Offline-to-Online Reinforcement Learning PDF

Noise-injected training scheme for flow matching

[25] ReinFlow: Fine-tuning flow matching policy with online reinforcement learning PDF

[24] Flow-GRPO: Training Flow Matching Models via Online RL PDF

[26] Stochastic Flow Matching for Resolving Small-Scale Physics PDF

[27] On Robustness of Vision-Language-Action Model against Multi-Modal Perturbations PDF

[28] Tempflow-grpo: When timing matters for grpo in flow models PDF

[29] Coefficients-Preserving Sampling for Reinforcement Learning with Flow Matching PDF

[30] Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling PDF

[31] Actionflow: Efficient, accurate, and fast policies with spatially symmetric flow matching PDF

[32] ARFlow: Human Action-Reaction Flow Matching with Physical Guidance PDF

[33] OSCAR: Orthogonal Stochastic Control for Alignment-Respecting Diversity in Flow Matching PDF

Entropy-guided sampling mechanism

[34] RÃ©nyi state entropy maximization for exploration acceleration in reinforcement learning PDF

[35] Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning PDF

[36] CE-GPPO: Coordinating Entropy via Gradient-Preserving Clipping Policy Optimization in Reinforcement Learning PDF

[37] Entropy-based exploration conduction for multi-step reasoning PDF

[38] Reasoning with exploration: An entropy perspective PDF

[39] First return, entropy-eliciting explore PDF

[40] An adaptive entropy-regularization framework for multi-agent reinforcement learning PDF

[41] Approximate information for efficient exploration-exploitation strategies PDF

[42] Agentic entropy-balanced policy optimization PDF

[43] A max-min entropy framework for reinforcement learning PDF

Table of Contents