Flow Matching with Injected Noise for Offline-to-Online Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement LearningOffline-to-Online Reinforcement LearningFlow MatchingNoise Injection
Abstract:

Generative models have recently demonstrated remarkable success across diverse domains, motivating their adoption as expressive policies in reinforcement learning (RL). While they have shown strong performance in offline RL, particularly where the target distribution is well defined, their extension to online fine-tuning has largely been treated as a direct continuation of offline pre-training, leaving key challenges unaddressed. In this paper, we propose Flow Matching with Injected Noise for Offline-to-Online RL (FINO), a novel method that leverages flow matching-based policies to enhance sample efficiency for offline-to-online RL. FINO facilitates effective exploration by injecting noise into policy training, thereby encouraging a broader range of actions beyond those observed in the offline dataset. In addition to exploration-enhanced flow policy training, we combine an entropy-guided sampling mechanism to balance exploration and exploitation, allowing the policy to adapt its behavior throughout online fine-tuning. Experiments across diverse, challenging tasks demonstrate that FINO consistently achieves superior performance under limited online budgets.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes FINO, a method combining noise injection and entropy-guided sampling to improve offline-to-online reinforcement learning with flow matching policies. According to the taxonomy, this work resides in the 'Noise Injection and Entropy-Guided Exploration' leaf under 'Offline-to-Online Transition and Exploration'. Notably, this leaf contains only the original paper itself—no sibling papers are listed—suggesting this specific combination of noise injection and entropy guidance represents a relatively sparse research direction within the broader offline-to-online flow matching landscape.

The taxonomy reveals that FINO's parent branch ('Offline-to-Online Transition and Exploration') contains three other leaves: adaptive post-training for vision-language-action models, action chunking for sample-efficient fine-tuning, and unified online-offline learning via implicit regularization. These neighboring directions address similar offline-to-online challenges but through different mechanisms—VLA-specific objectives, temporally extended actions, or implicit value regularization. The broader taxonomy shows related work in energy-guided training and critic design, but these focus on training-time objectives or value learning rather than exploration strategies during online fine-tuning, clarifying FINO's distinct positioning.

Among 29 candidates examined, the noise-injected training scheme (Contribution 2) shows one refutable candidate from 10 examined, indicating some prior work on noise-based exploration exists within the limited search scope. The overall FINO framework (Contribution 1) and entropy-guided sampling (Contribution 3) show no refutable candidates among 9 and 10 examined respectively, suggesting these specific combinations appear less directly addressed in the top-30 semantic matches. The statistics indicate moderate prior work overlap for the noise injection component, while the integrated framework and entropy mechanism appear more distinctive within this limited candidate pool.

Based on the top-29 semantic matches examined, FINO appears to occupy a relatively unexplored niche combining noise injection with entropy-guided sampling for flow-based offline-to-online RL. The taxonomy structure confirms this is a sparse leaf with no listed siblings, though the limited search scope means potentially relevant work outside the top-K matches remains unexamined. The analysis covers semantic proximity and citation-based expansion but does not constitute an exhaustive field survey.

Taxonomy

Core-task Taxonomy Papers
22
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: offline-to-online reinforcement learning with flow matching policies. This emerging field combines flow-based generative models with reinforcement learning to enable smooth transitions from offline datasets to online fine-tuning. The taxonomy reveals several interconnected research directions. Flow-Based Policy Architecture and Training Stability addresses the foundational design of flow matching networks and their convergence properties. Energy-Guided and Reward-Conditioned Flow Training explores how to steer generative policies toward high-reward regions using energy functions or explicit reward conditioning, as seen in works like Extremum Flow Matching[13] and Energy-Weighted Flow Matching[14]. Critic Design and Value Function Learning focuses on learning robust value estimates that can guide flow policies, with approaches ranging from distributional critics to expressive value representations. Offline Behavior Modeling and Constraint Enforcement tackles how to leverage offline data while avoiding out-of-distribution actions, often through behavioral cloning constraints or density-based regularization. Domain Transfer and Distribution Shift Handling examines robustness across different environments, while Inference-Time Guidance and Goal-Conditioned Learning investigates steering pre-trained flows at test time. The Offline-to-Online Transition and Exploration branch specifically addresses the challenge of moving from static datasets to active learning. Within the offline-to-online transition landscape, a handful of works explore different mechanisms for balancing exploitation of learned policies with exploration of new behaviors. Some methods like EXPO[1] and floq[5] emphasize principled exploration strategies that build on flow-based policy representations, while others such as SAC Flow[3] adapt entropy-regularized frameworks to the flow setting. Flow Matching Injected Noise[0] sits within the Noise Injection and Entropy-Guided Exploration cluster, focusing on how stochastic perturbations can facilitate exploration during the online phase. This contrasts with approaches like FlowQ[2] or Flow Q-Learning[4], which may prioritize tighter integration between flow generation and Q-function learning. The central tension across these branches involves maintaining the expressiveness of flow models while ensuring stable online improvement and adequate exploration, a challenge that Flow Matching Injected Noise[0] addresses through its noise injection mechanism.

Claimed Contributions

Flow Matching with Injected Noise for Offline-to-Online RL (FINO)

The authors introduce FINO, a method that injects noise into flow matching during offline pre-training to expand the action space beyond the dataset, enabling more effective exploration during subsequent online fine-tuning in reinforcement learning.

9 retrieved papers
Noise-injected training scheme for flow matching

The authors propose a training objective that injects controlled noise into the flow matching formulation, encouraging the policy to explore a broader range of actions beyond those present in the offline dataset while maintaining valid continuous normalizing flows.

10 retrieved papers
Can Refute
Entropy-guided sampling mechanism

The authors introduce a sampling mechanism that constructs a distribution over candidate actions based on their action-values and dynamically adjusts a temperature parameter using policy entropy, enabling adaptive balancing between exploration and exploitation during online fine-tuning.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Flow Matching with Injected Noise for Offline-to-Online RL (FINO)

The authors introduce FINO, a method that injects noise into flow matching during offline pre-training to expand the action space beyond the dataset, enabling more effective exploration during subsequent online fine-tuning in reinforcement learning.

Contribution

Noise-injected training scheme for flow matching

The authors propose a training objective that injects controlled noise into the flow matching formulation, encouraging the policy to explore a broader range of actions beyond those present in the offline dataset while maintaining valid continuous normalizing flows.

Contribution

Entropy-guided sampling mechanism

The authors introduce a sampling mechanism that constructs a distribution over candidate actions based on their action-values and dynamically adjusts a temperature parameter using policy entropy, enabling adaptive balancing between exploration and exploitation during online fine-tuning.