floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL

ICLR 2026 Conference SubmissionAnonymous Authors
offline RLonline fine-tuningflow-matchingTD-learning
Abstract:

A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token in language models or denoising step-by-step in diffusion models. This enables models to learn complex functions in a generalizable manner. Motivated by this observation, we investigate the benefits of iterative computation for temporal difference (TD) methods in reinforcement learning (RL). Typically, they represent value functions in a monolithic fashion, without iterative compute. We introduce floq (flow-matching Q-functions), an approach that parameterizes the Q-function using a velocity field and trains it with techniques from flow-matching, typically used in generative modeling. This velocity field underneath the flow is trained using a TD-learning objective, which bootstraps from values produced by a target velocity field, computed by running multiple steps of numerical integration. Crucially, floq allows for more fine-grained control and scaling of the Q-function capacity than monolithic architectures, by appropriately setting the number of integration steps. Across a suite of challenging offline RL benchmarks and online fine-tuning tasks, floq improves performance by nearly 1.8x. floq scales capacity far better than standard TD-learning architectures, highlighting the potential of iterative computation for value learning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces floq, which parameterizes Q-functions using velocity fields trained via flow-matching techniques from generative modeling, enabling iterative computation through numerical integration steps. According to the taxonomy, this work occupies the 'Flow-Matching for Q-Function Learning' leaf under 'Iterative Q-Function Computation and Flow-Based Methods'. Notably, this leaf contains only the original paper itself with no sibling papers, indicating this is a sparse and emerging research direction. The broader parent category includes one sibling leaf on 'Search-Augmented Value Estimation', suggesting limited prior exploration of iterative Q-function computation paradigms.

The taxonomy reveals that neighboring research directions focus on compute-optimal scaling through resource allocation, enhanced Q-learning architectures using ensembles or auxiliary value functions, and offline RL with expressive generative models. The 'Compute-Optimal Training Recipes' and 'Massively Parallel Simulation Scaling' leaves address scaling through parallelism and training efficiency rather than test-time iterative refinement. The 'Offline RL and Expressive Value Learning' category uses generative models for value learning but in offline settings without the iterative flow-matching mechanism. This positioning suggests floq bridges generative modeling and value-based RL in a manner distinct from existing compute scaling approaches.

Among thirteen candidates examined, the analysis found one refutable pair for the 'TD-learning objective with bootstrapped flow-matching targets' contribution, while the core floq architecture examined ten candidates with zero refutations, and design choices examined two candidates with zero refutations. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The single refutation for the TD-learning component suggests some overlap with existing bootstrapping techniques, while the architecture and design choices appear more novel within the examined candidate set. The small candidate pool and sparse taxonomy leaf together indicate this work explores relatively uncharted territory.

Based on the limited thirty-candidate search scope and the taxonomy structure showing an isolated leaf position, the work appears to occupy a genuinely sparse research direction. However, the analysis cannot rule out relevant prior work outside the semantic search radius or in adjacent fields like generative modeling. The single refutation for the TD-learning objective warrants closer examination of how bootstrapped flow-matching targets relate to existing temporal difference methods, though the core architectural contribution shows no clear overlap within the examined literature.

Taxonomy

Core-task Taxonomy Papers
25
3
Claimed Contributions
13
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: scaling compute in value-based reinforcement learning through iterative Q-function computation. The field encompasses diverse strategies for improving the efficiency, scalability, and expressiveness of Q-learning and related value-based methods. At the highest level, the taxonomy organizes work into branches addressing iterative computation and flow-based techniques, compute-optimal resource allocation, multi-agent value factorization, decentralized learning, enhanced architectures and training dynamics, constrained and safe RL, efficient state-action representations, offline RL with expressive value functions, focused and topological value iteration, adaptive stopping mechanisms, domain-specific applications, and foundational tutorials. Some branches, such as multi-agent value factorization and decentralized learning, emphasize coordination and scalability across agents, while others like compute-optimal scaling and enhanced architectures focus on single-agent efficiency and representational power. Works on constrained RL and safe learning address risk-sensitive settings, and offline RL methods tackle data-driven scenarios without online interaction. Within this landscape, a particularly active line of research explores novel computational paradigms for Q-function learning, including flow-matching and iterative refinement techniques that leverage additional compute at inference time. floq[0] sits squarely in this emerging direction, proposing flow-matching as a mechanism to iteratively improve Q-value estimates by treating the computation as a generative process. This contrasts with more traditional approaches like parallel Q-learning methods or ensemble-based exploration strategies, which distribute computation across multiple learners or use ensembles for uncertainty quantification. Compared to compute-optimal scaling studies that focus on balancing model size and training resources, floq[0] emphasizes test-time computation, aligning with recent trends in expressive value learning and adaptive stopping mechanisms. The work also differs from multi-agent factorization methods, which decompose joint Q-functions for coordination, by concentrating on single-agent iterative refinement. Overall, floq[0] represents a novel intersection of generative modeling and value-based RL, opening questions about how flow-based computation scales relative to classical iterative methods and ensemble techniques.

Claimed Contributions

floq: Flow-matching Q-function architecture with iterative computation

The authors propose a novel Q-function architecture that represents value functions as a velocity field trained via flow-matching. This velocity field transforms sampled noise into Q-value estimates through numerical integration, enabling iterative computation with dense supervision at each integration step.

10 retrieved papers
TD-learning objective with bootstrapped flow-matching targets

The authors develop a training procedure that combines temporal difference learning with flow-matching by supervising the velocity field to match evolving TD-targets at each step of the iterative process, providing dense supervision throughout the flow.

1 retrieved paper
Can Refute
Design choices for stable flow-matching in value-based RL

The authors identify and address key challenges in applying flow-matching to scalar Q-values, introducing specific architectural choices including noise distribution tuning, categorical input representations, and Fourier time embeddings to prevent flow collapse and enable effective scaling.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

floq: Flow-matching Q-function architecture with iterative computation

The authors propose a novel Q-function architecture that represents value functions as a velocity field trained via flow-matching. This velocity field transforms sampled noise into Q-value estimates through numerical integration, enabling iterative computation with dense supervision at each integration step.

Contribution

TD-learning objective with bootstrapped flow-matching targets

The authors develop a training procedure that combines temporal difference learning with flow-matching by supervising the velocity field to match evolving TD-targets at each step of the iterative process, providing dense supervision throughout the flow.

Contribution

Design choices for stable flow-matching in value-based RL

The authors identify and address key challenges in applying flow-matching to scalar Q-values, introducing specific architectural choices including noise distribution tuning, categorical input representations, and Fourier time embeddings to prevent flow collapse and enable effective scaling.