FlowRL: Matching Reward Distributions for LLM Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Reward Distribution MatchingFlow BalanceLLM Reasoning

We propose FlowRL: matching the full reward distribution via flow balancing instead of solely maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (e.g., PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on both math and code reasoning tasks: FlowRL achieves a significant average improvement of $10.0\%$ over GRPO and $5.1\%$ over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes FlowRL, a method that matches reward distributions via flow balancing rather than maximizing scalar rewards in LLM reinforcement learning. It resides in the Flow-Based and Distribution-Matching Methods leaf, which contains only three papers including this one. This is a relatively sparse research direction within the broader taxonomy of 50 papers across nine major branches, suggesting that distribution-matching approaches remain less explored compared to traditional reward maximization methods that dominate neighboring branches.

The taxonomy reveals that FlowRL sits within Reward Distribution Modeling and Optimization, adjacent to Reward Maximization Approaches containing outcome-based RL and policy optimization algorithms. The sibling papers in the same leaf (Distribution Matching Policy and one other) share the conceptual foundation of treating reasoning as a probabilistic flow problem. Neighboring branches like Process and Dense Reward Models and Self-Rewarding frameworks pursue different supervision strategies—step-level feedback versus self-generated rewards—highlighting how FlowRL's distributional objective diverges from both sparse outcome signals and dense process supervision paradigms.

Among 18 candidates examined across three contributions, the FlowRL algorithm contribution shows 2 refutable candidates out of 10 examined, while the theoretical equivalence between KL minimization and trajectory balance shows 4 refutable candidates out of 7 examined. The length normalization contribution examined only 1 candidate with no refutations. These statistics indicate that the core algorithmic and theoretical contributions face more substantial prior work overlap within the limited search scope, while the technical implementation details appear less contested. The search scale of 18 candidates suggests this analysis captures prominent related work but may not be exhaustive.

Based on the limited literature search of 18 candidates, FlowRL appears to occupy a relatively sparse research direction with meaningful but not overwhelming prior work overlap. The taxonomy structure confirms that distribution-matching methods remain a minority approach compared to scalar reward maximization, though the contribution-level statistics reveal that specific technical elements—particularly the flow balancing formulation and KL-trajectory balance equivalence—have notable precedents among the examined candidates. A broader search might uncover additional related work in adjacent optimization or probabilistic inference communities.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Matching reward distributions for large language model reasoning. The field addresses how to design, learn, and optimize reward signals that guide LLMs toward improved reasoning capabilities. The taxonomy reveals a rich structure spanning nine major branches. Reward Distribution Modeling and Optimization explores flow-based and distribution-matching techniques, exemplified by FlowRL[0] and Distribution Matching Policy[31], which aim to align policy outputs with target reward distributions rather than simply maximizing scalar rewards. Reward Maximization Approaches and Process and Dense Reward Models focus on outcome-based versus step-level feedback, with works like Outcome Reward Limit[5] and Dense Reasoning Reward[28] investigating the trade-offs between coarse and fine-grained supervision. Self-Rewarding and Inverse RL Frameworks, including Self-rewarding[1] and Meta-rewarding[14], enable models to generate their own training signals, while Open-Ended and Domain-Specific Reasoning branches address generalization across tasks. Additional branches cover theoretical foundations, multi-agent systems, inference techniques, and practical alignment considerations, forming a comprehensive landscape of reward-driven reasoning research. A particularly active line of work contrasts distribution-matching methods with traditional reward maximization. While many studies pursue direct optimization of scalar rewards, a smaller cluster emphasizes matching entire distributions to avoid reward collapse and mode-seeking behavior, as highlighted by Cal-DPO[16] and FlowRL[0]. FlowRL[0] sits squarely within the Flow-Based and Distribution-Matching Methods branch, sharing conceptual ground with Distribution Matching Policy[31] in treating reasoning as a probabilistic flow problem. Compared to outcome-focused approaches like Outcome Reward Limit[5], FlowRL[0] emphasizes aligning the generative process itself rather than merely optimizing terminal rewards. This distinction reflects broader tensions in the field: whether to rely on sparse outcome signals, dense process supervision, or distributional objectives that preserve diversity. Open questions remain about scalability, sample efficiency, and how these distribution-matching techniques interact with self-rewarding frameworks and inference-time search methods.

Claimed Contributions

FlowRL algorithm for reward distribution matching

Can Refute

10 retrieved papers

The authors introduce FlowRL, a policy optimization algorithm that shifts from reward maximization to reward distribution matching. It transforms scalar rewards into normalized target distributions using a learnable partition function and minimizes reverse KL divergence between the policy and target distribution, promoting diverse exploration and generalizable reasoning trajectories.

10 retrieved papers

Can Refute

Theoretical equivalence between KL minimization and trajectory balance

Can Refute

7 retrieved papers

The authors establish theoretical equivalence (Proposition 1) showing that minimizing the KL objective is equivalent to minimizing the trajectory balance loss from GFlowNets. This provides a practical surrogate for reward-guided KL minimization that can be integrated into existing RL frameworks.

7 retrieved papers

Can Refute

Length normalization and importance sampling techniques

1 retrieved paper

The authors develop two technical solutions for long chain-of-thought training: length normalization to prevent gradient explosion from variable-length sequences, and importance sampling to correct distribution mismatch between generated rollouts and the current policy.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[16] Cal-dpo: Calibrated direct preference optimization for language model alignment PDF

Vasant Honavar, Mingxiao Li, Teng Xiao, Yige Yuan, Huaisheng Zhu (2024)

[31] Enhancing reasoning for diffusion llms via distribution matching policy optimization PDF

Zhu Yuchen, Guo Wei, Yuchen Zhu, Choi, Jaemoo, Wei Guo, Molodyk, Petr, Jaemoo Choi, Yuan Bo, Petr Molodyk, Tao, Molei, Bo Yuan, Chen Yong-xin, Molei Tao, Yongxin Chen (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FlowRL algorithm for reward distribution matching

[6] Amortizing intractable inference in large language models PDF

Can Refute

[55] On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting PDF

Cannot Refute

[65] KL DIVERGENCE OPTIMIZATION WITH ENTROPY-RATIO ESTIMATION FOR STOCHASTIC GFLOWNETS PDF

Cannot Refute

Contribution

Length normalization and importance sampling techniques

[51] Remaining Useful Life Prediction of Aircraft Engines with Variable Length Input Sequences PDF

Cannot Refute

FlowRL: Matching Reward Distributions for LLM Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[16] Cal-dpo: Calibrated direct preference optimization for language model alignment PDF

[31] Enhancing reasoning for diffusion llms via distribution matching policy optimization PDF

Contribution Analysis

FlowRL algorithm for reward distribution matching

[6] Amortizing intractable inference in large language models PDF

[55] On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting PDF

[1] Self-rewarding language models PDF

[22] R3hf: Reward redistribution for enhancing reinforcement learning from human feedback PDF

[52] Guiding pretraining in reinforcement learning with large language models PDF

[53] Human-centric reward optimization for reinforcement learning-based automated driving using large language models PDF

[54] Transforming and combining rewards for aligning large language models PDF

[56] Generalist Reward Models: Found Inside Large Language Models PDF

[57] Direct preference optimization: Your language model is secretly a reward model PDF

[58] Reward collapse in aligning large language models PDF

Theoretical equivalence between KL minimization and trajectory balance

[59] On divergence measures for training gflownets PDF

[60] Amortizing intractable inference in diffusion models for vision, language, and control PDF

[61] A variational perspective on generative flow networks PDF

[63] Relative Trajectory Balance is equivalent to Trust-PCL PDF

[62] Streaming Bayes GFlowNets PDF

[64] FlowHF: Generative Flow Networks for RLHF PDF

[65] KL DIVERGENCE OPTIMIZATION WITH ENTROPY-RATIO ESTIMATION FOR STOCHASTIC GFLOWNETS PDF

Length normalization and importance sampling techniques

[51] Remaining Useful Life Prediction of Aircraft Engines with Variable Length Input Sequences PDF

Table of Contents