FlowRL: Matching Reward Distributions for LLM Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
Reward Distribution MatchingFlow BalanceLLM Reasoning
Abstract:

We propose FlowRL: matching the full reward distribution via flow balancing instead of solely maximizing rewards in large language model (LLM) reinforcement learning (RL). Recent advanced reasoning models adopt reward-maximizing methods (e.g., PPO and GRPO), which tend to over-optimize dominant reward signals while neglecting less frequent but valid reasoning paths, thus reducing diversity. In contrast, we transform scalar rewards into a normalized target distribution using a learnable partition function, and then minimize the reverse KL divergence between the policy and the target distribution. We implement this idea as a flow-balanced optimization method that promotes diverse exploration and generalizable reasoning trajectories. We conduct experiments on both math and code reasoning tasks: FlowRL achieves a significant average improvement of 10.0%10.0\% over GRPO and 5.1%5.1\% over PPO on math benchmarks, and performs consistently better on code reasoning tasks. These results highlight reward distribution-matching as a key step toward efficient exploration and diverse reasoning in LLM reinforcement learning.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes FlowRL, a method that matches reward distributions via flow balancing rather than maximizing scalar rewards in LLM reinforcement learning. It resides in the Flow-Based and Distribution-Matching Methods leaf, which contains only three papers including this one. This is a relatively sparse research direction within the broader taxonomy of 50 papers across nine major branches, suggesting that distribution-matching approaches remain less explored compared to traditional reward maximization methods that dominate neighboring branches.

The taxonomy reveals that FlowRL sits within Reward Distribution Modeling and Optimization, adjacent to Reward Maximization Approaches containing outcome-based RL and policy optimization algorithms. The sibling papers in the same leaf (Distribution Matching Policy and one other) share the conceptual foundation of treating reasoning as a probabilistic flow problem. Neighboring branches like Process and Dense Reward Models and Self-Rewarding frameworks pursue different supervision strategies—step-level feedback versus self-generated rewards—highlighting how FlowRL's distributional objective diverges from both sparse outcome signals and dense process supervision paradigms.

Among 18 candidates examined across three contributions, the FlowRL algorithm contribution shows 2 refutable candidates out of 10 examined, while the theoretical equivalence between KL minimization and trajectory balance shows 4 refutable candidates out of 7 examined. The length normalization contribution examined only 1 candidate with no refutations. These statistics indicate that the core algorithmic and theoretical contributions face more substantial prior work overlap within the limited search scope, while the technical implementation details appear less contested. The search scale of 18 candidates suggests this analysis captures prominent related work but may not be exhaustive.

Based on the limited literature search of 18 candidates, FlowRL appears to occupy a relatively sparse research direction with meaningful but not overwhelming prior work overlap. The taxonomy structure confirms that distribution-matching methods remain a minority approach compared to scalar reward maximization, though the contribution-level statistics reveal that specific technical elements—particularly the flow balancing formulation and KL-trajectory balance equivalence—have notable precedents among the examined candidates. A broader search might uncover additional related work in adjacent optimization or probabilistic inference communities.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
18
Contribution Candidate Papers Compared
6
Refutable Paper

Research Landscape Overview

Core task: Matching reward distributions for large language model reasoning. The field addresses how to design, learn, and optimize reward signals that guide LLMs toward improved reasoning capabilities. The taxonomy reveals a rich structure spanning nine major branches. Reward Distribution Modeling and Optimization explores flow-based and distribution-matching techniques, exemplified by FlowRL[0] and Distribution Matching Policy[31], which aim to align policy outputs with target reward distributions rather than simply maximizing scalar rewards. Reward Maximization Approaches and Process and Dense Reward Models focus on outcome-based versus step-level feedback, with works like Outcome Reward Limit[5] and Dense Reasoning Reward[28] investigating the trade-offs between coarse and fine-grained supervision. Self-Rewarding and Inverse RL Frameworks, including Self-rewarding[1] and Meta-rewarding[14], enable models to generate their own training signals, while Open-Ended and Domain-Specific Reasoning branches address generalization across tasks. Additional branches cover theoretical foundations, multi-agent systems, inference techniques, and practical alignment considerations, forming a comprehensive landscape of reward-driven reasoning research. A particularly active line of work contrasts distribution-matching methods with traditional reward maximization. While many studies pursue direct optimization of scalar rewards, a smaller cluster emphasizes matching entire distributions to avoid reward collapse and mode-seeking behavior, as highlighted by Cal-DPO[16] and FlowRL[0]. FlowRL[0] sits squarely within the Flow-Based and Distribution-Matching Methods branch, sharing conceptual ground with Distribution Matching Policy[31] in treating reasoning as a probabilistic flow problem. Compared to outcome-focused approaches like Outcome Reward Limit[5], FlowRL[0] emphasizes aligning the generative process itself rather than merely optimizing terminal rewards. This distinction reflects broader tensions in the field: whether to rely on sparse outcome signals, dense process supervision, or distributional objectives that preserve diversity. Open questions remain about scalability, sample efficiency, and how these distribution-matching techniques interact with self-rewarding frameworks and inference-time search methods.

Claimed Contributions

FlowRL algorithm for reward distribution matching

The authors introduce FlowRL, a policy optimization algorithm that shifts from reward maximization to reward distribution matching. It transforms scalar rewards into normalized target distributions using a learnable partition function and minimizes reverse KL divergence between the policy and target distribution, promoting diverse exploration and generalizable reasoning trajectories.

10 retrieved papers
Can Refute
Theoretical equivalence between KL minimization and trajectory balance

The authors establish theoretical equivalence (Proposition 1) showing that minimizing the KL objective is equivalent to minimizing the trajectory balance loss from GFlowNets. This provides a practical surrogate for reward-guided KL minimization that can be integrated into existing RL frameworks.

7 retrieved papers
Can Refute
Length normalization and importance sampling techniques

The authors develop two technical solutions for long chain-of-thought training: length normalization to prevent gradient explosion from variable-length sequences, and importance sampling to correct distribution mismatch between generated rollouts and the current policy.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FlowRL algorithm for reward distribution matching

The authors introduce FlowRL, a policy optimization algorithm that shifts from reward maximization to reward distribution matching. It transforms scalar rewards into normalized target distributions using a learnable partition function and minimizes reverse KL divergence between the policy and target distribution, promoting diverse exploration and generalizable reasoning trajectories.

Contribution

Theoretical equivalence between KL minimization and trajectory balance

The authors establish theoretical equivalence (Proposition 1) showing that minimizing the KL objective is equivalent to minimizing the trajectory balance loss from GFlowNets. This provides a practical surrogate for reward-guided KL minimization that can be integrated into existing RL frameworks.

Contribution

Length normalization and importance sampling techniques

The authors develop two technical solutions for long chain-of-thought training: length normalization to prevent gradient explosion from variable-length sequences, and importance sampling to correct distribution mismatch between generated rollouts and the current policy.

FlowRL: Matching Reward Distributions for LLM Reasoning | Novelty Validation