Translate Policy to Language: Flow Matching Generated Rewards for LLM Explanations

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

LLMContinuous Normalizing FlowDiffusion ModelRLAIFExplainable AI

As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain agent policies in natural language is vital for reliable coexistence. We introduce a general-purpose framework that trains explanation-generating LLMs via reinforcement learning from AI feedback, with distributional rewards generated by generative continuous normalizing flows (CNFs). CNFs capture the pluralistic and probabilistic nature of human judgments about explanations. Moreover, under mild assumptions, CNFs provably bound deviations from true human reward distributions when trained on noisy proxy rewards from LLMs. We design a specialized CNF architecture that selectively attends to linguistic cues in decision context and explanations when generating rewards. Human and LLM evaluators find that our method delivers explanations that enable more accurate predictions of true agent decisions, exhibit greater logical soundness and actionability, and impose lower cognitive load than explanations trained with proxy LLM rewards or state-of-the-art RLHF and RLAIF baselines.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a framework for training explanation-generating LLMs using reinforcement learning from AI feedback, with distributional rewards modeled by continuous normalizing flows. It resides in the Large Language Model-Based Explanation Frameworks leaf, which contains only three papers total, including the original work. This leaf sits within the broader Explanation Generation Approaches and Architectures branch, suggesting the paper targets a relatively sparse but emerging research direction focused on leveraging pre-trained language models for policy explanation rather than traditional symbolic or model-agnostic methods.

The taxonomy reveals neighboring approaches in sibling leaves: Model-Agnostic and Surrogate-Based Explanation uses observed behaviors without model internals, Rationale Generation via Neural Translation employs sequence-to-sequence architectures, and Symbolic and Argumentation-Based Explanation relies on formal logic frameworks. The original work diverges from these by emphasizing LLM fine-tuning with distributional reward modeling rather than translation or surrogate modeling. Its sibling papers in the same leaf address mental model alignment and recommendation-specific explanation, indicating the LLM-based explanation space spans diverse application contexts but shares common architectural foundations.

Among the three contributions analyzed, the literature search examined twenty-four candidates total. The general-purpose RLAIF framework with CNF rewards was assessed against ten candidates with zero refutations found. The theoretical guarantee on CNF deviation bounds also examined ten candidates with no clear prior work identified. The specialized CNF architecture with cross-attention reviewed four candidates, again with no refutations. These statistics reflect a limited semantic search scope rather than exhaustive coverage, suggesting the contributions may occupy relatively unexplored territory within the examined candidate set.

Based on the limited search scope of twenty-four candidates, the work appears to introduce novel technical components—particularly the CNF-based distributional reward modeling and theoretical guarantees—that were not clearly anticipated in the examined prior work. However, the analysis does not cover the full breadth of reinforcement learning from human feedback or reward modeling literature, leaving open the possibility of relevant work outside the top-K semantic matches and citation expansion performed.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Generating natural language explanations for agent policy decisions. The field encompasses diverse approaches to making autonomous agent behavior interpretable through textual descriptions. The taxonomy reveals several major branches: Explanation Generation Approaches and Architectures focuses on technical methods for producing explanations, including traditional model-agnostic techniques and newer large language model-based frameworks; Multi-Agent and Collaborative Explanation addresses scenarios where multiple agents must coordinate or justify collective decisions; Explanation Types and Modalities examines different forms explanations can take, from contrastive and causal accounts to visual and interactive formats; Application-Specific Explanation Systems tailors methods to domains like healthcare, robotics, and recommendation systems; Human Factors and Evaluation investigates how users perceive and benefit from explanations; Theoretical Foundations draws on philosophy, cognitive science, and legal theory; and LLM Agent Decision-Making explores both capabilities and limitations of language-model-driven agents. Representative works span from early symbolic approaches to modern neural methods, with State2explanation[8] exemplifying template-based generation and Model-Agnostic Policy Explanations[7] demonstrating domain-independent techniques. Particularly active lines of work contrast traditional interpretability methods with emerging LLM-based frameworks. The former often rely on policy summarization, saliency mapping, or causal reasoning to extract explanations from trained policies, while LLM-based approaches leverage pre-trained language models to generate more fluent, contextually rich rationales. The original paper sits squarely within the Large Language Model-Based Explanation Frameworks cluster, sharing this branch with Mental Model Alignment[6] and EdgeX-MMFRec[13]. Compared to Mental Model Alignment[6], which emphasizes aligning agent explanations with human mental models, and EdgeX-MMFRec[13], which applies LLM explanation to recommendation systems, the original work appears to focus on foundational architectures for LLM-driven explanation generation. Key open questions across these branches include balancing explanation fidelity with comprehensibility, ensuring explanations remain faithful to actual agent reasoning rather than post-hoc rationalizations, and determining which explanation modalities best serve different user needs and application contexts.

Claimed Contributions

General-purpose framework for training explanation-generating LLMs via RLAIF with CNF-generated distributional rewards

10 retrieved papers

The authors propose a task-agnostic framework that trains LLMs to generate natural language explanations of agent policies (both RL agents and LLMs) using reinforcement learning. The framework uses continuous normalizing flows to generate distributional rewards that capture pluralistic and probabilistic human judgments, replacing costly human feedback with AI-generated proxy rewards.

10 retrieved papers

Theoretical guarantee that CNFs bound deviations from true human reward distributions under noisy proxy rewards

10 retrieved papers

The authors provide theoretical analysis showing that when the noise in proxy LLM rewards has the same functional form as the CNF's base distribution (e.g., Gaussian), the CNF provably bounds its deviation from the underlying human reward distribution. This theoretical contribution distinguishes their approach from prior RLAIF methods that lack formal guarantees on managing proxy errors.

10 retrieved papers

Specialized CNF architecture with cross-attention for linguistic cue integration in reward generation

4 retrieved papers

The authors develop a novel rectified flow architecture that embeds the flow model into an LLM backbone using cross-attention mechanisms. This allows the reward model to selectively integrate linguistic and contextual information from decision contexts and explanations, enabling it to understand natural language inputs and generalize to unseen negative samples.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] SySLLM: Generating Synthesized Policy Summaries for Reinforcement Learning Agents Using Large Language Models PDF

S Admoni, O Ben-Porat, O Amir (2025)

[13] Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards PDF

Yang Xin-yi, Xinyi Yang, Zeng Liang, Liang Zeng, Dong, Heng, Heng Dong, Yu Chao, Chao Yu, Wu, Xiaoran, Xiaoran Wu, Yang, Huazhong, Huazhong Yang, Wang Yu, Yu Wang, Tambe, Milind, Milind Tambe, Wang Tonghan, Tonghan Wang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

General-purpose framework for training explanation-generating LLMs via RLAIF with CNF-generated distributional rewards

[65] Constitutional AI: Harmlessness from AI Feedback PDF

Cannot Refute

[66] Learning to generate explainable stock predictions using self-reflective large language models PDF

Cannot Refute

[67] HF4Rec: Human-Like Feedback-Driven Optimization Framework for Explainable Recommendation PDF

Cannot Refute

[68] Reward engineering for generating semi-structured explanation PDF

Cannot Refute

[69] Back to the future: Towards explainable temporal reasoning with large language models PDF

Cannot Refute

[70] Fine-tuning large language model based explainable recommendation with explainable quality reward PDF

Cannot Refute

[71] Explainable Rewards in RLHF Using LLM-as-a-Judge PDF

Cannot Refute

[72] ATA: An Abstract-Train-Abstract approach for explanation-friendly deep reinforcement learning. PDF

Cannot Refute

[73] Reasoning in Large Language Models: From Chain-of-Thought to Massively Decomposed Agentic Processes PDF

Cannot Refute

[74] Mitigating Misleadingness in LLM-Generated Natural Language Explanations for Recommender Systems: Ensuring Broad Truthfulness Through Factuality and â¦ PDF

Cannot Refute

Contribution

Theoretical guarantee that CNFs bound deviations from true human reward distributions under noisy proxy rewards

[55] The policy cliff: A theoretical analysis of reward-policy maps in large language models PDF

Cannot Refute

[56] Reward-Directed Conditional Diffusion: Provable Distribution Estimation and Reward Improvement PDF

Cannot Refute

[57] Reinforcement learning with perturbed rewards PDF

Cannot Refute

[58] Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer PDF

Cannot Refute

[59] Evaluation of best-of-n sampling strategies for language model alignment PDF

Cannot Refute

[60] Lightweight Robust Direct Preference Optimization PDF

Cannot Refute

[61] Guarded policy optimization with imperfect online demonstrations PDF

Cannot Refute

[62] Goodhart's Law in Reinforcement Learning PDF

Cannot Refute

[63] Off-policy evaluation in infinite-horizon reinforcement learning with latent confounders PDF

Cannot Refute

[64] The Distributional Reward Critic Framework for Reinforcement Learning Under Perturbed Rewards PDF

Cannot Refute

Contribution

Specialized CNF architecture with cross-attention for linguistic cue integration in reward generation

[51] : A Vision-Language-Action Flow Model for General Robot Control PDF

Cannot Refute

[52] Maniflow: A general robot manipulation policy via consistency flow training PDF

Cannot Refute

[53] GENFLOWRL: Generative Object-Centric Flow Matching for Reward Shaping in Visual Reinforcement Learning PDF

Cannot Refute

[54] A Self-Supervised Reinforcement Learning Approach for Fine-Tuning Large Language Models Using Cross-Attention Signals PDF

Cannot Refute

Translate Policy to Language: Flow Matching Generated Rewards for LLM Explanations

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] SySLLM: Generating Synthesized Policy Summaries for Reinforcement Learning Agents Using Large Language Models PDF

[13] Policy-to-Language: Train LLMs to Explain Decisions with Flow-Matching Generated Rewards PDF

Contribution Analysis

General-purpose framework for training explanation-generating LLMs via RLAIF with CNF-generated distributional rewards

[65] Constitutional AI: Harmlessness from AI Feedback PDF

[66] Learning to generate explainable stock predictions using self-reflective large language models PDF

[67] HF4Rec: Human-Like Feedback-Driven Optimization Framework for Explainable Recommendation PDF

[68] Reward engineering for generating semi-structured explanation PDF

[69] Back to the future: Towards explainable temporal reasoning with large language models PDF

[70] Fine-tuning large language model based explainable recommendation with explainable quality reward PDF

[71] Explainable Rewards in RLHF Using LLM-as-a-Judge PDF

[72] ATA: An Abstract-Train-Abstract approach for explanation-friendly deep reinforcement learning. PDF

[73] Reasoning in Large Language Models: From Chain-of-Thought to Massively Decomposed Agentic Processes PDF

[74] Mitigating Misleadingness in LLM-Generated Natural Language Explanations for Recommender Systems: Ensuring Broad Truthfulness Through Factuality and â¦ PDF

Theoretical guarantee that CNFs bound deviations from true human reward distributions under noisy proxy rewards

[55] The policy cliff: A theoretical analysis of reward-policy maps in large language models PDF

[56] Reward-Directed Conditional Diffusion: Provable Distribution Estimation and Reward Improvement PDF

[57] Reinforcement learning with perturbed rewards PDF

[58] Provably mitigating overoptimization in rlhf: Your sft loss is implicitly an adversarial regularizer PDF

[59] Evaluation of best-of-n sampling strategies for language model alignment PDF

[60] Lightweight Robust Direct Preference Optimization PDF

[61] Guarded policy optimization with imperfect online demonstrations PDF

[62] Goodhart's Law in Reinforcement Learning PDF

[63] Off-policy evaluation in infinite-horizon reinforcement learning with latent confounders PDF

[64] The Distributional Reward Critic Framework for Reinforcement Learning Under Perturbed Rewards PDF

Specialized CNF architecture with cross-attention for linguistic cue integration in reward generation

[51] : A Vision-Language-Action Flow Model for General Robot Control PDF

[52] Maniflow: A general robot manipulation policy via consistency flow training PDF

[53] GENFLOWRL: Generative Object-Centric Flow Matching for Reward Shaping in Visual Reinforcement Learning PDF

[54] A Self-Supervised Reinforcement Learning Approach for Fine-Tuning Large Language Models Using Cross-Attention Signals PDF

Table of Contents

[74] Mitigating Misleadingness in LLM-Generated Natural Language Explanations for Recommender Systems: Ensuring Broad Truthfulness Through Factuality and â¦ PDF