Translate Policy to Language: Flow Matching Generated Rewards for LLM Explanations

ICLR 2026 Conference SubmissionAnonymous Authors
LLMContinuous Normalizing FlowDiffusion ModelRLAIFExplainable AI
Abstract:

As humans increasingly share environments with diverse agents powered by RL, LLMs, and beyond, the ability to explain agent policies in natural language is vital for reliable coexistence. We introduce a general-purpose framework that trains explanation-generating LLMs via reinforcement learning from AI feedback, with distributional rewards generated by generative continuous normalizing flows (CNFs). CNFs capture the pluralistic and probabilistic nature of human judgments about explanations. Moreover, under mild assumptions, CNFs provably bound deviations from true human reward distributions when trained on noisy proxy rewards from LLMs. We design a specialized CNF architecture that selectively attends to linguistic cues in decision context and explanations when generating rewards. Human and LLM evaluators find that our method delivers explanations that enable more accurate predictions of true agent decisions, exhibit greater logical soundness and actionability, and impose lower cognitive load than explanations trained with proxy LLM rewards or state-of-the-art RLHF and RLAIF baselines.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a framework for training explanation-generating LLMs using reinforcement learning from AI feedback, with distributional rewards modeled by continuous normalizing flows. It resides in the Large Language Model-Based Explanation Frameworks leaf, which contains only three papers total, including the original work. This leaf sits within the broader Explanation Generation Approaches and Architectures branch, suggesting the paper targets a relatively sparse but emerging research direction focused on leveraging pre-trained language models for policy explanation rather than traditional symbolic or model-agnostic methods.

The taxonomy reveals neighboring approaches in sibling leaves: Model-Agnostic and Surrogate-Based Explanation uses observed behaviors without model internals, Rationale Generation via Neural Translation employs sequence-to-sequence architectures, and Symbolic and Argumentation-Based Explanation relies on formal logic frameworks. The original work diverges from these by emphasizing LLM fine-tuning with distributional reward modeling rather than translation or surrogate modeling. Its sibling papers in the same leaf address mental model alignment and recommendation-specific explanation, indicating the LLM-based explanation space spans diverse application contexts but shares common architectural foundations.

Among the three contributions analyzed, the literature search examined twenty-four candidates total. The general-purpose RLAIF framework with CNF rewards was assessed against ten candidates with zero refutations found. The theoretical guarantee on CNF deviation bounds also examined ten candidates with no clear prior work identified. The specialized CNF architecture with cross-attention reviewed four candidates, again with no refutations. These statistics reflect a limited semantic search scope rather than exhaustive coverage, suggesting the contributions may occupy relatively unexplored territory within the examined candidate set.

Based on the limited search scope of twenty-four candidates, the work appears to introduce novel technical components—particularly the CNF-based distributional reward modeling and theoretical guarantees—that were not clearly anticipated in the examined prior work. However, the analysis does not cover the full breadth of reinforcement learning from human feedback or reward modeling literature, leaving open the possibility of relevant work outside the top-K semantic matches and citation expansion performed.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Generating natural language explanations for agent policy decisions. The field encompasses diverse approaches to making autonomous agent behavior interpretable through textual descriptions. The taxonomy reveals several major branches: Explanation Generation Approaches and Architectures focuses on technical methods for producing explanations, including traditional model-agnostic techniques and newer large language model-based frameworks; Multi-Agent and Collaborative Explanation addresses scenarios where multiple agents must coordinate or justify collective decisions; Explanation Types and Modalities examines different forms explanations can take, from contrastive and causal accounts to visual and interactive formats; Application-Specific Explanation Systems tailors methods to domains like healthcare, robotics, and recommendation systems; Human Factors and Evaluation investigates how users perceive and benefit from explanations; Theoretical Foundations draws on philosophy, cognitive science, and legal theory; and LLM Agent Decision-Making explores both capabilities and limitations of language-model-driven agents. Representative works span from early symbolic approaches to modern neural methods, with State2explanation[8] exemplifying template-based generation and Model-Agnostic Policy Explanations[7] demonstrating domain-independent techniques. Particularly active lines of work contrast traditional interpretability methods with emerging LLM-based frameworks. The former often rely on policy summarization, saliency mapping, or causal reasoning to extract explanations from trained policies, while LLM-based approaches leverage pre-trained language models to generate more fluent, contextually rich rationales. The original paper sits squarely within the Large Language Model-Based Explanation Frameworks cluster, sharing this branch with Mental Model Alignment[6] and EdgeX-MMFRec[13]. Compared to Mental Model Alignment[6], which emphasizes aligning agent explanations with human mental models, and EdgeX-MMFRec[13], which applies LLM explanation to recommendation systems, the original work appears to focus on foundational architectures for LLM-driven explanation generation. Key open questions across these branches include balancing explanation fidelity with comprehensibility, ensuring explanations remain faithful to actual agent reasoning rather than post-hoc rationalizations, and determining which explanation modalities best serve different user needs and application contexts.

Claimed Contributions

General-purpose framework for training explanation-generating LLMs via RLAIF with CNF-generated distributional rewards

The authors propose a task-agnostic framework that trains LLMs to generate natural language explanations of agent policies (both RL agents and LLMs) using reinforcement learning. The framework uses continuous normalizing flows to generate distributional rewards that capture pluralistic and probabilistic human judgments, replacing costly human feedback with AI-generated proxy rewards.

10 retrieved papers
Theoretical guarantee that CNFs bound deviations from true human reward distributions under noisy proxy rewards

The authors provide theoretical analysis showing that when the noise in proxy LLM rewards has the same functional form as the CNF's base distribution (e.g., Gaussian), the CNF provably bounds its deviation from the underlying human reward distribution. This theoretical contribution distinguishes their approach from prior RLAIF methods that lack formal guarantees on managing proxy errors.

10 retrieved papers
Specialized CNF architecture with cross-attention for linguistic cue integration in reward generation

The authors develop a novel rectified flow architecture that embeds the flow model into an LLM backbone using cross-attention mechanisms. This allows the reward model to selectively integrate linguistic and contextual information from decision contexts and explanations, enabling it to understand natural language inputs and generalize to unseen negative samples.

4 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

General-purpose framework for training explanation-generating LLMs via RLAIF with CNF-generated distributional rewards

The authors propose a task-agnostic framework that trains LLMs to generate natural language explanations of agent policies (both RL agents and LLMs) using reinforcement learning. The framework uses continuous normalizing flows to generate distributional rewards that capture pluralistic and probabilistic human judgments, replacing costly human feedback with AI-generated proxy rewards.

Contribution

Theoretical guarantee that CNFs bound deviations from true human reward distributions under noisy proxy rewards

The authors provide theoretical analysis showing that when the noise in proxy LLM rewards has the same functional form as the CNF's base distribution (e.g., Gaussian), the CNF provably bounds its deviation from the underlying human reward distribution. This theoretical contribution distinguishes their approach from prior RLAIF methods that lack formal guarantees on managing proxy errors.

Contribution

Specialized CNF architecture with cross-attention for linguistic cue integration in reward generation

The authors develop a novel rectified flow architecture that embeds the flow model into an LLM backbone using cross-attention mechanisms. This allows the reward model to selectively integrate linguistic and contextual information from decision contexts and explanations, enabling it to understand natural language inputs and generalize to unseen negative samples.