Imitation Learning as Return Distribution Matching

ICLR 2026 Conference SubmissionAnonymous Authors
Imitation LearningBehavioral CloningRiskTheory
Abstract:

We study the problem of training a risk-sensitive reinforcement learning (RL) agent through imitation learning (IL). Unlike standard IL, our goal is not only to train an agent that matches the expert’s expected return (i.e., its average performance) but also its risk attitude (i.e., other features of the return distribution, such as variance). We propose a general formulation of the risk-sensitive IL problem in which the objective is to match the expert’s return distribution in Wasserstein distance. We focus on the tabular setting and assume the expert’s reward is known. After demonstrating the limited expressivity of Markovian policies for this task, we introduce an efficient and sufficiently expressive subclass of non-Markovian policies tailored to it. Building on this subclass, we develop two provably efficient algorithms—RS-BC and RS-KT —for solving the problem when the transition model is unknown and known, respectively. We show that RS-KT achieves substantially lower sample complexity than RS-BC by exploiting dynamics information. We further demonstrate the sample efficiency of return distribution matching in the setting where the expert’s reward is unknown by designing an oracle-based variant of RS-KT. Finally, we complement our theoretical analysis of RS-KT and RS-BC with numerical simulations, highlighting both their sample efficiency and the advantages of non-Markovian policies over standard sample-efficient IL algorithms.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a risk-sensitive imitation learning framework that matches expert return distributions using Wasserstein distance, introducing non-Markovian policies and two algorithms (RS-BC and RS-KT) with provable sample complexity guarantees. Within the taxonomy, it occupies the sole position in the 'Non-Markovian Policy Learning for Risk-Sensitive IL' leaf under 'Return Distribution Matching for Imitation Learning'. This leaf contains only the original paper itself, indicating a relatively sparse research direction focused specifically on non-Markovian approaches to distributional matching in imitation learning.

The taxonomy reveals neighboring work in adjacent branches: 'Distributional Inverse Reinforcement Learning' focuses on recovering reward distributions rather than direct policy learning, while 'Risk-Sensitive Distributional RL' addresses return distribution modeling in standard RL settings without expert demonstrations. The 'Behavioral Cloning and Policy Pretraining' branch handles policy initialization but excludes risk-sensitive distributional objectives. The original paper bridges these areas by combining distributional matching (from RL) with imitation learning, while explicitly using non-Markovian policies to capture temporal risk dependencies that Markovian approaches cannot express.

Among 20 candidates examined across three contributions, the return distribution matching formulation shows one refutable candidate among 10 examined, suggesting some conceptual overlap in the limited search scope. The non-Markovian policy class and algorithms show no refutable candidates among 1 examined, though this extremely small sample provides minimal evidence. The unknown-reward oracle-based result examined 9 candidates with none refutable, indicating potential novelty in this specific technical direction. These statistics reflect a constrained semantic search rather than comprehensive field coverage, leaving open questions about broader prior work in distributional imitation learning.

Based on the limited 20-candidate search, the work appears to occupy a relatively unexplored intersection of risk-sensitive RL and imitation learning, particularly in its non-Markovian policy formulation. The sparse taxonomy structure around this leaf and the modest refutation rate suggest potential novelty, though the small search scope means substantial related work may exist outside the examined candidates. A more exhaustive literature review would be needed to definitively assess originality across the broader distributional RL and imitation learning communities.

Taxonomy

Core-task Taxonomy Papers
5
3
Claimed Contributions
20
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: risk-sensitive imitation learning through return distribution matching. The field addresses how agents can learn from demonstrations while accounting for uncertainty and risk preferences, moving beyond traditional imitation learning that focuses solely on expected performance. The taxonomy reveals four main branches: Distributional Inverse Reinforcement Learning seeks to recover reward distributions from expert behavior, enabling richer characterizations of expert preferences; Risk-Sensitive Distributional Reinforcement Learning develops algorithms that optimize over return distributions rather than expected returns; Return Distribution Matching for Imitation Learning directly aligns learner and expert return distributions; and Behavioral Cloning and Policy Pretraining provides foundational methods for policy initialization. These branches collectively span the spectrum from reward inference to direct policy learning, with varying degrees of model assumptions and risk awareness. Recent work has explored several contrasting approaches to incorporating distributional information. Distributional IRL[1] focuses on inferring entire reward distributions from demonstrations, while methods like Posterior Behavioral Cloning[3] emphasize conditioning policies on desired outcomes or return levels. The original paper, Return Distribution Matching[0], sits within the direct distribution-matching branch, emphasizing non-Markovian policy structures that can capture richer temporal dependencies in risk-sensitive settings. This contrasts with approaches like Model Risk-sensitive Offline[4] that rely on learned dynamics models, and differs from Posterior Behavioral Cloning[3] by matching full return distributions rather than conditioning on specific return targets. A key open question across these directions is how to balance expressiveness in capturing risk preferences against sample efficiency and computational tractability, particularly when expert demonstrations are limited or noisy.

Claimed Contributions

Return distribution matching formulation for risk-sensitive imitation learning

The authors propose a general formulation of risk-sensitive imitation learning where the objective is to match the entire expert return distribution in Wasserstein distance, rather than only matching expected return or a single risk measure like CVaR. This formulation captures the expert's full risk attitude encoded in the return distribution shape.

10 retrieved papers
Can Refute
Efficient non-Markovian policy class and two provably efficient algorithms

The authors introduce a parameterized subclass of non-Markovian policies that balances expressivity and efficiency for return distribution matching. Building on this class, they develop RS-BC for the unknown transition setting and RS-KT for the known transition setting, both with provable sample complexity guarantees.

1 retrieved paper
Sample efficiency result for unknown-reward setting with oracle-based algorithm

The authors demonstrate that the robust return distribution matching problem with unknown expert reward remains statistically tractable when the transition model is known. They prove that a polynomial number of expert demonstrations suffices to accurately estimate the expert's return distribution under any reward, enabling an oracle-based solution approach.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Return distribution matching formulation for risk-sensitive imitation learning

The authors propose a general formulation of risk-sensitive imitation learning where the objective is to match the entire expert return distribution in Wasserstein distance, rather than only matching expected return or a single risk measure like CVaR. This formulation captures the expert's full risk attitude encoded in the return distribution shape.

Contribution

Efficient non-Markovian policy class and two provably efficient algorithms

The authors introduce a parameterized subclass of non-Markovian policies that balances expressivity and efficiency for return distribution matching. Building on this class, they develop RS-BC for the unknown transition setting and RS-KT for the known transition setting, both with provable sample complexity guarantees.

Contribution

Sample efficiency result for unknown-reward setting with oracle-based algorithm

The authors demonstrate that the robust return distribution matching problem with unknown expert reward remains statistically tractable when the transition model is known. They prove that a polynomial number of expert demonstrations suffices to accurately estimate the expert's return distribution under any reward, enabling an oracle-based solution approach.