Imitation Learning as Return Distribution Matching

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Imitation LearningBehavioral CloningRiskTheory

We study the problem of training a risk-sensitive reinforcement learning (RL) agent through imitation learning (IL). Unlike standard IL, our goal is not only to train an agent that matches the expert’s expected return (i.e., its average performance) but also its risk attitude (i.e., other features of the return distribution, such as variance). We propose a general formulation of the risk-sensitive IL problem in which the objective is to match the expert’s return distribution in Wasserstein distance. We focus on the tabular setting and assume the expert’s reward is known. After demonstrating the limited expressivity of Markovian policies for this task, we introduce an efficient and sufficiently expressive subclass of non-Markovian policies tailored to it. Building on this subclass, we develop two provably efficient algorithms—RS-BC and RS-KT —for solving the problem when the transition model is unknown and known, respectively. We show that RS-KT achieves substantially lower sample complexity than RS-BC by exploiting dynamics information. We further demonstrate the sample efficiency of return distribution matching in the setting where the expert’s reward is unknown by designing an oracle-based variant of RS-KT. Finally, we complement our theoretical analysis of RS-KT and RS-BC with numerical simulations, highlighting both their sample efficiency and the advantages of non-Markovian policies over standard sample-efficient IL algorithms.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a risk-sensitive imitation learning framework that matches expert return distributions using Wasserstein distance, introducing non-Markovian policies and two algorithms (RS-BC and RS-KT) with provable sample complexity guarantees. Within the taxonomy, it occupies the sole position in the 'Non-Markovian Policy Learning for Risk-Sensitive IL' leaf under 'Return Distribution Matching for Imitation Learning'. This leaf contains only the original paper itself, indicating a relatively sparse research direction focused specifically on non-Markovian approaches to distributional matching in imitation learning.

The taxonomy reveals neighboring work in adjacent branches: 'Distributional Inverse Reinforcement Learning' focuses on recovering reward distributions rather than direct policy learning, while 'Risk-Sensitive Distributional RL' addresses return distribution modeling in standard RL settings without expert demonstrations. The 'Behavioral Cloning and Policy Pretraining' branch handles policy initialization but excludes risk-sensitive distributional objectives. The original paper bridges these areas by combining distributional matching (from RL) with imitation learning, while explicitly using non-Markovian policies to capture temporal risk dependencies that Markovian approaches cannot express.

Among 20 candidates examined across three contributions, the return distribution matching formulation shows one refutable candidate among 10 examined, suggesting some conceptual overlap in the limited search scope. The non-Markovian policy class and algorithms show no refutable candidates among 1 examined, though this extremely small sample provides minimal evidence. The unknown-reward oracle-based result examined 9 candidates with none refutable, indicating potential novelty in this specific technical direction. These statistics reflect a constrained semantic search rather than comprehensive field coverage, leaving open questions about broader prior work in distributional imitation learning.

Based on the limited 20-candidate search, the work appears to occupy a relatively unexplored intersection of risk-sensitive RL and imitation learning, particularly in its non-Markovian policy formulation. The sparse taxonomy structure around this leaf and the modest refutation rate suggest potential novelty, though the small search scope means substantial related work may exist outside the examined candidates. A more exhaustive literature review would be needed to definitively assess originality across the broader distributional RL and imitation learning communities.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: risk-sensitive imitation learning through return distribution matching. The field addresses how agents can learn from demonstrations while accounting for uncertainty and risk preferences, moving beyond traditional imitation learning that focuses solely on expected performance. The taxonomy reveals four main branches: Distributional Inverse Reinforcement Learning seeks to recover reward distributions from expert behavior, enabling richer characterizations of expert preferences; Risk-Sensitive Distributional Reinforcement Learning develops algorithms that optimize over return distributions rather than expected returns; Return Distribution Matching for Imitation Learning directly aligns learner and expert return distributions; and Behavioral Cloning and Policy Pretraining provides foundational methods for policy initialization. These branches collectively span the spectrum from reward inference to direct policy learning, with varying degrees of model assumptions and risk awareness. Recent work has explored several contrasting approaches to incorporating distributional information. Distributional IRL[1] focuses on inferring entire reward distributions from demonstrations, while methods like Posterior Behavioral Cloning[3] emphasize conditioning policies on desired outcomes or return levels. The original paper, Return Distribution Matching[0], sits within the direct distribution-matching branch, emphasizing non-Markovian policy structures that can capture richer temporal dependencies in risk-sensitive settings. This contrasts with approaches like Model Risk-sensitive Offline[4] that rely on learned dynamics models, and differs from Posterior Behavioral Cloning[3] by matching full return distributions rather than conditioning on specific return targets. A key open question across these directions is how to balance expressiveness in capturing risk preferences against sample efficiency and computational tractability, particularly when expert demonstrations are limited or noisy.

Claimed Contributions

Return distribution matching formulation for risk-sensitive imitation learning

Can Refute

10 retrieved papers

The authors propose a general formulation of risk-sensitive imitation learning where the objective is to match the entire expert return distribution in Wasserstein distance, rather than only matching expected return or a single risk measure like CVaR. This formulation captures the expert's full risk attitude encoded in the return distribution shape.

10 retrieved papers

Can Refute

Efficient non-Markovian policy class and two provably efficient algorithms

1 retrieved paper

The authors introduce a parameterized subclass of non-Markovian policies that balances expressivity and efficiency for return distribution matching. Building on this class, they develop RS-BC for the unknown transition setting and RS-KT for the known transition setting, both with provable sample complexity guarantees.

1 retrieved paper

Sample efficiency result for unknown-reward setting with oracle-based algorithm

9 retrieved papers

The authors demonstrate that the robust return distribution matching problem with unknown expert reward remains statistically tractable when the transition model is known. They prove that a polynomial number of expert demonstrations suffices to accurately estimate the expert's return distribution under any reward, enabling an oracle-based solution approach.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Return distribution matching formulation for risk-sensitive imitation learning

[23] Risk-Sensitive Generative Adversarial Imitation Learning PDF

Can Refute

[17] Adapt Your Body: Mitigating Proprioception Shifts in Imitation Learning PDF

Cannot Refute

[18] Optimal transport for offline imitation learning PDF

Cannot Refute

[19] Humanmimic: Learning natural locomotion and transitions for humanoid robot via wasserstein adversarial imitation PDF

Cannot Refute

[20] Align your intents: Offline imitation learning via optimal transport PDF

Cannot Refute

[21] Imitation learning from observation through optimal transport PDF

Cannot Refute

[22] Cross-domain imitation learning via optimal transport PDF

Cannot Refute

[24] Wasserstein Adversarial Imitation Learning PDF

Cannot Refute

[25] Primal Wasserstein Imitation Learning PDF

Cannot Refute

[26] Inverse reinforcement learning via matching of optimality profiles PDF

Cannot Refute

Contribution

Efficient non-Markovian policy class and two provably efficient algorithms

[6] On the occupancy measure of non-Markovian policies in continuous MDPs PDF

Cannot Refute

Contribution

Sample efficiency result for unknown-reward setting with oracle-based algorithm

[8] Imitation learning from observations under transition model disparity PDF

Cannot Refute

[9] Distributionally robust behavioral cloning for robust imitation learning PDF

Cannot Refute

[10] An algorithmic perspective on imitation learning PDF

Cannot Refute

[11] Learning rewards from exploratory demonstrations using probabilistic temporal ranking PDF

Cannot Refute

[12] Learning from demonstrations: Is it worth estimating a reward function? PDF

Cannot Refute

[13] Inferring reward functions from demonstrators with unknown biases PDF

Cannot Refute

[14] Provably Efficient Adversarial Imitation Learning with Unknown Transitions PDF

Cannot Refute

[15] On generalization of adversarial imitation learning and beyond PDF

Cannot Refute

[16] Bayesian nonparametric reward learning from demonstration PDF

Cannot Refute

Imitation Learning as Return Distribution Matching

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Return distribution matching formulation for risk-sensitive imitation learning

[23] Risk-Sensitive Generative Adversarial Imitation Learning PDF

[17] Adapt Your Body: Mitigating Proprioception Shifts in Imitation Learning PDF

[18] Optimal transport for offline imitation learning PDF

[19] Humanmimic: Learning natural locomotion and transitions for humanoid robot via wasserstein adversarial imitation PDF

[20] Align your intents: Offline imitation learning via optimal transport PDF

[21] Imitation learning from observation through optimal transport PDF

[22] Cross-domain imitation learning via optimal transport PDF

[24] Wasserstein Adversarial Imitation Learning PDF

[25] Primal Wasserstein Imitation Learning PDF

[26] Inverse reinforcement learning via matching of optimality profiles PDF

Efficient non-Markovian policy class and two provably efficient algorithms

[6] On the occupancy measure of non-Markovian policies in continuous MDPs PDF

Sample efficiency result for unknown-reward setting with oracle-based algorithm

[8] Imitation learning from observations under transition model disparity PDF

[9] Distributionally robust behavioral cloning for robust imitation learning PDF

[10] An algorithmic perspective on imitation learning PDF

[11] Learning rewards from exploratory demonstrations using probabilistic temporal ranking PDF

[12] Learning from demonstrations: Is it worth estimating a reward function? PDF

[13] Inferring reward functions from demonstrators with unknown biases PDF

[14] Provably Efficient Adversarial Imitation Learning with Unknown Transitions PDF

[15] On generalization of adversarial imitation learning and beyond PDF

[16] Bayesian nonparametric reward learning from demonstration PDF

Table of Contents