Imitation Learning as Return Distribution Matching
Overview
Overall Novelty Assessment
The paper proposes a risk-sensitive imitation learning framework that matches expert return distributions using Wasserstein distance, introducing non-Markovian policies and two algorithms (RS-BC and RS-KT) with provable sample complexity guarantees. Within the taxonomy, it occupies the sole position in the 'Non-Markovian Policy Learning for Risk-Sensitive IL' leaf under 'Return Distribution Matching for Imitation Learning'. This leaf contains only the original paper itself, indicating a relatively sparse research direction focused specifically on non-Markovian approaches to distributional matching in imitation learning.
The taxonomy reveals neighboring work in adjacent branches: 'Distributional Inverse Reinforcement Learning' focuses on recovering reward distributions rather than direct policy learning, while 'Risk-Sensitive Distributional RL' addresses return distribution modeling in standard RL settings without expert demonstrations. The 'Behavioral Cloning and Policy Pretraining' branch handles policy initialization but excludes risk-sensitive distributional objectives. The original paper bridges these areas by combining distributional matching (from RL) with imitation learning, while explicitly using non-Markovian policies to capture temporal risk dependencies that Markovian approaches cannot express.
Among 20 candidates examined across three contributions, the return distribution matching formulation shows one refutable candidate among 10 examined, suggesting some conceptual overlap in the limited search scope. The non-Markovian policy class and algorithms show no refutable candidates among 1 examined, though this extremely small sample provides minimal evidence. The unknown-reward oracle-based result examined 9 candidates with none refutable, indicating potential novelty in this specific technical direction. These statistics reflect a constrained semantic search rather than comprehensive field coverage, leaving open questions about broader prior work in distributional imitation learning.
Based on the limited 20-candidate search, the work appears to occupy a relatively unexplored intersection of risk-sensitive RL and imitation learning, particularly in its non-Markovian policy formulation. The sparse taxonomy structure around this leaf and the modest refutation rate suggest potential novelty, though the small search scope means substantial related work may exist outside the examined candidates. A more exhaustive literature review would be needed to definitively assess originality across the broader distributional RL and imitation learning communities.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose a general formulation of risk-sensitive imitation learning where the objective is to match the entire expert return distribution in Wasserstein distance, rather than only matching expected return or a single risk measure like CVaR. This formulation captures the expert's full risk attitude encoded in the return distribution shape.
The authors introduce a parameterized subclass of non-Markovian policies that balances expressivity and efficiency for return distribution matching. Building on this class, they develop RS-BC for the unknown transition setting and RS-KT for the known transition setting, both with provable sample complexity guarantees.
The authors demonstrate that the robust return distribution matching problem with unknown expert reward remains statistically tractable when the transition model is known. They prove that a polynomial number of expert demonstrations suffices to accurately estimate the expert's return distribution under any reward, enabling an oracle-based solution approach.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Return distribution matching formulation for risk-sensitive imitation learning
The authors propose a general formulation of risk-sensitive imitation learning where the objective is to match the entire expert return distribution in Wasserstein distance, rather than only matching expected return or a single risk measure like CVaR. This formulation captures the expert's full risk attitude encoded in the return distribution shape.
[23] Risk-Sensitive Generative Adversarial Imitation Learning PDF
[17] Adapt Your Body: Mitigating Proprioception Shifts in Imitation Learning PDF
[18] Optimal transport for offline imitation learning PDF
[19] Humanmimic: Learning natural locomotion and transitions for humanoid robot via wasserstein adversarial imitation PDF
[20] Align your intents: Offline imitation learning via optimal transport PDF
[21] Imitation learning from observation through optimal transport PDF
[22] Cross-domain imitation learning via optimal transport PDF
[24] Wasserstein Adversarial Imitation Learning PDF
[25] Primal Wasserstein Imitation Learning PDF
[26] Inverse reinforcement learning via matching of optimality profiles PDF
Efficient non-Markovian policy class and two provably efficient algorithms
The authors introduce a parameterized subclass of non-Markovian policies that balances expressivity and efficiency for return distribution matching. Building on this class, they develop RS-BC for the unknown transition setting and RS-KT for the known transition setting, both with provable sample complexity guarantees.
[6] On the occupancy measure of non-Markovian policies in continuous MDPs PDF
Sample efficiency result for unknown-reward setting with oracle-based algorithm
The authors demonstrate that the robust return distribution matching problem with unknown expert reward remains statistically tractable when the transition model is known. They prove that a polynomial number of expert demonstrations suffices to accurately estimate the expert's return distribution under any reward, enabling an oracle-based solution approach.