EXPO: Stable Reinforcement Learning with Expressive Policies

ICLR 2026 Conference SubmissionAnonymous Authors
Reinforcement LearningImitation Learning
Abstract:

We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized policies -- a larger expressive base policy trained with a stable imitation learning objective and a light-weight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup. Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods both in the setting of fine-tuning a pretrained policy given offline data and in leveraging offline data to train online.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes EXPO, an algorithm for training expressive policies (specifically diffusion and flow-matching models) in online reinforcement learning settings. It resides in the 'Online Diffusion Policy Training and Optimization' leaf, which contains five papers total including the original work. This leaf sits within the broader 'Diffusion-Based Policy Learning' branch, which itself is part of 'Expressive Policy Architectures and Representations'. The taxonomy reveals this is a moderately populated research direction: the immediate leaf has four sibling papers, and the parent branch includes an additional three papers on offline and theoretical diffusion policy work, indicating sustained but not overwhelming activity in diffusion-based RL methods.

The taxonomy structure shows that EXPO's leaf is adjacent to 'Offline and Theoretical Diffusion Policy Work' (three papers) and sits alongside other expressive policy branches including 'Flow-Based and Generative Policy Methods' (two papers) and 'Maximum Entropy and Multimodal Policy Learning' (two papers). The scope note for EXPO's leaf emphasizes 'stable online RL training of diffusion policies addressing gradient pathologies and sample efficiency', while explicitly excluding offline-only methods. This boundary clarifies that EXPO targets the specific challenge of interactive learning with diffusion models, distinguishing it from offline behavioral cloning approaches and from flow-matching methods that use different generative architectures.

Among the three contributions analyzed, the literature search examined twenty-eight candidate papers total. For the core EXPO algorithm contribution, nine candidates were examined with zero refutable matches. The on-the-fly policy parameterization examined nine candidates (zero refutable), and the edit policy with distance constraint examined ten candidates (zero refutable). These statistics indicate that within the limited search scope—roughly thirty semantically similar papers—no prior work was identified that clearly overlaps with EXPO's specific combination of techniques. The absence of refutable candidates across all three contributions suggests the approach may occupy a relatively unexplored niche, though the search scale means this assessment is necessarily provisional.

Based on the limited literature search of approximately thirty candidates, EXPO appears to introduce a novel combination of mechanisms for online diffusion policy training. The taxonomy context reveals this work sits in an active but not saturated research area, with roughly a dozen papers across diffusion-based policy learning. The zero-refutation finding across all contributions should be interpreted cautiously given the search scope: it indicates no obvious prior work among top semantic matches, but does not constitute exhaustive verification of novelty across the entire field.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: online reinforcement learning with expressive policy classes. The field encompasses a broad spectrum of approaches for learning sequential decision-making policies that can represent complex behaviors while training directly from interaction. The taxonomy organizes this landscape into several major branches: Expressive Policy Architectures and Representations explores rich function approximators such as diffusion models and flow-based policies; Policy Optimization Algorithms and Training Frameworks addresses the core algorithmic machinery for updating these policies; State Representation and Observation Learning tackles how agents encode and process environmental information; Multi-Agent and Interactive Decision Making considers settings with multiple learners or dynamic environments; and branches like Exploration and Sample Efficiency, Robustness and Adaptation, and Human-Centered Learning address complementary challenges in data collection, generalization, and alignment with human preferences. Application-Specific Decision Making and Hierarchical Policy Representations further refine methods for particular domains or structured action spaces, while Offline Reinforcement Learning provides a contrasting paradigm that learns from fixed datasets rather than online interaction. Within the Expressive Policy Architectures branch, diffusion-based policy learning has emerged as a particularly active area, leveraging generative modeling techniques to represent multimodal action distributions. Works such as Q-Weighted Diffusion[3] and Efficient Diffusion[28] explore how to integrate value-based guidance and computational efficiency into diffusion policy training, while MaxEnt Diffusion[2] and Discrete Diffusion[5] extend these ideas to maximum-entropy frameworks and discrete action spaces. EXPO[0] sits squarely in this cluster, focusing on online diffusion policy training and optimization—a setting that contrasts with many offline diffusion methods by emphasizing direct interaction and iterative policy updates. Compared to Q-Weighted Diffusion[3], which emphasizes value-function weighting for action selection, EXPO[0] appears to prioritize the online training dynamics and optimization strategies needed to make diffusion policies practical in interactive environments. This positioning highlights ongoing questions about how to balance the expressive power of diffusion models with the sample efficiency and stability requirements of online RL, a trade-off that remains central across the broader taxonomy.

Claimed Contributions

EXPO algorithm for stable online RL with expressive policies

EXPO is a novel online reinforcement learning algorithm designed to train and fine-tune expressive policy classes (such as diffusion or flow-matching policies) in a stable manner. The method avoids direct value optimization of the expressive policy by using a base policy trained with imitation learning and a lightweight Gaussian edit policy that refines actions toward higher Q-values, combined with an on-the-fly action selection mechanism.

9 retrieved papers
On-the-fly policy parameterization for value maximization

The authors introduce an on-the-fly policy construction that performs value maximization without directly optimizing the expressive policy parameters. This policy samples actions from both the base expressive policy and an edit policy, then selects the highest Q-value action for both environment interaction and temporal-difference backup, enabling more stable and immediate reflection of Q-function changes.

9 retrieved papers
Edit policy with distance constraint for action refinement

The method introduces a small Gaussian edit policy that locally refines actions generated by the base expressive policy to maximize Q-values while maintaining proximity to the original actions through an edit distance constraint. This design allows the edit policy to solve a simpler local optimization problem, enabling efficient training with entropy regularization for exploration.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EXPO algorithm for stable online RL with expressive policies

EXPO is a novel online reinforcement learning algorithm designed to train and fine-tune expressive policy classes (such as diffusion or flow-matching policies) in a stable manner. The method avoids direct value optimization of the expressive policy by using a base policy trained with imitation learning and a lightweight Gaussian edit policy that refines actions toward higher Q-values, combined with an on-the-fly action selection mechanism.

Contribution

On-the-fly policy parameterization for value maximization

The authors introduce an on-the-fly policy construction that performs value maximization without directly optimizing the expressive policy parameters. This policy samples actions from both the base expressive policy and an edit policy, then selects the highest Q-value action for both environment interaction and temporal-difference backup, enabling more stable and immediate reflection of Q-function changes.

Contribution

Edit policy with distance constraint for action refinement

The method introduces a small Gaussian edit policy that locally refines actions generated by the base expressive policy to maximize Q-values while maintaining proximity to the original actions through an edit distance constraint. This design allows the edit policy to solve a simpler local optimization problem, enabling efficient training with entropy regularization for exploration.