EXPO: Stable Reinforcement Learning with Expressive Policies

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Reinforcement LearningImitation Learning

We study the problem of training and fine-tuning expressive policies with online reinforcement learning (RL) given an offline dataset. Training expressive policy classes with online RL present a unique challenge of stable value maximization. Unlike simpler Gaussian policies commonly used in online RL, expressive policies like diffusion and flow-matching policies are parameterized by a long denoising chain, which hinders stable gradient propagation from actions to policy parameters when optimizing against some value function. Our key insight is that we can address stable value maximization by avoiding direct optimization over value with the expressive policy and instead construct an on-the-fly RL policy to maximize Q-value. We propose Expressive Policy Optimization (EXPO), a sample-efficient online RL algorithm that utilizes an on-the-fly policy to maximize value with two parameterized policies -- a larger expressive base policy trained with a stable imitation learning objective and a light-weight Gaussian edit policy that edits the actions sampled from the base policy toward a higher value distribution. The on-the-fly policy optimizes the actions from the base policy with the learned edit policy and chooses the value maximizing action from the base and edited actions for both sampling and temporal-difference (TD) backup. Our approach yields up to 2-3x improvement in sample efficiency on average over prior methods both in the setting of fine-tuning a pretrained policy given offline data and in leveraging offline data to train online.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes EXPO, an algorithm for training expressive policies (specifically diffusion and flow-matching models) in online reinforcement learning settings. It resides in the 'Online Diffusion Policy Training and Optimization' leaf, which contains five papers total including the original work. This leaf sits within the broader 'Diffusion-Based Policy Learning' branch, which itself is part of 'Expressive Policy Architectures and Representations'. The taxonomy reveals this is a moderately populated research direction: the immediate leaf has four sibling papers, and the parent branch includes an additional three papers on offline and theoretical diffusion policy work, indicating sustained but not overwhelming activity in diffusion-based RL methods.

The taxonomy structure shows that EXPO's leaf is adjacent to 'Offline and Theoretical Diffusion Policy Work' (three papers) and sits alongside other expressive policy branches including 'Flow-Based and Generative Policy Methods' (two papers) and 'Maximum Entropy and Multimodal Policy Learning' (two papers). The scope note for EXPO's leaf emphasizes 'stable online RL training of diffusion policies addressing gradient pathologies and sample efficiency', while explicitly excluding offline-only methods. This boundary clarifies that EXPO targets the specific challenge of interactive learning with diffusion models, distinguishing it from offline behavioral cloning approaches and from flow-matching methods that use different generative architectures.

Among the three contributions analyzed, the literature search examined twenty-eight candidate papers total. For the core EXPO algorithm contribution, nine candidates were examined with zero refutable matches. The on-the-fly policy parameterization examined nine candidates (zero refutable), and the edit policy with distance constraint examined ten candidates (zero refutable). These statistics indicate that within the limited search scope—roughly thirty semantically similar papers—no prior work was identified that clearly overlaps with EXPO's specific combination of techniques. The absence of refutable candidates across all three contributions suggests the approach may occupy a relatively unexplored niche, though the search scale means this assessment is necessarily provisional.

Based on the limited literature search of approximately thirty candidates, EXPO appears to introduce a novel combination of mechanisms for online diffusion policy training. The taxonomy context reveals this work sits in an active but not saturated research area, with roughly a dozen papers across diffusion-based policy learning. The zero-refutation finding across all contributions should be interpreted cautiously given the search scope: it indicates no obvious prior work among top semantic matches, but does not constitute exhaustive verification of novelty across the entire field.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: online reinforcement learning with expressive policy classes. The field encompasses a broad spectrum of approaches for learning sequential decision-making policies that can represent complex behaviors while training directly from interaction. The taxonomy organizes this landscape into several major branches: Expressive Policy Architectures and Representations explores rich function approximators such as diffusion models and flow-based policies; Policy Optimization Algorithms and Training Frameworks addresses the core algorithmic machinery for updating these policies; State Representation and Observation Learning tackles how agents encode and process environmental information; Multi-Agent and Interactive Decision Making considers settings with multiple learners or dynamic environments; and branches like Exploration and Sample Efficiency, Robustness and Adaptation, and Human-Centered Learning address complementary challenges in data collection, generalization, and alignment with human preferences. Application-Specific Decision Making and Hierarchical Policy Representations further refine methods for particular domains or structured action spaces, while Offline Reinforcement Learning provides a contrasting paradigm that learns from fixed datasets rather than online interaction. Within the Expressive Policy Architectures branch, diffusion-based policy learning has emerged as a particularly active area, leveraging generative modeling techniques to represent multimodal action distributions. Works such as Q-Weighted Diffusion[3] and Efficient Diffusion[28] explore how to integrate value-based guidance and computational efficiency into diffusion policy training, while MaxEnt Diffusion[2] and Discrete Diffusion[5] extend these ideas to maximum-entropy frameworks and discrete action spaces. EXPO[0] sits squarely in this cluster, focusing on online diffusion policy training and optimization—a setting that contrasts with many offline diffusion methods by emphasizing direct interaction and iterative policy updates. Compared to Q-Weighted Diffusion[3], which emphasizes value-function weighting for action selection, EXPO[0] appears to prioritize the online training dynamics and optimization strategies needed to make diffusion policies practical in interactive environments. This positioning highlights ongoing questions about how to balance the expressive power of diffusion models with the sample efficiency and stability requirements of online RL, a trade-off that remains central across the broader taxonomy.

Claimed Contributions

EXPO algorithm for stable online RL with expressive policies

9 retrieved papers

EXPO is a novel online reinforcement learning algorithm designed to train and fine-tune expressive policy classes (such as diffusion or flow-matching policies) in a stable manner. The method avoids direct value optimization of the expressive policy by using a base policy trained with imitation learning and a lightweight Gaussian edit policy that refines actions toward higher Q-values, combined with an on-the-fly action selection mechanism.

9 retrieved papers

On-the-fly policy parameterization for value maximization

9 retrieved papers

The authors introduce an on-the-fly policy construction that performs value maximization without directly optimizing the expressive policy parameters. This policy samples actions from both the base expressive policy and an edit policy, then selects the highest Q-value action for both environment interaction and temporal-difference backup, enabling more stable and immediate reflection of Q-function changes.

9 retrieved papers

Edit policy with distance constraint for action refinement

10 retrieved papers

The method introduces a small Gaussian edit policy that locally refines actions generated by the base expressive policy to maximize Q-values while maintaining proximity to the original actions through an edit distance constraint. This design allows the edit policy to solve a simpler local optimization problem, enabling efficient training with entropy regularization for exploration.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Diffusion-based reinforcement learning via q-weighted variational policy optimization PDF

Shutong Ding, Ke Hu, Kan Ren, Ye Shi, Jingya Wang, Jingyi Yu, Weinan Zhang, Zhenhao Zhang (2024)

[28] Efficient Online Reinforcement Learning for Diffusion Policy PDF

Ma, Haitong, Chen, Tianyi, Haitong Ma, Wang Kai, Tianyi Chen, Li Na, Kai Wang, Dai Bo, Na Li, Bo Dai (2025)

[30] GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies PDF

Chubin Zhang, Zhenglin Wan, Feng Chen, Xingrui Yu, Ivor Tsang, Bo An (2025)

[45] D2 Actor Critic: Diffusion Actor Meets Distributional Critic PDF

Zhang Lun-jun, Han, Shuo, Lunjun Zhang, Shuo Han, Stadie, Bradly C., Hanrui Lyu, Bradly C. Stadie (2025) • Trans. Mach. Learn. Res.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

EXPO algorithm for stable online RL with expressive policies

[61] Finite-Horizon Optimal Control for Nonlinear Multi-Input Systems With Online Adaptive Integral Reinforcement Learning PDF

Cannot Refute

[62] Dual-Actor Fine-Tuning of VLA Models: A Talk-and-Tweak Human-in-the-Loop Approach PDF

Cannot Refute

[63] Reinforcement learning via gaussian processes with neural network dual kernels PDF

Cannot Refute

[64] OMPO: A Unified Framework for RL under Policy and Dynamics Shifts PDF

Cannot Refute

[65] A Two-Timescale Primal-Dual Framework for Reinforcement Learning via Online Dual Variable Guidance PDF

Cannot Refute

[66] Accelerating Deep Reinforcement Learning Using Human Demonstration Data Based on Dual Replay Buffer Management and Online Frame Skipping PDF

Cannot Refute

[67] Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality PDF

Cannot Refute

[68] A Distributed Primal-Dual Method for Constrained Multi-agent Reinforcement Learning with General Parameterization PDF

Cannot Refute

[69] Reward Shaping via Diffusion Process in Reinforcement Learning PDF

Cannot Refute

Contribution

On-the-fly policy parameterization for value maximization

[70] Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems PDF

Cannot Refute

[71] Inverse Reinforcement Learning with Explicit Policy Estimates PDF

Cannot Refute

[72] Decoupled policy actor-critic: Bridging pessimism and risk awareness in reinforcement learning PDF

Cannot Refute

[73] Combining policy gradient and Q-learning PDF

Cannot Refute

[74] Continuous-time reinforcement learning for optimal switching over multiple regimes PDF

Cannot Refute

[75] Meta-learning strategies through value maximization in neural networks PDF

Cannot Refute

[76] Learning in complex action spaces without policy gradients PDF

Cannot Refute

[77] Learning without gradients: multi-agent reinforcement learning approach to optimization PDF

Cannot Refute

[78] Unraveling the Rainbow: can value-based methods schedule? PDF

Cannot Refute

Contribution

Edit policy with distance constraint for action refinement

[51] Flow-Based Policy for Online Reinforcement Learning PDF

Cannot Refute

[52] A Train Cooperative Operation Optimization Method Considering Passenger Comfort based on Reinforcement Learning PDF

Cannot Refute

[53] Service-Oriented Segmented Trajectory Design for Low-altitude UAV-Assisted MEC Networks PDF

Cannot Refute

[54] Adaptive auxiliary task selection for multitasking-assisted constrained multi-objective optimization PDF

Cannot Refute

[55] Decision Boundary Estimation Using Reinforcement Learning for Complex Classification Problems PDF

Cannot Refute

[56] A Q-Learning-Based Particle Swarm Optimization for Aircraft Routing and Scheduling in Airport Terminal Area PDF

Cannot Refute

[57] Improved DQN-based Robot Path Planning Algorithm for Mobile Robots PDF

Cannot Refute

[58] A Proportional Navigation Based Reinforcement Learning Approach: Efficiency Enhancements and Practical Applications PDF

Cannot Refute

[59] Reinforcement learning with distance-based incentive/penalty (DIP) updates for highly constrained industrial control systems PDF

Cannot Refute

[60] Results in Engineering PDF

Cannot Refute

EXPO: Stable Reinforcement Learning with Expressive Policies

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Diffusion-based reinforcement learning via q-weighted variational policy optimization PDF

[28] Efficient Online Reinforcement Learning for Diffusion Policy PDF

[30] GoRL: An Algorithm-Agnostic Framework for Online Reinforcement Learning with Generative Policies PDF

[45] D2 Actor Critic: Diffusion Actor Meets Distributional Critic PDF

Contribution Analysis

EXPO algorithm for stable online RL with expressive policies

[61] Finite-Horizon Optimal Control for Nonlinear Multi-Input Systems With Online Adaptive Integral Reinforcement Learning PDF

[62] Dual-Actor Fine-Tuning of VLA Models: A Talk-and-Tweak Human-in-the-Loop Approach PDF

[63] Reinforcement learning via gaussian processes with neural network dual kernels PDF

[64] OMPO: A Unified Framework for RL under Policy and Dynamics Shifts PDF

[65] A Two-Timescale Primal-Dual Framework for Reinforcement Learning via Online Dual Variable Guidance PDF

[66] Accelerating Deep Reinforcement Learning Using Human Demonstration Data Based on Dual Replay Buffer Management and Online Frame Skipping PDF

[67] Knowledge-Level Consistency Reinforcement Learning: Dual-Fact Alignment for Long-Form Factuality PDF

[68] A Distributed Primal-Dual Method for Constrained Multi-agent Reinforcement Learning with General Parameterization PDF

[69] Reward Shaping via Diffusion Process in Reinforcement Learning PDF

On-the-fly policy parameterization for value maximization

[70] Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems PDF

[71] Inverse Reinforcement Learning with Explicit Policy Estimates PDF

[72] Decoupled policy actor-critic: Bridging pessimism and risk awareness in reinforcement learning PDF

[73] Combining policy gradient and Q-learning PDF

[74] Continuous-time reinforcement learning for optimal switching over multiple regimes PDF

[75] Meta-learning strategies through value maximization in neural networks PDF

[76] Learning in complex action spaces without policy gradients PDF

[77] Learning without gradients: multi-agent reinforcement learning approach to optimization PDF

[78] Unraveling the Rainbow: can value-based methods schedule? PDF

Edit policy with distance constraint for action refinement

[51] Flow-Based Policy for Online Reinforcement Learning PDF

[52] A Train Cooperative Operation Optimization Method Considering Passenger Comfort based on Reinforcement Learning PDF

[53] Service-Oriented Segmented Trajectory Design for Low-altitude UAV-Assisted MEC Networks PDF

[54] Adaptive auxiliary task selection for multitasking-assisted constrained multi-objective optimization PDF

[55] Decision Boundary Estimation Using Reinforcement Learning for Complex Classification Problems PDF

[56] A Q-Learning-Based Particle Swarm Optimization for Aircraft Routing and Scheduling in Airport Terminal Area PDF

[57] Improved DQN-based Robot Path Planning Algorithm for Mobile Robots PDF

[58] A Proportional Navigation Based Reinforcement Learning Approach: Efficiency Enhancements and Practical Applications PDF

[59] Reinforcement learning with distance-based incentive/penalty (DIP) updates for highly constrained industrial control systems PDF

[60] Results in Engineering PDF

Table of Contents