Peak-Return Greedy Slicing: Subtrajectory Selection for Transformer-based Offline RL
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce PRGS, a framework that partitions trajectories at the timestep level to identify and prioritize high-quality subtrajectories. This enables Transformer-based offline RL methods to better exploit and recombine valuable trajectory segments, addressing limitations in existing methods that process complete trajectories.
The framework employs a Maximum Mean Discrepancy-based return estimator that approximates the distribution of potential future returns for individual state-action pairs. By selecting top particles from this distribution, it produces optimistic estimates that guide the identification of high-value subtrajectories.
The authors propose an adaptive mechanism that dynamically truncates historical trajectory information during policy evaluation. This mechanism compares estimated values across timesteps to determine whether to retain or discard past information, ensuring consistency between the training phase (which uses subtrajectories) and the evaluation phase.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Peak-Return Greedy Slicing (PRGS) framework for subtrajectory selection
The authors introduce PRGS, a framework that partitions trajectories at the timestep level to identify and prioritize high-quality subtrajectories. This enables Transformer-based offline RL methods to better exploit and recombine valuable trajectory segments, addressing limitations in existing methods that process complete trajectories.
[6] Trajectory-based explainability framework for offline rl PDF
[32] Rtdiff: Reverse trajectory synthesis via diffusion for offline reinforcement learning PDF
[33] Sub-trajectory clustering with deep reinforcement learning PDF
[34] Scalable clustering of segmented trajectories within a continuous time framework: application to maritime traffic data PDF
[35] Less is more: Refining datasets for offline reinforcement learning with reward machines PDF
[36] Efficient multi-agent offline coordination via diffusion-based trajectory stitching PDF
[38] SLIM: Subtrajectory-Level Elimination for More Effective Reasoning PDF
[39] Sub-Goal Trees--a Framework for Goal-Directed Trajectory Prediction and Optimization PDF
[40] A Generalized Apprenticeship Learning Framework for Capturing Evolving Student Pedagogical Strategies PDF
MMD-based return estimator for optimistic value estimation
The framework employs a Maximum Mean Discrepancy-based return estimator that approximates the distribution of potential future returns for individual state-action pairs. By selecting top particles from this distribution, it produces optimistic estimates that guide the identification of high-value subtrajectories.
[26] Distributional reinforcement learning via moment matching PDF
[22] From Wasserstein to Maximum Mean Discrepancy Barycenters: A Novel Framework for Uncertainty Propagation in Model-Free Reinforcement Learning PDF
[23] Temporal difference and return optimism in cooperative multi-agent reinforcement learning PDF
[24] Utilizing Maximum Mean Discrepancy Barycenter for Propagating the Uncertainty of Value Functions in Reinforcement Learning PDF
[25] Distributional reinforcement learning with regularized wasserstein loss PDF
[27] Distributional reinforcement learning for multi-dimensional reward functions PDF
[28] Distributional reinforcement learning with maximum mean discrepancy PDF
[29] Policy Regularization for Model-Based Offline Reinforcement Learning PDF
[30] Model-based deep reinforcement learning for financial portfolio optimization PDF
[31] Model-Based Reinforcement Learning in Multi-Objective Environments with a Distributional Critic PDF
Adaptive history truncation mechanism for training-evaluation alignment
The authors propose an adaptive mechanism that dynamically truncates historical trajectory information during policy evaluation. This mechanism compares estimated values across timesteps to determine whether to retain or discard past information, ensuring consistency between the training phase (which uses subtrajectories) and the evaluation phase.