Peak-Return Greedy Slicing: Subtrajectory Selection for Transformer-based Offline RL

ICLR 2026 Conference SubmissionAnonymous Authors
Offline Reinforcement LearningTransformer
Abstract:

Offline reinforcement learning enables policy learning solely from fixed datasets, without costly or risky environment interactions, making it highly valuable for real-world applications. While Transformer-based approaches have recently demonstrated strong sequence modeling capabilities, they typically learn from complete trajectories conditioned on final returns. To mitigate this limitation, we propose the Peak-Return Greedy Slicing (PRGS) framework, which explicitly partitions trajectories at the timestep level and emphasizes high-quality subtrajectories. PRGS first leverages an MMD-based return estimator to characterize the distribution of future returns for state-action pairs, yielding optimistic return estimates. It then performs greedy slicing to extract high-quality subtrajectories for training. During evaluation, an adaptive history truncation mechanism is introduced to align the inference process with the training procedure. Extensive experiments across multiple benchmark datasets indicate that PRGS significantly improves the performance of Transformer-based offline reinforcement learning methods by effectively enhancing their ability to exploit and recombine valuable subtrajectories.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
21
3
Claimed Contributions
29
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: subtrajectory selection for transformer-based offline reinforcement learning. The field centers on how to effectively train sequence models—especially transformers—on pre-collected trajectory data without further environment interaction. The taxonomy reveals several complementary research directions. One major branch focuses on trajectory decomposition and subtrajectory extraction, exploring how to segment or filter offline datasets to improve learning quality; representative works here include return-based selection strategies such as Advantage-Guided Transformer[19] and Advantage-Guided Conditional Sequence[21], as well as the current Peak-Return Greedy Slicing[0] approach. A second branch examines conditioning and prompt engineering, investigating how to guide policy behavior through return-to-go targets, preferences, or learned prompts (e.g., Decision Transformer Preference[3], Prompt-Tuning Bandits[13]). Architectural innovations form a third branch, with studies proposing hierarchical decompositions (Hierarchical Decision Transformer[1]), alternative sequence models (Mamba Trajectory Optimization[8]), and temporal modeling refinements (Time-Delay Transformers[9]). Additional branches address online adaptation and finetuning (Online Finetuning Decision[11]), meta-learning and multi-task generalization (Meta-DT World Disentanglement[20]), and empirical evaluation frameworks that compare these diverse methods. A particularly active line of work revolves around how to extract high-quality training signals from suboptimal offline data. Some approaches emphasize value-based filtering or advantage weighting (Value-Guided Decision Transformer[4], Advantage-Guided Transformer[19]) to prioritize informative transitions, while others regularize trajectory returns (Trajectory Return Regularization[5]) or augment data to improve coverage (Adaptive Data Augmentation[7]). Peak-Return Greedy Slicing[0] sits squarely within the return-based subtrajectory selection cluster, sharing the goal of identifying promising segments with Advantage-Guided Transformer[19] and Advantage-Guided Conditional Sequence[21]. Where those neighbors typically rely on advantage estimates or conditional distributions to weight or filter data, Peak-Return Greedy Slicing[0] adopts a greedy slicing heuristic that directly targets peak-return windows. This design choice reflects a broader tension in the field: whether to use learned value functions for sophisticated filtering or simpler return-based heuristics that avoid additional estimation errors. Across these branches, open questions remain about the trade-offs between data efficiency, computational overhead, and generalization to diverse task distributions.

Claimed Contributions

Peak-Return Greedy Slicing (PRGS) framework for subtrajectory selection

The authors introduce PRGS, a framework that partitions trajectories at the timestep level to identify and prioritize high-quality subtrajectories. This enables Transformer-based offline RL methods to better exploit and recombine valuable trajectory segments, addressing limitations in existing methods that process complete trajectories.

9 retrieved papers
MMD-based return estimator for optimistic value estimation

The framework employs a Maximum Mean Discrepancy-based return estimator that approximates the distribution of potential future returns for individual state-action pairs. By selecting top particles from this distribution, it produces optimistic estimates that guide the identification of high-value subtrajectories.

10 retrieved papers
Can Refute
Adaptive history truncation mechanism for training-evaluation alignment

The authors propose an adaptive mechanism that dynamically truncates historical trajectory information during policy evaluation. This mechanism compares estimated values across timesteps to determine whether to retain or discard past information, ensuring consistency between the training phase (which uses subtrajectories) and the evaluation phase.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Peak-Return Greedy Slicing (PRGS) framework for subtrajectory selection

The authors introduce PRGS, a framework that partitions trajectories at the timestep level to identify and prioritize high-quality subtrajectories. This enables Transformer-based offline RL methods to better exploit and recombine valuable trajectory segments, addressing limitations in existing methods that process complete trajectories.

Contribution

MMD-based return estimator for optimistic value estimation

The framework employs a Maximum Mean Discrepancy-based return estimator that approximates the distribution of potential future returns for individual state-action pairs. By selecting top particles from this distribution, it produces optimistic estimates that guide the identification of high-value subtrajectories.

Contribution

Adaptive history truncation mechanism for training-evaluation alignment

The authors propose an adaptive mechanism that dynamically truncates historical trajectory information during policy evaluation. This mechanism compares estimated values across timesteps to determine whether to retain or discard past information, ensuring consistency between the training phase (which uses subtrajectories) and the evaluation phase.