HAMLET: Switch Your Vision-Language-Action Model into a History-Aware Policy

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Vision-language-action modelsRobot manipulation

Inherently, robotic manipulation tasks are history-dependent: leveraging past context could be beneficial. However, most existing Vision-Language-Action models (VLAs) have been designed without considering this aspect, i.e., they rely solely on the current observation, ignoring preceding context. In this paper, we propose HAMLET, a scalable framework to adapt VLAs to attend to the historical context during action prediction. Specifically, we introduce moment tokens that compactly encode perceptual information at each timestep. Their representations are initialized with time-contrastive learning, allowing them to better capture temporally distinctive aspects. Next, we employ a lightweight memory module that integrates the moment tokens across past timesteps into memory features, which are then leveraged for action prediction. Through empirical evaluation, we show that HAMLET successfully transforms a state-of-the-art VLA into a history-aware policy, especially demonstrating significant improvements on long-horizon tasks that require historical context. In particular, on top of GR00T N1.5, HAMLET achieves an average success rate of 76.4% on history-dependent real-world tasks, surpassing the baseline performance by 47.2%. Furthermore, HAMLET pushes prior art performance from 64.1% to 66.4% on RoboCasa Kitchen (100-demo setup) and from 95.6% to 97.7% on LIBERO, highlighting its effectiveness even under generic robot-manipulation benchmarks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HAMLET, a framework that adapts vision-language-action models to leverage historical context through moment tokens and a lightweight memory module. Within the taxonomy, HAMLET resides in the 'Compact Temporal Representations' leaf under 'Efficiency and Optimization for Historical Processing'. This leaf contains only two papers total, indicating a relatively sparse research direction focused specifically on compressing temporal information to reduce computational overhead. The positioning suggests the work addresses an emerging concern: balancing temporal expressiveness with deployment constraints on real robotic systems.

The taxonomy reveals that HAMLET's parent branch ('Efficiency and Optimization') sits alongside richer but more computationally intensive approaches. Neighboring leaves include 'Token Pruning and Selection' and 'Layer-Level Optimization', which tackle efficiency through different mechanisms. More distant branches like 'General Memory Modules' (containing MemoryVLA and related work) and 'Temporal Attention and Selection' represent alternative strategies that maintain more elaborate historical representations or use attention-based selection without explicit compression. The scope note for HAMLET's leaf explicitly excludes pruning methods, clarifying that compact representations differ from token-level selection strategies.

Among the three contributions analyzed, the overall HAMLET framework shows one refutable candidate among ten examined, suggesting some overlap with prior work on history-aware VLA adaptation within the limited search scope. The moment tokens with time-contrastive learning initialization and the lightweight memory module each examined ten candidates with zero refutations, indicating these specific technical choices appear more distinctive among the thirty total candidates reviewed. The statistics reflect a focused but not exhaustive literature search, primarily covering top semantic matches rather than the entire field of temporal VLA methods.

Based on the limited search scope of thirty candidates, HAMLET appears to occupy a relatively underexplored niche at the intersection of temporal modeling and computational efficiency. The sparse population of its taxonomy leaf and the low refutation rates for specific technical contributions suggest novelty in the particular combination of compact temporal encoding and time-contrastive initialization, though the broader framework concept shows some prior overlap. The analysis does not cover the full landscape of efficiency-oriented temporal methods beyond top semantic matches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Incorporating historical context into vision-language-action models for robotic manipulation. The field has evolved around several complementary directions that address how robots can leverage temporal information to improve decision-making. The taxonomy reveals major branches focusing on memory architecture and representation (how to store and organize past observations), temporal encoding and attention mechanisms (how to selectively weight historical data), motion and trajectory representation (capturing dynamic patterns over time), reasoning and planning with historical context (using past experience for deliberation), spatial-temporal coherence (maintaining consistency across frames), efficiency and optimization (reducing computational overhead), specialized applications (task-specific adaptations), foundational representations and pretraining (learning general-purpose embeddings), surveys and benchmarks (evaluation frameworks), and auxiliary mechanisms (supporting techniques). Works like MemoryVLA[2] and Instruction-driven history-aware policies[4] exemplify explicit memory systems, while others such as R3M[3] and Vision-Language Foundation Models[5] focus on foundational representations that implicitly capture temporal structure. A particularly active tension exists between methods that maintain rich, explicit historical representations versus those that compress temporal information for efficiency. The Efficiency and Optimization branch addresses this trade-off directly, exploring compact temporal representations that balance expressiveness with computational cost. HAMLET[0] sits within this efficiency-focused cluster, emphasizing compact temporal representations to reduce the overhead of processing long action histories. This contrasts with approaches like MemoryVLA[2], which maintains more elaborate memory structures, or CoT-VLA[6], which incorporates chain-of-thought reasoning over historical context. Nearby work such as Resolving State Ambiguity[39] also tackles efficiency concerns by addressing when and how historical information disambiguates current observations. The central challenge across these directions remains how to retain sufficient temporal context for complex manipulation tasks while keeping models deployable on real robotic systems with limited computational resources.

Claimed Contributions

HAMLET framework for history-aware VLA adaptation

Can Refute

10 retrieved papers

The authors introduce HAMLET, a fine-tuning framework that transforms pre-trained Vision-Language-Action models into history-aware policies without requiring costly retraining from scratch. The framework enables VLAs to leverage past context for improved action prediction on long-horizon manipulation tasks.

10 retrieved papers

Can Refute

Moment tokens with time-contrastive learning initialization

10 retrieved papers

The authors propose learnable moment tokens that compress observations at each timestep into compact representations. These tokens are initialized using time-contrastive learning to capture temporally distinctive aspects while filtering out redundant information like static backgrounds.

10 retrieved papers

Lightweight memory module for temporal aggregation

10 retrieved papers

The authors design a lightweight Transformer-based memory module that selectively aggregates moment token representations across timesteps. This module produces history-augmented features for action prediction, treating different timesteps with varying importance rather than equal weighting.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[39] Resolving State Ambiguity in Robot Manipulation via Adaptive Working Memory Recoding PDF

Qingda Hu, Ziheng Qiu, Zijun Xu, Kaizhao Zhang, Xizhou Bu, Zuolei Sun, Bo Zhang, Jieru Zhao, Zhongxue Gan, Wenchao Ding (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HAMLET framework for history-aware VLA adaptation

[2] MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation PDF

Can Refute

[7] Vlmpc: Vision-language model predictive control for robotic manipulation PDF

Cannot Refute

[15] MAP-VLA: Memory-Augmented Prompting for Vision-Language-Action Model in Robotic Manipulation PDF

Cannot Refute

[66] Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation PDF

Cannot Refute

[67] Physically Grounded Vision-Language Models for Robotic Manipulation PDF

Cannot Refute

[68] Vla-cache: Towards efficient vision-language-action model via adaptive token caching in robotic manipulation PDF

Cannot Refute

[69] Rynnvla-001: Using human demonstrations to improve robot manipulation PDF

Cannot Refute

[70] OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation PDF

Cannot Refute

[71] Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation PDF

Cannot Refute

[72] CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation PDF

Cannot Refute

Contribution

Moment tokens with time-contrastive learning initialization

[56] Self-supervised contrastive representation learning for semi-supervised time-series classification PDF

Cannot Refute

[57] Spatiotemporal Contrastive Video Representation Learning PDF

Cannot Refute

[58] Self-supervised contrastive representation learning for large-scale trajectories PDF

Cannot Refute

[59] Ts2vec: Towards universal representation of time series PDF

Cannot Refute

[60] Intent Contrastive Learning for Sequential Recommendation PDF

Cannot Refute

[61] Contrastive Learning for Sequential Recommendation PDF

Cannot Refute

[62] Contrastive Learning for Representation Degeneration Problem in Sequential Recommendation PDF

Cannot Refute

[63] MSST: Multi-Scale Spatial-Temporal Representation Learning for Trajectory Similarity Computation PDF

Cannot Refute

[64] VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples PDF

Cannot Refute

[65] TCLR: Temporal Contrastive Learning for Video Representation PDF

Cannot Refute

Contribution

Lightweight memory module for temporal aggregation

[46] Long-short term spatio-temporal aggregation for trajectory prediction PDF

Cannot Refute

[47] Aggregated Multi-GANs for Controlled 3D Human Motion Prediction PDF

Cannot Refute

[48] Mtil: Encoding full history with mamba for temporal imitation learning PDF

Cannot Refute

[49] Multi-Semantics Aggregation Network Based on the Dynamic-Attention Mechanism for 3D Human Motion Prediction PDF

Cannot Refute

[50] Dynamic Behavior Cloning with Temporal Feature Prediction: Enhancing Robotic Arm Manipulation in Moving Object Tasks PDF

Cannot Refute

[51] Relational action forecasting PDF

Cannot Refute

[52] VADER: Vector-Quantized Generative Adversarial Network for Motion Prediction PDF

Cannot Refute

[53] Rv-fusenet: Range view based fusion of time-series lidar data for joint 3d object detection and motion forecasting PDF

Cannot Refute

[54] History-Aware Visuomotor Policy Learning via Point Tracking PDF

Cannot Refute

[55] Predicting the next action by modeling the abstract goal PDF

Cannot Refute

HAMLET: Switch Your Vision-Language-Action Model into a History-Aware Policy

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[39] Resolving State Ambiguity in Robot Manipulation via Adaptive Working Memory Recoding PDF

Contribution Analysis

HAMLET framework for history-aware VLA adaptation

[2] MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation PDF

[7] Vlmpc: Vision-language model predictive control for robotic manipulation PDF

[15] MAP-VLA: Memory-Augmented Prompting for Vision-Language-Action Model in Robotic Manipulation PDF

[66] Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation PDF

[67] Physically Grounded Vision-Language Models for Robotic Manipulation PDF

[68] Vla-cache: Towards efficient vision-language-action model via adaptive token caching in robotic manipulation PDF

[69] Rynnvla-001: Using human demonstrations to improve robot manipulation PDF

[70] OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation PDF

[71] Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation PDF

[72] CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation PDF

Moment tokens with time-contrastive learning initialization

[56] Self-supervised contrastive representation learning for semi-supervised time-series classification PDF

[57] Spatiotemporal Contrastive Video Representation Learning PDF

[58] Self-supervised contrastive representation learning for large-scale trajectories PDF

[59] Ts2vec: Towards universal representation of time series PDF

[60] Intent Contrastive Learning for Sequential Recommendation PDF

[61] Contrastive Learning for Sequential Recommendation PDF

[62] Contrastive Learning for Representation Degeneration Problem in Sequential Recommendation PDF

[63] MSST: Multi-Scale Spatial-Temporal Representation Learning for Trajectory Similarity Computation PDF

[64] VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples PDF

[65] TCLR: Temporal Contrastive Learning for Video Representation PDF

Lightweight memory module for temporal aggregation

[46] Long-short term spatio-temporal aggregation for trajectory prediction PDF

[47] Aggregated Multi-GANs for Controlled 3D Human Motion Prediction PDF

[48] Mtil: Encoding full history with mamba for temporal imitation learning PDF

[49] Multi-Semantics Aggregation Network Based on the Dynamic-Attention Mechanism for 3D Human Motion Prediction PDF

[50] Dynamic Behavior Cloning with Temporal Feature Prediction: Enhancing Robotic Arm Manipulation in Moving Object Tasks PDF

[51] Relational action forecasting PDF

[52] VADER: Vector-Quantized Generative Adversarial Network for Motion Prediction PDF

[53] Rv-fusenet: Range view based fusion of time-series lidar data for joint 3d object detection and motion forecasting PDF

[54] History-Aware Visuomotor Policy Learning via Point Tracking PDF

[55] Predicting the next action by modeling the abstract goal PDF

Table of Contents