HAMLET: Switch Your Vision-Language-Action Model into a History-Aware Policy

ICLR 2026 Conference SubmissionAnonymous Authors
Vision-language-action modelsRobot manipulation
Abstract:

Inherently, robotic manipulation tasks are history-dependent: leveraging past context could be beneficial. However, most existing Vision-Language-Action models (VLAs) have been designed without considering this aspect, i.e., they rely solely on the current observation, ignoring preceding context. In this paper, we propose HAMLET, a scalable framework to adapt VLAs to attend to the historical context during action prediction. Specifically, we introduce moment tokens that compactly encode perceptual information at each timestep. Their representations are initialized with time-contrastive learning, allowing them to better capture temporally distinctive aspects. Next, we employ a lightweight memory module that integrates the moment tokens across past timesteps into memory features, which are then leveraged for action prediction. Through empirical evaluation, we show that HAMLET successfully transforms a state-of-the-art VLA into a history-aware policy, especially demonstrating significant improvements on long-horizon tasks that require historical context. In particular, on top of GR00T N1.5, HAMLET achieves an average success rate of 76.4% on history-dependent real-world tasks, surpassing the baseline performance by 47.2%. Furthermore, HAMLET pushes prior art performance from 64.1% to 66.4% on RoboCasa Kitchen (100-demo setup) and from 95.6% to 97.7% on LIBERO, highlighting its effectiveness even under generic robot-manipulation benchmarks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces HAMLET, a framework that adapts vision-language-action models to leverage historical context through moment tokens and a lightweight memory module. Within the taxonomy, HAMLET resides in the 'Compact Temporal Representations' leaf under 'Efficiency and Optimization for Historical Processing'. This leaf contains only two papers total, indicating a relatively sparse research direction focused specifically on compressing temporal information to reduce computational overhead. The positioning suggests the work addresses an emerging concern: balancing temporal expressiveness with deployment constraints on real robotic systems.

The taxonomy reveals that HAMLET's parent branch ('Efficiency and Optimization') sits alongside richer but more computationally intensive approaches. Neighboring leaves include 'Token Pruning and Selection' and 'Layer-Level Optimization', which tackle efficiency through different mechanisms. More distant branches like 'General Memory Modules' (containing MemoryVLA and related work) and 'Temporal Attention and Selection' represent alternative strategies that maintain more elaborate historical representations or use attention-based selection without explicit compression. The scope note for HAMLET's leaf explicitly excludes pruning methods, clarifying that compact representations differ from token-level selection strategies.

Among the three contributions analyzed, the overall HAMLET framework shows one refutable candidate among ten examined, suggesting some overlap with prior work on history-aware VLA adaptation within the limited search scope. The moment tokens with time-contrastive learning initialization and the lightweight memory module each examined ten candidates with zero refutations, indicating these specific technical choices appear more distinctive among the thirty total candidates reviewed. The statistics reflect a focused but not exhaustive literature search, primarily covering top semantic matches rather than the entire field of temporal VLA methods.

Based on the limited search scope of thirty candidates, HAMLET appears to occupy a relatively underexplored niche at the intersection of temporal modeling and computational efficiency. The sparse population of its taxonomy leaf and the low refutation rates for specific technical contributions suggest novelty in the particular combination of compact temporal encoding and time-contrastive initialization, though the broader framework concept shows some prior overlap. The analysis does not cover the full landscape of efficiency-oriented temporal methods beyond top semantic matches.

Taxonomy

Core-task Taxonomy Papers
45
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Incorporating historical context into vision-language-action models for robotic manipulation. The field has evolved around several complementary directions that address how robots can leverage temporal information to improve decision-making. The taxonomy reveals major branches focusing on memory architecture and representation (how to store and organize past observations), temporal encoding and attention mechanisms (how to selectively weight historical data), motion and trajectory representation (capturing dynamic patterns over time), reasoning and planning with historical context (using past experience for deliberation), spatial-temporal coherence (maintaining consistency across frames), efficiency and optimization (reducing computational overhead), specialized applications (task-specific adaptations), foundational representations and pretraining (learning general-purpose embeddings), surveys and benchmarks (evaluation frameworks), and auxiliary mechanisms (supporting techniques). Works like MemoryVLA[2] and Instruction-driven history-aware policies[4] exemplify explicit memory systems, while others such as R3M[3] and Vision-Language Foundation Models[5] focus on foundational representations that implicitly capture temporal structure. A particularly active tension exists between methods that maintain rich, explicit historical representations versus those that compress temporal information for efficiency. The Efficiency and Optimization branch addresses this trade-off directly, exploring compact temporal representations that balance expressiveness with computational cost. HAMLET[0] sits within this efficiency-focused cluster, emphasizing compact temporal representations to reduce the overhead of processing long action histories. This contrasts with approaches like MemoryVLA[2], which maintains more elaborate memory structures, or CoT-VLA[6], which incorporates chain-of-thought reasoning over historical context. Nearby work such as Resolving State Ambiguity[39] also tackles efficiency concerns by addressing when and how historical information disambiguates current observations. The central challenge across these directions remains how to retain sufficient temporal context for complex manipulation tasks while keeping models deployable on real robotic systems with limited computational resources.

Claimed Contributions

HAMLET framework for history-aware VLA adaptation

The authors introduce HAMLET, a fine-tuning framework that transforms pre-trained Vision-Language-Action models into history-aware policies without requiring costly retraining from scratch. The framework enables VLAs to leverage past context for improved action prediction on long-horizon manipulation tasks.

10 retrieved papers
Can Refute
Moment tokens with time-contrastive learning initialization

The authors propose learnable moment tokens that compress observations at each timestep into compact representations. These tokens are initialized using time-contrastive learning to capture temporally distinctive aspects while filtering out redundant information like static backgrounds.

10 retrieved papers
Lightweight memory module for temporal aggregation

The authors design a lightweight Transformer-based memory module that selectively aggregates moment token representations across timesteps. This module produces history-augmented features for action prediction, treating different timesteps with varying importance rather than equal weighting.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

HAMLET framework for history-aware VLA adaptation

The authors introduce HAMLET, a fine-tuning framework that transforms pre-trained Vision-Language-Action models into history-aware policies without requiring costly retraining from scratch. The framework enables VLAs to leverage past context for improved action prediction on long-horizon manipulation tasks.

Contribution

Moment tokens with time-contrastive learning initialization

The authors propose learnable moment tokens that compress observations at each timestep into compact representations. These tokens are initialized using time-contrastive learning to capture temporally distinctive aspects while filtering out redundant information like static backgrounds.

Contribution

Lightweight memory module for temporal aggregation

The authors design a lightweight Transformer-based memory module that selectively aggregates moment token representations across timesteps. This module produces history-augmented features for action prediction, treating different timesteps with varying importance rather than equal weighting.