HAMLET: Switch Your Vision-Language-Action Model into a History-Aware Policy
Overview
Overall Novelty Assessment
The paper introduces HAMLET, a framework that adapts vision-language-action models to leverage historical context through moment tokens and a lightweight memory module. Within the taxonomy, HAMLET resides in the 'Compact Temporal Representations' leaf under 'Efficiency and Optimization for Historical Processing'. This leaf contains only two papers total, indicating a relatively sparse research direction focused specifically on compressing temporal information to reduce computational overhead. The positioning suggests the work addresses an emerging concern: balancing temporal expressiveness with deployment constraints on real robotic systems.
The taxonomy reveals that HAMLET's parent branch ('Efficiency and Optimization') sits alongside richer but more computationally intensive approaches. Neighboring leaves include 'Token Pruning and Selection' and 'Layer-Level Optimization', which tackle efficiency through different mechanisms. More distant branches like 'General Memory Modules' (containing MemoryVLA and related work) and 'Temporal Attention and Selection' represent alternative strategies that maintain more elaborate historical representations or use attention-based selection without explicit compression. The scope note for HAMLET's leaf explicitly excludes pruning methods, clarifying that compact representations differ from token-level selection strategies.
Among the three contributions analyzed, the overall HAMLET framework shows one refutable candidate among ten examined, suggesting some overlap with prior work on history-aware VLA adaptation within the limited search scope. The moment tokens with time-contrastive learning initialization and the lightweight memory module each examined ten candidates with zero refutations, indicating these specific technical choices appear more distinctive among the thirty total candidates reviewed. The statistics reflect a focused but not exhaustive literature search, primarily covering top semantic matches rather than the entire field of temporal VLA methods.
Based on the limited search scope of thirty candidates, HAMLET appears to occupy a relatively underexplored niche at the intersection of temporal modeling and computational efficiency. The sparse population of its taxonomy leaf and the low refutation rates for specific technical contributions suggest novelty in the particular combination of compact temporal encoding and time-contrastive initialization, though the broader framework concept shows some prior overlap. The analysis does not cover the full landscape of efficiency-oriented temporal methods beyond top semantic matches.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce HAMLET, a fine-tuning framework that transforms pre-trained Vision-Language-Action models into history-aware policies without requiring costly retraining from scratch. The framework enables VLAs to leverage past context for improved action prediction on long-horizon manipulation tasks.
The authors propose learnable moment tokens that compress observations at each timestep into compact representations. These tokens are initialized using time-contrastive learning to capture temporally distinctive aspects while filtering out redundant information like static backgrounds.
The authors design a lightweight Transformer-based memory module that selectively aggregates moment token representations across timesteps. This module produces history-augmented features for action prediction, treating different timesteps with varying importance rather than equal weighting.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[39] Resolving State Ambiguity in Robot Manipulation via Adaptive Working Memory Recoding PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
HAMLET framework for history-aware VLA adaptation
The authors introduce HAMLET, a fine-tuning framework that transforms pre-trained Vision-Language-Action models into history-aware policies without requiring costly retraining from scratch. The framework enables VLAs to leverage past context for improved action prediction on long-horizon manipulation tasks.
[2] MemoryVLA: Perceptual-Cognitive Memory in Vision-Language-Action Models for Robotic Manipulation PDF
[7] Vlmpc: Vision-language model predictive control for robotic manipulation PDF
[15] MAP-VLA: Memory-Augmented Prompting for Vision-Language-Action Model in Robotic Manipulation PDF
[66] Tinyvla: Towards fast, data-efficient vision-language-action models for robotic manipulation PDF
[67] Physically Grounded Vision-Language Models for Robotic Manipulation PDF
[68] Vla-cache: Towards efficient vision-language-action model via adaptive token caching in robotic manipulation PDF
[69] Rynnvla-001: Using human demonstrations to improve robot manipulation PDF
[70] OpenHelix: A Short Survey, Empirical Analysis, and Open-Source Dual-System VLA Model for Robotic Manipulation PDF
[71] Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation PDF
[72] CrayonRobo: Object-Centric Prompt-Driven Vision-Language-Action Model for Robotic Manipulation PDF
Moment tokens with time-contrastive learning initialization
The authors propose learnable moment tokens that compress observations at each timestep into compact representations. These tokens are initialized using time-contrastive learning to capture temporally distinctive aspects while filtering out redundant information like static backgrounds.
[56] Self-supervised contrastive representation learning for semi-supervised time-series classification PDF
[57] Spatiotemporal Contrastive Video Representation Learning PDF
[58] Self-supervised contrastive representation learning for large-scale trajectories PDF
[59] Ts2vec: Towards universal representation of time series PDF
[60] Intent Contrastive Learning for Sequential Recommendation PDF
[61] Contrastive Learning for Sequential Recommendation PDF
[62] Contrastive Learning for Representation Degeneration Problem in Sequential Recommendation PDF
[63] MSST: Multi-Scale Spatial-Temporal Representation Learning for Trajectory Similarity Computation PDF
[64] VideoMoCo: Contrastive Video Representation Learning with Temporally Adversarial Examples PDF
[65] TCLR: Temporal Contrastive Learning for Video Representation PDF
Lightweight memory module for temporal aggregation
The authors design a lightweight Transformer-based memory module that selectively aggregates moment token representations across timesteps. This module produces history-augmented features for action prediction, treating different timesteps with varying importance rather than equal weighting.