FASTer: Toward Powerful and Efficient Autoregressive Vision–Language–Action Models with Learnable Action Tokenizer and Block-wise Decoding
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
FASTerVQ is a neural action tokenizer that encodes action chunks as single-channel images using transformer-based residual vector quantization. It achieves high compression ratios while maintaining reconstruction fidelity by capturing global spatio-temporal dependencies and modeling actions in both temporal and frequency domains.
The method introduces block-wise autoregressive decoding that predicts multiple tokens in parallel within each block, reducing inference steps. A lightweight action expert module is added to bridge the modality gap between linguistic reasoning and continuous control while maintaining parameter efficiency.
The authors create an extensive evaluation framework spanning multiple real-world robotic platforms and simulated environments to systematically analyze action tokenization methods for vision-language-action models, demonstrating superior performance across diverse embodiments and tasks.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[36] FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization PDF
[39] VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
FASTerVQ: Learnable Action Tokenizer with Transformer-based RVQ
FASTerVQ is a neural action tokenizer that encodes action chunks as single-channel images using transformer-based residual vector quantization. It achieves high compression ratios while maintaining reconstruction fidelity by capturing global spatio-temporal dependencies and modeling actions in both temporal and frequency domains.
[14] Behavior Generation with Latent Actions PDF
[57] Baku: An efficient transformer for multi-task policy learning PDF
[58] Causal Motion Tokenizer for Streaming Motion Generation PDF
[59] OmniSAT: Compact Action Token, Faster Auto Regression PDF
[60] Grounding multimodal large language models in actions PDF
[61] HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models PDF
[62] Towards Generally Intelligent Robots That Simply Work Everywhere PDF
[63] VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension PDF
Block-wise Autoregressive Decoding with Lightweight Action Expert
The method introduces block-wise autoregressive decoding that predicts multiple tokens in parallel within each block, reducing inference steps. A lightweight action expert module is added to bridge the modality gap between linguistic reasoning and continuous control while maintaining parameter efficiency.
[53] Carp: Visuomotor policy learning via coarse-to-fine autoregressive prediction PDF
[1] Unified Vision-Language-Action Model PDF
[4] FAST: Efficient Action Tokenization for Vision-Language-Action Models PDF
[31] LLaDA-VLA: Vision Language Diffusion Action Models PDF
[36] FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization PDF
[51] WorldVLA: Towards Autoregressive Action World Model PDF
[52] SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics PDF
[54] Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge PDF
[55] Pure vision language action (vla) models: A comprehensive survey PDF
[56] Handsonvlm: Vision-language models for hand-object interaction prediction PDF
Comprehensive Benchmark for Action Tokenization in VLAs
The authors create an extensive evaluation framework spanning multiple real-world robotic platforms and simulated environments to systematically analyze action tokenization methods for vision-language-action models, demonstrating superior performance across diverse embodiments and tasks.