Abstract:

Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce \textbf{FASTer}, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: action tokenization for vision-language-action models. Vision-language-action (VLA) models integrate visual perception, language understanding, and robotic control by converting continuous action spaces into discrete or learned representations that can be processed alongside text and image tokens. The field's taxonomy reveals several major branches: Action Tokenization Methods explore how to represent actions—whether through discrete vector quantization (VQ-VLA[39], FAST[4], FASTer[0]), latent embeddings (Latent Actions[14]), or semantic abstractions (Semantic Tokenization[45])—while VLA Architecture and Multimodal Integration addresses how to fuse vision and language backbones with action prediction heads (RT-2[7], 3D-VLA[6], SpatialVLA[2]). Training and Adaptation branches cover fine-tuning strategies (Fine-Tuning VLA[8], Preserving Pretrained[18]) and cross-embodiment transfer (Embodiment Transfer[12]), whereas Inference Optimization focuses on efficiency gains through caching (VLA-Cache[13]), pruning (VLA-Pruner[37]), and asynchronous processing (AsyncVLA[16]). Temporal Modeling examines multi-frame reasoning, Evaluation establishes benchmarks, and Surveys (VLA Recipe Survey[17], Action Tokenization Survey[42]) synthesize methodological insights. A particularly active line of work centers on discrete action tokenization via vector quantization, where methods like FAST[4] and VQ-VLA[39] learn compact codebooks to represent continuous actions as discrete tokens compatible with language model architectures. FASTer[0] sits squarely within this cluster, extending vector quantization-based tokenization to improve codebook utilization and reconstruction fidelity. Nearby, FASTer Neural[36] explores neural variants of the same approach, while Object-Agent Tokenization[3] emphasizes object-centric representations that complement action discretization. These discrete tokenization strategies contrast with latent action approaches (Latent Actions[14], CronusVLA Latent[41]) that embed actions in continuous spaces, and with methods that treat actions as natural language sequences (Actions as Language[46]). The trade-offs revolve around expressiveness versus compatibility with pretrained language models, with discrete tokenization offering seamless integration at the cost of potential quantization error, a challenge FASTer[0] addresses through refined codebook learning.

Claimed Contributions

FASTerVQ: Learnable Action Tokenizer with Transformer-based RVQ

FASTerVQ is a neural action tokenizer that encodes action chunks as single-channel images using transformer-based residual vector quantization. It achieves high compression ratios while maintaining reconstruction fidelity by capturing global spatio-temporal dependencies and modeling actions in both temporal and frequency domains.

8 retrieved papers
Can Refute
Block-wise Autoregressive Decoding with Lightweight Action Expert

The method introduces block-wise autoregressive decoding that predicts multiple tokens in parallel within each block, reducing inference steps. A lightweight action expert module is added to bridge the modality gap between linguistic reasoning and continuous control while maintaining parameter efficiency.

10 retrieved papers
Can Refute
Comprehensive Benchmark for Action Tokenization in VLAs

The authors create an extensive evaluation framework spanning multiple real-world robotic platforms and simulated environments to systematically analyze action tokenization methods for vision-language-action models, demonstrating superior performance across diverse embodiments and tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FASTerVQ: Learnable Action Tokenizer with Transformer-based RVQ

FASTerVQ is a neural action tokenizer that encodes action chunks as single-channel images using transformer-based residual vector quantization. It achieves high compression ratios while maintaining reconstruction fidelity by capturing global spatio-temporal dependencies and modeling actions in both temporal and frequency domains.

Contribution

Block-wise Autoregressive Decoding with Lightweight Action Expert

The method introduces block-wise autoregressive decoding that predicts multiple tokens in parallel within each block, reducing inference steps. A lightweight action expert module is added to bridge the modality gap between linguistic reasoning and continuous control while maintaining parameter efficiency.

Contribution

Comprehensive Benchmark for Action Tokenization in VLAs

The authors create an extensive evaluation framework spanning multiple real-world robotic platforms and simulated environments to systematically analyze action tokenization methods for vision-language-action models, demonstrating superior performance across diverse embodiments and tasks.

FASTer: Toward Powerful and Efficient Autoregressive Vision–Language–Action Models with Learnable Action Tokenizer and Block-wise Decoding | Novelty Validation