FASTer: Toward Powerful and Efficient Autoregressive Vision–Language–Action Models with Learnable Action Tokenizer and Block-wise Decoding

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

VLAembodied AIrobotics

Autoregressive vision-language-action (VLA) models have recently demonstrated strong capabilities in robotic manipulation. However, their core process of action tokenization often involves a trade-off between reconstruction fidelity and inference efficiency. We introduce \textbf{FASTer}, a unified framework for efficient and generalizable robot learning that integrates a learnable tokenizer with an autoregressive policy built upon it. FASTerVQ encodes action chunks as single-channel images, capturing global spatio-temporal dependencies while maintaining a high compression ratio. FASTerVLA builds on this tokenizer with block-wise autoregressive decoding and a lightweight action expert, achieving both faster inference and higher task performance. Extensive experiments across simulated and real-world benchmarks show that FASTerVQ delivers superior reconstruction quality, high token utilization, and strong cross-task and cross-embodiment generalization, while FASTerVLA further improves overall capability, surpassing previous state-of-the-art VLA models in both inference speed and task performance.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: action tokenization for vision-language-action models. Vision-language-action (VLA) models integrate visual perception, language understanding, and robotic control by converting continuous action spaces into discrete or learned representations that can be processed alongside text and image tokens. The field's taxonomy reveals several major branches: Action Tokenization Methods explore how to represent actions—whether through discrete vector quantization (VQ-VLA[39], FAST[4], FASTer[0]), latent embeddings (Latent Actions[14]), or semantic abstractions (Semantic Tokenization[45])—while VLA Architecture and Multimodal Integration addresses how to fuse vision and language backbones with action prediction heads (RT-2[7], 3D-VLA[6], SpatialVLA[2]). Training and Adaptation branches cover fine-tuning strategies (Fine-Tuning VLA[8], Preserving Pretrained[18]) and cross-embodiment transfer (Embodiment Transfer[12]), whereas Inference Optimization focuses on efficiency gains through caching (VLA-Cache[13]), pruning (VLA-Pruner[37]), and asynchronous processing (AsyncVLA[16]). Temporal Modeling examines multi-frame reasoning, Evaluation establishes benchmarks, and Surveys (VLA Recipe Survey[17], Action Tokenization Survey[42]) synthesize methodological insights. A particularly active line of work centers on discrete action tokenization via vector quantization, where methods like FAST[4] and VQ-VLA[39] learn compact codebooks to represent continuous actions as discrete tokens compatible with language model architectures. FASTer[0] sits squarely within this cluster, extending vector quantization-based tokenization to improve codebook utilization and reconstruction fidelity. Nearby, FASTer Neural[36] explores neural variants of the same approach, while Object-Agent Tokenization[3] emphasizes object-centric representations that complement action discretization. These discrete tokenization strategies contrast with latent action approaches (Latent Actions[14], CronusVLA Latent[41]) that embed actions in continuous spaces, and with methods that treat actions as natural language sequences (Actions as Language[46]). The trade-offs revolve around expressiveness versus compatibility with pretrained language models, with discrete tokenization offering seamless integration at the cost of potential quantization error, a challenge FASTer[0] addresses through refined codebook learning.

Claimed Contributions

FASTerVQ: Learnable Action Tokenizer with Transformer-based RVQ

Can Refute

8 retrieved papers

FASTerVQ is a neural action tokenizer that encodes action chunks as single-channel images using transformer-based residual vector quantization. It achieves high compression ratios while maintaining reconstruction fidelity by capturing global spatio-temporal dependencies and modeling actions in both temporal and frequency domains.

8 retrieved papers

Can Refute

Block-wise Autoregressive Decoding with Lightweight Action Expert

Can Refute

10 retrieved papers

The method introduces block-wise autoregressive decoding that predicts multiple tokens in parallel within each block, reducing inference steps. A lightweight action expert module is added to bridge the modality gap between linguistic reasoning and continuous control while maintaining parameter efficiency.

10 retrieved papers

Can Refute

Comprehensive Benchmark for Action Tokenization in VLAs

10 retrieved papers

The authors create an extensive evaluation framework spanning multiple real-world robotic platforms and simulated environments to systematically analyze action tokenization methods for vision-language-action models, demonstrating superior performance across diverse embodiments and tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[36] FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization PDF

Yicheng Liu, Shiduo Zhang, Zibin Dong, Baijun Ye, Tianyuan Yuan, Xiaopeng Yu, Linqi Yin, Chenhao Lu, Junhao Shi, Luca Jiang-Tao Yu, Liangtao Zheng, Tao Jiang, Jingjing Gong, Xipeng Qiu, Hang Zhao (2025)

[39] VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers PDF

Wang Yating, Zhu, Haoyi, Yating Wang, Liu, Mingyu, Haoyi Zhu, Yang, Jiange, Mingyu Liu, Fang Hao-Shu, Jiange Yang, He Tong, Hao-Shu Fang, Tong He (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FASTerVQ: Learnable Action Tokenizer with Transformer-based RVQ

[14] Behavior Generation with Latent Actions PDF

Can Refute

[57] Baku: An efficient transformer for multi-task policy learning PDF

Cannot Refute

[58] Causal Motion Tokenizer for Streaming Motion Generation PDF

Cannot Refute

[59] OmniSAT: Compact Action Token, Faster Auto Regression PDF

Cannot Refute

[60] Grounding multimodal large language models in actions PDF

Cannot Refute

[61] HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models PDF

Cannot Refute

[62] Towards Generally Intelligent Robots That Simply Work Everywhere PDF

Cannot Refute

[63] VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension PDF

Cannot Refute

Contribution

Block-wise Autoregressive Decoding with Lightweight Action Expert

[53] Carp: Visuomotor policy learning via coarse-to-fine autoregressive prediction PDF

Can Refute

[1] Unified Vision-Language-Action Model PDF

Cannot Refute

[4] FAST: Efficient Action Tokenization for Vision-Language-Action Models PDF

Cannot Refute

[31] LLaDA-VLA: Vision Language Diffusion Action Models PDF

Cannot Refute

[36] FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization PDF

Cannot Refute

[51] WorldVLA: Towards Autoregressive Action World Model PDF

Cannot Refute

[52] SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics PDF

Cannot Refute

[54] Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge PDF

Cannot Refute

[55] Pure vision language action (vla) models: A comprehensive survey PDF

Cannot Refute

[56] Handsonvlm: Vision-language models for hand-object interaction prediction PDF

Cannot Refute

Contribution

Comprehensive Benchmark for Action Tokenization in VLAs

[2] Spatialvla: Exploring spatial representations for visual-language-action model PDF

Cannot Refute

[14] Behavior Generation with Latent Actions PDF

Cannot Refute

[18] Enhancing generalization in vision-language-action models by preserving pretrained representations PDF

Cannot Refute

[39] VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers PDF

Cannot Refute

[64] Perceiver-actor: A multi-task transformer for robotic manipulation PDF

Cannot Refute

[65] The colosseum: A benchmark for evaluating generalization for robotic manipulation PDF

Cannot Refute

[66] Discrete policy: Learning disentangled action space for multi-task robotic manipulation PDF

Cannot Refute

[67] Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions PDF

Cannot Refute

[68] Lanmp: A multifaceted mobile manipulation benchmark for robots PDF

Cannot Refute

[69] Action-quantized offline reinforcement learning for robotic skill learning PDF

Cannot Refute

FASTer: Toward Powerful and Efficient Autoregressive Vision–Language–Action Models with Learnable Action Tokenizer and Block-wise Decoding

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[36] FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization PDF

[39] VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers PDF

Contribution Analysis

FASTerVQ: Learnable Action Tokenizer with Transformer-based RVQ

[14] Behavior Generation with Latent Actions PDF

[57] Baku: An efficient transformer for multi-task policy learning PDF

[58] Causal Motion Tokenizer for Streaming Motion Generation PDF

[59] OmniSAT: Compact Action Token, Faster Auto Regression PDF

[60] Grounding multimodal large language models in actions PDF

[61] HOIGPT: Learning Long Sequence Hand-Object Interaction with Language Models PDF

[62] Towards Generally Intelligent Robots That Simply Work Everywhere PDF

[63] VersatileMotion: A Unified Framework for Motion Synthesis and Comprehension PDF

Block-wise Autoregressive Decoding with Lightweight Action Expert

[53] Carp: Visuomotor policy learning via coarse-to-fine autoregressive prediction PDF

[1] Unified Vision-Language-Action Model PDF

[4] FAST: Efficient Action Tokenization for Vision-Language-Action Models PDF

[31] LLaDA-VLA: Vision Language Diffusion Action Models PDF

[36] FASTer: Toward Efficient Autoregressive Vision Language Action Modeling via Neural Action Tokenization PDF

[51] WorldVLA: Towards Autoregressive Action World Model PDF

[52] SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics PDF

[54] Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge PDF

[55] Pure vision language action (vla) models: A comprehensive survey PDF

[56] Handsonvlm: Vision-language models for hand-object interaction prediction PDF

Comprehensive Benchmark for Action Tokenization in VLAs

[2] Spatialvla: Exploring spatial representations for visual-language-action model PDF

[14] Behavior Generation with Latent Actions PDF

[18] Enhancing generalization in vision-language-action models by preserving pretrained representations PDF

[39] VQ-VLA: Improving Vision-Language-Action Models via Scaling Vector-Quantized Action Tokenizers PDF

[64] Perceiver-actor: A multi-task transformer for robotic manipulation PDF

[65] The colosseum: A benchmark for evaluating generalization for robotic manipulation PDF

[66] Discrete policy: Learning disentangled action space for multi-task robotic manipulation PDF

[67] Q-transformer: Scalable offline reinforcement learning via autoregressive q-functions PDF

[68] Lanmp: A multifaceted mobile manipulation benchmark for robots PDF

[69] Action-quantized offline reinforcement learning for robotic skill learning PDF

Table of Contents