RoboOmni: Actions Are Just Another Modality for Your Vision-Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Vision Language Action ModelMulti-Modal LearningManipulation

Integrating Vision--Language-Models (VLMs) into robotics has enabled building generalizable Vision-Language Action (VLA) models for robotic manipulation. While decoupled designs with a separate action expert often outperform unified frameworks, the latter (e.g., OpenVLA) present an appealing, conceptually integrated architecture. Nevertheless, current unified approaches typically suffer from poor historical context integration and distribution shift given their incapability of predicting action chunking.

We introduce RoboOmni, a unified multi-modal next-token prediction framework for robotic manipulation designed to overcome these issues. Compared with decoupled approaches, RoboOmni unifies the multi-modal representations and minimizes the distribution gap between vision-language pretraining and action finetuning. Besides, in contrast to prior unified approaches, RoboOmni brings in the action chunking mechanism by Multi-Token Action Prediction (MTAP) that supports both FAST and Bin tokenizers, and crucially alleviates the action distribution shift issue when training with noisy real-world data. Specifically, by preserving the original VLM training pipeline, RoboOmni naturally supports co-training with multi-modal information and various VLM optimization techniques, e.g., fast inference optimization, which significantly improves the generalization capabilities and extensibility of RoboOmni.

We conduct extensive experiments on both the CALVIN benchmark and a real-world robot, demonstrating state-of-the-art (SOTA) performance. Our MTAP implementation with the FAST tokenizer achieves a 94.4% average success rate on CALVIN. Furthermore, we show that our Bin tokenizer implementation, deployed with existing VLM serving frameworks like SGLang, achieves a 27x inference time speedup compared with OpenVLA.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

RoboOmni proposes a unified multi-modal next-token prediction framework for robotic manipulation, introducing Multi-Token Action Prediction (MTAP) to enable discrete action chunking within a single architecture. The paper resides in the 'Unified End-to-End Frameworks' leaf, which contains five papers total including OpenVLA and RT-2. This leaf represents a moderately populated research direction within the broader 'Model Architecture and Design' branch, suggesting active but not overcrowded exploration of unified VLA architectures that integrate vision, language, and action prediction without decoupled components.

The taxonomy reveals neighboring research directions that contextualize RoboOmni's positioning. Adjacent leaves include 'Hierarchical and Modular Architectures' (two papers emphasizing decomposed planning and perception) and 'Generative and Diffusion-Based Models' (three papers using diffusion processes for action prediction). The 'Reasoning and Cognitive Capabilities' branch explores explicit chain-of-thought and memory mechanisms, while 'Training Paradigms' addresses web-scale pretraining and demonstration learning. RoboOmni's unified design diverges from hierarchical approaches by avoiding modular decomposition, yet shares conceptual ground with generative models through its next-token prediction paradigm.

Among thirty candidates examined, the contribution-level analysis reveals mixed novelty signals. The MTAP mechanism for discrete action chunking examined ten candidates with one appearing to refute novelty, suggesting some prior work addresses action chunking in unified frameworks. The unified multi-modal framework itself also examined ten candidates with one refutable match, indicating overlapping architectural concepts exist. The multi-modal co-training strategy examined ten candidates with two refutable instances, pointing to established precedents for joint vision-language-action training. These statistics reflect limited search scope rather than exhaustive coverage, but suggest incremental refinement over existing unified VLA approaches.

Based on the top-thirty semantic matches examined, RoboOmni appears to offer architectural refinements within an established research direction rather than pioneering entirely new territory. The analysis does not cover broader literature beyond these candidates, and the taxonomy shows this is one of several active unified framework efforts. The contribution's distinctiveness likely hinges on specific implementation details of MTAP and co-training that differentiate it from sibling papers like OpenVLA, though the limited search scope prevents definitive assessment of these nuances.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Vision-language-action modeling for robotic manipulation. This field integrates visual perception, natural language understanding, and action generation to enable robots to perform complex manipulation tasks guided by human instructions. The taxonomy reveals a rich landscape organized around several major themes. Model Architecture and Design encompasses unified end-to-end frameworks that directly map multimodal inputs to actions, exemplified by works like RT-2[16] and OpenVLA[48], as well as specialized architectural innovations such as RoboMamba[15]. Reasoning and Cognitive Capabilities explores how models can perform chain-of-thought planning and hierarchical decision-making, while Efficiency and Optimization addresses practical deployment concerns through compression and acceleration techniques seen in TinyVLA[1] and BitVLA[2]. Training Paradigms and Data Utilization examines how models leverage diverse datasets and learning strategies, and Multimodal Input and Interaction investigates the fusion of vision, language, and additional sensory modalities. Action Representation and Generation focuses on how policies output executable robot commands, while Specialized Applications, Evaluation and Benchmarking, and Representation Learning branches round out the taxonomy by addressing domain-specific challenges, systematic testing, and foundational pretraining methods. Recent work has particularly concentrated on balancing model expressiveness with computational efficiency, and on improving generalization across diverse manipulation scenarios. RoboOmni[0] sits within the Unified End-to-End Frameworks branch alongside neighbors like Unified VLA[49] and InstructVLA[50], emphasizing holistic architectures that process vision and language jointly to produce actions without modular decomposition. Compared to Pi Zero[3] and Physically Grounded VLM[5], which may incorporate explicit physical reasoning or grounding mechanisms, RoboOmni[0] appears to prioritize seamless integration of multimodal streams in a single forward pass. This design choice contrasts with approaches that separate perception, planning, and control stages, reflecting an ongoing debate about whether end-to-end learning or compositional modularity better supports robust, generalizable manipulation policies. The field continues to grapple with trade-offs between model scale, data requirements, and real-world adaptability.

Claimed Contributions

Multi-Token Action Prediction (MTAP) mechanism for discrete action chunking

Can Refute

10 retrieved papers

The authors introduce MTAP, a mechanism that enables discrete tokenizers to predict multiple future action steps in parallel. This resolves temporal modeling limitations and reduces compounding errors inherent in sequential autoregressive prediction, supporting both binning-based and frequency-domain (FAST) tokenization schemes.

10 retrieved papers

Can Refute

RoboOmni unified multi-modal next-token prediction framework

Can Refute

10 retrieved papers

The authors present RoboOmni, a framework that treats actions as another modality within VLMs. By preserving the original VLM training pipeline, it naturally supports co-training with multi-modal information and leverages VLM optimization techniques, improving generalization and extensibility.

10 retrieved papers

Can Refute

Multi-modal co-training strategy with vision-language tasks

Can Refute

10 retrieved papers

The authors incorporate auxiliary vision-language tasks such as visual grounding, point trace prediction, and visual question answering into the training process. This co-training strategy enhances spatial understanding, temporal reasoning, and semantic comprehension, improving the model's generalization capabilities.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[15] RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation PDF

Pengju An, Yandong Guo, Jiaming Liu, Xiao-qi Li, Mengzhen Liu, Zhenyu Wang, Xiaoqi Li, Senqiao Yang, Kaichen Zhou, Renrui Zhang, Shanghang Zhang (2024) • Neural Information Processing Systems

[16] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control PDF

Brohan, Anthony, Brown, Noah, Anthony Brohan, Carbajal, Justice, Noah Brown, Chebotar, Yevgen, Justice Carbajal, Chen Xi, Yevgen Chebotar, Choromanski, Krzysztof, K. Choromanski, Ding Tianli, Tianli Ding, Driess, Danny, Danny Driess, Dubey, Avinava, Kumar Avinava Dubey, Finn, Chelsea, Chelsea Finn, Florence, Pete, Peter R. Florence, Fu, Chuyuan, Chuyuan Fu, Arenas, Montse Gonzalez, Montse Gonzalez Arenas, Gopalakrishnan, Keerthana, K. Gopalakrishnan, Han, Kehang, Kehang Han, Hausman, Karol, Karol Hausman, Herzog, Alexander, Alexander Herzog, Hsu, Jasmine, Jasmine Hsu, Ichter, Brian, Brian Ichter, Irpan, Alex, A. Irpan, Joshi Nikhil, Nikhil J. Joshi, Julian, Ryan, Ryan C. Julian, Kalashnikov, Dmitry, Dmitry Kalashnikov, Kuang, Yuheng, Yuheng Kuang, Leal, Isabel, Isabel Leal, Lee, Lisa, S. Levine, Tsang-Wei Edward, H. Michalewski, Levine, Sergey, Igor Mordatch, Lu Yao, Karl Pertsch, Michalewski, Henryk, Kanishka Rao, Mordatch, Igor, Krista Reymann, Pertsch, Karl, M. Ryoo, Rao, Kanishka, Grecia Salazar, Reymann, Krista, Pannag R. Sanketi, Ryoo, Michael, P. Sermanet, Salazar, Grecia, Jaspiar Singh, Sanketi, Pannag, Anikait Singh, Sermanet, Pierre, Radu Soricut, Singh, Jaspiar, Huong Tran, Anikait, Vincent Vanhoucke, Soricut, Radu, Q. Vuong, Tran Huong, Ayzaan Wahid, Vanhoucke, Vincent, Stefan Welker, Vuong, Quan, Paul Wohlhart, Wahid, Ayzaan, Ted Xiao, Welker, Stefan, Tianhe Yu, Wohlhart, Paul, Brianna Zitkovich, Wu Jialin, Xia Fei, Xiao, Ted, Xu Peng, Xu, Sichun, Yu Tianhe, Zitkovich, Brianna (2023)

[49] Unified Vision-Language-Action Model PDF

Wang, Yuqi, LI Xinghang, Yu-Quan Wang, Wenxuan, Xinghang Li, Zhang Junbo, Wenxuan Wang, Li, Yingyan, Junbo Zhang, Chen Yuntao, Yingyan Li, Wang Xinlong, Yuntao Chen, Zhang, ZhaoXiang, Xinlong Wang, Zhaoxiang Zhang (2025)

[50] Instructvla: Vision-language-action instruction tuning from understanding to manipulation PDF

Yang, Shuai, Li Hao, Shuai Yang, Chen Yilun, Hao Li, Wang Bin, Yilun Chen, Tian Yang, Bin Wang, Wang Tai, Yang Tian, Wang Hanqing, Tai Wang, Zhao Feng, Hanqing Wang, Liao, Yiyi, Feng Zhao, Pang, Jiangmiao, Yiyi Liao, Jiangmiao Pang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Multi-Token Action Prediction (MTAP) mechanism for discrete action chunking

[58] Fine-tuning vision-language-action models: Optimizing speed and success PDF

Can Refute

[57] FAST: Efficient Action Tokenization for Vision-Language-Action Models PDF

Cannot Refute

[59] Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics PDF

Cannot Refute

[60] Omnisat: Compact action token, faster auto regression PDF

Cannot Refute

[61] Action Tokenizer Matters in In-Context Imitation Learning PDF

Cannot Refute

[62] Causal Motion Tokenizer for Streaming Motion Generation PDF

Cannot Refute

[63] Quar-vla: Vision-language-action model for quadruped robots PDF

Cannot Refute

[64] Moto: Latent motion token as the bridging language for robot manipulation PDF

Cannot Refute

[65] Nora: A small open-sourced generalist vision language action model for embodied tasks PDF

Cannot Refute

[66] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies PDF

Cannot Refute

Contribution

RoboOmni unified multi-modal next-token prediction framework

[52] Palm-e: An embodied multimodal language model PDF

Can Refute

[7] Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation PDF

Cannot Refute

[14] : a Vision-Language-Action Model with Open-World Generalization PDF

Cannot Refute

[19] Mla: A multisensory language-action model for multimodal understanding and forecasting in robotic manipulation PDF

Cannot Refute

[55] Chatvla: Unified multimodal understanding and robot control with vision-language-action model PDF

Cannot Refute

[67] Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model PDF

Cannot Refute

[68] Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data PDF

Cannot Refute

[69] Vtla: Vision-tactile-language-action model with preference learning for insertion manipulation PDF

Cannot Refute

[70] Dita: Scaling diffusion transformer for generalist vision-language-action policy PDF

Cannot Refute

[71] Being-h0: vision-language-action pretraining from large-scale human videos PDF

Cannot Refute

Contribution

Multi-modal co-training strategy with vision-language tasks

[16] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control PDF

Can Refute

[52] Palm-e: An embodied multimodal language model PDF

Can Refute

[3] Ï0: A Vision-Language-Action Flow Model for General Robot Control PDF

Cannot Refute

[14] : a Vision-Language-Action Model with Open-World Generalization PDF

Cannot Refute

[48] OpenVLA: An Open-Source Vision-Language-Action Model PDF

Cannot Refute

[51] Gr00t n1: An open foundation model for generalist humanoid robots PDF

Cannot Refute

[53] Liv: Language-image representations and rewards for robotic control PDF

Cannot Refute

[54] Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning PDF

Cannot Refute

[55] Chatvla: Unified multimodal understanding and robot control with vision-language-action model PDF

Cannot Refute

[56] Tactile-VLA: unlocking vision-language-action model's physical knowledge for tactile generalization PDF

Cannot Refute

RoboOmni: Actions Are Just Another Modality for Your Vision-Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[15] RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation PDF

[16] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control PDF

[49] Unified Vision-Language-Action Model PDF

[50] Instructvla: Vision-language-action instruction tuning from understanding to manipulation PDF

Contribution Analysis

Multi-Token Action Prediction (MTAP) mechanism for discrete action chunking

[58] Fine-tuning vision-language-action models: Optimizing speed and success PDF

[57] FAST: Efficient Action Tokenization for Vision-Language-Action Models PDF

[59] Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics PDF

[60] Omnisat: Compact action token, faster auto regression PDF

[61] Action Tokenizer Matters in In-Context Imitation Learning PDF

[62] Causal Motion Tokenizer for Streaming Motion Generation PDF

[63] Quar-vla: Vision-language-action model for quadruped robots PDF

[64] Moto: Latent motion token as the bridging language for robot manipulation PDF

[65] Nora: A small open-sourced generalist vision language action model for embodied tasks PDF

[66] Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies PDF

RoboOmni unified multi-modal next-token prediction framework

[52] Palm-e: An embodied multimodal language model PDF

[7] Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation PDF

[14] : a Vision-Language-Action Model with Open-World Generalization PDF

[19] Mla: A multisensory language-action model for multimodal understanding and forecasting in robotic manipulation PDF

[55] Chatvla: Unified multimodal understanding and robot control with vision-language-action model PDF

[67] Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model PDF

[68] Graspvla: a grasping foundation model pre-trained on billion-scale synthetic action data PDF

[69] Vtla: Vision-tactile-language-action model with preference learning for insertion manipulation PDF

[70] Dita: Scaling diffusion transformer for generalist vision-language-action policy PDF

[71] Being-h0: vision-language-action pretraining from large-scale human videos PDF

Multi-modal co-training strategy with vision-language tasks

[16] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control PDF

[52] Palm-e: An embodied multimodal language model PDF

[3] Ï0: A Vision-Language-Action Flow Model for General Robot Control PDF

[14] : a Vision-Language-Action Model with Open-World Generalization PDF

[48] OpenVLA: An Open-Source Vision-Language-Action Model PDF

[51] Gr00t n1: An open foundation model for generalist humanoid robots PDF

[53] Liv: Language-image representations and rewards for robotic control PDF

[54] Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning PDF

[55] Chatvla: Unified multimodal understanding and robot control with vision-language-action model PDF

[56] Tactile-VLA: unlocking vision-language-action model's physical knowledge for tactile generalization PDF

Table of Contents

[3] Ï0: A Vision-Language-Action Flow Model for General Robot Control PDF