RoboOmni: Actions Are Just Another Modality for Your Vision-Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
Vision Language Action ModelMulti-Modal LearningManipulation
Abstract:

Integrating Vision--Language-Models (VLMs) into robotics has enabled building generalizable Vision-Language Action (VLA) models for robotic manipulation. While decoupled designs with a separate action expert often outperform unified frameworks, the latter (e.g., OpenVLA) present an appealing, conceptually integrated architecture. Nevertheless, current unified approaches typically suffer from poor historical context integration and distribution shift given their incapability of predicting action chunking.

We introduce RoboOmni, a unified multi-modal next-token prediction framework for robotic manipulation designed to overcome these issues. Compared with decoupled approaches, RoboOmni unifies the multi-modal representations and minimizes the distribution gap between vision-language pretraining and action finetuning. Besides, in contrast to prior unified approaches, RoboOmni brings in the action chunking mechanism by Multi-Token Action Prediction (MTAP) that supports both FAST and Bin tokenizers, and crucially alleviates the action distribution shift issue when training with noisy real-world data. Specifically, by preserving the original VLM training pipeline, RoboOmni naturally supports co-training with multi-modal information and various VLM optimization techniques, e.g., fast inference optimization, which significantly improves the generalization capabilities and extensibility of RoboOmni.

We conduct extensive experiments on both the CALVIN benchmark and a real-world robot, demonstrating state-of-the-art (SOTA) performance. Our MTAP implementation with the FAST tokenizer achieves a 94.4% average success rate on CALVIN. Furthermore, we show that our Bin tokenizer implementation, deployed with existing VLM serving frameworks like SGLang, achieves a 27x inference time speedup compared with OpenVLA.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

RoboOmni proposes a unified multi-modal next-token prediction framework for robotic manipulation, introducing Multi-Token Action Prediction (MTAP) to enable discrete action chunking within a single architecture. The paper resides in the 'Unified End-to-End Frameworks' leaf, which contains five papers total including OpenVLA and RT-2. This leaf represents a moderately populated research direction within the broader 'Model Architecture and Design' branch, suggesting active but not overcrowded exploration of unified VLA architectures that integrate vision, language, and action prediction without decoupled components.

The taxonomy reveals neighboring research directions that contextualize RoboOmni's positioning. Adjacent leaves include 'Hierarchical and Modular Architectures' (two papers emphasizing decomposed planning and perception) and 'Generative and Diffusion-Based Models' (three papers using diffusion processes for action prediction). The 'Reasoning and Cognitive Capabilities' branch explores explicit chain-of-thought and memory mechanisms, while 'Training Paradigms' addresses web-scale pretraining and demonstration learning. RoboOmni's unified design diverges from hierarchical approaches by avoiding modular decomposition, yet shares conceptual ground with generative models through its next-token prediction paradigm.

Among thirty candidates examined, the contribution-level analysis reveals mixed novelty signals. The MTAP mechanism for discrete action chunking examined ten candidates with one appearing to refute novelty, suggesting some prior work addresses action chunking in unified frameworks. The unified multi-modal framework itself also examined ten candidates with one refutable match, indicating overlapping architectural concepts exist. The multi-modal co-training strategy examined ten candidates with two refutable instances, pointing to established precedents for joint vision-language-action training. These statistics reflect limited search scope rather than exhaustive coverage, but suggest incremental refinement over existing unified VLA approaches.

Based on the top-thirty semantic matches examined, RoboOmni appears to offer architectural refinements within an established research direction rather than pioneering entirely new territory. The analysis does not cover broader literature beyond these candidates, and the taxonomy shows this is one of several active unified framework efforts. The contribution's distinctiveness likely hinges on specific implementation details of MTAP and co-training that differentiate it from sibling papers like OpenVLA, though the limited search scope prevents definitive assessment of these nuances.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: Vision-language-action modeling for robotic manipulation. This field integrates visual perception, natural language understanding, and action generation to enable robots to perform complex manipulation tasks guided by human instructions. The taxonomy reveals a rich landscape organized around several major themes. Model Architecture and Design encompasses unified end-to-end frameworks that directly map multimodal inputs to actions, exemplified by works like RT-2[16] and OpenVLA[48], as well as specialized architectural innovations such as RoboMamba[15]. Reasoning and Cognitive Capabilities explores how models can perform chain-of-thought planning and hierarchical decision-making, while Efficiency and Optimization addresses practical deployment concerns through compression and acceleration techniques seen in TinyVLA[1] and BitVLA[2]. Training Paradigms and Data Utilization examines how models leverage diverse datasets and learning strategies, and Multimodal Input and Interaction investigates the fusion of vision, language, and additional sensory modalities. Action Representation and Generation focuses on how policies output executable robot commands, while Specialized Applications, Evaluation and Benchmarking, and Representation Learning branches round out the taxonomy by addressing domain-specific challenges, systematic testing, and foundational pretraining methods. Recent work has particularly concentrated on balancing model expressiveness with computational efficiency, and on improving generalization across diverse manipulation scenarios. RoboOmni[0] sits within the Unified End-to-End Frameworks branch alongside neighbors like Unified VLA[49] and InstructVLA[50], emphasizing holistic architectures that process vision and language jointly to produce actions without modular decomposition. Compared to Pi Zero[3] and Physically Grounded VLM[5], which may incorporate explicit physical reasoning or grounding mechanisms, RoboOmni[0] appears to prioritize seamless integration of multimodal streams in a single forward pass. This design choice contrasts with approaches that separate perception, planning, and control stages, reflecting an ongoing debate about whether end-to-end learning or compositional modularity better supports robust, generalizable manipulation policies. The field continues to grapple with trade-offs between model scale, data requirements, and real-world adaptability.

Claimed Contributions

Multi-Token Action Prediction (MTAP) mechanism for discrete action chunking

The authors introduce MTAP, a mechanism that enables discrete tokenizers to predict multiple future action steps in parallel. This resolves temporal modeling limitations and reduces compounding errors inherent in sequential autoregressive prediction, supporting both binning-based and frequency-domain (FAST) tokenization schemes.

10 retrieved papers
Can Refute
RoboOmni unified multi-modal next-token prediction framework

The authors present RoboOmni, a framework that treats actions as another modality within VLMs. By preserving the original VLM training pipeline, it naturally supports co-training with multi-modal information and leverages VLM optimization techniques, improving generalization and extensibility.

10 retrieved papers
Can Refute
Multi-modal co-training strategy with vision-language tasks

The authors incorporate auxiliary vision-language tasks such as visual grounding, point trace prediction, and visual question answering into the training process. This co-training strategy enhances spatial understanding, temporal reasoning, and semantic comprehension, improving the model's generalization capabilities.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Multi-Token Action Prediction (MTAP) mechanism for discrete action chunking

The authors introduce MTAP, a mechanism that enables discrete tokenizers to predict multiple future action steps in parallel. This resolves temporal modeling limitations and reduces compounding errors inherent in sequential autoregressive prediction, supporting both binning-based and frequency-domain (FAST) tokenization schemes.

Contribution

RoboOmni unified multi-modal next-token prediction framework

The authors present RoboOmni, a framework that treats actions as another modality within VLMs. By preserving the original VLM training pipeline, it naturally supports co-training with multi-modal information and leverages VLM optimization techniques, improving generalization and extensibility.

Contribution

Multi-modal co-training strategy with vision-language tasks

The authors incorporate auxiliary vision-language tasks such as visual grounding, point trace prediction, and visual question answering into the training process. This co-training strategy enhances spatial understanding, temporal reasoning, and semantic comprehension, improving the model's generalization capabilities.