Unified Vision-Language-Action Model

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

world modelroboticsvision-language-action model

Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This tokenized formulation naturally supports flexible multimodal task learning, particularly from large-scale video data, and further demonstrates that generative vision supervision can significantly enhance visual understanding. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning—especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, substantially outperforming prior methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing π₀-FAST's 85.5%. We further demonstrate its broad applicability through experiments on real-world ALOHA manipulation tasks and autonomous driving scenarios.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

UniVLA proposes a unified autoregressive framework that models vision, language, and action as discrete token sequences, incorporating generative vision supervision and world modeling during post-training. The paper resides in the 'Native Multimodal VLA Architectures' leaf, which currently contains no sibling papers in the taxonomy. This positioning suggests the work occupies a relatively sparse research direction within the broader unified multimodal autoregressive modeling branch, distinguishing itself from hybrid paradigms and reasoning-augmented approaches that populate neighboring taxonomy leaves.

The taxonomy reveals several active neighboring directions. The 'Hybrid Action Generation Paradigms' branch explores autoregressive-diffusion combinations and diffusion-based VLA models, while 'Reasoning and Chain-of-Thought Integration' investigates explicit intermediate reasoning steps before action generation. The 'World Modeling and Predictive Dynamics' branch, particularly relevant to UniVLA's post-training approach, contains methods for autoregressive world models and occupancy-based representations. UniVLA's emphasis on native multimodal tokenization and world modeling positions it at the intersection of unified sequence modeling and predictive dynamics, diverging from hybrid architectures that separate discrete and continuous action representations.

Among 30 candidates examined, the contribution-level analysis reveals mixed novelty signals. The core UniVLA architecture shows overlap with prior work: 2 of 10 examined candidates provide potentially refuting evidence for the unified modeling framework itself. The unified sequence modeling contribution appears more distinctive, with 0 refutable candidates among 10 examined, suggesting this aspect may represent a clearer advance. The benchmark performance claims face stronger prior work, with 3 of 10 candidates offering comparable or overlapping results. These statistics reflect a limited semantic search scope rather than exhaustive coverage, indicating that substantial related work exists in this rapidly evolving area.

Based on the top-30 semantic matches and taxonomy structure, UniVLA appears to refine existing unified autoregressive approaches rather than introduce fundamentally new paradigms. The sparse population of its taxonomy leaf suggests either emerging novelty or incomplete literature coverage in this analysis. The world modeling integration during post-training may represent the most distinctive technical contribution, though the limited search scope prevents definitive assessment of how this compares to concurrent developments in predictive dynamics for VLA systems.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: unified vision-language-action modeling through autoregressive token sequences. This emerging field seeks to unify perception, language understanding, and action generation within a single autoregressive framework, treating all modalities as discrete token sequences. The taxonomy reveals a rich landscape organized around several key themes. Action Tokenization and Representation explores how continuous control signals are discretized for autoregressive modeling, while Reasoning and Chain-of-Thought Integration examines methods that interleave explicit reasoning steps with action prediction. Hybrid Action Generation Paradigms investigates architectures that blend discrete token prediction with continuous action heads or diffusion processes, as seen in works like HybridVLA[12] and DiffusionVLA[15]. World Modeling and Predictive Dynamics focuses on learning forward models that predict future states, enabling planning and simulation. Domain-Specific VLA Applications addresses specialized deployments in robotics, autonomous driving, and embodied AI, with examples like OpenDriveVLA[7] and DrivingGPT[9]. Unified Multimodal Autoregressive Modeling, the branch housing the original paper, emphasizes native architectures that process vision, language, and action tokens within a single transformer backbone, exemplified by Unified-IO 2[3] and related systems. Several active research directions reveal key trade-offs and open questions. One line explores whether pure autoregressive token prediction suffices or whether hybrid approaches combining discrete and continuous representations yield better control precision and sample efficiency. Another examines the role of explicit reasoning: CoT-VLA[4] and DeepThinkVLA[11] demonstrate that chain-of-thought prompting can improve decision quality, yet questions remain about computational overhead and generalization. Unified VLA[0] sits within the Native Multimodal VLA Architectures cluster, emphasizing end-to-end autoregressive modeling without hybrid components. Compared to Unified-IO 2[3], which pioneered broad multimodal unification, Unified VLA[0] likely refines architectural choices or training strategies for tighter vision-language-action integration. Relative to reasoning-augmented approaches like CoT-VLA[4], it appears to prioritize streamlined token-level prediction, trading explicit intermediate reasoning for architectural simplicity and potentially faster inference.

Claimed Contributions

UniVLA unified vision-language-action model

Can Refute

10 retrieved papers

The authors introduce UniVLA, a novel architecture that represents vision, language, and action modalities as discrete tokens in a unified vocabulary and models them autoregressively. This design enables tighter cross-modal integration and supports large-scale video-based training, offering an alternative to existing VLA paradigms that rely on separate vision encoders.

10 retrieved papers

Can Refute

Unified sequence modeling framework supporting multimodal tasks

10 retrieved papers

The framework enables diverse multimodal tasks including text-supervised perception grounding, vision-supervised world modeling, and action-supervised policy learning within a single architecture. The authors demonstrate that world model post-training substantially improves performance and efficiency in downstream policy learning across simulation benchmarks, real-world robots, and driving scenarios.

10 retrieved papers

State-of-the-art performance on robotic manipulation benchmarks

Can Refute

10 retrieved papers

UniVLA achieves new state-of-the-art results on multiple simulation benchmarks including CALVIN, LIBERO, and SimplerEnv-Bridge, significantly outperforming prior methods. The model also demonstrates effective transfer to real ALOHA platform and autonomous driving scenarios, highlighting its potential for generalist embodied intelligence.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

UniVLA unified vision-language-action model

[41] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control PDF

Can Refute

[42] A survey on vision-language-action models for embodied ai PDF

Can Refute

[3] Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action PDF

Cannot Refute

[39] Palm-e: An embodied multimodal language model PDF

Cannot Refute

[40] XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations PDF

Cannot Refute

[43] LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks PDF

Cannot Refute

[44] Multimodal fusion and vision-language models: A survey for robot vision PDF

Cannot Refute

[45] Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation PDF

Cannot Refute

[46] ShowUI: One Vision-Language-Action Model for GUI Visual Agent PDF

Cannot Refute

[47] Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process PDF

Cannot Refute

Contribution

Unified sequence modeling framework supporting multimodal tasks

[48] Imagine-2-Drive: Leveraging High-Fidelity World Models via Multi-Modal Diffusion Policies PDF

Cannot Refute

[49] 3D-VLA: A 3D Vision-Language-Action Generative World Model PDF

Cannot Refute

[50] Spatial-Temporal Aware Visuomotor Diffusion Policy Learning PDF

Cannot Refute

[51] Pre-training contextualized world models with in-the-wild videos for reinforcement learning PDF

Cannot Refute

[52] A step toward world models: A survey on robotic manipulation PDF

Cannot Refute

[53] Learning to model the world with language PDF

Cannot Refute

[54] GenRL: Multimodal-foundation world models for generalization in embodied agents PDF

Cannot Refute

[55] Can World Models Benefit VLMs for World Dynamics? PDF

Cannot Refute

[56] Multimodal foundation world models for generalist embodied agents PDF

Cannot Refute

[57] Merlot: Multimodal neural script knowledge models PDF

Cannot Refute

Contribution

State-of-the-art performance on robotic manipulation benchmarks

[47] Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process PDF

Can Refute

[59] Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge PDF

Can Refute

[64] Predictive inverse dynamics models are scalable learners for robotic manipulation PDF

Can Refute

[58] Fine-tuning vision-language-action models: Optimizing speed and success PDF

Cannot Refute

[60] Robomm: All-in-one multimodal large model for robotic manipulation PDF

Cannot Refute

[61] Molmoact: Action reasoning models that can reason in space PDF

Cannot Refute

[62] RoboBERT: An End-to-end Multimodal Robotic Manipulation Model PDF

Cannot Refute

[63] Vita-vla: Efficiently teaching vision-language models to act via action expert distillation PDF

Cannot Refute

[65] Multimodal diffusion transformer: Learning versatile behavior from multimodal goals PDF

Cannot Refute

[66] BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation PDF

Cannot Refute

Unified Vision-Language-Action Model

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

UniVLA unified vision-language-action model

[41] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control PDF

[42] A survey on vision-language-action models for embodied ai PDF

[3] Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action PDF

[39] Palm-e: An embodied multimodal language model PDF

[40] XR-1: Towards Versatile Vision-Language-Action Models via Learning Unified Vision-Motion Representations PDF

[43] LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks PDF

[44] Multimodal fusion and vision-language models: A survey for robot vision PDF

[45] Cogact: A foundational vision-language-action model for synergizing cognition and action in robotic manipulation PDF

[46] ShowUI: One Vision-Language-Action Model for GUI Visual Agent PDF

[47] Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process PDF

Unified sequence modeling framework supporting multimodal tasks

[48] Imagine-2-Drive: Leveraging High-Fidelity World Models via Multi-Modal Diffusion Policies PDF

[49] 3D-VLA: A 3D Vision-Language-Action Generative World Model PDF

[50] Spatial-Temporal Aware Visuomotor Diffusion Policy Learning PDF

[51] Pre-training contextualized world models with in-the-wild videos for reinforcement learning PDF

[52] A step toward world models: A survey on robotic manipulation PDF

[53] Learning to model the world with language PDF

[54] GenRL: Multimodal-foundation world models for generalization in embodied agents PDF

[55] Can World Models Benefit VLMs for World Dynamics? PDF

[56] Multimodal foundation world models for generalist embodied agents PDF

[57] Merlot: Multimodal neural script knowledge models PDF

State-of-the-art performance on robotic manipulation benchmarks

[47] Unified Diffusion VLA: Vision-Language-Action Model via Joint Discrete Denoising Diffusion Process PDF

[59] Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge PDF

[64] Predictive inverse dynamics models are scalable learners for robotic manipulation PDF

[58] Fine-tuning vision-language-action models: Optimizing speed and success PDF

[60] Robomm: All-in-one multimodal large model for robotic manipulation PDF

[61] Molmoact: Action reasoning models that can reason in space PDF

[62] RoboBERT: An End-to-end Multimodal Robotic Manipulation Model PDF

[63] Vita-vla: Efficiently teaching vision-language models to act via action expert distillation PDF

[65] Multimodal diffusion transformer: Learning versatile behavior from multimodal goals PDF

[66] BitVLA: 1-bit Vision-Language-Action Models for Robotics Manipulation PDF

Table of Contents