Unified Vision-Language-Action Model

ICLR 2026 Conference SubmissionAnonymous Authors
world modelroboticsvision-language-action model
Abstract:

Vision-language-action models (VLAs) have garnered significant attention for their potential in advancing robotic manipulation. However, previous approaches predominantly rely on the general comprehension capabilities of vision-language models (VLMs) to generate action signals, often overlooking the rich temporal and causal structure embedded in visual observations. In this paper, we present UniVLA, a unified and native multimodal VLA model that autoregressively models vision, language, and action signals as discrete token sequences. This tokenized formulation naturally supports flexible multimodal task learning, particularly from large-scale video data, and further demonstrates that generative vision supervision can significantly enhance visual understanding. By incorporating world modeling during post-training, UniVLA captures causal dynamics from videos, facilitating effective transfer to downstream policy learning—especially for long-horizon tasks. Our approach sets new state-of-the-art results across several widely used simulation benchmarks, including CALVIN, LIBERO, and Simplenv-Bridge, substantially outperforming prior methods. For example, UniVLA achieves 95.5% average success rate on LIBERO benchmark, surpassing π₀-FAST's 85.5%. We further demonstrate its broad applicability through experiments on real-world ALOHA manipulation tasks and autonomous driving scenarios.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

UniVLA proposes a unified autoregressive framework that models vision, language, and action as discrete token sequences, incorporating generative vision supervision and world modeling during post-training. The paper resides in the 'Native Multimodal VLA Architectures' leaf, which currently contains no sibling papers in the taxonomy. This positioning suggests the work occupies a relatively sparse research direction within the broader unified multimodal autoregressive modeling branch, distinguishing itself from hybrid paradigms and reasoning-augmented approaches that populate neighboring taxonomy leaves.

The taxonomy reveals several active neighboring directions. The 'Hybrid Action Generation Paradigms' branch explores autoregressive-diffusion combinations and diffusion-based VLA models, while 'Reasoning and Chain-of-Thought Integration' investigates explicit intermediate reasoning steps before action generation. The 'World Modeling and Predictive Dynamics' branch, particularly relevant to UniVLA's post-training approach, contains methods for autoregressive world models and occupancy-based representations. UniVLA's emphasis on native multimodal tokenization and world modeling positions it at the intersection of unified sequence modeling and predictive dynamics, diverging from hybrid architectures that separate discrete and continuous action representations.

Among 30 candidates examined, the contribution-level analysis reveals mixed novelty signals. The core UniVLA architecture shows overlap with prior work: 2 of 10 examined candidates provide potentially refuting evidence for the unified modeling framework itself. The unified sequence modeling contribution appears more distinctive, with 0 refutable candidates among 10 examined, suggesting this aspect may represent a clearer advance. The benchmark performance claims face stronger prior work, with 3 of 10 candidates offering comparable or overlapping results. These statistics reflect a limited semantic search scope rather than exhaustive coverage, indicating that substantial related work exists in this rapidly evolving area.

Based on the top-30 semantic matches and taxonomy structure, UniVLA appears to refine existing unified autoregressive approaches rather than introduce fundamentally new paradigms. The sparse population of its taxonomy leaf suggests either emerging novelty or incomplete literature coverage in this analysis. The world modeling integration during post-training may represent the most distinctive technical contribution, though the limited search scope prevents definitive assessment of how this compares to concurrent developments in predictive dynamics for VLA systems.

Taxonomy

Core-task Taxonomy Papers
38
3
Claimed Contributions
30
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: unified vision-language-action modeling through autoregressive token sequences. This emerging field seeks to unify perception, language understanding, and action generation within a single autoregressive framework, treating all modalities as discrete token sequences. The taxonomy reveals a rich landscape organized around several key themes. Action Tokenization and Representation explores how continuous control signals are discretized for autoregressive modeling, while Reasoning and Chain-of-Thought Integration examines methods that interleave explicit reasoning steps with action prediction. Hybrid Action Generation Paradigms investigates architectures that blend discrete token prediction with continuous action heads or diffusion processes, as seen in works like HybridVLA[12] and DiffusionVLA[15]. World Modeling and Predictive Dynamics focuses on learning forward models that predict future states, enabling planning and simulation. Domain-Specific VLA Applications addresses specialized deployments in robotics, autonomous driving, and embodied AI, with examples like OpenDriveVLA[7] and DrivingGPT[9]. Unified Multimodal Autoregressive Modeling, the branch housing the original paper, emphasizes native architectures that process vision, language, and action tokens within a single transformer backbone, exemplified by Unified-IO 2[3] and related systems. Several active research directions reveal key trade-offs and open questions. One line explores whether pure autoregressive token prediction suffices or whether hybrid approaches combining discrete and continuous representations yield better control precision and sample efficiency. Another examines the role of explicit reasoning: CoT-VLA[4] and DeepThinkVLA[11] demonstrate that chain-of-thought prompting can improve decision quality, yet questions remain about computational overhead and generalization. Unified VLA[0] sits within the Native Multimodal VLA Architectures cluster, emphasizing end-to-end autoregressive modeling without hybrid components. Compared to Unified-IO 2[3], which pioneered broad multimodal unification, Unified VLA[0] likely refines architectural choices or training strategies for tighter vision-language-action integration. Relative to reasoning-augmented approaches like CoT-VLA[4], it appears to prioritize streamlined token-level prediction, trading explicit intermediate reasoning for architectural simplicity and potentially faster inference.

Claimed Contributions

UniVLA unified vision-language-action model

The authors introduce UniVLA, a novel architecture that represents vision, language, and action modalities as discrete tokens in a unified vocabulary and models them autoregressively. This design enables tighter cross-modal integration and supports large-scale video-based training, offering an alternative to existing VLA paradigms that rely on separate vision encoders.

10 retrieved papers
Can Refute
Unified sequence modeling framework supporting multimodal tasks

The framework enables diverse multimodal tasks including text-supervised perception grounding, vision-supervised world modeling, and action-supervised policy learning within a single architecture. The authors demonstrate that world model post-training substantially improves performance and efficiency in downstream policy learning across simulation benchmarks, real-world robots, and driving scenarios.

10 retrieved papers
State-of-the-art performance on robotic manipulation benchmarks

UniVLA achieves new state-of-the-art results on multiple simulation benchmarks including CALVIN, LIBERO, and SimplerEnv-Bridge, significantly outperforming prior methods. The model also demonstrates effective transfer to real ALOHA platform and autonomous driving scenarios, highlighting its potential for generalist embodied intelligence.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

UniVLA unified vision-language-action model

The authors introduce UniVLA, a novel architecture that represents vision, language, and action modalities as discrete tokens in a unified vocabulary and models them autoregressively. This design enables tighter cross-modal integration and supports large-scale video-based training, offering an alternative to existing VLA paradigms that rely on separate vision encoders.

Contribution

Unified sequence modeling framework supporting multimodal tasks

The framework enables diverse multimodal tasks including text-supervised perception grounding, vision-supervised world modeling, and action-supervised policy learning within a single architecture. The authors demonstrate that world model post-training substantially improves performance and efficiency in downstream policy learning across simulation benchmarks, real-world robots, and driving scenarios.

Contribution

State-of-the-art performance on robotic manipulation benchmarks

UniVLA achieves new state-of-the-art results on multiple simulation benchmarks including CALVIN, LIBERO, and SimplerEnv-Bridge, significantly outperforming prior methods. The model also demonstrates effective transfer to real ALOHA platform and autonomous driving scenarios, highlighting its potential for generalist embodied intelligence.

Unified Vision-Language-Action Model | Novelty Validation