OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds

ICLR 2026 Conference SubmissionAnonymous Authors
generalist agent; GUI agent; embodied agent; MoE
Abstract:

Multimodal large language models are progressively advancing toward multimodal agents that can proactively execute tasks. Existing research on multimodal agents primarily targets either GUI or embodied scenarios, corresponding to interactions within 2D virtual world and 3D physical world, respectively. However, many real-world tasks inherently require agents to interleave interactions across both types of environments. We initially mix GUI and embodied data to train models, but find performance degradation caused by data conflicts. Further analysis reveals that GUI and embodied data exhibit synergy at shallow layers but conflict at deep layers, resembling the cerebrum-cerebellum mechanism in the human brain. To this end, we introduce a high-performance generalist agent OmniActor, designed from both structural and data perspectives. First, we propose Layer-heterogeneous MoE that separates parameters at deep layers to eliminate conflict, while sharing parameters at shallow layers to leverage synergy. This design enables OmniActor to outperform agents trained solely on GUI or embodied data in their respective tasks. Furthermore, we unify the action spaces of GUI and embodied tasks and collect large-scale datasets from diverse sources for training. This substantially enhances the performance of OmniActor across various scenarios, especially in GUI tasks. The code will be publicly available.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces OmniActor, a generalist agent designed to operate across both GUI and embodied environments through a Layer-heterogeneous Mixture-of-Experts architecture and unified action space. It resides in the GUI-Embodied Dual-World Agents leaf, which contains only three papers total, indicating a relatively sparse but emerging research direction. This positioning suggests the work targets a specific niche within the broader unified agent training landscape, where most prior efforts focus on either GUI automation or embodied control separately rather than their explicit integration.

The taxonomy reveals that neighboring research directions include Generalist Multi-Embodiment Agents (two papers) focusing on diverse embodied domains with unified tokenization, and Specialized GUI Automation Agents addressing desktop and web control without physical grounding. The paper's dual-world framing distinguishes it from these adjacent categories: it explicitly bridges 2D virtual and 3D physical interaction rather than treating them as separate specializations. The Training Methodologies branch offers complementary perspectives on language-guided and reinforcement learning approaches, but does not directly address the architectural challenge of reconciling conflicting data distributions across modalities.

Among thirty candidates examined, the Layer-heterogeneous MoE architecture shows no clear refutation across ten candidates, suggesting architectural novelty within the limited search scope. However, the unified action space contribution encountered two refutable candidates among ten examined, and the OmniActor generalist agent concept found three refutable candidates among ten, indicating more substantial prior work in these areas. The statistics suggest that while the architectural innovation appears less contested, the broader goals of unified action representation and cross-domain generalist agents have received attention in related literature, though the search scope remains constrained.

Based on the limited thirty-candidate search, the work appears to occupy a sparsely populated research direction with modest architectural novelty but more crowded conceptual territory around unified action spaces and generalist agent design. The analysis does not cover exhaustive citation networks or domain-specific venues, leaving open questions about how the cerebrum-cerebellum-inspired layer separation compares to alternative modular or routing strategies in broader multi-task learning literature.

Taxonomy

Core-task Taxonomy Papers
11
3
Claimed Contributions
30
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: unified agent training for GUI and embodied tasks. The field is organized around several complementary branches that together address the challenge of building agents capable of operating across digital and physical environments. Unified Multi-Domain Agent Architectures explore models that can handle both GUI interactions and embodied control, often leveraging shared representations or dual-world training paradigms. Training Methodologies for Agent Learning focus on learning strategies—ranging from reinforcement learning to imitation and self-supervised approaches—that enable agents to acquire skills across diverse task distributions. Specialized GUI Automation Agents concentrate on web navigation, desktop control, and interface understanding, while User-Centric and Personalized Agent Systems emphasize adaptation to individual user preferences. Supporting Infrastructure and Evaluation Platforms provide the benchmarks and simulation environments necessary for reproducible research, and Physical Interaction and Constraint Modeling addresses the unique challenges of grounding actions in real-world physics and spatial reasoning. Within the Unified Multi-Domain Agent Architectures branch, a particularly active line of work targets GUI-Embodied Dual-World Agents that bridge digital and physical modalities. OmniActor[0] exemplifies this direction by proposing a unified framework for training agents that operate seamlessly in both GUI and embodied settings, addressing the challenge of transferring learned behaviors across these distinct interaction paradigms. This work sits alongside efforts like NaviMaster[1] and Embodied Web Agents[3], which similarly explore cross-domain generalization but may emphasize different aspects of the dual-world problem—such as navigation-centric tasks or web-grounded embodied reasoning. A central open question in this cluster is how to balance the trade-offs between domain-specific inductive biases and the flexibility required for true multi-domain competence, with ongoing debate about whether shared architectures or modular designs better support scalable agent learning.

Claimed Contributions

Layer-heterogeneity MoE architecture

A novel mixture-of-experts architecture that shares parameters in shallow layers to exploit synergy between GUI and embodied data, while separating parameters in deep layers to mitigate conflicts arising from action differences. This design is inspired by the cerebrum-cerebellum mechanism in the human brain.

10 retrieved papers
Unified action space and large-scale dataset collection

A unified data format and action space that standardizes GUI and embodied tasks, enabling training on large-scale datasets from multiple sources (OS-Atlas, Uground, Aguvis, Aria-UI, and LIBERO). This unification substantially enhances agent performance across various scenarios.

10 retrieved papers
Can Refute
OmniActor generalist agent

A generalist multimodal agent capable of performing tasks in both 2D virtual worlds (GUI tasks) and 3D physical worlds (embodied tasks). OmniActor outperforms agents trained solely on GUI or embodied data and surpasses existing generalist agents in both domains.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Layer-heterogeneity MoE architecture

A novel mixture-of-experts architecture that shares parameters in shallow layers to exploit synergy between GUI and embodied data, while separating parameters in deep layers to mitigate conflicts arising from action differences. This design is inspired by the cerebrum-cerebellum mechanism in the human brain.

Contribution

Unified action space and large-scale dataset collection

A unified data format and action space that standardizes GUI and embodied tasks, enabling training on large-scale datasets from multiple sources (OS-Atlas, Uground, Aguvis, Aria-UI, and LIBERO). This unification substantially enhances agent performance across various scenarios.

Contribution

OmniActor generalist agent

A generalist multimodal agent capable of performing tasks in both 2D virtual worlds (GUI tasks) and 3D physical worlds (embodied tasks). OmniActor outperforms agents trained solely on GUI or embodied data and surpasses existing generalist agents in both domains.