OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

generalist agent; GUI agent; embodied agent; MoE

Multimodal large language models are progressively advancing toward multimodal agents that can proactively execute tasks. Existing research on multimodal agents primarily targets either GUI or embodied scenarios, corresponding to interactions within 2D virtual world and 3D physical world, respectively. However, many real-world tasks inherently require agents to interleave interactions across both types of environments. We initially mix GUI and embodied data to train models, but find performance degradation caused by data conflicts. Further analysis reveals that GUI and embodied data exhibit synergy at shallow layers but conflict at deep layers, resembling the cerebrum-cerebellum mechanism in the human brain. To this end, we introduce a high-performance generalist agent OmniActor, designed from both structural and data perspectives. First, we propose Layer-heterogeneous MoE that separates parameters at deep layers to eliminate conflict, while sharing parameters at shallow layers to leverage synergy. This design enables OmniActor to outperform agents trained solely on GUI or embodied data in their respective tasks. Furthermore, we unify the action spaces of GUI and embodied tasks and collect large-scale datasets from diverse sources for training. This substantially enhances the performance of OmniActor across various scenarios, especially in GUI tasks. The code will be publicly available.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces OmniActor, a generalist agent designed to operate across both GUI and embodied environments through a Layer-heterogeneous Mixture-of-Experts architecture and unified action space. It resides in the GUI-Embodied Dual-World Agents leaf, which contains only three papers total, indicating a relatively sparse but emerging research direction. This positioning suggests the work targets a specific niche within the broader unified agent training landscape, where most prior efforts focus on either GUI automation or embodied control separately rather than their explicit integration.

The taxonomy reveals that neighboring research directions include Generalist Multi-Embodiment Agents (two papers) focusing on diverse embodied domains with unified tokenization, and Specialized GUI Automation Agents addressing desktop and web control without physical grounding. The paper's dual-world framing distinguishes it from these adjacent categories: it explicitly bridges 2D virtual and 3D physical interaction rather than treating them as separate specializations. The Training Methodologies branch offers complementary perspectives on language-guided and reinforcement learning approaches, but does not directly address the architectural challenge of reconciling conflicting data distributions across modalities.

Among thirty candidates examined, the Layer-heterogeneous MoE architecture shows no clear refutation across ten candidates, suggesting architectural novelty within the limited search scope. However, the unified action space contribution encountered two refutable candidates among ten examined, and the OmniActor generalist agent concept found three refutable candidates among ten, indicating more substantial prior work in these areas. The statistics suggest that while the architectural innovation appears less contested, the broader goals of unified action representation and cross-domain generalist agents have received attention in related literature, though the search scope remains constrained.

Based on the limited thirty-candidate search, the work appears to occupy a sparsely populated research direction with modest architectural novelty but more crowded conceptual territory around unified action spaces and generalist agent design. The analysis does not cover exhaustive citation networks or domain-specific venues, leaving open questions about how the cerebrum-cerebellum-inspired layer separation compares to alternative modular or routing strategies in broader multi-task learning literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: unified agent training for GUI and embodied tasks. The field is organized around several complementary branches that together address the challenge of building agents capable of operating across digital and physical environments. Unified Multi-Domain Agent Architectures explore models that can handle both GUI interactions and embodied control, often leveraging shared representations or dual-world training paradigms. Training Methodologies for Agent Learning focus on learning strategies—ranging from reinforcement learning to imitation and self-supervised approaches—that enable agents to acquire skills across diverse task distributions. Specialized GUI Automation Agents concentrate on web navigation, desktop control, and interface understanding, while User-Centric and Personalized Agent Systems emphasize adaptation to individual user preferences. Supporting Infrastructure and Evaluation Platforms provide the benchmarks and simulation environments necessary for reproducible research, and Physical Interaction and Constraint Modeling addresses the unique challenges of grounding actions in real-world physics and spatial reasoning. Within the Unified Multi-Domain Agent Architectures branch, a particularly active line of work targets GUI-Embodied Dual-World Agents that bridge digital and physical modalities. OmniActor[0] exemplifies this direction by proposing a unified framework for training agents that operate seamlessly in both GUI and embodied settings, addressing the challenge of transferring learned behaviors across these distinct interaction paradigms. This work sits alongside efforts like NaviMaster[1] and Embodied Web Agents[3], which similarly explore cross-domain generalization but may emphasize different aspects of the dual-world problem—such as navigation-centric tasks or web-grounded embodied reasoning. A central open question in this cluster is how to balance the trade-offs between domain-specific inductive biases and the flexibility required for true multi-domain competence, with ongoing debate about whether shared architectures or modular designs better support scalable agent learning.

Claimed Contributions

Layer-heterogeneity MoE architecture

10 retrieved papers

A novel mixture-of-experts architecture that shares parameters in shallow layers to exploit synergy between GUI and embodied data, while separating parameters in deep layers to mitigate conflicts arising from action differences. This design is inspired by the cerebrum-cerebellum mechanism in the human brain.

10 retrieved papers

Unified action space and large-scale dataset collection

Can Refute

10 retrieved papers

A unified data format and action space that standardizes GUI and embodied tasks, enabling training on large-scale datasets from multiple sources (OS-Atlas, Uground, Aguvis, Aria-UI, and LIBERO). This unification substantially enhances agent performance across various scenarios.

10 retrieved papers

Can Refute

OmniActor generalist agent

Can Refute

10 retrieved papers

A generalist multimodal agent capable of performing tasks in both 2D virtual worlds (GUI tasks) and 3D physical worlds (embodied tasks). OmniActor outperforms agents trained solely on GUI or embodied data and surpasses existing generalist agents in both domains.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks PDF

Luo Zhihao, Yan Wen-tao, Gong, Jingyu, Wang Min, Zhang Zhizhong, Wang Xuhong, Xie Yuan, Tan Xin (2025)

[3] Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence PDF

Hong, Yining, Sun Rui, Li Bingxuan, Yao, Xingcheng, Yin Da, Wu, Ying Nian, Chang, Kai-Wei (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Layer-heterogeneity MoE architecture

[12] Revisiting Sparse Mixture of Experts for Resource-adaptive Federated Fine-tuning Foundation Models PDF

Cannot Refute

[13] Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models PDF

Cannot Refute

[14] MoEfication: Transformer Feed-forward Layers are Mixtures of Experts PDF

Cannot Refute

[15] The MoE-Empowered Edge LLMs Deployment: Architecture, Challenges, and Opportunities PDF

Cannot Refute

[16] CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge PDF

Cannot Refute

[17] Ensemble and Mixture-of-Experts DeepONets For Operator Learning PDF

Cannot Refute

[18] Fsmoe: A flexible and scalable training system for sparse mixture-of-experts models PDF

Cannot Refute

[19] Lancet: Accelerating mixture-of-experts training via whole graph computation-communication overlapping PDF

Cannot Refute

[20] Moe-infinity: Offloading-efficient moe model serving PDF

Cannot Refute

[21] Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning PDF

Cannot Refute

Contribution

Unified action space and large-scale dataset collection

[4] From multimodal llms to generalist embodied agents: Methods and lessons PDF

Can Refute

[33] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction PDF

Can Refute

[1] NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks PDF

Cannot Refute

[32] ShowUI: One Vision-Language-Action Model for GUI Visual Agent PDF

Cannot Refute

[34] Vision language action models in robotic manipulation: A systematic review PDF

Cannot Refute

[35] Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks PDF

Cannot Refute

[36] Agentstudio: A toolkit for building general virtual agents PDF

Cannot Refute

[37] Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation PDF

Cannot Refute

[38] GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent PDF

Cannot Refute

[39] D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI PDF

Cannot Refute

Contribution

OmniActor generalist agent

[25] Gemini Robotics: Bringing AI into the Physical World PDF

Can Refute

[28] An Embodied Generalist Agent in 3D World PDF

Can Refute

[30] PaLM-E: An Embodied Multimodal Language Model PDF

Can Refute

[22] Embodiedeval: Evaluate multimodal llms as embodied agents PDF

Cannot Refute

[23] Up-vla: A unified understanding and prediction model for embodied agent PDF

Cannot Refute

[24] UI-TARS: Pioneering Automated GUI Interaction with Native Agents PDF

Cannot Refute

[26] CogAgent: A Visual Language Model for GUI Agents PDF

Cannot Refute

[27] Mind2Web: Towards a Generalist Agent for the Web PDF

Cannot Refute

[29] InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency PDF

Cannot Refute

[31] ScreenAgent: A Vision Language Model-driven Computer Control Agent PDF

Cannot Refute

OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks PDF

[3] Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence PDF

Contribution Analysis

Layer-heterogeneity MoE architecture

[12] Revisiting Sparse Mixture of Experts for Resource-adaptive Federated Fine-tuning Foundation Models PDF

[13] Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models PDF

[14] MoEfication: Transformer Feed-forward Layers are Mixtures of Experts PDF

[15] The MoE-Empowered Edge LLMs Deployment: Architecture, Challenges, and Opportunities PDF

[16] CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge PDF

[17] Ensemble and Mixture-of-Experts DeepONets For Operator Learning PDF

[18] Fsmoe: A flexible and scalable training system for sparse mixture-of-experts models PDF

[19] Lancet: Accelerating mixture-of-experts training via whole graph computation-communication overlapping PDF

[20] Moe-infinity: Offloading-efficient moe model serving PDF

[21] Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning PDF

Unified action space and large-scale dataset collection

[4] From multimodal llms to generalist embodied agents: Methods and lessons PDF

[33] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction PDF

[1] NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks PDF

[32] ShowUI: One Vision-Language-Action Model for GUI Visual Agent PDF

[34] Vision language action models in robotic manipulation: A systematic review PDF

[35] Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks PDF

[36] Agentstudio: A toolkit for building general virtual agents PDF

[37] Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation PDF

[38] GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent PDF

[39] D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI PDF

OmniActor generalist agent

[25] Gemini Robotics: Bringing AI into the Physical World PDF

[28] An Embodied Generalist Agent in 3D World PDF

[30] PaLM-E: An Embodied Multimodal Language Model PDF

[22] Embodiedeval: Evaluate multimodal llms as embodied agents PDF

[23] Up-vla: A unified understanding and prediction model for embodied agent PDF

[24] UI-TARS: Pioneering Automated GUI Interaction with Native Agents PDF

[26] CogAgent: A Visual Language Model for GUI Agents PDF

[27] Mind2Web: Towards a Generalist Agent for the Web PDF

[29] InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency PDF

[31] ScreenAgent: A Vision Language Model-driven Computer Control Agent PDF

Table of Contents