OmniActor: A Generalist GUI and Embodied Agent for 2D&3D Worlds
Overview
Overall Novelty Assessment
The paper introduces OmniActor, a generalist agent designed to operate across both GUI and embodied environments through a Layer-heterogeneous Mixture-of-Experts architecture and unified action space. It resides in the GUI-Embodied Dual-World Agents leaf, which contains only three papers total, indicating a relatively sparse but emerging research direction. This positioning suggests the work targets a specific niche within the broader unified agent training landscape, where most prior efforts focus on either GUI automation or embodied control separately rather than their explicit integration.
The taxonomy reveals that neighboring research directions include Generalist Multi-Embodiment Agents (two papers) focusing on diverse embodied domains with unified tokenization, and Specialized GUI Automation Agents addressing desktop and web control without physical grounding. The paper's dual-world framing distinguishes it from these adjacent categories: it explicitly bridges 2D virtual and 3D physical interaction rather than treating them as separate specializations. The Training Methodologies branch offers complementary perspectives on language-guided and reinforcement learning approaches, but does not directly address the architectural challenge of reconciling conflicting data distributions across modalities.
Among thirty candidates examined, the Layer-heterogeneous MoE architecture shows no clear refutation across ten candidates, suggesting architectural novelty within the limited search scope. However, the unified action space contribution encountered two refutable candidates among ten examined, and the OmniActor generalist agent concept found three refutable candidates among ten, indicating more substantial prior work in these areas. The statistics suggest that while the architectural innovation appears less contested, the broader goals of unified action representation and cross-domain generalist agents have received attention in related literature, though the search scope remains constrained.
Based on the limited thirty-candidate search, the work appears to occupy a sparsely populated research direction with modest architectural novelty but more crowded conceptual territory around unified action spaces and generalist agent design. The analysis does not cover exhaustive citation networks or domain-specific venues, leaving open questions about how the cerebrum-cerebellum-inspired layer separation compares to alternative modular or routing strategies in broader multi-task learning literature.
Taxonomy
Research Landscape Overview
Claimed Contributions
A novel mixture-of-experts architecture that shares parameters in shallow layers to exploit synergy between GUI and embodied data, while separating parameters in deep layers to mitigate conflicts arising from action differences. This design is inspired by the cerebrum-cerebellum mechanism in the human brain.
A unified data format and action space that standardizes GUI and embodied tasks, enabling training on large-scale datasets from multiple sources (OS-Atlas, Uground, Aguvis, Aria-UI, and LIBERO). This unification substantially enhances agent performance across various scenarios.
A generalist multimodal agent capable of performing tasks in both 2D virtual worlds (GUI tasks) and 3D physical worlds (embodied tasks). OmniActor outperforms agents trained solely on GUI or embodied data and surpasses existing generalist agents in both domains.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks PDF
[3] Embodied Web Agents: Bridging Physical-Digital Realms for Integrated Agent Intelligence PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Layer-heterogeneity MoE architecture
A novel mixture-of-experts architecture that shares parameters in shallow layers to exploit synergy between GUI and embodied data, while separating parameters in deep layers to mitigate conflicts arising from action differences. This design is inspired by the cerebrum-cerebellum mechanism in the human brain.
[12] Revisiting Sparse Mixture of Experts for Resource-adaptive Federated Fine-tuning Foundation Models PDF
[13] Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models PDF
[14] MoEfication: Transformer Feed-forward Layers are Mixtures of Experts PDF
[15] The MoE-Empowered Edge LLMs Deployment: Architecture, Challenges, and Opportunities PDF
[16] CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge PDF
[17] Ensemble and Mixture-of-Experts DeepONets For Operator Learning PDF
[18] Fsmoe: A flexible and scalable training system for sparse mixture-of-experts models PDF
[19] Lancet: Accelerating mixture-of-experts training via whole graph computation-communication overlapping PDF
[20] Moe-infinity: Offloading-efficient moe model serving PDF
[21] Condense, Don't Just Prune: Enhancing Efficiency and Performance in MoE Layer Pruning PDF
Unified action space and large-scale dataset collection
A unified data format and action space that standardizes GUI and embodied tasks, enabling training on large-scale datasets from multiple sources (OS-Atlas, Uground, Aguvis, Aria-UI, and LIBERO). This unification substantially enhances agent performance across various scenarios.
[4] From multimodal llms to generalist embodied agents: Methods and lessons PDF
[33] Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction PDF
[1] NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks PDF
[32] ShowUI: One Vision-Language-Action Model for GUI Visual Agent PDF
[34] Vision language action models in robotic manipulation: A systematic review PDF
[35] Uni-navid: A video-based vision-language-action model for unifying embodied navigation tasks PDF
[36] Agentstudio: A toolkit for building general virtual agents PDF
[37] Hermes: Human-to-robot embodied learning from multi-source motion data for mobile dexterous manipulation PDF
[38] GUI Testing Arena: A Unified Benchmark for Advancing Autonomous GUI Testing Agent PDF
[39] D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI PDF
OmniActor generalist agent
A generalist multimodal agent capable of performing tasks in both 2D virtual worlds (GUI tasks) and 3D physical worlds (embodied tasks). OmniActor outperforms agents trained solely on GUI or embodied data and surpasses existing generalist agents in both domains.