ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data
Overview
Overall Novelty Assessment
The paper contributes a large-scale cross-platform dataset spanning six operating systems and three task domains, alongside a family of foundation models (ScaleCUA) trained on this data. Within the taxonomy, it resides in the 'Cross-Platform Dataset Construction' leaf under 'Data Collection and Scaling Pipelines,' sharing this leaf with only one sibling paper (OS-Atlas). This positioning indicates a relatively sparse research direction focused specifically on multi-OS data aggregation, distinguishing it from mobile-only annotation efforts and from foundation model architectures that consume such data.
The taxonomy reveals neighboring work in 'Open-Source Foundation Action Models' (three papers on VLM-based GUI agents) and 'Mobile-Specific Annotation Datasets' (one paper on mobile GUI annotations). The paper bridges data collection and model training, connecting to both the dataset construction branch and the foundation model branch. Its scope explicitly includes automated data pipelines with human-in-the-loop validation, aligning with the leaf's scope note emphasizing 'automated or hybrid collection pipelines.' The exclude note clarifies that mobile-only datasets belong elsewhere, reinforcing this work's cross-platform emphasis.
Among 29 candidates examined, the data pipeline contribution (Contribution A) identified one refutable candidate from 10 examined, while the base model contribution (Contribution B) also found one refutable candidate from 10 examined. The evaluation contribution (Contribution C) showed no clear refutation across nine candidates. These statistics suggest that within the limited search scope, the data pipeline and model architecture face some prior overlap, whereas the evaluation insights appear more distinctive. The relatively low refutation counts (one per contribution) indicate that most examined candidates address different facets or scales of the problem.
Given the limited search scope of 29 candidates from top-K semantic matches, this analysis captures the most semantically proximate prior work but does not constitute an exhaustive field survey. The sparse taxonomy leaf (two papers total) and low refutation rates suggest the work occupies a relatively underexplored niche within computer use agents, though the presence of any refutable candidates indicates incremental overlap with existing cross-platform data efforts. The evaluation contribution appears most novel within this constrained search.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose a dual-loop data acquisition pipeline that combines automated agent exploration with human expert supervision to collect computer-use data across six operating systems (Windows, macOS, Linux, Android, iOS, Web) and three task domains (understanding, grounding, task completion). This pipeline addresses the scarcity of computer-use training data by balancing automation with quality control.
The authors develop a series of vision-language models (3B, 7B, 32B parameters) built on Qwen2.5-VL that support three inference paradigms: grounding mode for UI element localization, direct action mode for efficient task completion, and reasoned action mode with chain-of-thought reasoning. The models use a unified action space enabling seamless cross-platform interaction.
The authors perform extensive experiments across GUI understanding, grounding, and task completion benchmarks on multiple platforms. Their evaluation provides fundamental insights into factors affecting agent performance, including data scaling effects, resolution trade-offs, inference mode comparisons, and cross-platform training strategies, establishing new state-of-the-art results on several benchmarks.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[5] CCAgent: Coordinating Collaborative Data Scaling for Operating System Agents via Web3 PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Cross-platform interactive data pipeline for computer use agents
The authors propose a dual-loop data acquisition pipeline that combines automated agent exploration with human expert supervision to collect computer-use data across six operating systems (Windows, macOS, Linux, Android, iOS, Web) and three task domains (understanding, grounding, task completion). This pipeline addresses the scarcity of computer-use training data by balancing automation with quality control.
[2] Opencua: Open foundations for computer-use agents PDF
[32] 3EED: Ground everything everywhere in 3D PDF
[33] Exploring the Integration of Generative AI Tools in Software Testing Education: A Case Study on ChatGPT and Copilot for Preparatory Testing Artifacts in Postgraduate ⦠PDF
[34] Crab: Cross-platfrom agent benchmark for multi-modal embodied language model agents PDF
[35] Multi-Agent Systems in AIOps: Enhancing Detection, Diagnosis, and Remediation PDF
[36] Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web PDF
[37] Automated Database Tuning Using AI: A Comprehensive Framework for Real-Time Performance Optimization PDF
[38] REPP: A robust cross-platform solution for online sensorimotor synchronization experiments PDF
[39] Chouette: An Automated Cross-Platform UI Crawler for Improving App Quality PDF
[40] A Multi-Agent Monitoring System for Computer Networks PDF
ScaleCUA family of base agent models with unified action space
The authors develop a series of vision-language models (3B, 7B, 32B parameters) built on Qwen2.5-VL that support three inference paradigms: grounding mode for UI element localization, direct action mode for efficient task completion, and reasoned action mode with chain-of-thought reasoning. The models use a unified action space enabling seamless cross-platform interaction.
[16] Aguvis: Unified pure vision agents for autonomous gui interaction PDF
[14] Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency PDF
[15] CogAgent: A Visual Language Model for GUI Agents PDF
[17] Mug: Interactive multimodal grounding on user interfaces PDF
[18] Mobilevlm: A vision-language model for better intra-and inter-ui understanding PDF
[19] Gui-r1: A generalist r1-style vision-language action model for gui agents PDF
[20] Seed1. 5-vl technical report PDF
[21] Aria-ui: Visual grounding for gui instructions PDF
[22] ScreenAgent: A Vision Language Model-driven Computer Control Agent PDF
[23] Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners PDF
Comprehensive evaluation and insights for computer use agents
The authors perform extensive experiments across GUI understanding, grounding, and task completion benchmarks on multiple platforms. Their evaluation provides fundamental insights into factors affecting agent performance, including data scaling effects, resolution trade-offs, inference mode comparisons, and cross-platform training strategies, establishing new state-of-the-art results on several benchmarks.