ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

ICLR 2026 Conference SubmissionAnonymous Authors
GUI AgentGUI Data PipelineComputer UseOpen Source
Abstract:

Vision-Language Models (VLMs) have enabled computer use agents (CUAs) that operate GUIs autonomously, showing great potential, yet progress is limited by the lack of large-scale, open-source computer use data and foundation models. In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, built via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the-art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose computer use agents. We will release data, models, and code to advance future research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper contributes a large-scale cross-platform dataset spanning six operating systems and three task domains, alongside a family of foundation models (ScaleCUA) trained on this data. Within the taxonomy, it resides in the 'Cross-Platform Dataset Construction' leaf under 'Data Collection and Scaling Pipelines,' sharing this leaf with only one sibling paper (OS-Atlas). This positioning indicates a relatively sparse research direction focused specifically on multi-OS data aggregation, distinguishing it from mobile-only annotation efforts and from foundation model architectures that consume such data.

The taxonomy reveals neighboring work in 'Open-Source Foundation Action Models' (three papers on VLM-based GUI agents) and 'Mobile-Specific Annotation Datasets' (one paper on mobile GUI annotations). The paper bridges data collection and model training, connecting to both the dataset construction branch and the foundation model branch. Its scope explicitly includes automated data pipelines with human-in-the-loop validation, aligning with the leaf's scope note emphasizing 'automated or hybrid collection pipelines.' The exclude note clarifies that mobile-only datasets belong elsewhere, reinforcing this work's cross-platform emphasis.

Among 29 candidates examined, the data pipeline contribution (Contribution A) identified one refutable candidate from 10 examined, while the base model contribution (Contribution B) also found one refutable candidate from 10 examined. The evaluation contribution (Contribution C) showed no clear refutation across nine candidates. These statistics suggest that within the limited search scope, the data pipeline and model architecture face some prior overlap, whereas the evaluation insights appear more distinctive. The relatively low refutation counts (one per contribution) indicate that most examined candidates address different facets or scales of the problem.

Given the limited search scope of 29 candidates from top-K semantic matches, this analysis captures the most semantically proximate prior work but does not constitute an exhaustive field survey. The sparse taxonomy leaf (two papers total) and low refutation rates suggest the work occupies a relatively underexplored niche within computer use agents, though the presence of any refutable candidates indicates incremental overlap with existing cross-platform data efforts. The evaluation contribution appears most novel within this constrained search.

Taxonomy

Core-task Taxonomy Papers
13
3
Claimed Contributions
29
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Scaling open-source computer use agents with cross-platform data. The field of computer use agents has rapidly evolved around four main branches. Foundation Models and Architectures for GUI Agents explores the underlying neural architectures and pretraining strategies that enable agents to perceive and interact with graphical interfaces. Data Collection and Scaling Pipelines focuses on methods for gathering, annotating, and synthesizing large-scale interaction traces across diverse operating systems and applications, addressing the bottleneck of training data availability. Evaluation Frameworks and Benchmarks develops standardized testbeds and metrics to measure agent performance on realistic tasks, such as MMBench-GUI[3] and FineState-Bench[10]. Specialized Agent Capabilities and Applications examines domain-specific skills—ranging from mobile navigation in Mobile-Agent-v3[8] to web browsing in Surfer 2[12]—and how agents generalize across platforms. A particularly active line of work centers on cross-platform dataset construction, where researchers aim to unify interaction data from Windows, macOS, Linux, mobile, and web environments to improve agent robustness. ScaleCUA[0] exemplifies this direction by systematically aggregating diverse platform traces to train open-source models at scale. Closely related efforts include OS-Atlas[1], which emphasizes operating-system-level grounding, and CCAgent[5], which tackles cross-platform consistency in action spaces. Meanwhile, OpenCUA[2] and UltraCUA[6] explore complementary strategies for data synthesis and quality filtering. The main trade-off in this cluster revolves around breadth versus depth: some works prioritize coverage across many platforms, while others focus on high-fidelity annotations within a narrower scope. ScaleCUA[0] sits squarely in the breadth-focused camp, leveraging cross-platform diversity to enhance generalization, whereas CCAgent[5] places greater emphasis on ensuring semantic alignment of actions across different GUI paradigms.

Claimed Contributions

Cross-platform interactive data pipeline for computer use agents

The authors propose a dual-loop data acquisition pipeline that combines automated agent exploration with human expert supervision to collect computer-use data across six operating systems (Windows, macOS, Linux, Android, iOS, Web) and three task domains (understanding, grounding, task completion). This pipeline addresses the scarcity of computer-use training data by balancing automation with quality control.

10 retrieved papers
Can Refute
ScaleCUA family of base agent models with unified action space

The authors develop a series of vision-language models (3B, 7B, 32B parameters) built on Qwen2.5-VL that support three inference paradigms: grounding mode for UI element localization, direct action mode for efficient task completion, and reasoned action mode with chain-of-thought reasoning. The models use a unified action space enabling seamless cross-platform interaction.

10 retrieved papers
Can Refute
Comprehensive evaluation and insights for computer use agents

The authors perform extensive experiments across GUI understanding, grounding, and task completion benchmarks on multiple platforms. Their evaluation provides fundamental insights into factors affecting agent performance, including data scaling effects, resolution trade-offs, inference mode comparisons, and cross-platform training strategies, establishing new state-of-the-art results on several benchmarks.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Cross-platform interactive data pipeline for computer use agents

The authors propose a dual-loop data acquisition pipeline that combines automated agent exploration with human expert supervision to collect computer-use data across six operating systems (Windows, macOS, Linux, Android, iOS, Web) and three task domains (understanding, grounding, task completion). This pipeline addresses the scarcity of computer-use training data by balancing automation with quality control.

Contribution

ScaleCUA family of base agent models with unified action space

The authors develop a series of vision-language models (3B, 7B, 32B parameters) built on Qwen2.5-VL that support three inference paradigms: grounding mode for UI element localization, direct action mode for efficient task completion, and reasoned action mode with chain-of-thought reasoning. The models use a unified action space enabling seamless cross-platform interaction.

Contribution

Comprehensive evaluation and insights for computer use agents

The authors perform extensive experiments across GUI understanding, grounding, and task completion benchmarks on multiple platforms. Their evaluation provides fundamental insights into factors affecting agent performance, including data scaling effects, resolution trade-offs, inference mode comparisons, and cross-platform training strategies, establishing new state-of-the-art results on several benchmarks.