GTA1: GUI Test-time Scaling Agent

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

GUI Agent; Multimodal Large Language Model

Graphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, two main challenges arise: i) planning (\ie, the action proposal sequence) under expansive action space, where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, \ie, precisely interacting with visual targets. This paper investigates the aforementioned challenges with our \textbf{G}UI \textbf{T}est-time Scaling \textbf{A}gent, namely \name. First, we conduct test-time scaling to select the most appropriate action proposal: at each step, multiple candidate proposals are sampled and evaluated and selected by a judge model. It trades off computation for better decision quality by concurrent sampling. Second, we propose a model that improves grounding of the selected action proposals to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, \name achieves state-of-the-art performance on both grounding and agent task execution benchmarks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a GUI agent that combines test-time scaling for action proposal selection with reinforcement learning for visual grounding. It resides in the 'Grounding Enhancement via Reinforcement Learning' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy of fifty papers. This leaf focuses specifically on applying RL to improve grounding precision, distinguishing it from coordinate-based methods or supervised grounding approaches found in sibling categories. The small cluster size suggests this particular intersection of RL and GUI grounding remains an emerging area rather than a saturated research direction.

The taxonomy reveals neighboring work in coordinate-based grounding methods and alternative grounding paradigms, which collectively address element localization without RL. Adjacent branches cover agent reasoning and decision-making, including hierarchical reasoning and error recovery mechanisms, while training methodologies explore supervised fine-tuning and online exploration strategies. The paper's dual focus on test-time scaling (typically found in reasoning categories) and RL-based grounding creates a bridge between these branches. The scope notes clarify that this leaf excludes general RL for task planning, concentrating instead on grounding-specific RL applications, positioning the work at the intersection of visual localization and adaptive learning.

Among twenty-eight candidates examined, the test-time scaling contribution shows two refutable candidates from ten examined, while the RL-based grounding model encounters four refutable candidates from ten examined. The data cleaning strategy faces four refutable candidates from eight examined. These statistics indicate that each contribution has identifiable prior work within the limited search scope, with the grounding and data cleaning aspects showing more substantial overlap. The analysis does not claim exhaustive coverage; rather, it reflects patterns among top-K semantic matches and citation expansions, suggesting that while the individual components have precedents, their specific combination may offer differentiation.

Based on the limited search scope of twenty-eight candidates, the work appears to synthesize existing techniques—test-time scaling, RL-driven grounding, and data cleaning—in a GUI agent context. The sparse population of its taxonomy leaf and the moderate refutation rates across contributions suggest incremental advancement rather than foundational novelty. The analysis cannot assess whether the specific integration or empirical results provide sufficient differentiation, as this depends on implementation details and experimental outcomes beyond the scope of this literature-based assessment.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Graphical user interface agent task execution and grounding. The field encompasses a diverse set of challenges, from building end-to-end agent frameworks and architectures that orchestrate perception and action, to developing specialized visual grounding and element localization techniques that pinpoint UI components in screenshots or live environments. Training methodologies and data construction form another major branch, addressing how to curate large-scale datasets and design effective learning pipelines, while agent reasoning and decision-making explores planning, reflection, and multi-step strategies. Benchmarks and evaluation frameworks provide standardized testbeds such as VisualWebArena[3], and domain-specific applications extend these ideas to mobile, desktop, or web contexts. Security and robustness studies like VisualTrap[11] examine adversarial scenarios, and survey literature such as GUI Agents Survey[21] and OS Agents Survey[23] synthesize recent progress across these branches. Within visual grounding and element localization, a particularly active line of work leverages reinforcement learning to refine grounding accuracy through iterative feedback. GTA1[0] exemplifies this approach by using RL-based optimization to improve how agents map natural language instructions to precise UI elements, contrasting with purely supervised methods like SeeClick[2] or universal grounding frameworks such as Universal Visual Grounding[4]. Neighboring efforts include Self-Evolutionary RL[30], which iteratively evolves grounding policies, and GUI-G1[42], which integrates grounding into broader reasoning loops. These RL-driven techniques address the challenge of handling dynamic or ambiguous interfaces where static training data may fall short, offering a complementary pathway to the large-scale pretraining strategies seen in models like CogAgent[9] or ShowUI[24]. GTA1[0] sits squarely in this reinforcement-enhanced grounding cluster, emphasizing adaptive learning over fixed supervision to achieve robust localization in complex GUI environments.

Claimed Contributions

Test-time scaling strategy for GUI agent planning

Can Refute

10 retrieved papers

The authors propose a test-time scaling method that samples multiple candidate action proposals at each step and uses a multimodal large language model judge to select the most appropriate one. This approach enables robust planning in complex GUI environments without requiring full action sequence rollouts.

10 retrieved papers

Can Refute

RL-based GUI grounding model without explicit reasoning

Can Refute

10 retrieved papers

The authors introduce a grounding model optimized using reinforcement learning (specifically GRPO) that directly predicts click coordinates and receives rewards when predictions fall within target UI elements. Unlike prior work, this approach does not require intermediate 'thinking' or CoT reasoning steps.

10 retrieved papers

Can Refute

Data cleaning strategy for GUI grounding datasets

Can Refute

8 retrieved papers

The authors develop a lightweight data cleaning strategy that uses OmniParser to detect UI elements and filters out training samples where annotated bounding boxes are misaligned with actual visual targets, measured by Intersection over Union below a threshold.

8 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[30] Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning PDF

Zhang Jian, Li Kaixin, Yao, Lujian, Chen Jie, Wang Enguang, Hou, Qibin, Chen Jin-wei, Jiang, Peng-Tao, Li Bo (2025)

[42] GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents PDF

Zhou Yuqi, Dai, Sunhao, Yuqi Zhou, Wang, Shuai, Sunhao Dai, Zhou, Kaiwen, Shuai Wang, Jia Qing-lin, Kaiwen Zhou, Xu Jun, Qinglin Jia, Jun Xu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Test-time scaling strategy for GUI agent planning

[55] Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation PDF

Can Refute

[58] Scaling Test-time Compute in Mobile GUI Agents with Parallel Speculative Execution PDF

Cannot Refute

GTA1: GUI Test-time Scaling Agent

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[30] Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning PDF

[42] GUI-G1: Understanding R1-Zero-Like Training for Visual Grounding in GUI Agents PDF

Contribution Analysis

Test-time scaling strategy for GUI agent planning

[55] Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation PDF

[58] Scaling Test-time Compute in Mobile GUI Agents with Parallel Speculative Execution PDF

[51] Visual Test-time Scaling for GUI Agent Grounding PDF

[52] DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning PDF

[53] Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction PDF

[54] ReGUIDE: Data Efficient GUI Grounding via Spatial Reasoning and Search PDF

[56] Test-Time Reinforcement Learning for GUI Grounding via Region Consistency PDF

[57] Auto-scaling Continuous Memory for GUI Agent PDF

[59] Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents PDF

[60] A Survey on Benchmarks of LLM-based GUI Agents PDF

RL-based GUI grounding model without explicit reasoning

[15] GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents PDF

[30] Enhancing Visual Grounding for GUI Agents via Self-Evolutionary Reinforcement Learning PDF

[33] UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning PDF

[56] Test-Time Reinforcement Learning for GUI Grounding via Region Consistency PDF

[49] Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment PDF

[61] Grounded Reinforcement Learning for Visual Reasoning PDF

[62] Learning gui grounding with spatial reasoning from visual feedback PDF

[63] A comparative study on reward models for user interface adaptation with reinforcement learning PDF

[64] GUI-G: Gaussian Reward Modeling for GUI Grounding PDF

[65] Detect Anything via Next Point Prediction PDF

Data cleaning strategy for GUI grounding datasets

[1] Aria-ui: Visual grounding for gui instructions PDF

[41] Magicgui: A foundational mobile gui agent with scalable data pipeline and reinforcement fine-tuning PDF

[68] Mud: Towards a large-scale and noise-filtered ui dataset for modern style ui modeling PDF

[69] An Efficient Training Pipeline for Reasoning Graphical User Interface Agents PDF

[66] Autogui: Scaling gui grounding with automatic functionality annotations from llms PDF

[67] AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials PDF

[70] Grounding Multimodal Large Language Model in GUI World PDF

[71] Beyond Clicking: A Step Towards Generalist GUI Grounding via Text Dragging PDF

Table of Contents