GTA1: GUI Test-time Scaling Agent

ICLR 2026 Conference SubmissionAnonymous Authors
GUI Agent; Multimodal Large Language Model
Abstract:

Graphical user interface (GUI) agents autonomously complete tasks across platforms (\eg, Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, two main challenges arise: i) planning (\ie, the action proposal sequence) under expansive action space, where selecting an appropriate plan is non-trivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, \ie, precisely interacting with visual targets. This paper investigates the aforementioned challenges with our \textbf{G}UI \textbf{T}est-time Scaling \textbf{A}gent, namely \name. First, we conduct test-time scaling to select the most appropriate action proposal: at each step, multiple candidate proposals are sampled and evaluated and selected by a judge model. It trades off computation for better decision quality by concurrent sampling. Second, we propose a model that improves grounding of the selected action proposals to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, \name achieves state-of-the-art performance on both grounding and agent task execution benchmarks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a GUI agent that combines test-time scaling for action proposal selection with reinforcement learning for visual grounding. It resides in the 'Grounding Enhancement via Reinforcement Learning' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy of fifty papers. This leaf focuses specifically on applying RL to improve grounding precision, distinguishing it from coordinate-based methods or supervised grounding approaches found in sibling categories. The small cluster size suggests this particular intersection of RL and GUI grounding remains an emerging area rather than a saturated research direction.

The taxonomy reveals neighboring work in coordinate-based grounding methods and alternative grounding paradigms, which collectively address element localization without RL. Adjacent branches cover agent reasoning and decision-making, including hierarchical reasoning and error recovery mechanisms, while training methodologies explore supervised fine-tuning and online exploration strategies. The paper's dual focus on test-time scaling (typically found in reasoning categories) and RL-based grounding creates a bridge between these branches. The scope notes clarify that this leaf excludes general RL for task planning, concentrating instead on grounding-specific RL applications, positioning the work at the intersection of visual localization and adaptive learning.

Among twenty-eight candidates examined, the test-time scaling contribution shows two refutable candidates from ten examined, while the RL-based grounding model encounters four refutable candidates from ten examined. The data cleaning strategy faces four refutable candidates from eight examined. These statistics indicate that each contribution has identifiable prior work within the limited search scope, with the grounding and data cleaning aspects showing more substantial overlap. The analysis does not claim exhaustive coverage; rather, it reflects patterns among top-K semantic matches and citation expansions, suggesting that while the individual components have precedents, their specific combination may offer differentiation.

Based on the limited search scope of twenty-eight candidates, the work appears to synthesize existing techniques—test-time scaling, RL-driven grounding, and data cleaning—in a GUI agent context. The sparse population of its taxonomy leaf and the moderate refutation rates across contributions suggest incremental advancement rather than foundational novelty. The analysis cannot assess whether the specific integration or empirical results provide sufficient differentiation, as this depends on implementation details and experimental outcomes beyond the scope of this literature-based assessment.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
10
Refutable Paper

Research Landscape Overview

Core task: Graphical user interface agent task execution and grounding. The field encompasses a diverse set of challenges, from building end-to-end agent frameworks and architectures that orchestrate perception and action, to developing specialized visual grounding and element localization techniques that pinpoint UI components in screenshots or live environments. Training methodologies and data construction form another major branch, addressing how to curate large-scale datasets and design effective learning pipelines, while agent reasoning and decision-making explores planning, reflection, and multi-step strategies. Benchmarks and evaluation frameworks provide standardized testbeds such as VisualWebArena[3], and domain-specific applications extend these ideas to mobile, desktop, or web contexts. Security and robustness studies like VisualTrap[11] examine adversarial scenarios, and survey literature such as GUI Agents Survey[21] and OS Agents Survey[23] synthesize recent progress across these branches. Within visual grounding and element localization, a particularly active line of work leverages reinforcement learning to refine grounding accuracy through iterative feedback. GTA1[0] exemplifies this approach by using RL-based optimization to improve how agents map natural language instructions to precise UI elements, contrasting with purely supervised methods like SeeClick[2] or universal grounding frameworks such as Universal Visual Grounding[4]. Neighboring efforts include Self-Evolutionary RL[30], which iteratively evolves grounding policies, and GUI-G1[42], which integrates grounding into broader reasoning loops. These RL-driven techniques address the challenge of handling dynamic or ambiguous interfaces where static training data may fall short, offering a complementary pathway to the large-scale pretraining strategies seen in models like CogAgent[9] or ShowUI[24]. GTA1[0] sits squarely in this reinforcement-enhanced grounding cluster, emphasizing adaptive learning over fixed supervision to achieve robust localization in complex GUI environments.

Claimed Contributions

Test-time scaling strategy for GUI agent planning

The authors propose a test-time scaling method that samples multiple candidate action proposals at each step and uses a multimodal large language model judge to select the most appropriate one. This approach enables robust planning in complex GUI environments without requiring full action sequence rollouts.

10 retrieved papers
Can Refute
RL-based GUI grounding model without explicit reasoning

The authors introduce a grounding model optimized using reinforcement learning (specifically GRPO) that directly predicts click coordinates and receives rewards when predictions fall within target UI elements. Unlike prior work, this approach does not require intermediate 'thinking' or CoT reasoning steps.

10 retrieved papers
Can Refute
Data cleaning strategy for GUI grounding datasets

The authors develop a lightweight data cleaning strategy that uses OmniParser to detect UI elements and filters out training samples where annotated bounding boxes are misaligned with actual visual targets, measured by Intersection over Union below a threshold.

8 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Test-time scaling strategy for GUI agent planning

The authors propose a test-time scaling method that samples multiple candidate action proposals at each step and uses a multimodal large language model judge to select the most appropriate one. This approach enables robust planning in complex GUI environments without requiring full action sequence rollouts.

Contribution

RL-based GUI grounding model without explicit reasoning

The authors introduce a grounding model optimized using reinforcement learning (specifically GRPO) that directly predicts click coordinates and receives rewards when predictions fall within target UI elements. Unlike prior work, this approach does not require intermediate 'thinking' or CoT reasoning steps.

Contribution

Data cleaning strategy for GUI grounding datasets

The authors develop a lightweight data cleaning strategy that uses OmniParser to detect UI elements and filters out training samples where annotated bounding boxes are misaligned with actual visual targets, measured by Intersection over Union below a threshold.