OpenApps: Simulating Environment Variations to Measure UI Agent Reliability

ICLR 2026 Conference SubmissionAnonymous Authors
reinforcement learningagentsenvrionmentreliability
Abstract:

Reliability is key to realizing the promise of autonomous UI-agents, multimodal agents that directly interact with the apps humans use, as users must be able to trust an agent to complete a given task. Current evaluations rely on fixed environments---often clones of existing apps--- which are limited in that they can only shed light on whether or how often an agent can complete a task within a specific environment. When deployed however, agents are likely to encounter variations in app design and content that can affect an agent’s ability to complete a task. To address this blind spot of measuring agent reliability across app variations, we develop OpenApps, a light-weight open-source ecosystem with six apps (messenger, calendar, maps, etc.) that are configurable in appearance and content. OpenApps requires just a single CPU to run, enabling easy generation and deployment of thousands of versions of each app. Specifically, we run more than 10,000 independent evaluations to study reliability across seven leading multimodal agents. We find that while standard reliability within a fixed app is relatively stable, reliability can vary drastically when measured across app variations. Task success rates for many agents can fluctuate by more than 50% across app variations. For example, Kimi-VL-3B's average success across all tasks fluctuates from 63% to just 4% across app versions. We also find agent behaviors such as looping or hallucinating actions can differ drastically depending on the environment configuration. These initial findings highlight the importance of measuring reliability along this new dimension of app variations.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
30
3
Claimed Contributions
0
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: measuring UI agent reliability across environment variations. The field has organized itself around several complementary perspectives. Agent Robustness to Interface and Environmental Variations examines how agents handle GUI anomalies, layout shifts, and real-world interface changes—work such as GUI Agents Survey[2] and Trustworthy GUI Agents[3] surveys these challenges broadly, while studies like Locator Robustness[1] and GUI-Robust[6] drill into specific failure modes. Test Automation Robustness and Reliability focuses on traditional software testing concerns, including locator stability and cross-platform consistency. Agent Architectures and Task Automation Systems explores the design of end-to-end systems that orchestrate perception, planning, and action, often integrating multimodal models. Domain-Specific Agent Benchmarks and Applications tailors evaluation to particular environments such as mobile apps, web browsers, or desktop software. Formal Methods and Model-Based Approaches for UI Reliability brings verification techniques and symbolic reasoning to ensure correctness guarantees. Finally, Simulation and Modeling for Agent Evaluation provides controlled testbeds where environmental parameters can be systematically varied. Within the robustness branch, a particularly active line of work investigates how agents degrade under realistic interface perturbations—missing elements, dynamic content, or visual noise—and whether current architectures can generalize beyond clean benchmarks. OpenApps[0] sits squarely in this cluster, emphasizing systematic measurement of reliability when GUI conditions shift. It shares thematic ground with Trustworthy GUI Agents[3], which also prioritizes robustness and safety properties, and with GUI-Robust[6], which targets adversarial or noisy interface scenarios. Compared to these neighbors, OpenApps[0] appears to focus more explicitly on quantifying degradation across a spectrum of environmental variations rather than proposing a single hardening technique. This positioning reflects a broader trend: as agent capabilities improve, the community is moving from proof-of-concept demonstrations toward rigorous stress-testing and reliability engineering, ensuring that deployed systems remain dependable when real-world interfaces inevitably deviate from training distributions.

Claimed Contributions

OpenApps: A configurable ecosystem for measuring UI-agent reliability across app variations

The authors introduce OpenApps, a lightweight Python-based environment containing six configurable apps that can generate thousands of versions to measure agent reliability across app variations in appearance and content, rather than only within fixed environments.

0 retrieved papers
Measurement of reliability across app variations as a new dimension

The authors establish a new dimension for evaluating UI-agent reliability by measuring performance fluctuations across different app variations (design, appearance, content), addressing a blind spot in current evaluations that rely on fixed environment clones.

0 retrieved papers
Ground-truth state-based reward function avoiding trajectory imitation and reward hacking

The authors design a deterministic reward function based on complete app state verification that avoids the limitations of human-trajectory rewards and change-based checks, preventing agents from gaming rewards through unintended actions.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

OpenApps: A configurable ecosystem for measuring UI-agent reliability across app variations

The authors introduce OpenApps, a lightweight Python-based environment containing six configurable apps that can generate thousands of versions to measure agent reliability across app variations in appearance and content, rather than only within fixed environments.

Contribution

Measurement of reliability across app variations as a new dimension

The authors establish a new dimension for evaluating UI-agent reliability by measuring performance fluctuations across different app variations (design, appearance, content), addressing a blind spot in current evaluations that rely on fixed environment clones.

Contribution

Ground-truth state-based reward function avoiding trajectory imitation and reward hacking

The authors design a deterministic reward function based on complete app state verification that avoids the limitations of human-trajectory rewards and change-based checks, preventing agents from gaming rewards through unintended actions.

OpenApps: Simulating Environment Variations to Measure UI Agent Reliability | Novelty Validation