OpenApps: Simulating Environment Variations to Measure UI Agent Reliability
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce OpenApps, a lightweight Python-based environment containing six configurable apps that can generate thousands of versions to measure agent reliability across app variations in appearance and content, rather than only within fixed environments.
The authors establish a new dimension for evaluating UI-agent reliability by measuring performance fluctuations across different app variations (design, appearance, content), addressing a blind spot in current evaluations that rely on fixed environment clones.
The authors design a deterministic reward function based on complete app state verification that avoids the limitations of human-trajectory rewards and change-based checks, preventing agents from gaming rewards through unintended actions.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
OpenApps: A configurable ecosystem for measuring UI-agent reliability across app variations
The authors introduce OpenApps, a lightweight Python-based environment containing six configurable apps that can generate thousands of versions to measure agent reliability across app variations in appearance and content, rather than only within fixed environments.
Measurement of reliability across app variations as a new dimension
The authors establish a new dimension for evaluating UI-agent reliability by measuring performance fluctuations across different app variations (design, appearance, content), addressing a blind spot in current evaluations that rely on fixed environment clones.
Ground-truth state-based reward function avoiding trajectory imitation and reward hacking
The authors design a deterministic reward function based on complete app state verification that avoids the limitations of human-trajectory rewards and change-based checks, preventing agents from gaming rewards through unintended actions.