GhostEI-Bench: Do Mobile Agent Resilience to Environmental Injection in Dynamic On-Device Environments?
Overview
Overall Novelty Assessment
The paper introduces GhostEI-Bench, a benchmark for evaluating mobile agents under environmental injection attacks within executable Android environments. It resides in the 'Benchmark and Evaluation Frameworks for Environmental Injection' leaf, which contains four papers total. This is a relatively sparse research direction within the broader taxonomy of 50 papers, suggesting that systematic evaluation frameworks for environmental injection remain underdeveloped. The work targets a specific gap: moving beyond static image-based assessments to dynamic, executable workflows where adversarial UI elements are injected into realistic application contexts.
The taxonomy reveals that environmental injection attacks on mobile and GUI agents form one major branch, with sibling leaves addressing security vulnerabilities and defense mechanisms. Neighboring branches cover false data injection in multi-agent systems and adversarial perturbations in reinforcement learning, which focus on sensor spoofing and state-space attacks rather than GUI-level manipulation. The scope note for this leaf explicitly excludes general robustness testing without environmental injection focus, positioning GhostEI-Bench within a narrow but critical niche: evaluating how agents perceive and respond to adversarial visual cues in mobile interfaces, distinct from prompt-based or communication-layer attacks.
Among 29 candidates examined, the analysis identified potential overlaps across all three contributions. The benchmark contribution examined 10 candidates with 1 refutable match, the evaluation protocol examined 10 with 2 refutable matches, and the threat model formalization examined 9 with 3 refutable matches. These statistics indicate that within the limited search scope, some prior work addresses related evaluation methodologies or threat characterizations. However, the relatively low refutation counts suggest that the specific combination of executable Android environments, dynamic injection, and fine-grained failure analysis may offer incremental novelty over existing static or web-focused benchmarks.
Given the sparse taxonomy leaf and limited search scope of 29 candidates, the work appears to occupy a moderately novel position within environmental injection evaluation. The analysis does not cover exhaustive literature beyond top-K semantic matches, so additional related work may exist in adjacent domains such as web agent security or mobile app testing. The contribution-level statistics suggest that while individual components have precedents, the integrated benchmark design targeting mobile GUI agents in executable environments may represent a meaningful step forward in a nascent research area.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present GhostEI-Bench, a comprehensive benchmark that systematically evaluates mobile agent robustness against environmental injection attacks in fully operational Android emulators. The benchmark includes 110 test cases spanning seven critical risk fields and three attack vectors, moving beyond static image-based assessments to inject adversarial events into realistic application workflows.
The authors propose an evaluation protocol that uses a judge LLM to analyze agent action trajectories and screenshots, identifying precise failure points in perception, recognition, or reasoning. This protocol enables systematic assessment of both capability and robustness through metrics including Task Completion, Full/Partial Attack Success, and Vulnerability Rate.
The authors establish environmental injection as a unique threat vector that contaminates agent visual perception through adversarial UI elements like deceptive overlays or spoofed notifications. This formalization defines a unified threat model encompassing three attack vectors: Deceptive Instruction, Static Environmental Injection, and Dynamic Environmental Injection across seven critical risk fields.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[1] Evaluating the robustness of multimodal agents against active environmental injection attacks PDF
[2] AEIA-MN: Evaluating the Robustness of Multimodal LLM-Powered Mobile Agents Against Active Environmental Injection Attacks PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
GhostEI-Bench benchmark for environmental injection attacks
The authors present GhostEI-Bench, a comprehensive benchmark that systematically evaluates mobile agent robustness against environmental injection attacks in fully operational Android emulators. The benchmark includes 110 test cases spanning seven critical risk fields and three attack vectors, moving beyond static image-based assessments to inject adversarial events into realistic application workflows.
[6] Hijacking JARVIS: Benchmarking Mobile GUI Agents against Unprivileged Third Parties PDF
[1] Evaluating the robustness of multimodal agents against active environmental injection attacks PDF
[2] AEIA-MN: Evaluating the Robustness of Multimodal LLM-Powered Mobile Agents Against Active Environmental Injection Attacks PDF
[16] Mobilesafetybench: Evaluating safety of autonomous agents in mobile device control PDF
[66] MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents PDF
[67] WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks PDF
[68] To Protect the LLM Agent Against the Prompt Injection Attack with Polymorphic Prompt PDF
[69] WebInject: Prompt Injection Attack to Web Agents PDF
[70] BabelView: Evaluating the Impact of Code Injection Attacks in Mobile Webviews PDF
Novel LLM-based evaluation protocol with fine-grained failure analysis
The authors propose an evaluation protocol that uses a judge LLM to analyze agent action trajectories and screenshots, identifying precise failure points in perception, recognition, or reasoning. This protocol enables systematic assessment of both capability and robustness through metrics including Task Completion, Full/Partial Attack Success, and Vulnerability Rate.
[58] Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge PDF
[65] Os-sentinel: Towards safety-enhanced mobile gui agents via hybrid validation in realistic workflows PDF
[56] Why Do Multi-Agent LLM Systems Fail? PDF
[57] The danger of overthinking: Examining the reasoning-action dilemma in agentic tasks PDF
[59] Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges PDF
[60] Abduct, Act, Predict: Scaffolding Causal Inference for Automated Failure Attribution in Multi-Agent Systems PDF
[61] Plan Verification for LLM-Based Embodied Task Completion Agents PDF
[62] Why do multiagent systems fail? PDF
[63] Reasoningbank: Scaling agent self-evolving with reasoning memory PDF
[64] Agentic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback PDF
Formalization of environmental injection as a distinct threat model
The authors establish environmental injection as a unique threat vector that contaminates agent visual perception through adversarial UI elements like deceptive overlays or spoofed notifications. This formalization defines a unified threat model encompassing three attack vectors: Deceptive Instruction, Static Environmental Injection, and Dynamic Environmental Injection across seven critical risk fields.