ViMo: A Generative Visual GUI World Model for App Agents

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

World ModelGUI GenerationApp Agent

App agents, which autonomously operate mobile Apps through GUIs, have gained significant interest in real-world applications. Yet, they often struggle with long-horizon planning, failing to find the optimal actions for complex tasks with longer steps. To address this, world models are used to predict the next GUI observation based on user actions, enabling more effective agent planning. However, existing world models primarily focus on generating only textual descriptions, lacking essential visual details. To fill this gap, we propose ViMo, the first Visual world Model designed to generate future App observations as images. For the challenge of generating text in image patches, where even minor pixel errors can distort readability, we decompose GUI generation into graphic and text content generation. We propose a novel data representation, the Symbolic Text Representation (STR), to overlay text content with symbolic placeholders while preserving graphics. With this design, ViMo employs a STR Predictor to predict future GUIs’ graphics and a GUI-text Predictor for generating the corresponding text. Moreover, we deploy ViMo to enhance agent-focused tasks by predicting the outcome of actions. Experiments show that ViMo establishes visual world models as a compelling alternative to language-based approaches, producing visually plausible and functionally effective GUIs that empower App agents with more informed decisions.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ViMo as the first visual world model for mobile app agents, generating future GUI observations as images rather than text descriptions. According to the taxonomy, it resides in the 'Image-Based GUI World Models' leaf under 'Visual GUI State Prediction for Agent Planning'. This leaf contains only two papers total, including ViMo itself and one sibling work. The sparse population suggests this is an emerging research direction rather than a crowded subfield, with limited prior exploration of pixel-level GUI prediction for autonomous agents.

The taxonomy reveals that neighboring research directions pursue fundamentally different approaches. The sibling branch 'Visual Aesthetics Distribution Prediction' focuses on scoring interface designs rather than predicting state transitions. Meanwhile, the broader 'User-Initiated App and Action Prediction' branch encompasses five distinct leaves with thirteen papers total, all emphasizing user behavior forecasting from usage logs rather than visual world modeling. The taxonomy's scope and exclude notes clarify that ViMo's image-based generative approach deliberately diverges from text-only world models and user-centric prediction tasks, positioning it at the intersection of computer vision and agent planning.

Among twenty candidates examined, the contribution-level analysis reveals mixed novelty signals. The core claim of being the 'first visual GUI world model' shows one refutable candidate among nine examined, suggesting at least one prior work explores overlapping territory within this limited search scope. The Symbolic Text Representation contribution similarly encounters one refutable candidate among ten examined. However, the two-stage architecture combining STR Predictor and GUI-text Predictor shows no refutations among the single candidate examined. These statistics indicate that while the overall visual world modeling direction appears relatively novel, specific technical components may have precedents in the examined literature.

Based on the limited search scope of twenty semantically similar papers, the work appears to occupy a sparsely populated research area with some prior overlap in core claims. The taxonomy structure confirms that image-based GUI world models constitute a small emerging cluster, though the analysis cannot rule out relevant work beyond the top-K semantic matches examined. The contribution-level findings suggest incremental novelty in architectural choices while the broader visual prediction paradigm shows at least one substantial prior reference.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Predicting future mobile app GUI observations from current state and user actions. The field encompasses several distinct branches that reflect different motivations and technical approaches. Visual GUI State Prediction for Agent Planning focuses on building world models that enable autonomous agents to anticipate interface changes, often using image-based representations to support planning and decision-making in interactive environments. User-Initiated App and Action Prediction emphasizes forecasting which applications or actions a user will invoke next, drawing on usage logs and sequential patterns to improve recommendations and system responsiveness. Adaptive Interface Design and User Intent Modeling explores how interfaces can dynamically adjust based on inferred user goals, while Web and Cross-Platform Usage Pattern Analysis extends predictive modeling beyond mobile apps to broader digital ecosystems. These branches share the common goal of anticipating future states but differ in whether the prediction serves an autonomous agent, a personalized recommendation system, or an adaptive interface. Within Visual GUI State Prediction for Agent Planning, a small handful of works have emerged that treat GUI transitions as learnable dynamics. ViMo[0] constructs an image-based world model to predict pixel-level GUI changes given actions, enabling agents to simulate outcomes before execution. This approach contrasts with ViMo App Agent[3], which builds on similar predictive machinery but integrates it more tightly into an agent's planning loop for task completion. Meanwhile, branches like User-Initiated App and Action Prediction include studies such as Predicting Mobile Usage[1] and Next App Prediction[8], which rely on temporal usage traces rather than visual state representations. The tension between pixel-level world models and symbolic or log-based forecasting highlights an open question: whether richer visual predictions justify their computational cost compared to lighter sequential models. ViMo[0] sits squarely in the image-based world model cluster, emphasizing generative fidelity over interpretability, and differs from adaptive interface works like AI Interface Design[5] that prioritize real-time user intent inference.

Claimed Contributions

ViMo: First Visual GUI World Model for App Agents

Can Refute

9 retrieved papers

The authors introduce ViMo, the first generative visual GUI world model that predicts future App observations in visual modality (as images) rather than text descriptions. This enables more realistic and concrete visual GUI predictions compared to language-based methods.

9 retrieved papers

Can Refute

Symbolic Text Representation (STR) for GUI Generation

Can Refute

10 retrieved papers

The authors propose STR, a novel data representation that decouples GUI generation into graphic and text content generation by overlaying text with uniform symbolic placeholders (text symbols). This simplifies text content generation to text location generation, addressing the challenge of pixel-level accuracy required for readable text in GUIs.

10 retrieved papers

Can Refute

Two-Stage Architecture: STR Predictor and GUI-text Predictor

1 retrieved paper

The authors design a two-stage architecture where a diffusion-based STR Predictor generates the graphic structure (with text symbols) of the next GUI, and an LLM-based GUI-text Predictor generates the actual text content for each symbol. This architecture enables accurate generation of both visual layout and semantic text content.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] ViMo: A Generative Visual GUI World Model for App Agent PDF

Luo, Dezhao, Tang, Bohan, Dezhao Luo, Li Kang, Bohan Tang, Papoudakis, Georgios, Kang Li, Song Jifei, Georgios Papoudakis, Gong, Shaogang, Jifei Song, Hao, Jianye, Shaogang Gong, Wang Jun, Jianye Hao, Shao Kun, Jun Wang, Kun Shao (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ViMo: First Visual GUI World Model for App Agents

[3] ViMo: A Generative Visual GUI World Model for App Agent PDF

Can Refute

[28] WorldVLA: Towards Autoregressive Action World Model PDF

Cannot Refute

[29] Gui agents: A survey PDF

Cannot Refute

[30] GUI-World: A Dataset for GUI-Orientated Multimodal Large Language Models PDF

Cannot Refute

[31] Visual foresight: Model-based deep reinforcement learning for vision-based robotic control PDF

Cannot Refute

[32] V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM PDF

Cannot Refute

[33] MobileDreamer: Generative Sketch World Model for GUI Agent PDF

Cannot Refute

[34] UISim: An Interactive Image-Based UI Simulator for Dynamic Mobile Environments PDF

Cannot Refute

[35] A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Manipulation PDF

Cannot Refute

Contribution

Symbolic Text Representation (STR) for GUI Generation

[3] ViMo: A Generative Visual GUI World Model for App Agent PDF

Can Refute

[19] Fashioning creative expertise with generative AI: Graphical interfaces for design space exploration better support ideation than text prompts PDF

Cannot Refute

[20] Understanding mobile GUI: From pixel-words to screen-sentences PDF

Cannot Refute

[21] CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation PDF

Cannot Refute

[22] Comprehensive Cognitive LLM Agent for Smartphone GUI Automation PDF

Cannot Refute

[23] MP-GUI: Modality Perception with MLLMs for GUI Understanding PDF

Cannot Refute

[24] Regenerating a graphical user interface using deep learning PDF

Cannot Refute

[25] DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning PDF

Cannot Refute

[26] A Parse-Then-Place Approach for Generating Graphic Layouts from Textual Descriptions PDF

Cannot Refute

[27] Development of a graphical user interface for automatic separation of human voice from Doppler ultrasound audio in diving experiments PDF

Cannot Refute

Contribution

Two-Stage Architecture: STR Predictor and GUI-text Predictor

[36] Sgedit: Bridging llm with text2image generative model for scene graph-based image editing PDF

Cannot Refute

ViMo: A Generative Visual GUI World Model for App Agents

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] ViMo: A Generative Visual GUI World Model for App Agent PDF

Contribution Analysis

ViMo: First Visual GUI World Model for App Agents

[3] ViMo: A Generative Visual GUI World Model for App Agent PDF

[28] WorldVLA: Towards Autoregressive Action World Model PDF

[29] Gui agents: A survey PDF

[30] GUI-World: A Dataset for GUI-Orientated Multimodal Large Language Models PDF

[31] Visual foresight: Model-based deep reinforcement learning for vision-based robotic control PDF

[32] V-Zen: Efficient GUI Understanding and Precise Grounding With A Novel Multimodal LLM PDF

[33] MobileDreamer: Generative Sketch World Model for GUI Agent PDF

[34] UISim: An Interactive Image-Based UI Simulator for Dynamic Mobile Environments PDF

[35] A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Manipulation PDF

Symbolic Text Representation (STR) for GUI Generation

[3] ViMo: A Generative Visual GUI World Model for App Agent PDF

[19] Fashioning creative expertise with generative AI: Graphical interfaces for design space exploration better support ideation than text prompts PDF

[20] Understanding mobile GUI: From pixel-words to screen-sentences PDF

[21] CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation PDF

[22] Comprehensive Cognitive LLM Agent for Smartphone GUI Automation PDF

[23] MP-GUI: Modality Perception with MLLMs for GUI Understanding PDF

[24] Regenerating a graphical user interface using deep learning PDF

[25] DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning PDF

[26] A Parse-Then-Place Approach for Generating Graphic Layouts from Textual Descriptions PDF

[27] Development of a graphical user interface for automatic separation of human voice from Doppler ultrasound audio in diving experiments PDF

Two-Stage Architecture: STR Predictor and GUI-text Predictor

[36] Sgedit: Bridging llm with text2image generative model for scene graph-based image editing PDF

Table of Contents