ViMo: A Generative Visual GUI World Model for App Agents

ICLR 2026 Conference SubmissionAnonymous Authors
World ModelGUI GenerationApp Agent
Abstract:

App agents, which autonomously operate mobile Apps through GUIs, have gained significant interest in real-world applications. Yet, they often struggle with long-horizon planning, failing to find the optimal actions for complex tasks with longer steps. To address this, world models are used to predict the next GUI observation based on user actions, enabling more effective agent planning. However, existing world models primarily focus on generating only textual descriptions, lacking essential visual details. To fill this gap, we propose ViMo, the first Visual world Model designed to generate future App observations as images. For the challenge of generating text in image patches, where even minor pixel errors can distort readability, we decompose GUI generation into graphic and text content generation. We propose a novel data representation, the Symbolic Text Representation (STR), to overlay text content with symbolic placeholders while preserving graphics. With this design, ViMo employs a STR Predictor to predict future GUIs’ graphics and a GUI-text Predictor for generating the corresponding text. Moreover, we deploy ViMo to enhance agent-focused tasks by predicting the outcome of actions. Experiments show that ViMo establishes visual world models as a compelling alternative to language-based approaches, producing visually plausible and functionally effective GUIs that empower App agents with more informed decisions.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces ViMo as the first visual world model for mobile app agents, generating future GUI observations as images rather than text descriptions. According to the taxonomy, it resides in the 'Image-Based GUI World Models' leaf under 'Visual GUI State Prediction for Agent Planning'. This leaf contains only two papers total, including ViMo itself and one sibling work. The sparse population suggests this is an emerging research direction rather than a crowded subfield, with limited prior exploration of pixel-level GUI prediction for autonomous agents.

The taxonomy reveals that neighboring research directions pursue fundamentally different approaches. The sibling branch 'Visual Aesthetics Distribution Prediction' focuses on scoring interface designs rather than predicting state transitions. Meanwhile, the broader 'User-Initiated App and Action Prediction' branch encompasses five distinct leaves with thirteen papers total, all emphasizing user behavior forecasting from usage logs rather than visual world modeling. The taxonomy's scope and exclude notes clarify that ViMo's image-based generative approach deliberately diverges from text-only world models and user-centric prediction tasks, positioning it at the intersection of computer vision and agent planning.

Among twenty candidates examined, the contribution-level analysis reveals mixed novelty signals. The core claim of being the 'first visual GUI world model' shows one refutable candidate among nine examined, suggesting at least one prior work explores overlapping territory within this limited search scope. The Symbolic Text Representation contribution similarly encounters one refutable candidate among ten examined. However, the two-stage architecture combining STR Predictor and GUI-text Predictor shows no refutations among the single candidate examined. These statistics indicate that while the overall visual world modeling direction appears relatively novel, specific technical components may have precedents in the examined literature.

Based on the limited search scope of twenty semantically similar papers, the work appears to occupy a sparsely populated research area with some prior overlap in core claims. The taxonomy structure confirms that image-based GUI world models constitute a small emerging cluster, though the analysis cannot rule out relevant work beyond the top-K semantic matches examined. The contribution-level findings suggest incremental novelty in architectural choices while the broader visual prediction paradigm shows at least one substantial prior reference.

Taxonomy

Core-task Taxonomy Papers
18
3
Claimed Contributions
20
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Predicting future mobile app GUI observations from current state and user actions. The field encompasses several distinct branches that reflect different motivations and technical approaches. Visual GUI State Prediction for Agent Planning focuses on building world models that enable autonomous agents to anticipate interface changes, often using image-based representations to support planning and decision-making in interactive environments. User-Initiated App and Action Prediction emphasizes forecasting which applications or actions a user will invoke next, drawing on usage logs and sequential patterns to improve recommendations and system responsiveness. Adaptive Interface Design and User Intent Modeling explores how interfaces can dynamically adjust based on inferred user goals, while Web and Cross-Platform Usage Pattern Analysis extends predictive modeling beyond mobile apps to broader digital ecosystems. These branches share the common goal of anticipating future states but differ in whether the prediction serves an autonomous agent, a personalized recommendation system, or an adaptive interface. Within Visual GUI State Prediction for Agent Planning, a small handful of works have emerged that treat GUI transitions as learnable dynamics. ViMo[0] constructs an image-based world model to predict pixel-level GUI changes given actions, enabling agents to simulate outcomes before execution. This approach contrasts with ViMo App Agent[3], which builds on similar predictive machinery but integrates it more tightly into an agent's planning loop for task completion. Meanwhile, branches like User-Initiated App and Action Prediction include studies such as Predicting Mobile Usage[1] and Next App Prediction[8], which rely on temporal usage traces rather than visual state representations. The tension between pixel-level world models and symbolic or log-based forecasting highlights an open question: whether richer visual predictions justify their computational cost compared to lighter sequential models. ViMo[0] sits squarely in the image-based world model cluster, emphasizing generative fidelity over interpretability, and differs from adaptive interface works like AI Interface Design[5] that prioritize real-time user intent inference.

Claimed Contributions

ViMo: First Visual GUI World Model for App Agents

The authors introduce ViMo, the first generative visual GUI world model that predicts future App observations in visual modality (as images) rather than text descriptions. This enables more realistic and concrete visual GUI predictions compared to language-based methods.

9 retrieved papers
Can Refute
Symbolic Text Representation (STR) for GUI Generation

The authors propose STR, a novel data representation that decouples GUI generation into graphic and text content generation by overlaying text with uniform symbolic placeholders (text symbols). This simplifies text content generation to text location generation, addressing the challenge of pixel-level accuracy required for readable text in GUIs.

10 retrieved papers
Can Refute
Two-Stage Architecture: STR Predictor and GUI-text Predictor

The authors design a two-stage architecture where a diffusion-based STR Predictor generates the graphic structure (with text symbols) of the next GUI, and an LLM-based GUI-text Predictor generates the actual text content for each symbol. This architecture enables accurate generation of both visual layout and semantic text content.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

ViMo: First Visual GUI World Model for App Agents

The authors introduce ViMo, the first generative visual GUI world model that predicts future App observations in visual modality (as images) rather than text descriptions. This enables more realistic and concrete visual GUI predictions compared to language-based methods.

Contribution

Symbolic Text Representation (STR) for GUI Generation

The authors propose STR, a novel data representation that decouples GUI generation into graphic and text content generation by overlaying text with uniform symbolic placeholders (text symbols). This simplifies text content generation to text location generation, addressing the challenge of pixel-level accuracy required for readable text in GUIs.

Contribution

Two-Stage Architecture: STR Predictor and GUI-text Predictor

The authors design a two-stage architecture where a diffusion-based STR Predictor generates the graphic structure (with text symbols) of the next GUI, and an LLM-based GUI-text Predictor generates the actual text content for each symbol. This architecture enables accurate generation of both visual layout and semantic text content.