Visual Planning: Let's Think Only with Images

ICLR 2026 Conference SubmissionAnonymous Authors
visual planning
Abstract:

Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these "vision-first'' tasks, as a supplementary channel to language-based reasoning. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising supplement to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a Visual Planning paradigm that reasons exclusively through sequences of images, positioning itself in the 'Purely Visual Planning Paradigms' leaf of the taxonomy. Notably, this leaf contains only the original paper itself—no sibling papers exist in this category. This isolation suggests the work occupies a sparse research direction within the broader Visual Planning Frameworks and Architectures branch, which itself contains only two leaves. The taxonomy reveals that most prior work integrates visual representations with language, symbolic reasoning, or explicit geometric maps, rather than pursuing purely image-based planning.

The taxonomy shows neighboring leaves in Hybrid Planning Approaches, which combine visual representations with other modalities, and extensive work in Vision-Based Navigation and Control, where learned policies or map-based methods dominate. The scope notes clarify that purely visual planning excludes language or symbolic reasoning, distinguishing it from Vision-Language Navigation and Vision-Language-Action Reasoning clusters. The field context indicates that while visual representation learning and visuomotor control are well-populated areas, the specific paradigm of planning through image sequences without intermediate abstractions remains underexplored, with most methods relying on feature embeddings, semantic maps, or language grounding.

Among thirty candidates examined, none clearly refuted any of the three contributions. The Visual Planning paradigm examined ten candidates with zero refutable matches, as did the VPRL framework and the RL-for-image-generation claim. This absence of refutation reflects the limited search scope rather than definitive novelty—the analysis covers top-K semantic matches and citation expansion, not an exhaustive survey. The statistics suggest that within this bounded search, the contributions appear distinct from examined prior work, though the small candidate pool and sparse taxonomy leaf indicate the field may lack extensive directly comparable research.

The analysis reveals a work positioned in an underpopulated research direction, with no sibling papers in its taxonomy leaf and limited overlap among thirty examined candidates. The taxonomy structure shows that most visual planning research integrates additional modalities or representations, leaving purely image-based reasoning relatively unexplored. However, the limited search scope and sparse field structure mean these findings reflect a snapshot of accessible literature rather than comprehensive coverage of all potentially relevant work.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Visual planning through purely visual representations. The field organizes itself around several complementary branches that reflect different emphases in how vision guides action. Visual Representation Learning for Planning focuses on extracting and encoding spatial or temporal features from images to support downstream decision-making, often leveraging learned embeddings or predictive models such as Visual Foresight[3]. Vision-Based Navigation and Control addresses how agents use camera inputs to traverse environments, spanning classical teach-and-repeat paradigms and modern neural approaches like ViNT[12] or GaussNav[10]. Visuomotor Manipulation and Imitation targets robotic grasping and object interaction, where visual feedback directly informs low-level control. Language-Vision Integration for Planning explores how natural language instructions or semantic cues combine with visual observations, exemplified by Visual Language Maps[2]. Visual Planning Frameworks and Architectures examines overarching system designs—such as Universal Planning Networks[1]—that unify perception and planning in end-to-end architectures. Finally, Domain-Specific Visual Planning Applications captures specialized settings like urban scene analysis or underwater navigation, demonstrating how core methods adapt to particular constraints. Within these branches, a central tension emerges between end-to-end learned policies and modular pipelines that separate perception from planning. Some lines of work pursue fully differentiable frameworks that map pixels to actions without explicit geometric reasoning, while others maintain structured representations or classical search components like Neural A Star[8]. Visual Planning[0] sits squarely in the Purely Visual Planning Paradigms cluster, emphasizing direct image-to-plan mappings without intermediate symbolic abstractions. This contrasts with approaches such as ThinkAct[5], which interleaves reasoning steps with visual input, or methods that rely on language grounding like Visual Language Maps[2]. By forgoing explicit semantic or geometric scaffolding, Visual Planning[0] aligns closely with works that treat the visual stream as the primary substrate for both world modeling and action selection, raising open questions about generalization, sample efficiency, and interpretability in purely pixel-driven planning.

Claimed Contributions

Visual Planning paradigm for reasoning purely through visual representations

The authors introduce Visual Planning, a paradigm where planning is executed via sequences of images that encode step-by-step inference in the visual domain, without language mediation. This approach enables models to reason directly in the visual modality for vision-first tasks involving spatial and geometrical information.

10 retrieved papers
VPRL framework: Visual Planning via Reinforcement Learning

The authors propose VPRL, a novel two-stage reinforcement learning framework empowered by GRPO for post-training large vision models. The framework includes a policy initialization stage followed by RL training with progress rewards to enable visual planning through sequential image generation.

10 retrieved papers
First application of RL to image generation for planning tasks

The authors claim to be the first to apply reinforcement learning techniques to image generation specifically for planning tasks, demonstrating substantial performance improvements and better generalization compared to supervised baselines in visual spatial planning settings.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Visual Planning paradigm for reasoning purely through visual representations

The authors introduce Visual Planning, a paradigm where planning is executed via sequences of images that encode step-by-step inference in the visual domain, without language mediation. This approach enables models to reason directly in the visual modality for vision-first tasks involving spatial and geometrical information.

Contribution

VPRL framework: Visual Planning via Reinforcement Learning

The authors propose VPRL, a novel two-stage reinforcement learning framework empowered by GRPO for post-training large vision models. The framework includes a policy initialization stage followed by RL training with progress rewards to enable visual planning through sequential image generation.

Contribution

First application of RL to image generation for planning tasks

The authors claim to be the first to apply reinforcement learning techniques to image generation specifically for planning tasks, demonstrating substantial performance improvements and better generalization compared to supervised baselines in visual spatial planning settings.