Visual Planning: Let's Think Only with Images

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

visual planning

Recent advancements in Large Language Models (LLMs) and their multimodal extensions (MLLMs) have substantially enhanced machine reasoning across diverse tasks. However, these models predominantly rely on pure text as the medium for both expressing and structuring reasoning, even when visual information is present. In this work, we argue that language may not always be the most natural or effective modality for reasoning, particularly in tasks involving spatial and geometrical information. Motivated by this, we propose a new paradigm, Visual Planning, which enables planning through purely visual representations for these "vision-first'' tasks, as a supplementary channel to language-based reasoning. In this paradigm, planning is executed via sequences of images that encode step-by-step inference in the visual domain, akin to how humans sketch or visualize future actions. We introduce a novel reinforcement learning framework, Visual Planning via Reinforcement Learning (VPRL), empowered by GRPO for post-training large vision models, leading to substantial improvements in planning in a selection of representative visual navigation tasks, FrozenLake, Maze, and MiniBehavior. Our visual planning paradigm outperforms all other planning variants that conduct reasoning in the text-only space. Our results establish Visual Planning as a viable and promising supplement to language-based reasoning, opening new avenues for tasks that benefit from intuitive, image-based inference.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a Visual Planning paradigm that reasons exclusively through sequences of images, positioning itself in the 'Purely Visual Planning Paradigms' leaf of the taxonomy. Notably, this leaf contains only the original paper itself—no sibling papers exist in this category. This isolation suggests the work occupies a sparse research direction within the broader Visual Planning Frameworks and Architectures branch, which itself contains only two leaves. The taxonomy reveals that most prior work integrates visual representations with language, symbolic reasoning, or explicit geometric maps, rather than pursuing purely image-based planning.

The taxonomy shows neighboring leaves in Hybrid Planning Approaches, which combine visual representations with other modalities, and extensive work in Vision-Based Navigation and Control, where learned policies or map-based methods dominate. The scope notes clarify that purely visual planning excludes language or symbolic reasoning, distinguishing it from Vision-Language Navigation and Vision-Language-Action Reasoning clusters. The field context indicates that while visual representation learning and visuomotor control are well-populated areas, the specific paradigm of planning through image sequences without intermediate abstractions remains underexplored, with most methods relying on feature embeddings, semantic maps, or language grounding.

Among thirty candidates examined, none clearly refuted any of the three contributions. The Visual Planning paradigm examined ten candidates with zero refutable matches, as did the VPRL framework and the RL-for-image-generation claim. This absence of refutation reflects the limited search scope rather than definitive novelty—the analysis covers top-K semantic matches and citation expansion, not an exhaustive survey. The statistics suggest that within this bounded search, the contributions appear distinct from examined prior work, though the small candidate pool and sparse taxonomy leaf indicate the field may lack extensive directly comparable research.

The analysis reveals a work positioned in an underpopulated research direction, with no sibling papers in its taxonomy leaf and limited overlap among thirty examined candidates. The taxonomy structure shows that most visual planning research integrates additional modalities or representations, leaving purely image-based reasoning relatively unexplored. However, the limited search scope and sparse field structure mean these findings reflect a snapshot of accessible literature rather than comprehensive coverage of all potentially relevant work.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Visual planning through purely visual representations. The field organizes itself around several complementary branches that reflect different emphases in how vision guides action. Visual Representation Learning for Planning focuses on extracting and encoding spatial or temporal features from images to support downstream decision-making, often leveraging learned embeddings or predictive models such as Visual Foresight[3]. Vision-Based Navigation and Control addresses how agents use camera inputs to traverse environments, spanning classical teach-and-repeat paradigms and modern neural approaches like ViNT[12] or GaussNav[10]. Visuomotor Manipulation and Imitation targets robotic grasping and object interaction, where visual feedback directly informs low-level control. Language-Vision Integration for Planning explores how natural language instructions or semantic cues combine with visual observations, exemplified by Visual Language Maps[2]. Visual Planning Frameworks and Architectures examines overarching system designs—such as Universal Planning Networks[1]—that unify perception and planning in end-to-end architectures. Finally, Domain-Specific Visual Planning Applications captures specialized settings like urban scene analysis or underwater navigation, demonstrating how core methods adapt to particular constraints. Within these branches, a central tension emerges between end-to-end learned policies and modular pipelines that separate perception from planning. Some lines of work pursue fully differentiable frameworks that map pixels to actions without explicit geometric reasoning, while others maintain structured representations or classical search components like Neural A Star[8]. Visual Planning[0] sits squarely in the Purely Visual Planning Paradigms cluster, emphasizing direct image-to-plan mappings without intermediate symbolic abstractions. This contrasts with approaches such as ThinkAct[5], which interleaves reasoning steps with visual input, or methods that rely on language grounding like Visual Language Maps[2]. By forgoing explicit semantic or geometric scaffolding, Visual Planning[0] aligns closely with works that treat the visual stream as the primary substrate for both world modeling and action selection, raising open questions about generalization, sample efficiency, and interpretability in purely pixel-driven planning.

Claimed Contributions

Visual Planning paradigm for reasoning purely through visual representations

10 retrieved papers

The authors introduce Visual Planning, a paradigm where planning is executed via sequences of images that encode step-by-step inference in the visual domain, without language mediation. This approach enables models to reason directly in the visual modality for vision-first tasks involving spatial and geometrical information.

10 retrieved papers

VPRL framework: Visual Planning via Reinforcement Learning

10 retrieved papers

The authors propose VPRL, a novel two-stage reinforcement learning framework empowered by GRPO for post-training large vision models. The framework includes a policy initialization stage followed by RL training with progress rewards to enable visual planning through sequential image generation.

10 retrieved papers

First application of RL to image generation for planning tasks

10 retrieved papers

The authors claim to be the first to apply reinforcement learning techniques to image generation specifically for planning tasks, demonstrating substantial performance improvements and better generalization compared to supervised baselines in visual spatial planning settings.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Visual Planning paradigm for reasoning purely through visual representations

[69] SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model PDF

Cannot Refute

[70] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[71] SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities PDF

Cannot Refute

[72] Enhancing Spatial Reasoning through Visual and Textual Thinking PDF

Cannot Refute

[73] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

Cannot Refute

[74] PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs PDF

Cannot Refute

[75] Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models PDF

Cannot Refute

[76] What's" up" with vision-language models? investigating their struggle with spatial reasoning PDF

Cannot Refute

[77] GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing PDF

Cannot Refute

[78] Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing PDF

Cannot Refute

Contribution

VPRL framework: Visual Planning via Reinforcement Learning

[53] Rlss: A deep reinforcement learning algorithm for sequential scene generation PDF

Cannot Refute

[55] ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL PDF

Cannot Refute

[61] Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation PDF

Cannot Refute

[62] Hierarchically structured reinforcement learning for topically coherent visual story generation PDF

Cannot Refute

[63] Synthetic data generation & multi-step rl for reasoning & tool use PDF

Cannot Refute

[64] Diffusion model is an effective planner and data synthesizer for multi-task reinforcement learning PDF

Cannot Refute

[65] âgood robot!â: Efficient reinforcement learning for multi-step visual tasks with sim to real transfer PDF

Cannot Refute

[66] Reinforcement Learning for Uncooperative Space Objects Smart Imaging Path-Planning PDF

Cannot Refute

[67] T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT PDF

Cannot Refute

[68] Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning PDF

Cannot Refute

Contribution

First application of RL to image generation for planning tasks

[51] Visual reinforcement learning with imagined goals PDF

Cannot Refute

[52] Rl-cyclegan: Reinforcement learning aware simulation-to-real PDF

Cannot Refute

[53] Rlss: A deep reinforcement learning algorithm for sequential scene generation PDF

Cannot Refute

[54] Training diffusion models towards diverse image generation with reinforcement learning PDF

Cannot Refute

[55] ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL PDF

Cannot Refute

[56] Generative image as action models PDF

Cannot Refute

[57] Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning PDF

Cannot Refute

[58] Grid path planning with deep reinforcement learning: Preliminary results PDF

Cannot Refute

[59] Finetuning generative trajectory model with reinforcement learning from human feedback PDF

Cannot Refute

[60] Guided Flows for Generative Modeling and Decision Making PDF

Cannot Refute

Visual Planning: Let's Think Only with Images

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Visual Planning paradigm for reasoning purely through visual representations

[69] SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model PDF

[70] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[71] SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities PDF

[72] Enhancing Spatial Reasoning through Visual and Textual Thinking PDF

[73] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

[74] PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs PDF

[75] Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models PDF

[76] What's" up" with vision-language models? investigating their struggle with spatial reasoning PDF

[77] GoT: Unleashing Reasoning Capability of Multimodal Large Language Model for Visual Generation and Editing PDF

[78] Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing PDF

VPRL framework: Visual Planning via Reinforcement Learning

[53] Rlss: A deep reinforcement learning algorithm for sequential scene generation PDF

[55] ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL PDF

[61] Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation PDF

[62] Hierarchically structured reinforcement learning for topically coherent visual story generation PDF

[63] Synthetic data generation & multi-step rl for reasoning & tool use PDF

[64] Diffusion model is an effective planner and data synthesizer for multi-task reinforcement learning PDF

[65] âgood robot!â: Efficient reinforcement learning for multi-step visual tasks with sim to real transfer PDF

[66] Reinforcement Learning for Uncooperative Space Objects Smart Imaging Path-Planning PDF

[67] T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT PDF

[68] Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning PDF

First application of RL to image generation for planning tasks

[51] Visual reinforcement learning with imagined goals PDF

[52] Rl-cyclegan: Reinforcement learning aware simulation-to-real PDF

[53] Rlss: A deep reinforcement learning algorithm for sequential scene generation PDF

[54] Training diffusion models towards diverse image generation with reinforcement learning PDF

[55] ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL PDF

[56] Generative image as action models PDF

[57] Gen-drive: Enhancing diffusion generative driving policies with reward modeling and reinforcement learning fine-tuning PDF

[58] Grid path planning with deep reinforcement learning: Preliminary results PDF

[59] Finetuning generative trajectory model with reinforcement learning from human feedback PDF

[60] Guided Flows for Generative Modeling and Decision Making PDF

Table of Contents

[65] âgood robot!â: Efficient reinforcement learning for multi-step visual tasks with sim to real transfer PDF