From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Spatial VLMsGeneral Robotic ManipulationVLM ReasoningSpatial Chain-of-Thought

Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data construction pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD’s capabilities in both “seeing” and “doing”, achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 40.6% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes FSD, a vision-language model that generates intermediate spatial representations through chain-of-thought reasoning to guide robotic manipulation. It sits in the 'Embodiment-Agnostic Representations' leaf under 'Cross-Domain and Embodiment Transfer', which currently contains no sibling papers in the taxonomy. This positioning suggests the work targets a relatively sparse research direction focused on transfer across robot platforms through shared intermediate representations, rather than the more crowded areas of VLM-driven planning or 3D spatial encoding where multiple competing approaches exist.

The taxonomy reveals substantial activity in neighboring branches. 'Language-Conditioned Spatial Reasoning' (one paper) and 'Scene Understanding for Manipulation' (one paper) represent closely related directions within the same parent category, while 'VLM-Driven Spatial Reasoning and Planning' (seven papers) and '3D Spatial Encoding and Descriptor Fields' (four papers) show concentrated effort in complementary approaches. The exclude_note clarifies that embodiment-agnostic methods differ from embodiment-specific techniques by using intermediate representations like pointing or affordances. FSD's spatial relationship reasoning appears to bridge VLM-based planning with explicit spatial encoding, occupying a position between these established clusters.

Among thirty candidates examined, none clearly refute any of the three core contributions. The SrCoT framework examined ten candidates with zero refutable overlaps, as did the hierarchical data construction pipeline and self-consistency alignment mechanism. This suggests either genuine novelty in the specific combination of techniques or limitations in the search scope. The statistics indicate a focused rather than exhaustive literature review, meaning substantial prior work in spatial reasoning for manipulation may exist outside the top-thirty semantic matches. The absence of sibling papers in the same taxonomy leaf reinforces that this particular framing—embodiment-agnostic spatial representations via chain-of-thought—appears underexplored in the examined literature.

Based on the limited search scope of thirty candidates, the work appears to occupy a relatively novel position combining VLM-based spatial reasoning with embodiment transfer objectives. However, the analysis cannot rule out relevant prior work in the broader manipulation literature, particularly in areas like spatial affordance learning or cross-embodiment policy transfer that may not have surfaced in semantic search. The taxonomy structure suggests the field has concentrated effort in adjacent but distinct directions, leaving this specific intersection less densely populated.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: zero-shot generalization in robotic manipulation through spatial reasoning. The field has evolved into a rich landscape organized around several complementary directions. Vision-Language Foundation Model Integration explores how large pretrained models can ground language instructions in visual scenes, while Spatial Representation Learning focuses on encoding geometric and relational structure to support reasoning about object arrangements and affordances. Action Representation and Primitive Learning investigates how to parameterize and compose manipulation skills, and Data-Driven Generalization Strategies examines how to leverage diverse datasets and simulation to improve transfer. Reinforcement Learning with Spatial Reasoning combines trial-and-error learning with structured spatial knowledge, whereas Task-Specific Zero-Shot Applications targets concrete problem domains like grasping or assembly. Cross-Domain and Embodiment Transfer addresses the challenge of deploying policies across different robots and environments, and Supporting Technologies and Benchmarks provides the infrastructure for evaluation and comparison. Representative works such as CLIPort[6] and ReKep[5] illustrate how spatial representations can be integrated with vision-language models to enable flexible manipulation. A particularly active theme concerns how to build embodiment-agnostic representations that transfer across robot morphologies and sensor configurations. Seeing to Doing[0] sits within the Cross-Domain and Embodiment Transfer branch, specifically targeting embodiment-agnostic representations that enable zero-shot deployment on new platforms. This contrasts with approaches like Maniplvm[3] and SpatialVLA[7], which emphasize tighter integration of vision-language models with spatial reasoning but may require more embodiment-specific tuning. Meanwhile, works such as Omnimanip[2] and Ten Demonstrations[4] explore data-driven strategies that aggregate cross-embodiment demonstrations to improve generalization. The central tension across these lines involves balancing the expressiveness of spatial reasoning—whether through explicit geometric encodings, learned latent representations, or foundation model priors—against the need for policies that generalize immediately to unseen robots and tasks without additional training or fine-tuning.

Claimed Contributions

FSD framework with Spatial Relationship-Focused Chain-of-Thought (SrCoT)

10 retrieved papers

The authors introduce FSD, a framework that generates visual aids (spatial affordance boxes, points, and visual traces) through structured spatial reasoning. The core SrCoT mechanism treats visual aid generation as a reasoning process, first analyzing spatial relationships between objects before generating comprehensive object-centric visual aids.

10 retrieved papers

Hierarchical data construction pipeline for weak-to-strong capability enhancement

10 retrieved papers

The authors develop a progressive data construction pipeline that builds five hierarchical capability levels (region grounding, spatial relationship understanding, spatial reasoning, spatial affordance generation, and visual trace generation). This pipeline processes 300K demonstrations from large-scale embodied datasets to cultivate spatial reasoning abilities hierarchically.

10 retrieved papers

Self-consistency alignment mechanism for spatial understanding and generation

10 retrieved papers

The authors propose a self-consistency mechanism that frames generation tasks inversely as understanding problems, enabling bidirectional training where the model both generates visual aids from instructions and predicts instructions from visual aids. This approach aligns coordinate space with image-text modalities and enhances spatial reasoning capabilities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FSD framework with Spatial Relationship-Focused Chain-of-Thought (SrCoT)

[36] Vision-language model-driven scene understanding and robotic object manipulation PDF

Cannot Refute

[61] Robopoint: A vision-language model for spatial affordance prediction for robotics PDF

Cannot Refute

[62] SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities PDF

Cannot Refute

[63] Large language models for robotics: Opportunities, challenges, and perspectives PDF

Cannot Refute

[64] Physically Grounded Vision-Language Models for Robotic Manipulation PDF

Cannot Refute

[65] Physvlm: Enabling visual language models to understand robotic physical reachability PDF

Cannot Refute

[66] Ï0: A Vision-Language-Action Flow Model for General Robot Control PDF

Cannot Refute

[67] Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation PDF

Cannot Refute

[68] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[69] SpatialBot: Precise Spatial Understanding with Vision Language Models PDF

Cannot Refute

Contribution

Hierarchical data construction pipeline for weak-to-strong capability enhancement

[51] Holodeck: Language guided generation of 3d embodied ai environments PDF

Cannot Refute

[52] Star: A benchmark for situated reasoning in real-world videos PDF

Cannot Refute

[53] Embodiedgpt: Vision-language pre-training via embodied chain of thought PDF

Cannot Refute

[54] SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding PDF

Cannot Refute

[55] Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning PDF

Cannot Refute

[56] Multi-modal situated reasoning in 3d scenes PDF

Cannot Refute

[57] Viki-r: Coordinating embodied multi-agent cooperation via reinforcement learning PDF

Cannot Refute

[58] Automated acquisition of structured, semantic models of manipulation activities from human VR demonstration PDF

Cannot Refute

[59] Empowering embodied visual tracking with visual foundation models and offline rl PDF

Cannot Refute

[60] Physical reasoning and object planning for household embodied agents PDF

Cannot Refute

Contribution

Self-consistency alignment mechanism for spatial understanding and generation

[70] Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces PDF

Cannot Refute

[71] Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence PDF

Cannot Refute

[72] MELOV: Multimodal Entity Linking with Optimized Visual Features in Latent Space PDF

Cannot Refute

[73] Words over pixels? rethinking vision in multimodal large language models PDF

Cannot Refute

[74] Layoutvlm: Differentiable optimization of 3d layout via vision-language models PDF

Cannot Refute

[75] Pinco: Position-induced consistent adapter for diffusion transformer in foreground-conditioned inpainting PDF

Cannot Refute

[76] Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models PDF

Cannot Refute

[77] Visual Position Prompt for MLLM based Visual Grounding PDF

Cannot Refute

[78] Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding PDF

Cannot Refute

[79] Cross-Modal Alignment Enhancement for VisionâLanguage Tracking via Textual Heatmap Mapping PDF

Cannot Refute

From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

FSD framework with Spatial Relationship-Focused Chain-of-Thought (SrCoT)

[36] Vision-language model-driven scene understanding and robotic object manipulation PDF

[61] Robopoint: A vision-language model for spatial affordance prediction for robotics PDF

[62] SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities PDF

[63] Large language models for robotics: Opportunities, challenges, and perspectives PDF

[64] Physically Grounded Vision-Language Models for Robotic Manipulation PDF

[65] Physvlm: Enabling visual language models to understand robotic physical reachability PDF

[66] Ï0: A Vision-Language-Action Flow Model for General Robot Control PDF

[67] Robomamba: Efficient vision-language-action model for robotic reasoning and manipulation PDF

[68] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[69] SpatialBot: Precise Spatial Understanding with Vision Language Models PDF

Hierarchical data construction pipeline for weak-to-strong capability enhancement

[51] Holodeck: Language guided generation of 3d embodied ai environments PDF

[52] Star: A benchmark for situated reasoning in real-world videos PDF

[53] Embodiedgpt: Vision-language pre-training via embodied chain of thought PDF

[54] SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding PDF

[55] Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning PDF

[56] Multi-modal situated reasoning in 3d scenes PDF

[57] Viki-r: Coordinating embodied multi-agent cooperation via reinforcement learning PDF

[58] Automated acquisition of structured, semantic models of manipulation activities from human VR demonstration PDF

[59] Empowering embodied visual tracking with visual foundation models and offline rl PDF

[60] Physical reasoning and object planning for household embodied agents PDF

Self-consistency alignment mechanism for spatial understanding and generation

[70] Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces PDF

[71] Spatial-mllm: Boosting mllm capabilities in visual-based spatial intelligence PDF

[72] MELOV: Multimodal Entity Linking with Optimized Visual Features in Latent Space PDF

[73] Words over pixels? rethinking vision in multimodal large language models PDF

[74] Layoutvlm: Differentiable optimization of 3d layout via vision-language models PDF

[75] Pinco: Position-induced consistent adapter for diffusion transformer in foreground-conditioned inpainting PDF

[76] Analyzing Fine-Grained Alignment and Enhancing Vision Understanding in Multimodal Language Models PDF

[77] Visual Position Prompt for MLLM based Visual Grounding PDF

[78] Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding PDF

[79] Cross-Modal Alignment Enhancement for VisionâLanguage Tracking via Textual Heatmap Mapping PDF

Table of Contents

[66] Ï0: A Vision-Language-Action Flow Model for General Robot Control PDF

[79] Cross-Modal Alignment Enhancement for VisionâLanguage Tracking via Textual Heatmap Mapping PDF