From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors
Spatial VLMsGeneral Robotic ManipulationVLM ReasoningSpatial Chain-of-Thought
Abstract:

Achieving generalization in robotic manipulation remains a critical challenge, particularly for unseen scenarios and novel tasks. Current Vision-Language-Action (VLA) models, while building on top of general Vision-Language Models (VLMs), still fall short of achieving robust zero-shot performance due to the scarcity and heterogeneity prevalent in embodied datasets. To address these limitations, we propose FSD (From Seeing to Doing), a novel vision-language model that generates intermediate representations through spatial relationship reasoning, providing fine-grained guidance for robotic manipulation. Our approach combines a hierarchical data construction pipeline for training with a self-consistency mechanism that aligns spatial coordinates with visual signals. Through extensive experiments, we comprehensively validated FSD’s capabilities in both “seeing” and “doing”, achieving outstanding performance across 8 benchmarks for general spatial reasoning and embodied reference abilities, as well as on our proposed more challenging benchmark VABench. We also verified zero-shot capabilities in robot manipulation, demonstrating significant performance improvements over baseline methods in both SimplerEnv and real robot settings. Experimental results show that FSD achieves 40.6% success rate in SimplerEnv and 72% success rate across 8 real-world tasks, outperforming the strongest baseline by 30%.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes FSD, a vision-language model that generates intermediate spatial representations through chain-of-thought reasoning to guide robotic manipulation. It sits in the 'Embodiment-Agnostic Representations' leaf under 'Cross-Domain and Embodiment Transfer', which currently contains no sibling papers in the taxonomy. This positioning suggests the work targets a relatively sparse research direction focused on transfer across robot platforms through shared intermediate representations, rather than the more crowded areas of VLM-driven planning or 3D spatial encoding where multiple competing approaches exist.

The taxonomy reveals substantial activity in neighboring branches. 'Language-Conditioned Spatial Reasoning' (one paper) and 'Scene Understanding for Manipulation' (one paper) represent closely related directions within the same parent category, while 'VLM-Driven Spatial Reasoning and Planning' (seven papers) and '3D Spatial Encoding and Descriptor Fields' (four papers) show concentrated effort in complementary approaches. The exclude_note clarifies that embodiment-agnostic methods differ from embodiment-specific techniques by using intermediate representations like pointing or affordances. FSD's spatial relationship reasoning appears to bridge VLM-based planning with explicit spatial encoding, occupying a position between these established clusters.

Among thirty candidates examined, none clearly refute any of the three core contributions. The SrCoT framework examined ten candidates with zero refutable overlaps, as did the hierarchical data construction pipeline and self-consistency alignment mechanism. This suggests either genuine novelty in the specific combination of techniques or limitations in the search scope. The statistics indicate a focused rather than exhaustive literature review, meaning substantial prior work in spatial reasoning for manipulation may exist outside the top-thirty semantic matches. The absence of sibling papers in the same taxonomy leaf reinforces that this particular framing—embodiment-agnostic spatial representations via chain-of-thought—appears underexplored in the examined literature.

Based on the limited search scope of thirty candidates, the work appears to occupy a relatively novel position combining VLM-based spatial reasoning with embodiment transfer objectives. However, the analysis cannot rule out relevant prior work in the broader manipulation literature, particularly in areas like spatial affordance learning or cross-embodiment policy transfer that may not have surfaced in semantic search. The taxonomy structure suggests the field has concentrated effort in adjacent but distinct directions, leaving this specific intersection less densely populated.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: zero-shot generalization in robotic manipulation through spatial reasoning. The field has evolved into a rich landscape organized around several complementary directions. Vision-Language Foundation Model Integration explores how large pretrained models can ground language instructions in visual scenes, while Spatial Representation Learning focuses on encoding geometric and relational structure to support reasoning about object arrangements and affordances. Action Representation and Primitive Learning investigates how to parameterize and compose manipulation skills, and Data-Driven Generalization Strategies examines how to leverage diverse datasets and simulation to improve transfer. Reinforcement Learning with Spatial Reasoning combines trial-and-error learning with structured spatial knowledge, whereas Task-Specific Zero-Shot Applications targets concrete problem domains like grasping or assembly. Cross-Domain and Embodiment Transfer addresses the challenge of deploying policies across different robots and environments, and Supporting Technologies and Benchmarks provides the infrastructure for evaluation and comparison. Representative works such as CLIPort[6] and ReKep[5] illustrate how spatial representations can be integrated with vision-language models to enable flexible manipulation. A particularly active theme concerns how to build embodiment-agnostic representations that transfer across robot morphologies and sensor configurations. Seeing to Doing[0] sits within the Cross-Domain and Embodiment Transfer branch, specifically targeting embodiment-agnostic representations that enable zero-shot deployment on new platforms. This contrasts with approaches like Maniplvm[3] and SpatialVLA[7], which emphasize tighter integration of vision-language models with spatial reasoning but may require more embodiment-specific tuning. Meanwhile, works such as Omnimanip[2] and Ten Demonstrations[4] explore data-driven strategies that aggregate cross-embodiment demonstrations to improve generalization. The central tension across these lines involves balancing the expressiveness of spatial reasoning—whether through explicit geometric encodings, learned latent representations, or foundation model priors—against the need for policies that generalize immediately to unseen robots and tasks without additional training or fine-tuning.

Claimed Contributions

FSD framework with Spatial Relationship-Focused Chain-of-Thought (SrCoT)

The authors introduce FSD, a framework that generates visual aids (spatial affordance boxes, points, and visual traces) through structured spatial reasoning. The core SrCoT mechanism treats visual aid generation as a reasoning process, first analyzing spatial relationships between objects before generating comprehensive object-centric visual aids.

10 retrieved papers
Hierarchical data construction pipeline for weak-to-strong capability enhancement

The authors develop a progressive data construction pipeline that builds five hierarchical capability levels (region grounding, spatial relationship understanding, spatial reasoning, spatial affordance generation, and visual trace generation). This pipeline processes 300K demonstrations from large-scale embodied datasets to cultivate spatial reasoning abilities hierarchically.

10 retrieved papers
Self-consistency alignment mechanism for spatial understanding and generation

The authors propose a self-consistency mechanism that frames generation tasks inversely as understanding problems, enabling bidirectional training where the model both generates visual aids from instructions and predicts instructions from visual aids. This approach aligns coordinate space with image-text modalities and enhances spatial reasoning capabilities.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FSD framework with Spatial Relationship-Focused Chain-of-Thought (SrCoT)

The authors introduce FSD, a framework that generates visual aids (spatial affordance boxes, points, and visual traces) through structured spatial reasoning. The core SrCoT mechanism treats visual aid generation as a reasoning process, first analyzing spatial relationships between objects before generating comprehensive object-centric visual aids.

Contribution

Hierarchical data construction pipeline for weak-to-strong capability enhancement

The authors develop a progressive data construction pipeline that builds five hierarchical capability levels (region grounding, spatial relationship understanding, spatial reasoning, spatial affordance generation, and visual trace generation). This pipeline processes 300K demonstrations from large-scale embodied datasets to cultivate spatial reasoning abilities hierarchically.

Contribution

Self-consistency alignment mechanism for spatial understanding and generation

The authors propose a self-consistency mechanism that frames generation tasks inversely as understanding problems, enabling bidirectional training where the model both generates visual aids from instructions and predicts instructions from visual aids. This approach aligns coordinate space with image-text modalities and enhances spatial reasoning capabilities.