Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

spatial understandingbenchmarkmulti-viewvlmrobotics

Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet, most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information largely underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions augmented by Chain-of-Thought (CoT)-inspired enhancements. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task reasoning are correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing a foundation for advancing embodied multi-view intelligence in robotics.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MV-RoboBench, a benchmark comprising 1.7k manually curated question-answer items designed to evaluate multi-view spatial reasoning capabilities of vision-language models in robotic manipulation contexts. Within the taxonomy, it resides in the 'Benchmarking and Evaluation' leaf, which contains only two papers total. This makes it a relatively sparse research direction compared to more crowded areas like 'Transformer-Based Multi-View Encoding' (four papers) or 'VLM-Based Spatial Understanding and Grounding' (three papers). The benchmark focuses on assessing existing models rather than proposing new architectures or learning methods.

The taxonomy reveals that most neighboring work concentrates on algorithmic contributions: multi-view representation learning, vision-language integration, and end-to-end policy learning. The 'Benchmarking and Evaluation' leaf sits apart from these methodological branches, serving a complementary role by providing systematic assessment tools. Its sibling paper in the same leaf addresses evaluation in a different context (construction-focused interactive environments), while nearby branches like 'Vision-Language Integration for Spatial Reasoning' and 'Multi-View 3D Representation Learning' develop the techniques that benchmarks like MV-RoboBench aim to measure. This positioning suggests the paper addresses a gap in systematic evaluation infrastructure rather than competing directly with method-oriented work.

Among 30 candidates examined through semantic search, none were found to clearly refute any of the three identified contributions: the MV-RoboBench benchmark itself (10 candidates examined, 0 refutable), the comprehensive VLM evaluation with CoT enhancements (10 candidates, 0 refutable), and the correlation analysis between spatial and robotic reasoning (10 candidates, 0 refutable). The absence of refutable prior work across all contributions suggests that within this limited search scope, the specific combination of multi-view focus, robotic manipulation context, and systematic VLM evaluation appears relatively unexplored. However, this reflects the scale of the search rather than an exhaustive literature review.

Based on the limited search of 30 semantically similar papers, the work appears to occupy a distinct position by providing evaluation infrastructure for multi-view spatial reasoning in robotics. The sparse 'Benchmarking and Evaluation' category and lack of overlapping prior work within the examined candidates suggest novelty in this specific assessment focus. However, the analysis does not cover the full breadth of vision-language or robotic benchmarking literature, and a more comprehensive search might reveal additional related evaluation efforts or datasets addressing similar capabilities.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: multi-view spatial reasoning in robotic manipulation scenarios. The field organizes itself around several complementary branches that address different facets of enabling robots to understand and act upon spatial information from multiple viewpoints. Multi-View 3D Representation Learning focuses on how to encode geometric structure from camera arrays, with works like Robotic View Transformer[3] and 3D Multiview Pretraining[2] building rich volumetric or feature-based representations. Vision-Language Integration for Spatial Reasoning explores how natural language instructions can guide spatial understanding, exemplified by approaches such as Robo2vlm[1] and RoboRefer[8]. Spatial Reasoning Mechanisms and Representations delve into explicit reasoning modules—ranging from coordinate transforms to attention over spatial features—while Action Representation and Execution and End-to-End Policy Learning address how spatial understanding translates into executable robot actions. Relational and Graph-Based Reasoning captures methods that model object interactions and scene structure explicitly, and Data Generation and Augmentation tackles the challenge of obtaining diverse training scenarios. Interactive and Closed-Loop Reasoning, Spatial Reasoning for Human-Robot Collaboration, and Benchmarking and Evaluation round out the taxonomy by addressing online adaptation, collaborative settings, and systematic assessment of spatial reasoning capabilities. Recent activity highlights a tension between end-to-end learned policies and modular pipelines that separate perception, reasoning, and control. Many studies pursue tighter integration of vision and language to handle complex instructions, as seen in Embodied-r1[5] and Incentivizing Multimodal Reasoning[6], while others emphasize robust 3D scene understanding through multi-view fusion or explicit spatial representations like SpatialCoT[13] and RoboSpatial[23]. Seeing Across Views[0] sits squarely within the Benchmarking and Evaluation branch, providing systematic assessment tools for multi-view spatial reasoning rather than proposing a new architecture. Its emphasis contrasts with neighboring work like MineAnyBuild[37], which also evaluates spatial capabilities but does so in a construction-focused interactive environment. By offering structured benchmarks, Seeing Across Views[0] complements the broader ecosystem: it helps quantify progress across the diverse methodological branches and identifies which spatial reasoning challenges remain open, thereby guiding future work in representation learning, policy design, and human-robot collaboration.

Claimed Contributions

MV-RoboBench benchmark for multi-view spatial reasoning in robotic manipulation

10 retrieved papers

The authors present MV-RoboBench, the first benchmark that integrates spatial understanding and robotic execution tasks using synchronized multi-view camera inputs from real robotic demonstrations. It contains 1.7K manually curated QA items across eight subtasks to systematically evaluate vision-language models in multi-view robotic scenarios.

10 retrieved papers

Comprehensive evaluation of VLMs with CoT-inspired enhancements

10 retrieved papers

The authors conduct extensive experiments evaluating various VLMs on multi-view robotic reasoning and explore three CoT-inspired enhancement directions: textual scene descriptions, view synthesis, and depth priors. Their results show state-of-the-art models remain far below human performance.

10 retrieved papers

Correlation analysis revealing relationships between spatial and robotic reasoning

10 retrieved papers

The authors provide systematic correlation analysis demonstrating that spatial and robotic reasoning are related in multi-view manipulation settings, while also showing that single-view spatial benchmark performance does not transfer reliably to multi-view robotic tasks, highlighting unique challenges of embodied multi-view intelligence.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[37] MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents PDF

Wei Ziming, Lin, Bingqian, Ziming Wei, Bingqian Lin, Zijian Jiao, Ma Liang, Yunshuang Nie, Liu Yue-cheng, Liang Ma, Zhuang, Yuzheng, Yuecheng Liu, Liang, Xiaodan, Yuzheng Zhuang, Xiaodan Liang (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MV-RoboBench benchmark for multi-view spatial reasoning in robotic manipulation

[3] Rvt: Robotic view transformer for 3d object manipulation PDF

Cannot Refute

[8] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics PDF

Cannot Refute

[21] Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation PDF

Cannot Refute

[50] RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation PDF

Cannot Refute

[70] ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models PDF

Cannot Refute

[71] SpaceVista: All-Scale Visual Spatial Reasoning from mm to km PDF

Cannot Refute

[72] LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? PDF

Cannot Refute

[73] RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation PDF

Cannot Refute

[74] SITE: towards Spatial Intelligence Thorough Evaluation PDF

Cannot Refute

[75] Multimodal spatial reasoning in the large model era: A survey and benchmarks PDF

Cannot Refute

Contribution

Comprehensive evaluation of VLMs with CoT-inspired enhancements

[51] Zero-shot object navigation with vision-language models reasoning PDF

Cannot Refute

[52] Robotic control via embodied chain-of-thought reasoning PDF

Cannot Refute

[53] Manipllm: Embodied multimodal large language model for object-centric robotic manipulation PDF

Cannot Refute

[54] dvla: Diffusion vision-language-action model with multimodal chain-of-thought PDF

Cannot Refute

[55] VCoT-Grasp: Grasp Foundation Models with Visual Chain-of-Thought Reasoning for Language-driven Grasp Generation PDF

Cannot Refute

[56] Embodiedgpt: Vision-language pre-training via embodied chain of thought PDF

Cannot Refute

[57] Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning PDF

Cannot Refute

[58] Generate subgoal images before act: Unlocking the chain-of-thought reasoning in diffusion model for robot manipulation with multimodal prompts PDF

Cannot Refute

[59] Graphcot-vla: A 3d spatial-aware reasoning vision-language-action model for robotic manipulation with ambiguous instructions PDF

Cannot Refute

[60] Embodied Chain of Action Reasoning with Multi-Modal Foundation Model for Humanoid Loco-manipulation PDF

Cannot Refute

Contribution

Correlation analysis revealing relationships between spatial and robotic reasoning

[14] From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation PDF

Cannot Refute

[61] Enhancing computational thinking and spatial reasoning skills in gamification programming learning: A comparative study of tangible, block and paperâandâpencil â¦ PDF

Cannot Refute

[62] Enhancing computational thinking, Spatial reasoning, and executive function skills: The impact of tangible programming tools in early childhood and across different â¦ PDF

Cannot Refute

[63] Magma: A Foundation Model for Multimodal AI Agents PDF

Cannot Refute

[64] Towards Unobtrusive Physical AI: Augmenting Everyday Objects with Intelligence and Robotic Movement for Proactive Assistance PDF

Cannot Refute

[65] MolmoAct: Action Reasoning Models that can Reason in Space PDF

Cannot Refute

[66] Robix: A unified model for robot interaction, reasoning and planning PDF

Cannot Refute

[67] Structured Task Solving via Modular Embodied Intelligence: A Case Study on Rubik's Cube PDF

Cannot Refute

[68] Boosting Robotic Manipulation Generalization with Minimal Costly Data PDF

Cannot Refute

[69] Unravelling the Computational Thinking and Spatial Thinking Development: An Exploration of a Virtual Robot Programming Environment PDF

Cannot Refute

Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[37] MineAnyBuild: Benchmarking Spatial Planning for Open-world AI Agents PDF

Contribution Analysis

MV-RoboBench benchmark for multi-view spatial reasoning in robotic manipulation

[3] Rvt: Robotic view transformer for 3d object manipulation PDF

[8] RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics PDF

[21] Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation PDF

[50] RoboHorizon: An LLM-Assisted Multi-View World Model for Long-Horizon Robotic Manipulation PDF

[70] ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models PDF

[71] SpaceVista: All-Scale Visual Spatial Reasoning from mm to km PDF

[72] LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? PDF

[73] RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation PDF

[74] SITE: towards Spatial Intelligence Thorough Evaluation PDF

[75] Multimodal spatial reasoning in the large model era: A survey and benchmarks PDF

Comprehensive evaluation of VLMs with CoT-inspired enhancements

[51] Zero-shot object navigation with vision-language models reasoning PDF

[52] Robotic control via embodied chain-of-thought reasoning PDF

[53] Manipllm: Embodied multimodal large language model for object-centric robotic manipulation PDF

[54] dvla: Diffusion vision-language-action model with multimodal chain-of-thought PDF

[55] VCoT-Grasp: Grasp Foundation Models with Visual Chain-of-Thought Reasoning for Language-driven Grasp Generation PDF

[56] Embodiedgpt: Vision-language pre-training via embodied chain of thought PDF

[57] Emma-x: An embodied multimodal action model with grounded chain of thought and look-ahead spatial reasoning PDF

[58] Generate subgoal images before act: Unlocking the chain-of-thought reasoning in diffusion model for robot manipulation with multimodal prompts PDF

[59] Graphcot-vla: A 3d spatial-aware reasoning vision-language-action model for robotic manipulation with ambiguous instructions PDF

[60] Embodied Chain of Action Reasoning with Multi-Modal Foundation Model for Humanoid Loco-manipulation PDF

Correlation analysis revealing relationships between spatial and robotic reasoning

[14] From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation PDF

[61] Enhancing computational thinking and spatial reasoning skills in gamification programming learning: A comparative study of tangible, block and paperâandâpencil â¦ PDF

[62] Enhancing computational thinking, Spatial reasoning, and executive function skills: The impact of tangible programming tools in early childhood and across different â¦ PDF

[63] Magma: A Foundation Model for Multimodal AI Agents PDF

[64] Towards Unobtrusive Physical AI: Augmenting Everyday Objects with Intelligence and Robotic Movement for Proactive Assistance PDF

[65] MolmoAct: Action Reasoning Models that can Reason in Space PDF

[66] Robix: A unified model for robot interaction, reasoning and planning PDF

[67] Structured Task Solving via Modular Embodied Intelligence: A Case Study on Rubik's Cube PDF

[68] Boosting Robotic Manipulation Generalization with Minimal Costly Data PDF

[69] Unravelling the Computational Thinking and Spatial Thinking Development: An Exploration of a Virtual Robot Programming Environment PDF

Table of Contents

[61] Enhancing computational thinking and spatial reasoning skills in gamification programming learning: A comparative study of tangible, block and paperâandâpencil â¦ PDF

[62] Enhancing computational thinking, Spatial reasoning, and executive function skills: The impact of tangible programming tools in early childhood and across different â¦ PDF