Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes

ICLR 2026 Conference SubmissionAnonymous Authors
spatial understandingbenchmarkmulti-viewvlmrobotics
Abstract:

Vision-language models (VLMs) are essential to Embodied AI, enabling robots to perceive, reason, and act in complex environments. They also serve as the foundation for the recent Vision-Language-Action (VLA) models. Yet, most evaluations of VLMs focus on single-view settings, leaving their ability to integrate multi-view information largely underexplored. At the same time, multi-camera setups are increasingly standard in robotic platforms, as they provide complementary perspectives to mitigate occlusion and depth ambiguity. Whether VLMs can effectively leverage such multi-view inputs for robotic reasoning therefore remains an open question. To bridge this gap, we introduce MV-RoboBench, a benchmark specifically designed to evaluate the multi-view spatial reasoning capabilities of VLMs in robotic manipulation. MV-RoboBench consists of 1.7k manually curated QA items across eight subtasks, divided into two primary categories: spatial understanding and robotic execution. We evaluate a diverse set of existing VLMs, including both open-source and closed-source models, along with enhanced versions augmented by Chain-of-Thought (CoT)-inspired enhancements. The results show that state-of-the-art models remain far below human performance, underscoring the substantial challenges VLMs face in multi-view robotic perception. Additionally, our analysis uncovers two key findings: (i) spatial intelligence and robotic task reasoning are correlated in multi-view robotic scenarios; and (ii) strong performance on existing general-purpose single-view spatial understanding benchmarks does not reliably translate to success in the robotic spatial tasks assessed by our benchmark. We release MV-RoboBench as an open resource to foster progress in spatially grounded VLMs and VLAs, providing a foundation for advancing embodied multi-view intelligence in robotics.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MV-RoboBench, a benchmark comprising 1.7k manually curated question-answer items designed to evaluate multi-view spatial reasoning capabilities of vision-language models in robotic manipulation contexts. Within the taxonomy, it resides in the 'Benchmarking and Evaluation' leaf, which contains only two papers total. This makes it a relatively sparse research direction compared to more crowded areas like 'Transformer-Based Multi-View Encoding' (four papers) or 'VLM-Based Spatial Understanding and Grounding' (three papers). The benchmark focuses on assessing existing models rather than proposing new architectures or learning methods.

The taxonomy reveals that most neighboring work concentrates on algorithmic contributions: multi-view representation learning, vision-language integration, and end-to-end policy learning. The 'Benchmarking and Evaluation' leaf sits apart from these methodological branches, serving a complementary role by providing systematic assessment tools. Its sibling paper in the same leaf addresses evaluation in a different context (construction-focused interactive environments), while nearby branches like 'Vision-Language Integration for Spatial Reasoning' and 'Multi-View 3D Representation Learning' develop the techniques that benchmarks like MV-RoboBench aim to measure. This positioning suggests the paper addresses a gap in systematic evaluation infrastructure rather than competing directly with method-oriented work.

Among 30 candidates examined through semantic search, none were found to clearly refute any of the three identified contributions: the MV-RoboBench benchmark itself (10 candidates examined, 0 refutable), the comprehensive VLM evaluation with CoT enhancements (10 candidates, 0 refutable), and the correlation analysis between spatial and robotic reasoning (10 candidates, 0 refutable). The absence of refutable prior work across all contributions suggests that within this limited search scope, the specific combination of multi-view focus, robotic manipulation context, and systematic VLM evaluation appears relatively unexplored. However, this reflects the scale of the search rather than an exhaustive literature review.

Based on the limited search of 30 semantically similar papers, the work appears to occupy a distinct position by providing evaluation infrastructure for multi-view spatial reasoning in robotics. The sparse 'Benchmarking and Evaluation' category and lack of overlapping prior work within the examined candidates suggest novelty in this specific assessment focus. However, the analysis does not cover the full breadth of vision-language or robotic benchmarking literature, and a more comprehensive search might reveal additional related evaluation efforts or datasets addressing similar capabilities.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: multi-view spatial reasoning in robotic manipulation scenarios. The field organizes itself around several complementary branches that address different facets of enabling robots to understand and act upon spatial information from multiple viewpoints. Multi-View 3D Representation Learning focuses on how to encode geometric structure from camera arrays, with works like Robotic View Transformer[3] and 3D Multiview Pretraining[2] building rich volumetric or feature-based representations. Vision-Language Integration for Spatial Reasoning explores how natural language instructions can guide spatial understanding, exemplified by approaches such as Robo2vlm[1] and RoboRefer[8]. Spatial Reasoning Mechanisms and Representations delve into explicit reasoning modules—ranging from coordinate transforms to attention over spatial features—while Action Representation and Execution and End-to-End Policy Learning address how spatial understanding translates into executable robot actions. Relational and Graph-Based Reasoning captures methods that model object interactions and scene structure explicitly, and Data Generation and Augmentation tackles the challenge of obtaining diverse training scenarios. Interactive and Closed-Loop Reasoning, Spatial Reasoning for Human-Robot Collaboration, and Benchmarking and Evaluation round out the taxonomy by addressing online adaptation, collaborative settings, and systematic assessment of spatial reasoning capabilities. Recent activity highlights a tension between end-to-end learned policies and modular pipelines that separate perception, reasoning, and control. Many studies pursue tighter integration of vision and language to handle complex instructions, as seen in Embodied-r1[5] and Incentivizing Multimodal Reasoning[6], while others emphasize robust 3D scene understanding through multi-view fusion or explicit spatial representations like SpatialCoT[13] and RoboSpatial[23]. Seeing Across Views[0] sits squarely within the Benchmarking and Evaluation branch, providing systematic assessment tools for multi-view spatial reasoning rather than proposing a new architecture. Its emphasis contrasts with neighboring work like MineAnyBuild[37], which also evaluates spatial capabilities but does so in a construction-focused interactive environment. By offering structured benchmarks, Seeing Across Views[0] complements the broader ecosystem: it helps quantify progress across the diverse methodological branches and identifies which spatial reasoning challenges remain open, thereby guiding future work in representation learning, policy design, and human-robot collaboration.

Claimed Contributions

MV-RoboBench benchmark for multi-view spatial reasoning in robotic manipulation

The authors present MV-RoboBench, the first benchmark that integrates spatial understanding and robotic execution tasks using synchronized multi-view camera inputs from real robotic demonstrations. It contains 1.7K manually curated QA items across eight subtasks to systematically evaluate vision-language models in multi-view robotic scenarios.

10 retrieved papers
Comprehensive evaluation of VLMs with CoT-inspired enhancements

The authors conduct extensive experiments evaluating various VLMs on multi-view robotic reasoning and explore three CoT-inspired enhancement directions: textual scene descriptions, view synthesis, and depth priors. Their results show state-of-the-art models remain far below human performance.

10 retrieved papers
Correlation analysis revealing relationships between spatial and robotic reasoning

The authors provide systematic correlation analysis demonstrating that spatial and robotic reasoning are related in multi-view manipulation settings, while also showing that single-view spatial benchmark performance does not transfer reliably to multi-view robotic tasks, highlighting unique challenges of embodied multi-view intelligence.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MV-RoboBench benchmark for multi-view spatial reasoning in robotic manipulation

The authors present MV-RoboBench, the first benchmark that integrates spatial understanding and robotic execution tasks using synchronized multi-view camera inputs from real robotic demonstrations. It contains 1.7K manually curated QA items across eight subtasks to systematically evaluate vision-language models in multi-view robotic scenarios.

Contribution

Comprehensive evaluation of VLMs with CoT-inspired enhancements

The authors conduct extensive experiments evaluating various VLMs on multi-view robotic reasoning and explore three CoT-inspired enhancement directions: textual scene descriptions, view synthesis, and depth priors. Their results show state-of-the-art models remain far below human performance.

Contribution

Correlation analysis revealing relationships between spatial and robotic reasoning

The authors provide systematic correlation analysis demonstrating that spatial and robotic reasoning are related in multi-view manipulation settings, while also showing that single-view spatial benchmark performance does not transfer reliably to multi-view robotic tasks, highlighting unique challenges of embodied multi-view intelligence.

Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes | Novelty Validation