Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes

ICLR 2026 Conference SubmissionAnonymous Authors
Vision Language ModelSpatial ReasoningMultiview Images
Abstract:

Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agents—such as robots and self-driving cars—typically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding (SU). To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% and 56% average improvements on multi-choice QA and absolute distance estimation, respectively. Ego3D-VLM can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level SU in real-world, multi-view environments. Code is available in the supplementary materials.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Ego3D-Bench, a benchmark for evaluating 3D spatial reasoning in vision-language models using ego-centric multi-view outdoor data, and proposes Ego3D-VLM, a post-training framework that generates cognitive maps from estimated 3D coordinates. Within the taxonomy, this work resides in the 'Ego-Centric Spatial Reasoning and QA Benchmarks' leaf, which contains four papers total. This leaf sits under the broader 'Ego-Centric Multi-View Datasets and Benchmarks' branch, indicating a moderately populated research direction focused on evaluation resources rather than algorithmic innovation alone.

The taxonomy reveals neighboring leaves addressing related but distinct challenges. 'Ego-Exo Multi-View Activity and Interaction Datasets' captures simultaneous first-person and third-person views with activity annotations, while 'Multi-View 3D Scene Understanding Benchmarks' emphasizes cross-viewpoint integration without the ego-centric constraint. The 'Vision-Language Models for Spatial Reasoning' branch, particularly 'Multi-Perspective Spatial Reasoning in VLMs,' explores perspective-taking mechanisms that complement this work's focus on outdoor ego-centric scenarios. The taxonomy's scope notes clarify that purely exocentric or single-view benchmarks fall outside this leaf, positioning Ego3D-Bench as addressing a specific gap in outdoor, multi-view ego-centric evaluation.

Among thirty candidates examined, the benchmark contribution (Contribution A) showed no clear refutation across ten candidates, suggesting limited prior work on outdoor ego-centric multi-view spatial QA at this scale. The training-free framework (Contribution B) similarly encountered no refutable candidates among ten examined, indicating novelty in the cognitive map generation approach for VLM enhancement. However, the textual cognitive map generation (Contribution C) identified one refutable candidate among ten examined, pointing to some overlap with existing spatial representation methods. The limited search scope means these findings reflect top-ranked semantic matches rather than exhaustive coverage of the field.

Given the analysis of thirty candidates and the taxonomy structure, the work appears to occupy a relatively sparse niche within ego-centric spatial reasoning benchmarks, particularly for outdoor multi-view settings. The cognitive map generation component shows partial overlap with prior spatial representation work, while the benchmark and post-training framework contributions demonstrate clearer differentiation. The assessment is constrained by the top-K semantic search methodology and does not capture all potentially relevant work in adjacent domains such as embodied AI navigation or indoor scene understanding.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: 3D spatial reasoning in ego-centric multi-view scenes. The field organizes around several major branches that reflect distinct methodological and application-driven priorities. Ego-Centric Multi-View Datasets and Benchmarks provide foundational resources for training and evaluation, including large-scale collections like Ego-Exo4D[1] and specialized question-answering benchmarks such as Robospatial[3] and ViewSpatial-Bench[13]. Vision-Language Models for Spatial Reasoning and 3D Visual Grounding and Localization focus on integrating linguistic and visual cues to understand spatial relationships, while Multi-View 3D Object Detection and Reconstruction addresses geometric inference from multiple viewpoints, exemplified by methods like BEVDepth[4] and Sparse4D[26]. Parallel branches explore Multi-View Consistent Generation and Rendering, Ego-Centric 3D Pose Estimation for human motion capture, and Cognitive and Neuroscience Perspectives that draw on mental imagery and first-person consciousness research. Finally, Immersive and Embodied Interaction Systems and Specialized Applications tackle domain-specific challenges in virtual reality, robotics, and industrial navigation. A particularly active line of work centers on ego-centric spatial reasoning and question-answering benchmarks, where researchers probe how models interpret viewpoint-dependent spatial queries. Spatial Reasoning Egocentric[0] fits naturally within this cluster, emphasizing the challenges of reasoning about 3D scenes from a first-person perspective. It shares thematic ground with Viewsrd[5] and Dynamic Egocentric Scenes[9], which similarly address viewpoint variability and temporal dynamics in ego-centric settings. In contrast, works like Robospatial[3] and MV-ScanQA[15] extend spatial reasoning to multi-modal or multi-view question-answering, highlighting trade-offs between single-viewpoint depth and cross-view consistency. Open questions persist around how to best leverage ego-exo correspondences, integrate cognitive priors from neuroscience studies, and scale these methods to real-world embodied agents navigating complex environments.

Claimed Contributions

Ego3D-Bench: Ego-centric multi-view 3D spatial reasoning benchmark

The authors introduce Ego3D-Bench, a benchmark comprising over 8,600 QA pairs across five categories (absolute distance, relative distance, localization, motion reasoning, travel time) designed to evaluate VLMs' 3D spatial understanding in ego-centric multi-view outdoor scenarios. The benchmark is constructed from validation sets of three public datasets with significant human annotator involvement.

10 retrieved papers
Ego3D-VLM: Training-free framework for enhancing 3D spatial reasoning

The authors propose Ego3D-VLM, a training-free method that improves VLMs' 3D spatial understanding by generating a textual cognitive map based on estimated global 3D coordinates. This framework can be integrated with any existing VLM and achieves 12% and 56% average improvements on multi-choice QA and absolute distance estimation, respectively.

10 retrieved papers
Textual cognitive map generation for multi-view spatial reasoning

The authors develop a cognitive map generator function that creates a textual representation of the 3D scene by defining a coordinate system centered on the ego and locating important objects in 3D coordinate space. This approach focuses only on referred objects, making it more efficient than point-cloud or BEV image methods while enabling grounded spatial reasoning.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Ego3D-Bench: Ego-centric multi-view 3D spatial reasoning benchmark

The authors introduce Ego3D-Bench, a benchmark comprising over 8,600 QA pairs across five categories (absolute distance, relative distance, localization, motion reasoning, travel time) designed to evaluate VLMs' 3D spatial understanding in ego-centric multi-view outdoor scenarios. The benchmark is constructed from validation sets of three public datasets with significant human annotator involvement.

Contribution

Ego3D-VLM: Training-free framework for enhancing 3D spatial reasoning

The authors propose Ego3D-VLM, a training-free method that improves VLMs' 3D spatial understanding by generating a textual cognitive map based on estimated global 3D coordinates. This framework can be integrated with any existing VLM and achieves 12% and 56% average improvements on multi-choice QA and absolute distance estimation, respectively.

Contribution

Textual cognitive map generation for multi-view spatial reasoning

The authors develop a cognitive map generator function that creates a textual representation of the 3D scene by defining a coordinate system centered on the ego and locating important objects in 3D coordinate space. This approach focuses only on referred objects, making it more efficient than point-cloud or BEV image methods while enabling grounded spatial reasoning.