Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Vision Language ModelSpatial ReasoningMultiview Images

Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs). Prior work has addressed this issue by creating spatial question-answering (QA) datasets based on single images or indoor videos. However, real-world embodied AI agents—such as robots and self-driving cars—typically rely on ego-centric, multi-view observations. To this end, we introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data. Ego3D-Bench comprises over 8,600 QA pairs, created with significant involvement from human annotators to ensure quality and diversity. We benchmark 16 SOTA VLMs, including GPT-4o, Gemini1.5-Pro, InternVL3, and Qwen2.5-VL. Our results reveal a notable performance gap between human level scores and VLM performance, highlighting that current VLMs still fall short of human level spatial understanding (SU). To bridge this gap, we propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs. Ego3D-VLM generates cognitive map based on estimated global 3D coordinates, resulting in 12% and 56% average improvements on multi-choice QA and absolute distance estimation, respectively. Ego3D-VLM can be integrated with any existing VLM. Together, Ego3D-Bench and Ego3D-VLM offer valuable tools for advancing toward human level SU in real-world, multi-view environments. Code is available in the supplementary materials.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Ego3D-Bench, a benchmark for evaluating 3D spatial reasoning in vision-language models using ego-centric multi-view outdoor data, and proposes Ego3D-VLM, a post-training framework that generates cognitive maps from estimated 3D coordinates. Within the taxonomy, this work resides in the 'Ego-Centric Spatial Reasoning and QA Benchmarks' leaf, which contains four papers total. This leaf sits under the broader 'Ego-Centric Multi-View Datasets and Benchmarks' branch, indicating a moderately populated research direction focused on evaluation resources rather than algorithmic innovation alone.

The taxonomy reveals neighboring leaves addressing related but distinct challenges. 'Ego-Exo Multi-View Activity and Interaction Datasets' captures simultaneous first-person and third-person views with activity annotations, while 'Multi-View 3D Scene Understanding Benchmarks' emphasizes cross-viewpoint integration without the ego-centric constraint. The 'Vision-Language Models for Spatial Reasoning' branch, particularly 'Multi-Perspective Spatial Reasoning in VLMs,' explores perspective-taking mechanisms that complement this work's focus on outdoor ego-centric scenarios. The taxonomy's scope notes clarify that purely exocentric or single-view benchmarks fall outside this leaf, positioning Ego3D-Bench as addressing a specific gap in outdoor, multi-view ego-centric evaluation.

Among thirty candidates examined, the benchmark contribution (Contribution A) showed no clear refutation across ten candidates, suggesting limited prior work on outdoor ego-centric multi-view spatial QA at this scale. The training-free framework (Contribution B) similarly encountered no refutable candidates among ten examined, indicating novelty in the cognitive map generation approach for VLM enhancement. However, the textual cognitive map generation (Contribution C) identified one refutable candidate among ten examined, pointing to some overlap with existing spatial representation methods. The limited search scope means these findings reflect top-ranked semantic matches rather than exhaustive coverage of the field.

Given the analysis of thirty candidates and the taxonomy structure, the work appears to occupy a relatively sparse niche within ego-centric spatial reasoning benchmarks, particularly for outdoor multi-view settings. The cognitive map generation component shows partial overlap with prior spatial representation work, while the benchmark and post-training framework contributions demonstrate clearer differentiation. The assessment is constrained by the top-K semantic search methodology and does not capture all potentially relevant work in adjacent domains such as embodied AI navigation or indoor scene understanding.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: 3D spatial reasoning in ego-centric multi-view scenes. The field organizes around several major branches that reflect distinct methodological and application-driven priorities. Ego-Centric Multi-View Datasets and Benchmarks provide foundational resources for training and evaluation, including large-scale collections like Ego-Exo4D[1] and specialized question-answering benchmarks such as Robospatial[3] and ViewSpatial-Bench[13]. Vision-Language Models for Spatial Reasoning and 3D Visual Grounding and Localization focus on integrating linguistic and visual cues to understand spatial relationships, while Multi-View 3D Object Detection and Reconstruction addresses geometric inference from multiple viewpoints, exemplified by methods like BEVDepth[4] and Sparse4D[26]. Parallel branches explore Multi-View Consistent Generation and Rendering, Ego-Centric 3D Pose Estimation for human motion capture, and Cognitive and Neuroscience Perspectives that draw on mental imagery and first-person consciousness research. Finally, Immersive and Embodied Interaction Systems and Specialized Applications tackle domain-specific challenges in virtual reality, robotics, and industrial navigation. A particularly active line of work centers on ego-centric spatial reasoning and question-answering benchmarks, where researchers probe how models interpret viewpoint-dependent spatial queries. Spatial Reasoning Egocentric[0] fits naturally within this cluster, emphasizing the challenges of reasoning about 3D scenes from a first-person perspective. It shares thematic ground with Viewsrd[5] and Dynamic Egocentric Scenes[9], which similarly address viewpoint variability and temporal dynamics in ego-centric settings. In contrast, works like Robospatial[3] and MV-ScanQA[15] extend spatial reasoning to multi-modal or multi-view question-answering, highlighting trade-offs between single-viewpoint depth and cross-view consistency. Open questions persist around how to best leverage ego-exo correspondences, integrate cognitive priors from neuroscience studies, and scale these methods to real-world embodied agents navigating complex environments.

Claimed Contributions

Ego3D-Bench: Ego-centric multi-view 3D spatial reasoning benchmark

10 retrieved papers

The authors introduce Ego3D-Bench, a benchmark comprising over 8,600 QA pairs across five categories (absolute distance, relative distance, localization, motion reasoning, travel time) designed to evaluate VLMs' 3D spatial understanding in ego-centric multi-view outdoor scenarios. The benchmark is constructed from validation sets of three public datasets with significant human annotator involvement.

10 retrieved papers

Ego3D-VLM: Training-free framework for enhancing 3D spatial reasoning

10 retrieved papers

The authors propose Ego3D-VLM, a training-free method that improves VLMs' 3D spatial understanding by generating a textual cognitive map based on estimated global 3D coordinates. This framework can be integrated with any existing VLM and achieves 12% and 56% average improvements on multi-choice QA and absolute distance estimation, respectively.

10 retrieved papers

Textual cognitive map generation for multi-view spatial reasoning

Can Refute

10 retrieved papers

The authors develop a cognitive map generator function that creates a textual representation of the 3D scene by defining a coordinate system centered on the ego and locating important objects in 3D coordinate space. This approach focuses only on referred objects, making it more efficient than point-cloud or BEV image methods while enabling grounded spatial reasoning.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] Understanding Dynamic Scenes in Ego Centric 4D Point Clouds PDF

Huang Jun-sheng, Hao Sheng-yu, Hu Bocheng, Wang, Gaoang (2025)

[18] Egoloc: Revisiting 3d object localization from egocentric videos with visual queries PDF

Jinjie Mai, Abdullah Hamdi, Silvio Giancola, Chen Zhao, Bernard Ghanem (2023)

[42] Instance Tracking in 3D Scenes from Egocentric Videos PDF

Yun-Han Zhao, Haoyu Ma, Yunhan Zhao, Shu Kong, Charless Fowlkes, Charless C. Fowlkes (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Ego3D-Bench: Ego-centric multi-view 3D spatial reasoning benchmark

[1] Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives PDF

Cannot Refute

[3] Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics PDF

Cannot Refute

[13] ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models PDF

Cannot Refute

[61] Nuplanqa: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models PDF

Cannot Refute

[62] Space3d-bench: Spatial 3d question answering benchmark PDF

Cannot Refute

[63] Hd-epic: A highly-detailed egocentric video dataset PDF

Cannot Refute

[64] Ecbench: Can multi-modal foundation models understand the egocentric world? a holistic embodied cognition benchmark PDF

Cannot Refute

[65] EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models PDF

Cannot Refute

[66] Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes PDF

Cannot Refute

[67] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI PDF

Cannot Refute

Contribution

Ego3D-VLM: Training-free framework for enhancing 3D spatial reasoning

[51] Revision: Rendering tools enable spatial fidelity in vision-language models PDF

Cannot Refute

[52] Zero-Shot 3D Visual Grounding from Vision-Language Models PDF

Cannot Refute

[53] See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model PDF

Cannot Refute

[54] SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models PDF

Cannot Refute

[55] Reasoning3D--Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models PDF

Cannot Refute

[56] Know your neighbors: Improving single-view reconstruction via spatial vision-language reasoning PDF

Cannot Refute

[57] Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning PDF

Cannot Refute

[58] MindJourney: Test-Time Scaling with World Models for Spatial Reasoning PDF

Cannot Refute

[59] VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation PDF

Cannot Refute

[60] Mutual exclusivity bias and spatial reasoning in Vision-Language Models PDF

Cannot Refute

Contribution

Textual cognitive map generation for multi-view spatial reasoning

[73] Spatial mental modeling from limited views PDF

Can Refute

[68] PLA: Language-Driven Open-Vocabulary 3D Scene Understanding PDF

Cannot Refute

[69] DreamBooth3D: Subject-Driven Text-to-3D Generation PDF

Cannot Refute

[70] 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment PDF

Cannot Refute

[71] DreamFusion: Text-to-3D using 2D Diffusion PDF

Cannot Refute

[72] ViewInfer3D: 3D Visual Grounding Based on Embodied Viewpoint Inference PDF

Cannot Refute

[74] Does Spatial Cognition Emerge in Frontier Models? PDF

Cannot Refute

[75] LanguageRefer: Spatial-Language Model for 3D Visual Grounding PDF

Cannot Refute

[76] VisionCube: 3D-Aware Vision-Language Model for Multi-Step Spatial Reasoning PDF

Cannot Refute

[77] Text-to-3D with Classifier Score Distillation PDF

Cannot Refute

Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] Understanding Dynamic Scenes in Ego Centric 4D Point Clouds PDF

[18] Egoloc: Revisiting 3d object localization from egocentric videos with visual queries PDF

[42] Instance Tracking in 3D Scenes from Egocentric Videos PDF

Contribution Analysis

Ego3D-Bench: Ego-centric multi-view 3D spatial reasoning benchmark

[1] Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives PDF

[3] Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics PDF

[13] ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models PDF

[61] Nuplanqa: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models PDF

[62] Space3d-bench: Spatial 3d question answering benchmark PDF

[63] Hd-epic: A highly-detailed egocentric video dataset PDF

[64] Ecbench: Can multi-modal foundation models understand the egocentric world? a holistic embodied cognition benchmark PDF

[65] EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models PDF

[66] Seeing Across Views: Benchmarking Spatial Reasoning of Vision-Language Models in Robotic Scenes PDF

[67] EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI PDF

Ego3D-VLM: Training-free framework for enhancing 3D spatial reasoning

[51] Revision: Rendering tools enable spatial fidelity in vision-language models PDF

[52] Zero-Shot 3D Visual Grounding from Vision-Language Models PDF

[53] See&Trek: Training-Free Spatial Prompting for Multimodal Large Language Model PDF

[54] SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models PDF

[55] Reasoning3D--Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models PDF

[56] Know your neighbors: Improving single-view reconstruction via spatial vision-language reasoning PDF

[57] Pointclip v2: Prompting clip and gpt for powerful 3d open-world learning PDF

[58] MindJourney: Test-Time Scaling with World Models for Spatial Reasoning PDF

[59] VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation PDF

[60] Mutual exclusivity bias and spatial reasoning in Vision-Language Models PDF

Textual cognitive map generation for multi-view spatial reasoning

[73] Spatial mental modeling from limited views PDF

[68] PLA: Language-Driven Open-Vocabulary 3D Scene Understanding PDF

[69] DreamBooth3D: Subject-Driven Text-to-3D Generation PDF

[70] 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment PDF

[71] DreamFusion: Text-to-3D using 2D Diffusion PDF

[72] ViewInfer3D: 3D Visual Grounding Based on Embodied Viewpoint Inference PDF

[74] Does Spatial Cognition Emerge in Frontier Models? PDF

[75] LanguageRefer: Spatial-Language Model for 3D Visual Grounding PDF

[76] VisionCube: 3D-Aware Vision-Language Model for Multi-Step Spatial Reasoning PDF

[77] Text-to-3D with Classifier Score Distillation PDF

Table of Contents