On the Generalization Capacities of MLLMs for Spatial Intelligence

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

3D Computer VisionMultimodal Large Language ModelSpatial IntelligenceEmbodied AI

Multimodal Large Language Models (MLLMs) that directly process RGB inputs for tasks like 3D localization and navigation have shown remarkable potential. However, we argue that these ``RGB-only'' approaches are fundamentally flawed in their ability to generalize across cameras. By ignoring camera parameters, they entangle an object's physical properties with the camera's perspective, creating an irresolvable ambiguity. We show this leads MLLMs to overfit to the training camera distribution, rather than learning true and generalizable 3D geometric principles. To address this, we propose Camera-Aware MLLM framework for spatial MLLMs. It learns generalizable spatial reasoning by: (i) injecting camera intrinsics via a dense embedding that conditions each visual token; (ii) introducing a camera-aware data augmentation strategy that synthetically varies camera parameters, forcing the model to disentangle camera properties from scene content; and (iii) distilling geometric priors from a 3D vision foundation model. Extensive experiments demonstrate that camera-aware MLLMs substantially outperform their naive counterparts, particularly in cross-camera generalization tests on spatially-grounded tasks, indicating that camera-awareness is not only beneficial but also a prerequisite for robust and generalizable spatial intelligence in MLLMs.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a Camera-Aware MLLM framework that injects camera intrinsics via dense embeddings, applies camera-aware data augmentation, and distills geometric priors from 3D vision models. It resides in the Camera Parameter Integration leaf, which contains only three papers total, including this work and two siblings (Spatialrgpt and SpaceMind). This is a relatively sparse research direction within the broader taxonomy of 44 papers, suggesting the explicit integration of camera parameters into MLLMs remains an emerging area rather than a crowded subfield.

The taxonomy reveals that Camera Parameter Integration sits under Camera-Aware Spatial Reasoning Frameworks, which contrasts sharply with Camera-Agnostic Spatial Reasoning (covering RGB-only methods, perspective-taking, and multi-view integration) and 3D-Enhanced Spatial Reasoning (using point clouds or depth). The scope note for Camera Parameter Integration explicitly excludes RGB-only approaches, positioning this work as fundamentally distinct from methods that attempt spatial reasoning without camera metadata. Neighboring leaves like Camera-Guided Multimodal Fusion and Implicit 3D Reasoning from 2D explore related but orthogonal strategies, indicating the field has multiple parallel pathways for achieving spatial understanding.

Among 30 candidates examined, the analysis of geometric ambiguity in RGB-only reasoning found no refutable prior work across 10 candidates, suggesting this conceptual framing may be relatively novel. However, the Camera-Aware MLLM framework itself encountered one refutable candidate among 10 examined, and the empirical validation of camera-awareness for generalization also found one refutable match among 10 candidates. These statistics indicate that while the core architectural approach has some overlap with existing work, the scale of the search was limited and the majority of examined candidates did not directly refute the contributions.

Given the limited search scope of 30 candidates and the sparse population of the Camera Parameter Integration leaf, the work appears to occupy a less-explored niche within spatial reasoning for MLLMs. The taxonomy structure suggests that explicit camera modeling is not yet mainstream, with most prior efforts concentrated in camera-agnostic or 3D-enhanced directions. The contribution-level statistics reflect partial novelty, though a more exhaustive literature review would be needed to confirm the extent of overlap with the identified refutable candidates.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: camera-aware spatial reasoning in multimodal large language models. The field has organized itself around several complementary directions that reflect different strategies for equipping vision-language models with spatial understanding. Camera-Aware Spatial Reasoning Frameworks explicitly integrate camera parameters—such as intrinsic matrices, extrinsics, or viewpoint metadata—to ground spatial predictions in geometric principles, as seen in works like Spatialrgpt[1] and SpaceMind[23]. In contrast, Camera-Agnostic Spatial Reasoning approaches attempt to infer spatial relations directly from visual cues without explicit camera information, while 3D-Enhanced Spatial Reasoning leverages depth maps, point clouds, or volumetric representations to enrich the model's geometric awareness. Spatial Reasoning Benchmarks and Evaluation provide standardized testbeds such as ViewSpatial-Bench[5] and 3DSRBench[40] to measure progress, and Application-Driven Spatial Intelligence explores deployment in robotics, navigation, and autonomous driving. Auxiliary Techniques for Spatial Reasoning encompass prompting strategies like SpatialPrompting[3], data augmentation, and architectural modifications that support these core capabilities. Recent work has highlighted trade-offs between explicit geometric grounding and end-to-end learned representations. Camera-aware methods promise stronger generalization by encoding known physical constraints, yet they require access to camera metadata that may not always be available or accurate. On the Generalization Capacities[0] sits within the Camera Parameter Integration cluster, examining how well models trained with explicit camera information transfer across diverse viewpoints and scenes—a question closely related to efforts like Spatialrgpt[1], which also integrates camera parameters for improved spatial grounding, and SpaceMind[23], which explores viewpoint-conditioned reasoning. Meanwhile, benchmarks such as ViewSpatial-Bench[5] reveal that many models still struggle with perspective shifts and occlusion reasoning, underscoring open challenges in robustness and scalability. These contrasting lines of work collectively shape an evolving landscape where geometric priors and data-driven learning continue to be balanced.

Claimed Contributions

Analysis of geometric ambiguity in RGB-only spatial reasoning

10 retrieved papers

The authors formally analyze how RGB-only MLLMs suffer from fundamental geometric ambiguities (focal-depth and size-depth) when camera intrinsics are unknown. They show theoretically and empirically that this causes models to overfit to training camera distributions rather than learning generalizable 3D principles.

10 retrieved papers

Camera-Aware MLLM framework

Can Refute

10 retrieved papers

The authors introduce a novel framework that makes spatial MLLMs explicitly camera-aware via three mechanisms: dense camera ray embeddings that condition visual tokens on intrinsic parameters, camera-aware data augmentation that synthetically varies camera parameters, and distillation of geometric priors from a 3D vision foundation model.

10 retrieved papers

Can Refute

Empirical validation of camera-awareness as prerequisite for generalization

Can Refute

10 retrieved papers

Through comprehensive experiments on cross-camera generalization tasks and spatial reasoning benchmarks, the authors demonstrate that their camera-aware approach substantially outperforms camera-agnostic baselines, particularly when tested on out-of-distribution cameras, proving that camera-awareness is essential for robust spatial intelligence.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

An-Chieh Cheng, Yang Fu, Qiushan Guo, Jan Kautz, Sifei Liu, Xiaolong Wang, Ruihan Yang, Hongxu Yin (2024)

[23] SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models PDF

Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, Zizhuang Wei (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Analysis of geometric ambiguity in RGB-only spatial reasoning

[55] Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics PDF

Cannot Refute

[56] Large Spatial Model: End-to-end Unposed Images to Semantic 3D PDF

Cannot Refute

[57] Ndc-scene: Boost monocular 3d semantic scene completion in normalized device coordinates space PDF

Cannot Refute

[58] 3D Reconstruction with Spatial Memory PDF

Cannot Refute

[59] CamLessMonoDepth: Monocular Depth Estimation with Unknown Camera Parameters PDF

Cannot Refute

[60] SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos PDF

Cannot Refute

[61] Calibrating Self-supervised Monocular Depth Estimation PDF

Cannot Refute

[62] Crafting monocular cues and velocity guidance for self-supervised multi-frame depth learning PDF

Cannot Refute

[63] Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity PDF

Cannot Refute

[64] Deep model-based 6d pose refinement in rgb PDF

Cannot Refute

Contribution

Camera-Aware MLLM framework

[45] UniDepth: Universal monocular metric depth estimation PDF

Can Refute

[1] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[18] Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation PDF

Cannot Refute

[48] Unidepthv2: Universal monocular metric depth estimation made simpler PDF

Cannot Refute

[65] Cross-view transformers for real-time map-view semantic segmentation PDF

Cannot Refute

[66] Cami2v: Camera-controlled image-to-video diffusion model PDF

Cannot Refute

[67] Cameras as Relative Positional Encoding PDF

Cannot Refute

[68] Vid-CamEdit: Video Camera Trajectory Editing with Generative Rendering from Estimated Geometry PDF

Cannot Refute

[69] DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation PDF

Cannot Refute

[70] OverlapOcc: Leveraging overlap regions of surround-view cameras for 3D semantic occupancy prediction PDF

Cannot Refute

Contribution

Empirical validation of camera-awareness as prerequisite for generalization

[45] UniDepth: Universal monocular metric depth estimation PDF

Can Refute

[46] Camera-aware proxies for unsupervised person re-identification PDF

Cannot Refute

[47] CoBEVT: Cooperative bird's eye view semantic segmentation with sparse transformers PDF

Cannot Refute

[48] Unidepthv2: Universal monocular metric depth estimation made simpler PDF

Cannot Refute

[49] Hierarchical camera-aware contrast extension for unsupervised person re-identification PDF

Cannot Refute

[50] Joint generative and camera-aware clustering for unsupervised domain adaptation on person re-identification PDF

Cannot Refute

[51] Towards Generalizable Multi-Camera 3D Object Detection via Perspective Rendering PDF

Cannot Refute

[52] Triple adversarial learning and multi-view imaginative reasoning for unsupervised domain adaptation person re-identification PDF

Cannot Refute

[53] Breaking the paired sample barrier in person re-identification: leveraging unpaired samples for domain generalization PDF

Cannot Refute

[54] R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation PDF

Cannot Refute

On the Generalization Capacities of MLLMs for Spatial Intelligence

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[23] SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models PDF

Contribution Analysis

Analysis of geometric ambiguity in RGB-only spatial reasoning

[55] Transformers in Self-Supervised Monocular Depth Estimation with Unknown Camera Intrinsics PDF

[56] Large Spatial Model: End-to-end Unposed Images to Semantic 3D PDF

[57] Ndc-scene: Boost monocular 3d semantic scene completion in normalized device coordinates space PDF

[58] 3D Reconstruction with Spatial Memory PDF

[59] CamLessMonoDepth: Monocular Depth Estimation with Unknown Camera Parameters PDF

[60] SLAM3R: Real-Time Dense Scene Reconstruction from Monocular RGB Videos PDF

[61] Calibrating Self-supervised Monocular Depth Estimation PDF

[62] Crafting monocular cues and velocity guidance for self-supervised multi-frame depth learning PDF

[63] Towards Ambiguity-Free Spatial Foundation Model: Rethinking and Decoupling Depth Ambiguity PDF

[64] Deep model-based 6d pose refinement in rgb PDF

Camera-Aware MLLM framework

[45] UniDepth: Universal monocular metric depth estimation PDF

[1] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[18] Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation PDF

[48] Unidepthv2: Universal monocular metric depth estimation made simpler PDF

[65] Cross-view transformers for real-time map-view semantic segmentation PDF

[66] Cami2v: Camera-controlled image-to-video diffusion model PDF

[67] Cameras as Relative Positional Encoding PDF

[68] Vid-CamEdit: Video Camera Trajectory Editing with Generative Rendering from Estimated Geometry PDF

[69] DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation PDF

[70] OverlapOcc: Leveraging overlap regions of surround-view cameras for 3D semantic occupancy prediction PDF

Empirical validation of camera-awareness as prerequisite for generalization

[45] UniDepth: Universal monocular metric depth estimation PDF

[46] Camera-aware proxies for unsupervised person re-identification PDF

[47] CoBEVT: Cooperative bird's eye view semantic segmentation with sparse transformers PDF

[48] Unidepthv2: Universal monocular metric depth estimation made simpler PDF

[49] Hierarchical camera-aware contrast extension for unsupervised person re-identification PDF

[50] Joint generative and camera-aware clustering for unsupervised domain adaptation on person re-identification PDF

[51] Towards Generalizable Multi-Camera 3D Object Detection via Perspective Rendering PDF

[52] Triple adversarial learning and multi-view imaginative reasoning for unsupervised domain adaptation person re-identification PDF

[53] Breaking the paired sample barrier in person re-identification: leveraging unpaired samples for domain generalization PDF

[54] R2RGEN: Real-to-Real 3D Data Generation for Spatially Generalized Manipulation PDF

Table of Contents