On the Generalization Capacities of MLLMs for Spatial Intelligence

ICLR 2026 Conference SubmissionAnonymous Authors
3D Computer VisionMultimodal Large Language ModelSpatial IntelligenceEmbodied AI
Abstract:

Multimodal Large Language Models (MLLMs) that directly process RGB inputs for tasks like 3D localization and navigation have shown remarkable potential. However, we argue that these ``RGB-only'' approaches are fundamentally flawed in their ability to generalize across cameras. By ignoring camera parameters, they entangle an object's physical properties with the camera's perspective, creating an irresolvable ambiguity. We show this leads MLLMs to overfit to the training camera distribution, rather than learning true and generalizable 3D geometric principles. To address this, we propose Camera-Aware MLLM framework for spatial MLLMs. It learns generalizable spatial reasoning by: (i) injecting camera intrinsics via a dense embedding that conditions each visual token; (ii) introducing a camera-aware data augmentation strategy that synthetically varies camera parameters, forcing the model to disentangle camera properties from scene content; and (iii) distilling geometric priors from a 3D vision foundation model. Extensive experiments demonstrate that camera-aware MLLMs substantially outperform their naive counterparts, particularly in cross-camera generalization tests on spatially-grounded tasks, indicating that camera-awareness is not only beneficial but also a prerequisite for robust and generalizable spatial intelligence in MLLMs.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a Camera-Aware MLLM framework that injects camera intrinsics via dense embeddings, applies camera-aware data augmentation, and distills geometric priors from 3D vision models. It resides in the Camera Parameter Integration leaf, which contains only three papers total, including this work and two siblings (Spatialrgpt and SpaceMind). This is a relatively sparse research direction within the broader taxonomy of 44 papers, suggesting the explicit integration of camera parameters into MLLMs remains an emerging area rather than a crowded subfield.

The taxonomy reveals that Camera Parameter Integration sits under Camera-Aware Spatial Reasoning Frameworks, which contrasts sharply with Camera-Agnostic Spatial Reasoning (covering RGB-only methods, perspective-taking, and multi-view integration) and 3D-Enhanced Spatial Reasoning (using point clouds or depth). The scope note for Camera Parameter Integration explicitly excludes RGB-only approaches, positioning this work as fundamentally distinct from methods that attempt spatial reasoning without camera metadata. Neighboring leaves like Camera-Guided Multimodal Fusion and Implicit 3D Reasoning from 2D explore related but orthogonal strategies, indicating the field has multiple parallel pathways for achieving spatial understanding.

Among 30 candidates examined, the analysis of geometric ambiguity in RGB-only reasoning found no refutable prior work across 10 candidates, suggesting this conceptual framing may be relatively novel. However, the Camera-Aware MLLM framework itself encountered one refutable candidate among 10 examined, and the empirical validation of camera-awareness for generalization also found one refutable match among 10 candidates. These statistics indicate that while the core architectural approach has some overlap with existing work, the scale of the search was limited and the majority of examined candidates did not directly refute the contributions.

Given the limited search scope of 30 candidates and the sparse population of the Camera Parameter Integration leaf, the work appears to occupy a less-explored niche within spatial reasoning for MLLMs. The taxonomy structure suggests that explicit camera modeling is not yet mainstream, with most prior efforts concentrated in camera-agnostic or 3D-enhanced directions. The contribution-level statistics reflect partial novelty, though a more exhaustive literature review would be needed to confirm the extent of overlap with the identified refutable candidates.

Taxonomy

Core-task Taxonomy Papers
44
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: camera-aware spatial reasoning in multimodal large language models. The field has organized itself around several complementary directions that reflect different strategies for equipping vision-language models with spatial understanding. Camera-Aware Spatial Reasoning Frameworks explicitly integrate camera parameters—such as intrinsic matrices, extrinsics, or viewpoint metadata—to ground spatial predictions in geometric principles, as seen in works like Spatialrgpt[1] and SpaceMind[23]. In contrast, Camera-Agnostic Spatial Reasoning approaches attempt to infer spatial relations directly from visual cues without explicit camera information, while 3D-Enhanced Spatial Reasoning leverages depth maps, point clouds, or volumetric representations to enrich the model's geometric awareness. Spatial Reasoning Benchmarks and Evaluation provide standardized testbeds such as ViewSpatial-Bench[5] and 3DSRBench[40] to measure progress, and Application-Driven Spatial Intelligence explores deployment in robotics, navigation, and autonomous driving. Auxiliary Techniques for Spatial Reasoning encompass prompting strategies like SpatialPrompting[3], data augmentation, and architectural modifications that support these core capabilities. Recent work has highlighted trade-offs between explicit geometric grounding and end-to-end learned representations. Camera-aware methods promise stronger generalization by encoding known physical constraints, yet they require access to camera metadata that may not always be available or accurate. On the Generalization Capacities[0] sits within the Camera Parameter Integration cluster, examining how well models trained with explicit camera information transfer across diverse viewpoints and scenes—a question closely related to efforts like Spatialrgpt[1], which also integrates camera parameters for improved spatial grounding, and SpaceMind[23], which explores viewpoint-conditioned reasoning. Meanwhile, benchmarks such as ViewSpatial-Bench[5] reveal that many models still struggle with perspective shifts and occlusion reasoning, underscoring open challenges in robustness and scalability. These contrasting lines of work collectively shape an evolving landscape where geometric priors and data-driven learning continue to be balanced.

Claimed Contributions

Analysis of geometric ambiguity in RGB-only spatial reasoning

The authors formally analyze how RGB-only MLLMs suffer from fundamental geometric ambiguities (focal-depth and size-depth) when camera intrinsics are unknown. They show theoretically and empirically that this causes models to overfit to training camera distributions rather than learning generalizable 3D principles.

10 retrieved papers
Camera-Aware MLLM framework

The authors introduce a novel framework that makes spatial MLLMs explicitly camera-aware via three mechanisms: dense camera ray embeddings that condition visual tokens on intrinsic parameters, camera-aware data augmentation that synthetically varies camera parameters, and distillation of geometric priors from a 3D vision foundation model.

10 retrieved papers
Can Refute
Empirical validation of camera-awareness as prerequisite for generalization

Through comprehensive experiments on cross-camera generalization tasks and spatial reasoning benchmarks, the authors demonstrate that their camera-aware approach substantially outperforms camera-agnostic baselines, particularly when tested on out-of-distribution cameras, proving that camera-awareness is essential for robust spatial intelligence.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Analysis of geometric ambiguity in RGB-only spatial reasoning

The authors formally analyze how RGB-only MLLMs suffer from fundamental geometric ambiguities (focal-depth and size-depth) when camera intrinsics are unknown. They show theoretically and empirically that this causes models to overfit to training camera distributions rather than learning generalizable 3D principles.

Contribution

Camera-Aware MLLM framework

The authors introduce a novel framework that makes spatial MLLMs explicitly camera-aware via three mechanisms: dense camera ray embeddings that condition visual tokens on intrinsic parameters, camera-aware data augmentation that synthetically varies camera parameters, and distillation of geometric priors from a 3D vision foundation model.

Contribution

Empirical validation of camera-awareness as prerequisite for generalization

Through comprehensive experiments on cross-camera generalization tasks and spatial reasoning benchmarks, the authors demonstrate that their camera-aware approach substantially outperforms camera-agnostic baselines, particularly when tested on out-of-distribution cameras, proving that camera-awareness is essential for robust spatial intelligence.