Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

Unified Multimodal ModelSpatial IntelligenceControllable GenerationCamera Calibration

Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin’s superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Puffin proposes a unified camera-centric multimodal model that jointly performs understanding and generation tasks conditioned on camera parameters. Within the taxonomy, it resides in the 'Unified Camera-Centric Multimodal Models' leaf under 'Camera-Aware Perception and Reasoning'. This leaf contains only three papers total, including Puffin itself, indicating a relatively sparse and emerging research direction. The sibling works—Agent3D Zero and MVLLaVA—focus on embodied navigation and multi-view visual question answering respectively, whereas Puffin emphasizes broader integration of camera control with vision-language architectures for both perception and generation.

The taxonomy reveals that most camera-centric work concentrates on synthesis tasks (Novel View Synthesis, Dynamic Scene View Synthesis, Scene-Level Generation) or specialized modalities (panoramic, event-based rendering). Camera-Aware Perception and Reasoning is a smaller branch with only two leaf nodes: Camera-Conditioned Semantic Understanding (two papers on segmentation) and Unified Camera-Centric Multimodal Models (three papers). Puffin's positioning suggests it bridges perception-focused methods and generative approaches, diverging from purely synthesis-oriented branches by incorporating explicit reasoning over camera parameters within a multimodal framework. The taxonomy's scope and exclude notes clarify that Puffin's unified architecture distinguishes it from single-task models in neighboring leaves.

Among 28 candidates examined across three contributions, no clearly refutable prior work was identified. For the unified model contribution, 10 candidates were examined with zero refutations; for the 'thinking with camera' paradigm, 9 candidates yielded no refutations; and for the Puffin-4M dataset, 9 candidates similarly showed no overlapping prior work. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—no existing work directly anticipates Puffin's combination of camera-as-language reasoning, joint understanding-generation architecture, and large-scale vision-language-camera triplet training. The absence of refutations across all contributions indicates potential novelty, though the search was not exhaustive.

Based on the limited literature search of 28 candidates, Puffin appears to occupy a relatively unexplored niche at the intersection of camera-aware reasoning and multimodal generation. The sparse population of its taxonomy leaf and the lack of refutable prior work suggest meaningful novelty, though a broader search might reveal additional related efforts. The analysis covers top semantic matches and does not claim completeness across the entire field.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: camera-centric understanding and generation from arbitrary viewpoints. The field encompasses methods that synthesize, reason about, or manipulate visual content under varying camera perspectives. At the highest level, the taxonomy divides into eight major branches. Novel View Synthesis from Sparse or Single Views (e.g., Synsin[2], Free View Synthesis[5]) focuses on reconstructing scenes from limited input, while Dynamic Scene View Synthesis extends this to temporal settings. Scene-Level Generation and Exploration (e.g., Megascenes[3], 3D SceneDreamer[7]) emphasizes creating or navigating large-scale environments, and Multi-View Consistent Generation ensures coherence across viewpoints. Specialized View Synthesis Modalities address domain-specific rendering (panoramic, event-based, etc.), whereas Camera-Aware Perception and Reasoning targets tasks like bird's-eye-view segmentation or multimodal understanding that explicitly leverage camera geometry. Camera Control and Interaction (e.g., Generative Camera Dolly[8]) provides user-driven trajectory specification, and Foundational Representations and Theory underpins the geometric and learning frameworks common to all branches. Several active lines of work highlight contrasting emphases: some pursue end-to-end generative models that produce novel views directly from text or sparse images (Flare[1], ArbiViewGen[18]), while others build explicit 3D representations before rendering. Trade-offs between computational efficiency, geometric fidelity, and generalization remain central. Within Camera-Aware Perception and Reasoning, a small cluster of Unified Camera-Centric Multimodal Models integrates vision-language understanding with viewpoint reasoning. Thinking with Camera[0] sits squarely in this cluster, emphasizing joint reasoning over camera parameters and scene semantics in a multimodal framework. Compared to Agent3D Zero[40], which focuses on embodied navigation tasks, and MVLLaVA[45], which targets multi-view visual question answering, Thinking with Camera[0] appears to prioritize a broader integration of camera control signals within large-scale vision-language architectures, bridging perception and interactive generation.

Claimed Contributions

Puffin: unified camera-centric multimodal model

10 retrieved papers

The authors introduce Puffin, a unified framework that jointly performs camera-centric understanding (estimating camera parameters from images) and generation (controllable image synthesis from camera parameters). This represents the first attempt to unify these two traditionally isolated tasks within a single multimodal model.

10 retrieved papers

Thinking with camera paradigm

9 retrieved papers

The authors propose a novel mechanism called thinking with camera that bridges the modality gap between camera parameters and vision-language models. It aligns spatially grounded visual cues with professional photographic terms through structured spatial reasoning across geometric context (roll, pitch, FoV), enabling both accurate parameter prediction and controllable generation.

9 retrieved papers

Puffin-4M dataset and benchmark

9 retrieved papers

The authors construct Puffin-4M, a large-scale dataset containing 4 million vision-language-camera triplets with precise camera parameters, descriptive captions, pixel-wise camera maps, and spatial reasoning annotations. They also establish comprehensive benchmarks (Puffin-Gen and Puffin-Und) for evaluating camera-centric multimodal models.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[40] Agent3D-Zero: An Agent for Zero-shot 3D Understanding PDF

Zhang Sha, Huang Di, Sha Zhang, Deng Jia-Jun, Di Huang, Tang, Shixiang, Jiajun Deng, Ouyang, Wanli, Shixiang Tang, He Tong, Wanli Ouyang, Zhang, Yanyong, Tong He, Yanyong Zhang (2024)

[45] MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis PDF

Hanyu Jiang, Jian Xue, Xing Lan, Guohong Hu, Ke Lu (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Puffin: unified camera-centric multimodal model

[69] Vggt: Visual geometry grounded transformer PDF

Cannot Refute

[70] Imagedream: Image-prompt multi-view diffusion for 3d generation PDF

Cannot Refute

[71] Gen3c: 3d-informed world-consistent video generation with precise camera control PDF

Cannot Refute

[72] Flovd: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis PDF

Cannot Refute

[73] CamEdit: Continuous Camera Parameter Control for Photorealistic Image Editing PDF

Cannot Refute

[74] Vidcraft3: Camera, object, and lighting control for image-to-video generation PDF

Cannot Refute

[75] Training-free camera control for video generation PDF

Cannot Refute

[76] Realcam-i2v: Real-world image-to-video generation with interactive complex camera control PDF

Cannot Refute

[77] Multimodal image synthesis and editing: A survey and taxonomy PDF

Cannot Refute

[78] Generative photography: Scene-consistent camera control for realistic text-to-image synthesis PDF

Cannot Refute

Contribution

Thinking with camera paradigm

[53] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[61] TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics PDF

Cannot Refute

[62] Grounding actions in camera space: Observation-centric vision-language-action policy PDF

Cannot Refute

[63] 3ds-vla: A 3d spatial-aware vision language action model for robust multi-task manipulation PDF

Cannot Refute

[64] Robomm: All-in-one multimodal large model for robotic manipulation PDF

Cannot Refute

[65] CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography PDF

Cannot Refute

[66] SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models PDF

Cannot Refute

[67] RoboRetriever: Single-Camera Robot Object Retrieval via Active and Interactive Perception with Dynamic Scene Graph PDF

Cannot Refute

[68] Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy: A Review PDF

Cannot Refute

Contribution

Puffin-4M dataset and benchmark

[51] A survey on multimodal large language models for autonomous driving PDF

Cannot Refute

[52] Thinking in space: How multimodal large language models see, remember, and recall spaces PDF

Cannot Refute

[53] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[55] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

Cannot Refute

[56] Visual language maps for robot navigation PDF

Cannot Refute

[57] Covla: Comprehensive vision-language-action dataset for autonomous driving PDF

Cannot Refute

[58] SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding PDF

Cannot Refute

[59] Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding PDF

Cannot Refute

[60] Spatialladder: Progressive training for spatial reasoning in vision-language models PDF

Cannot Refute

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[40] Agent3D-Zero: An Agent for Zero-shot 3D Understanding PDF

[45] MVLLaVA: An Intelligent Agent for Unified and Flexible Novel View Synthesis PDF

Contribution Analysis

Puffin: unified camera-centric multimodal model

[69] Vggt: Visual geometry grounded transformer PDF

[70] Imagedream: Image-prompt multi-view diffusion for 3d generation PDF

[71] Gen3c: 3d-informed world-consistent video generation with precise camera control PDF

[72] Flovd: Optical flow meets video diffusion model for enhanced camera-controlled video synthesis PDF

[73] CamEdit: Continuous Camera Parameter Control for Photorealistic Image Editing PDF

[74] Vidcraft3: Camera, object, and lighting control for image-to-video generation PDF

[75] Training-free camera control for video generation PDF

[76] Realcam-i2v: Real-world image-to-video generation with interactive complex camera control PDF

[77] Multimodal image synthesis and editing: A survey and taxonomy PDF

[78] Generative photography: Scene-consistent camera control for realistic text-to-image synthesis PDF

Thinking with camera paradigm

[53] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[61] TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics PDF

[62] Grounding actions in camera space: Observation-centric vision-language-action policy PDF

[63] 3ds-vla: A 3d spatial-aware vision language action model for robust multi-task manipulation PDF

[64] Robomm: All-in-one multimodal large model for robotic manipulation PDF

[65] CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography PDF

[66] SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models PDF

[67] RoboRetriever: Single-Camera Robot Object Retrieval via Active and Interactive Perception with Dynamic Scene Graph PDF

[68] Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy: A Review PDF

Puffin-4M dataset and benchmark

[51] A survey on multimodal large language models for autonomous driving PDF

[52] Thinking in space: How multimodal large language models see, remember, and recall spaces PDF

[53] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[55] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

[56] Visual language maps for robot navigation PDF

[57] Covla: Comprehensive vision-language-action dataset for autonomous driving PDF

[58] SceneVerse: Scaling 3D Vision-Language Learning for Grounded Scene Understanding PDF

[59] Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding PDF

[60] Spatialladder: Progressive training for spatial reasoning in vision-language models PDF

Table of Contents