Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

ICLR 2026 Conference SubmissionAnonymous Authors
Unified Multimodal ModelSpatial IntelligenceControllable GenerationCamera Calibration
Abstract:

Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin’s superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Puffin proposes a unified camera-centric multimodal model that jointly performs understanding and generation tasks conditioned on camera parameters. Within the taxonomy, it resides in the 'Unified Camera-Centric Multimodal Models' leaf under 'Camera-Aware Perception and Reasoning'. This leaf contains only three papers total, including Puffin itself, indicating a relatively sparse and emerging research direction. The sibling works—Agent3D Zero and MVLLaVA—focus on embodied navigation and multi-view visual question answering respectively, whereas Puffin emphasizes broader integration of camera control with vision-language architectures for both perception and generation.

The taxonomy reveals that most camera-centric work concentrates on synthesis tasks (Novel View Synthesis, Dynamic Scene View Synthesis, Scene-Level Generation) or specialized modalities (panoramic, event-based rendering). Camera-Aware Perception and Reasoning is a smaller branch with only two leaf nodes: Camera-Conditioned Semantic Understanding (two papers on segmentation) and Unified Camera-Centric Multimodal Models (three papers). Puffin's positioning suggests it bridges perception-focused methods and generative approaches, diverging from purely synthesis-oriented branches by incorporating explicit reasoning over camera parameters within a multimodal framework. The taxonomy's scope and exclude notes clarify that Puffin's unified architecture distinguishes it from single-task models in neighboring leaves.

Among 28 candidates examined across three contributions, no clearly refutable prior work was identified. For the unified model contribution, 10 candidates were examined with zero refutations; for the 'thinking with camera' paradigm, 9 candidates yielded no refutations; and for the Puffin-4M dataset, 9 candidates similarly showed no overlapping prior work. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—no existing work directly anticipates Puffin's combination of camera-as-language reasoning, joint understanding-generation architecture, and large-scale vision-language-camera triplet training. The absence of refutations across all contributions indicates potential novelty, though the search was not exhaustive.

Based on the limited literature search of 28 candidates, Puffin appears to occupy a relatively unexplored niche at the intersection of camera-aware reasoning and multimodal generation. The sparse population of its taxonomy leaf and the lack of refutable prior work suggest meaningful novelty, though a broader search might reveal additional related efforts. The analysis covers top semantic matches and does not claim completeness across the entire field.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: camera-centric understanding and generation from arbitrary viewpoints. The field encompasses methods that synthesize, reason about, or manipulate visual content under varying camera perspectives. At the highest level, the taxonomy divides into eight major branches. Novel View Synthesis from Sparse or Single Views (e.g., Synsin[2], Free View Synthesis[5]) focuses on reconstructing scenes from limited input, while Dynamic Scene View Synthesis extends this to temporal settings. Scene-Level Generation and Exploration (e.g., Megascenes[3], 3D SceneDreamer[7]) emphasizes creating or navigating large-scale environments, and Multi-View Consistent Generation ensures coherence across viewpoints. Specialized View Synthesis Modalities address domain-specific rendering (panoramic, event-based, etc.), whereas Camera-Aware Perception and Reasoning targets tasks like bird's-eye-view segmentation or multimodal understanding that explicitly leverage camera geometry. Camera Control and Interaction (e.g., Generative Camera Dolly[8]) provides user-driven trajectory specification, and Foundational Representations and Theory underpins the geometric and learning frameworks common to all branches. Several active lines of work highlight contrasting emphases: some pursue end-to-end generative models that produce novel views directly from text or sparse images (Flare[1], ArbiViewGen[18]), while others build explicit 3D representations before rendering. Trade-offs between computational efficiency, geometric fidelity, and generalization remain central. Within Camera-Aware Perception and Reasoning, a small cluster of Unified Camera-Centric Multimodal Models integrates vision-language understanding with viewpoint reasoning. Thinking with Camera[0] sits squarely in this cluster, emphasizing joint reasoning over camera parameters and scene semantics in a multimodal framework. Compared to Agent3D Zero[40], which focuses on embodied navigation tasks, and MVLLaVA[45], which targets multi-view visual question answering, Thinking with Camera[0] appears to prioritize a broader integration of camera control signals within large-scale vision-language architectures, bridging perception and interactive generation.

Claimed Contributions

Puffin: unified camera-centric multimodal model

The authors introduce Puffin, a unified framework that jointly performs camera-centric understanding (estimating camera parameters from images) and generation (controllable image synthesis from camera parameters). This represents the first attempt to unify these two traditionally isolated tasks within a single multimodal model.

10 retrieved papers
Thinking with camera paradigm

The authors propose a novel mechanism called thinking with camera that bridges the modality gap between camera parameters and vision-language models. It aligns spatially grounded visual cues with professional photographic terms through structured spatial reasoning across geometric context (roll, pitch, FoV), enabling both accurate parameter prediction and controllable generation.

9 retrieved papers
Puffin-4M dataset and benchmark

The authors construct Puffin-4M, a large-scale dataset containing 4 million vision-language-camera triplets with precise camera parameters, descriptive captions, pixel-wise camera maps, and spatial reasoning annotations. They also establish comprehensive benchmarks (Puffin-Gen and Puffin-Und) for evaluating camera-centric multimodal models.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Puffin: unified camera-centric multimodal model

The authors introduce Puffin, a unified framework that jointly performs camera-centric understanding (estimating camera parameters from images) and generation (controllable image synthesis from camera parameters). This represents the first attempt to unify these two traditionally isolated tasks within a single multimodal model.

Contribution

Thinking with camera paradigm

The authors propose a novel mechanism called thinking with camera that bridges the modality gap between camera parameters and vision-language models. It aligns spatially grounded visual cues with professional photographic terms through structured spatial reasoning across geometric context (roll, pitch, FoV), enabling both accurate parameter prediction and controllable generation.

Contribution

Puffin-4M dataset and benchmark

The authors construct Puffin-4M, a large-scale dataset containing 4 million vision-language-camera triplets with precise camera parameters, descriptive captions, pixel-wise camera maps, and spatial reasoning annotations. They also establish comprehensive benchmarks (Puffin-Gen and Puffin-Und) for evaluating camera-centric multimodal models.

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation | Novelty Validation