IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

3D Scene Understanding; Multi-View Reconstruction

Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose Instance-Grounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline. Unlike previous methods that bound with a specific language model, we introduce an Instance-Grounded Scene Understanding paradigm, where instance masks serve as the bridge connecting our unified representation with diverse Visual Language Models (VLMs) in a plug-and-play manner, substantially expanding downstream understanding capabilities. Extensive experiments on instance multi-view instance matching, open-vocabulary segmentation, and QA scene grounding demonstrate that IGGT outperforms state-of-the-art methods in both quality and consistency for semantic 3D reconstruction.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: semantic 3D reconstruction with instance-level understanding. This field aims to recover complete 3D scene geometry while simultaneously identifying and segmenting individual object instances and assigning semantic labels. The taxonomy reveals a rich landscape organized around several complementary directions. Instance-Aware 3D Reconstruction and Decomposition focuses on methods that explicitly model and separate distinct objects during reconstruction, often leveraging volumetric representations or incremental mapping strategies (e.g., Volumetric Instance Mapping[7], RGBD Instance Map[5]). Scene-Level Semantic Completion and Understanding emphasizes filling in occluded or unobserved regions with semantic predictions, as seen in works like Monoscene[2] and Voxeland[10]. Panoptic and Unified 3D Scene Understanding merges instance and semantic segmentation into holistic frameworks (Panopticrecon[18]), while Open-Vocabulary and Language-Grounded 3D Understanding extends these capabilities to arbitrary textual queries (SceneVerse[27], Lowis3d[26]). Neural Radiance and Gaussian Splatting for Semantic 3D Scenes explores differentiable rendering techniques for joint geometry and semantics (FMGS[12], COB-GS[36]), and Specialized Applications and Domains targets specific settings such as autonomous driving (InstDrive[31]) or aerial imagery (Aerial MVS Segmentation[44]). Finally, Foundational Resources and End-to-End Architectures provides datasets (Omniobject3d[3]) and integrated pipelines that unify multiple stages of perception and reconstruction. Recent work has increasingly emphasized end-to-end learning and the integration of large-scale pretrained models to handle diverse scene types and open-vocabulary queries. Within the Foundational Resources and End-to-End Architectures branch, IGGT[0] exemplifies this trend by proposing a unified architecture that tightly couples instance detection, semantic labeling, and geometric reconstruction in a single forward pass. This approach contrasts with earlier modular pipelines like Atlas[47] and RFD-Net[48], which relied on separate stages for depth estimation, segmentation, and fusion. By learning joint representations, IGGT[0] aims to leverage cross-task synergies and reduce error propagation, positioning itself alongside other recent end-to-end efforts (Instascene[6], Symphonize[4]) that similarly seek to streamline the reconstruction pipeline. The main trade-off remains between the flexibility of modular designs and the efficiency and coherence of integrated architectures, with ongoing research exploring how to best incorporate language grounding and neural rendering into these unified frameworks.

Claimed Contributions

Instance-Grounded Geometry Transformer (IGGT)

10 retrieved papers

The authors introduce IGGT, a unified end-to-end transformer framework that jointly performs 3D geometric reconstruction and instance-level semantic understanding. The model uses a 3D-Consistent Contrastive Learning strategy to encode unified representations capturing both geometric structures and instance-grounded clustering from 2D visual inputs.

10 retrieved papers

InsScene-15K dataset

9 retrieved papers

The authors curate a large-scale dataset comprising 15,000 scenes with high-quality RGB images, camera poses, depth maps, and 3D-consistent instance masks. The dataset is constructed using a novel data curation pipeline that integrates synthetic, video-captured, and RGBD-scan sources with SAM2-driven annotation.

9 retrieved papers

Instance-Grounded Scene Understanding paradigm

10 retrieved papers

The authors propose a scene understanding strategy where instance masks act as bridges between the unified representation and various VLMs or LMMs. This plug-and-play approach decouples the framework from specific language models, enabling flexible integration with different foundation models and supporting diverse downstream tasks like open-vocabulary segmentation and scene grounding.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[47] Atlas: End-to-end 3d scene reconstruction from posed images PDF

Murez, Zak, van As, Tarrence, Zak Murez, Bartolozzi, James, Tarrence van As, Sinha, Ayan, James Bartolozzi, Badrinarayanan, Vijay, Ayan Sinha, Rabinovich, Andrew, Vijay Badrinarayanan, Andrew Rabinovich (2020)

[48] Rfd-net: Point scene understanding by semantic instance reconstruction PDF

Y. Nie, Ji Hou, Matthias Niesner, Xiaoguang Han, Yinyu Nie, M. NieÃner (2021)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Instance-Grounded Geometry Transformer (IGGT)

[39] Semantic scene completion via semantic-aware guidance and interactive refinement transformer PDF

Cannot Refute

[51] Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion PDF

Cannot Refute

[52] InstanceBEV: Unifying Instance and BEV Representation for Global Modeling PDF

Cannot Refute

[53] Large spatial model: End-to-end unposed images to semantic 3d PDF

Cannot Refute

[54] Unifying 3d vision-language understanding via promptable queries PDF

Cannot Refute

[55] Uni-3d: A universal model for panoptic 3d scene reconstruction PDF

Cannot Refute

[56] Uni3r: Unified 3d reconstruction and semantic understanding via generalizable gaussian splatting from unposed multi-view images PDF

Cannot Refute

[57] Mrftrans: Multimodal representation fusion transformer for monocular 3d semantic scene completion PDF

Cannot Refute

[58] SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment PDF

Cannot Refute

[59] GVLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning PDF

Cannot Refute

Contribution

InsScene-15K dataset

[53] Large spatial model: End-to-end unposed images to semantic 3d PDF

Cannot Refute

[60] Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data PDF

Cannot Refute

[62] Matterport3d: Learning from rgb-d data in indoor environments PDF

Cannot Refute

[63] Automatically annotating indoor images with CAD models via RGB-D scans PDF

Cannot Refute

[64] MCD-Net: toward RGB-D video inpainting in real-world scenes PDF

Cannot Refute

[65] 3dmatch: Learning local geometric descriptors from rgb-d reconstructions PDF

Cannot Refute

[66] 3D shape segmentation with projective convolutional networks PDF

Cannot Refute

[67] Learning rich features from RGB-D images for object detection and segmentation PDF

Cannot Refute

[68] Rio: 3d object instance re-localization in changing indoor environments PDF

Cannot Refute

Contribution

Instance-Grounded Scene Understanding paradigm

[69] Lisa: Reasoning segmentation via large language model PDF

Cannot Refute

[70] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

Cannot Refute

[71] Vocabulary-free 3d instance segmentation with vision-language assistant PDF

Cannot Refute

[72] Edge-aware 3d instance segmentation network with intelligent semantic prior PDF

Cannot Refute

[73] Text promptable surgical instrument segmentation with vision-language models PDF

Cannot Refute

[74] Locality alignment improves vision-language models PDF

Cannot Refute

[75] Groundhog: Grounding large language models to holistic segmentation PDF

Cannot Refute

[76] Generalizable entity grounding via assistance of large language model PDF

Cannot Refute

[77] Geopixel: Pixel grounding large multimodal model in remote sensing PDF

Cannot Refute

[78] Openmask3d: Open-vocabulary 3d instance segmentation PDF

Cannot Refute

IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[47] Atlas: End-to-end 3d scene reconstruction from posed images PDF

[48] Rfd-net: Point scene understanding by semantic instance reconstruction PDF

Contribution Analysis

Instance-Grounded Geometry Transformer (IGGT)

[39] Semantic scene completion via semantic-aware guidance and interactive refinement transformer PDF

[51] Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion PDF

[52] InstanceBEV: Unifying Instance and BEV Representation for Global Modeling PDF

[53] Large spatial model: End-to-end unposed images to semantic 3d PDF

[54] Unifying 3d vision-language understanding via promptable queries PDF

[55] Uni-3d: A universal model for panoptic 3d scene reconstruction PDF

[56] Uni3r: Unified 3d reconstruction and semantic understanding via generalizable gaussian splatting from unposed multi-view images PDF

[57] Mrftrans: Multimodal representation fusion transformer for monocular 3d semantic scene completion PDF

[58] SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment PDF

[59] GVLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning PDF

InsScene-15K dataset

[53] Large spatial model: End-to-end unposed images to semantic 3d PDF

[60] Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data PDF

[62] Matterport3d: Learning from rgb-d data in indoor environments PDF

[63] Automatically annotating indoor images with CAD models via RGB-D scans PDF

[64] MCD-Net: toward RGB-D video inpainting in real-world scenes PDF

[65] 3dmatch: Learning local geometric descriptors from rgb-d reconstructions PDF

[66] 3D shape segmentation with projective convolutional networks PDF

[67] Learning rich features from RGB-D images for object detection and segmentation PDF

[68] Rio: 3d object instance re-localization in changing indoor environments PDF

Instance-Grounded Scene Understanding paradigm

[69] Lisa: Reasoning segmentation via large language model PDF

[70] Spatialrgpt: Grounded spatial reasoning in vision-language models PDF

[71] Vocabulary-free 3d instance segmentation with vision-language assistant PDF

[72] Edge-aware 3d instance segmentation network with intelligent semantic prior PDF

[73] Text promptable surgical instrument segmentation with vision-language models PDF

[74] Locality alignment improves vision-language models PDF

[75] Groundhog: Grounding large language models to holistic segmentation PDF

[76] Generalizable entity grounding via assistance of large language model PDF

[77] Geopixel: Pixel grounding large multimodal model in remote sensing PDF

[78] Openmask3d: Open-vocabulary 3d instance segmentation PDF

Table of Contents