IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

ICLR 2026 Conference SubmissionAnonymous Authors
3D Scene Understanding; Multi-View Reconstruction
Abstract:

Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose Instance-Grounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline. Unlike previous methods that bound with a specific language model, we introduce an Instance-Grounded Scene Understanding paradigm, where instance masks serve as the bridge connecting our unified representation with diverse Visual Language Models (VLMs) in a plug-and-play manner, substantially expanding downstream understanding capabilities. Extensive experiments on instance multi-view instance matching, open-vocabulary segmentation, and QA scene grounding demonstrate that IGGT outperforms state-of-the-art methods in both quality and consistency for semantic 3D reconstruction.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: semantic 3D reconstruction with instance-level understanding. This field aims to recover complete 3D scene geometry while simultaneously identifying and segmenting individual object instances and assigning semantic labels. The taxonomy reveals a rich landscape organized around several complementary directions. Instance-Aware 3D Reconstruction and Decomposition focuses on methods that explicitly model and separate distinct objects during reconstruction, often leveraging volumetric representations or incremental mapping strategies (e.g., Volumetric Instance Mapping[7], RGBD Instance Map[5]). Scene-Level Semantic Completion and Understanding emphasizes filling in occluded or unobserved regions with semantic predictions, as seen in works like Monoscene[2] and Voxeland[10]. Panoptic and Unified 3D Scene Understanding merges instance and semantic segmentation into holistic frameworks (Panopticrecon[18]), while Open-Vocabulary and Language-Grounded 3D Understanding extends these capabilities to arbitrary textual queries (SceneVerse[27], Lowis3d[26]). Neural Radiance and Gaussian Splatting for Semantic 3D Scenes explores differentiable rendering techniques for joint geometry and semantics (FMGS[12], COB-GS[36]), and Specialized Applications and Domains targets specific settings such as autonomous driving (InstDrive[31]) or aerial imagery (Aerial MVS Segmentation[44]). Finally, Foundational Resources and End-to-End Architectures provides datasets (Omniobject3d[3]) and integrated pipelines that unify multiple stages of perception and reconstruction. Recent work has increasingly emphasized end-to-end learning and the integration of large-scale pretrained models to handle diverse scene types and open-vocabulary queries. Within the Foundational Resources and End-to-End Architectures branch, IGGT[0] exemplifies this trend by proposing a unified architecture that tightly couples instance detection, semantic labeling, and geometric reconstruction in a single forward pass. This approach contrasts with earlier modular pipelines like Atlas[47] and RFD-Net[48], which relied on separate stages for depth estimation, segmentation, and fusion. By learning joint representations, IGGT[0] aims to leverage cross-task synergies and reduce error propagation, positioning itself alongside other recent end-to-end efforts (Instascene[6], Symphonize[4]) that similarly seek to streamline the reconstruction pipeline. The main trade-off remains between the flexibility of modular designs and the efficiency and coherence of integrated architectures, with ongoing research exploring how to best incorporate language grounding and neural rendering into these unified frameworks.

Claimed Contributions

Instance-Grounded Geometry Transformer (IGGT)

The authors introduce IGGT, a unified end-to-end transformer framework that jointly performs 3D geometric reconstruction and instance-level semantic understanding. The model uses a 3D-Consistent Contrastive Learning strategy to encode unified representations capturing both geometric structures and instance-grounded clustering from 2D visual inputs.

10 retrieved papers
InsScene-15K dataset

The authors curate a large-scale dataset comprising 15,000 scenes with high-quality RGB images, camera poses, depth maps, and 3D-consistent instance masks. The dataset is constructed using a novel data curation pipeline that integrates synthetic, video-captured, and RGBD-scan sources with SAM2-driven annotation.

9 retrieved papers
Instance-Grounded Scene Understanding paradigm

The authors propose a scene understanding strategy where instance masks act as bridges between the unified representation and various VLMs or LMMs. This plug-and-play approach decouples the framework from specific language models, enabling flexible integration with different foundation models and supporting diverse downstream tasks like open-vocabulary segmentation and scene grounding.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Instance-Grounded Geometry Transformer (IGGT)

The authors introduce IGGT, a unified end-to-end transformer framework that jointly performs 3D geometric reconstruction and instance-level semantic understanding. The model uses a 3D-Consistent Contrastive Learning strategy to encode unified representations capturing both geometric structures and instance-grounded clustering from 2D visual inputs.

Contribution

InsScene-15K dataset

The authors curate a large-scale dataset comprising 15,000 scenes with high-quality RGB images, camera poses, depth maps, and 3D-consistent instance masks. The dataset is constructed using a novel data curation pipeline that integrates synthetic, video-captured, and RGBD-scan sources with SAM2-driven annotation.

Contribution

Instance-Grounded Scene Understanding paradigm

The authors propose a scene understanding strategy where instance masks act as bridges between the unified representation and various VLMs or LMMs. This plug-and-play approach decouples the framework from specific language models, enabling flexible integration with different foundation models and supporting diverse downstream tasks like open-vocabulary segmentation and scene grounding.