IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce IGGT, a unified end-to-end transformer framework that jointly performs 3D geometric reconstruction and instance-level semantic understanding. The model uses a 3D-Consistent Contrastive Learning strategy to encode unified representations capturing both geometric structures and instance-grounded clustering from 2D visual inputs.
The authors curate a large-scale dataset comprising 15,000 scenes with high-quality RGB images, camera poses, depth maps, and 3D-consistent instance masks. The dataset is constructed using a novel data curation pipeline that integrates synthetic, video-captured, and RGBD-scan sources with SAM2-driven annotation.
The authors propose a scene understanding strategy where instance masks act as bridges between the unified representation and various VLMs or LMMs. This plug-and-play approach decouples the framework from specific language models, enabling flexible integration with different foundation models and supporting diverse downstream tasks like open-vocabulary segmentation and scene grounding.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[47] Atlas: End-to-end 3d scene reconstruction from posed images PDF
[48] Rfd-net: Point scene understanding by semantic instance reconstruction PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Instance-Grounded Geometry Transformer (IGGT)
The authors introduce IGGT, a unified end-to-end transformer framework that jointly performs 3D geometric reconstruction and instance-level semantic understanding. The model uses a 3D-Consistent Contrastive Learning strategy to encode unified representations capturing both geometric structures and instance-grounded clustering from 2D visual inputs.
[39] Semantic scene completion via semantic-aware guidance and interactive refinement transformer PDF
[51] Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion PDF
[52] InstanceBEV: Unifying Instance and BEV Representation for Global Modeling PDF
[53] Large spatial model: End-to-end unposed images to semantic 3d PDF
[54] Unifying 3d vision-language understanding via promptable queries PDF
[55] Uni-3d: A universal model for panoptic 3d scene reconstruction PDF
[56] Uni3r: Unified 3d reconstruction and semantic understanding via generalizable gaussian splatting from unposed multi-view images PDF
[57] Mrftrans: Multimodal representation fusion transformer for monocular 3d semantic scene completion PDF
[58] SIU3R: Simultaneous Scene Understanding and 3D Reconstruction Beyond Feature Alignment PDF
[59] GVLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning PDF
InsScene-15K dataset
The authors curate a large-scale dataset comprising 15,000 scenes with high-quality RGB images, camera poses, depth maps, and 3D-consistent instance masks. The dataset is constructed using a novel data curation pipeline that integrates synthetic, video-captured, and RGBD-scan sources with SAM2-driven annotation.
[53] Large spatial model: End-to-end unposed images to semantic 3d PDF
[60] Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data PDF
[62] Matterport3d: Learning from rgb-d data in indoor environments PDF
[63] Automatically annotating indoor images with CAD models via RGB-D scans PDF
[64] MCD-Net: toward RGB-D video inpainting in real-world scenes PDF
[65] 3dmatch: Learning local geometric descriptors from rgb-d reconstructions PDF
[66] 3D shape segmentation with projective convolutional networks PDF
[67] Learning rich features from RGB-D images for object detection and segmentation PDF
[68] Rio: 3d object instance re-localization in changing indoor environments PDF
Instance-Grounded Scene Understanding paradigm
The authors propose a scene understanding strategy where instance masks act as bridges between the unified representation and various VLMs or LMMs. This plug-and-play approach decouples the framework from specific language models, enabling flexible integration with different foundation models and supporting diverse downstream tasks like open-vocabulary segmentation and scene grounding.