Dens3R: A Foundation Model for 3D Geometry Prediction

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Visual Foundation Model3D Geometry Prediction

Recent advances in dense 3D reconstruction have led to significant progress, yet achieving accurate unified geometric prediction remains a major challenge. Most existing methods are limited to predicting a single geometry quantity from input images. However, geometric quantities such as depth, surface normals, and point maps are inherently correlated, and estimating them in isolation often fails to ensure consistency, thereby limiting both accuracy and practical applicability. This motivates us to explore a unified framework that explicitly models the structural coupling among different geometric properties to enable joint regression. In this paper, we present Dens3R, a 3D foundation model designed for joint geometric dense prediction and adaptable to a wide range of downstream tasks. Dens3R adopts a two-stage training framework to progressively build a pointmap representation that is both generalizable and intrinsically invariant. Specifically, we design a lightweight shared encoder-decoder backbone and introduce position-interpolated rotary positional encoding to maintain expressive power while enhancing robustness to high-resolution inputs. By integrating image-pair matching features with intrinsic invariance modeling, Dens3R accurately regresses multiple geometric quantities such as surface normals and depth, achieving consistent geometry perception from single-view to multi-view inputs. Additionally, we propose a post-processing pipeline that supports geometrically consistent multi-view inference. Extensive experiments demonstrate the superior performance of Dens3R across various tasks and highlight its potential for broader applications.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Dens3R, a foundation model for joint geometric dense prediction that simultaneously estimates depth, surface normals, and point maps from unconstrained images. According to the taxonomy tree, this work resides in the 'Unified Foundation Models for Geometry' leaf, which contains only two papers including the original submission. This leaf sits under 'Multi-Geometry Joint Prediction', indicating a relatively sparse research direction focused on explicitly modeling structural coupling among geometric properties rather than predicting them in isolation.

The taxonomy reveals that neighboring research directions include 'Feed-Forward 3D Scene Reconstruction' (three papers on single-pass networks for uncalibrated reconstruction) and 'Classical Multi-View Stereo Approaches' (two papers on traditional geometry-based methods). While these adjacent leaves address 3D reconstruction from uncalibrated inputs, they differ in scope: the sibling leaf focuses on scene structure and camera parameters, whereas Dens3R's leaf emphasizes unified geometric representation across multiple coupled quantities. The taxonomy's exclude_note clarifies that task-specific models without unified representation belong elsewhere, positioning this work at the intersection of multi-task learning and geometric foundation models.

Among thirty candidates examined, the contribution-level analysis shows mixed novelty signals. The core Dens3R foundation model (Contribution 1) examined ten candidates with one appearing to provide overlapping prior work, suggesting some precedent exists within this limited search scope. The two-stage training framework with intrinsic-invariant pointmap representation (Contribution 2) and the position-interpolated rotary positional encoding (Contribution 3) each examined ten candidates with zero refutable matches, indicating these technical components appear more distinctive among the sampled literature. The analysis explicitly notes this is based on top-K semantic search plus citation expansion, not exhaustive coverage.

Given the limited search scope of thirty candidates and the sparse taxonomy leaf containing only one sibling paper, the work appears to occupy a relatively underexplored niche within joint geometric prediction. The single refutable match for the core model suggests some conceptual overlap exists, though the technical framework components show fewer direct precedents among examined candidates. A more comprehensive literature search would be needed to assess whether the unified geometric coupling approach represents a fundamental departure from prior multi-task methods or an incremental refinement of existing foundation model paradigms.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: joint geometric dense prediction from unconstrained images. The field encompasses a diverse set of approaches organized into five main branches. Multi-Geometry Joint Prediction focuses on unified frameworks that simultaneously estimate multiple geometric properties such as depth, normals, and camera parameters, often leveraging foundation models to handle diverse scene types. 3D Reconstruction from Uncalibrated Images addresses structure-from-motion and multi-view stereo without strict calibration assumptions, exemplified by works like DiffusionSfM[8] and classical methods such as Multi Viewpoint Stereo[16]. Human-Centric 3D Reconstruction specializes in reconstructing faces, bodies, and hands from images, with representative efforts including PifuHD[4] and Joint Face Reconstruction[12]. Geometric Transformation Estimation targets tasks like homography and change detection, as seen in Deep Homography[5] and Geometric Change Detection[20]. Finally, Generative and Perceptual 3D Models explore synthesis and perception through generative frameworks such as EG3D[2]. Recent activity has concentrated on unified foundation models that predict multiple geometric cues in a single forward pass, contrasting with earlier specialized pipelines. Dens3R[0] exemplifies this trend by jointly estimating dense geometry across unconstrained images, positioning itself within the Unified Foundation Models for Geometry cluster alongside Surface Normal Constraint[19]. While Surface Normal Constraint[19] emphasizes leveraging normal information to refine depth predictions, Dens3R[0] adopts a broader multi-task perspective that integrates several geometric outputs. This shift toward holistic geometric reasoning reflects a move away from isolated depth or normal estimation, as seen in older works like Quasi Dense[11], toward end-to-end systems that exploit cross-task synergies. Key open questions include how to best balance task-specific inductive biases with the flexibility of large-scale pretraining, and whether such unified models can match or exceed the performance of domain-specific methods in challenging scenarios.

Claimed Contributions

Dens3R foundation model for unified geometric dense prediction

Can Refute

10 retrieved papers

The authors introduce Dens3R, a feed-forward visual foundation model that jointly predicts multiple geometric quantities (depth, surface normals, pointmaps) from unconstrained images. Unlike prior methods that estimate geometry in isolation, Dens3R explicitly models structural coupling among these properties to ensure consistency and improve accuracy.

10 retrieved papers

Can Refute

Two-stage training framework with intrinsic-invariant pointmap representation

10 retrieved papers

The authors propose a novel two-stage training strategy. Stage 1 learns a scale-invariant pointmap via cross-view matching features. Stage 2 incorporates surface normals and one-to-one correspondence constraints to transform the representation into an intrinsic-invariant pointmap, simplifying training and improving normal prediction accuracy.

10 retrieved papers

Position-interpolated rotary positional encoding for multi-resolution inputs

10 retrieved papers

The authors design a position-interpolated rotary positional encoding mechanism that enables the model to handle high-resolution and multi-resolution inputs without performance degradation, addressing instability issues in existing methods when processing images beyond training resolution.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[19] Adaptive Surface Normal Constraint for Geometric Estimation from Monocular Images PDF

Long, Xiaoxiao, Zheng, Yuhang, Zheng Yu-peng, Tian, Beiwen, Lin Cheng, Liu, Lingjie, Zhao Hao, Zhou, Guyue, Wang Wenping (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Dens3R foundation model for unified geometric dense prediction

[43] Vggt: Visual geometry grounded transformer PDF

Can Refute

[44] Dust3r: Geometric 3d vision made easy PDF

Cannot Refute

[45] Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation PDF

Cannot Refute

[46] Continuous 3d perception model with persistent state PDF

Cannot Refute

[47] MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors PDF

Cannot Refute

[48] Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture PDF

Cannot Refute

[49] DN-Splatter: Depth and Normal Priors for Gaussian Splatting and Meshing PDF

Cannot Refute

[50] Enforcing geometric constraints of virtual normal for depth prediction PDF

Cannot Refute

[51] Worldmirror: Universal 3d world reconstruction with any-prior prompting PDF

Cannot Refute

[52] Pow3r: Empowering unconstrained 3d reconstruction with camera and scene priors PDF

Cannot Refute

Contribution

Two-stage training framework with intrinsic-invariant pointmap representation

[33] Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision PDF

Cannot Refute

[34] PCPNet Learning Local Shape Properties from Raw Point Clouds PDF

Cannot Refute

[35] 3D Shape Similarity Measurement Based on Scale Invariant Functional Maps PDF

Cannot Refute

[36] Adaptive GMM Convolution for Point Cloud Learning PDF

Cannot Refute

[37] Nonisometric surface registration via conformal laplaceâbeltrami basis pursuit PDF

Cannot Refute

[38] Single Person Dense Pose Estimation via Geometric Equivariance Consistency PDF

Cannot Refute

[39] RISAS: A novel rotation, illumination, scale invariant appearance and shape feature PDF

Cannot Refute

[40] Single Image Depth Estimation With Normal Guided Scale Invariant Deep Convolutional Fields PDF

Cannot Refute

[41] POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction PDF

Cannot Refute

[42] Place recognition of 3D landmarks based on geometric relations PDF

Cannot Refute

Contribution

Position-interpolated rotary positional encoding for multi-resolution inputs

[23] Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution PDF

Cannot Refute

[24] ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models PDF

Cannot Refute

[25] Visual autoregressive modeling for image super-resolution PDF

Cannot Refute

[26] Packing input frame context in next-frame prediction models for video generation PDF

Cannot Refute

[27] Rotary position embedding for vision transformer PDF

Cannot Refute

[28] HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models PDF

Cannot Refute

[29] Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers PDF

Cannot Refute

[30] St-llm: Large language models are effective temporal learners PDF

Cannot Refute

[31] Revisiting Multimodal Positional Encoding in Vision-Language Models PDF

Cannot Refute

[32] LieRE: Generalizing Rotary Position Encodings PDF

Cannot Refute

Dens3R: A Foundation Model for 3D Geometry Prediction

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[19] Adaptive Surface Normal Constraint for Geometric Estimation from Monocular Images PDF

Contribution Analysis

Dens3R foundation model for unified geometric dense prediction

[43] Vggt: Visual geometry grounded transformer PDF

[44] Dust3r: Geometric 3d vision made easy PDF

[45] Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation PDF

[46] Continuous 3d perception model with persistent state PDF

[47] MASt3R-SLAM: Real-time dense SLAM with 3D reconstruction priors PDF

[48] Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture PDF

[49] DN-Splatter: Depth and Normal Priors for Gaussian Splatting and Meshing PDF

[50] Enforcing geometric constraints of virtual normal for depth prediction PDF

[51] Worldmirror: Universal 3d world reconstruction with any-prior prompting PDF

[52] Pow3r: Empowering unconstrained 3d reconstruction with camera and scene priors PDF

Two-stage training framework with intrinsic-invariant pointmap representation

[33] Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision PDF

[34] PCPNet Learning Local Shape Properties from Raw Point Clouds PDF

[35] 3D Shape Similarity Measurement Based on Scale Invariant Functional Maps PDF

[36] Adaptive GMM Convolution for Point Cloud Learning PDF

[37] Nonisometric surface registration via conformal laplaceâbeltrami basis pursuit PDF

[38] Single Person Dense Pose Estimation via Geometric Equivariance Consistency PDF

[39] RISAS: A novel rotation, illumination, scale invariant appearance and shape feature PDF

[40] Single Image Depth Estimation With Normal Guided Scale Invariant Deep Convolutional Fields PDF

[41] POMATO: Marrying Pointmap Matching with Temporal Motion for Dynamic 3D Reconstruction PDF

[42] Place recognition of 3D landmarks based on geometric relations PDF

Position-interpolated rotary positional encoding for multi-resolution inputs

[23] Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution PDF

[24] ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models PDF

[25] Visual autoregressive modeling for image super-resolution PDF

[26] Packing input frame context in next-frame prediction models for video generation PDF

[27] Rotary position embedding for vision transformer PDF

[28] HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models PDF

[29] Lumina-t2x: Transforming text into any modality, resolution, and duration via flow-based large diffusion transformers PDF

[30] St-llm: Large language models are effective temporal learners PDF

[31] Revisiting Multimodal Positional Encoding in Vision-Language Models PDF

[32] LieRE: Generalizing Rotary Position Encodings PDF

Table of Contents

[37] Nonisometric surface registration via conformal laplaceâbeltrami basis pursuit PDF