Dens3R: A Foundation Model for 3D Geometry Prediction

ICLR 2026 Conference SubmissionAnonymous Authors
Visual Foundation Model3D Geometry Prediction
Abstract:

Recent advances in dense 3D reconstruction have led to significant progress, yet achieving accurate unified geometric prediction remains a major challenge. Most existing methods are limited to predicting a single geometry quantity from input images. However, geometric quantities such as depth, surface normals, and point maps are inherently correlated, and estimating them in isolation often fails to ensure consistency, thereby limiting both accuracy and practical applicability. This motivates us to explore a unified framework that explicitly models the structural coupling among different geometric properties to enable joint regression. In this paper, we present Dens3R, a 3D foundation model designed for joint geometric dense prediction and adaptable to a wide range of downstream tasks. Dens3R adopts a two-stage training framework to progressively build a pointmap representation that is both generalizable and intrinsically invariant. Specifically, we design a lightweight shared encoder-decoder backbone and introduce position-interpolated rotary positional encoding to maintain expressive power while enhancing robustness to high-resolution inputs. By integrating image-pair matching features with intrinsic invariance modeling, Dens3R accurately regresses multiple geometric quantities such as surface normals and depth, achieving consistent geometry perception from single-view to multi-view inputs. Additionally, we propose a post-processing pipeline that supports geometrically consistent multi-view inference. Extensive experiments demonstrate the superior performance of Dens3R across various tasks and highlight its potential for broader applications.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Dens3R, a foundation model for joint geometric dense prediction that simultaneously estimates depth, surface normals, and point maps from unconstrained images. According to the taxonomy tree, this work resides in the 'Unified Foundation Models for Geometry' leaf, which contains only two papers including the original submission. This leaf sits under 'Multi-Geometry Joint Prediction', indicating a relatively sparse research direction focused on explicitly modeling structural coupling among geometric properties rather than predicting them in isolation.

The taxonomy reveals that neighboring research directions include 'Feed-Forward 3D Scene Reconstruction' (three papers on single-pass networks for uncalibrated reconstruction) and 'Classical Multi-View Stereo Approaches' (two papers on traditional geometry-based methods). While these adjacent leaves address 3D reconstruction from uncalibrated inputs, they differ in scope: the sibling leaf focuses on scene structure and camera parameters, whereas Dens3R's leaf emphasizes unified geometric representation across multiple coupled quantities. The taxonomy's exclude_note clarifies that task-specific models without unified representation belong elsewhere, positioning this work at the intersection of multi-task learning and geometric foundation models.

Among thirty candidates examined, the contribution-level analysis shows mixed novelty signals. The core Dens3R foundation model (Contribution 1) examined ten candidates with one appearing to provide overlapping prior work, suggesting some precedent exists within this limited search scope. The two-stage training framework with intrinsic-invariant pointmap representation (Contribution 2) and the position-interpolated rotary positional encoding (Contribution 3) each examined ten candidates with zero refutable matches, indicating these technical components appear more distinctive among the sampled literature. The analysis explicitly notes this is based on top-K semantic search plus citation expansion, not exhaustive coverage.

Given the limited search scope of thirty candidates and the sparse taxonomy leaf containing only one sibling paper, the work appears to occupy a relatively underexplored niche within joint geometric prediction. The single refutable match for the core model suggests some conceptual overlap exists, though the technical framework components show fewer direct precedents among examined candidates. A more comprehensive literature search would be needed to assess whether the unified geometric coupling approach represents a fundamental departure from prior multi-task methods or an incremental refinement of existing foundation model paradigms.

Taxonomy

Core-task Taxonomy Papers
22
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: joint geometric dense prediction from unconstrained images. The field encompasses a diverse set of approaches organized into five main branches. Multi-Geometry Joint Prediction focuses on unified frameworks that simultaneously estimate multiple geometric properties such as depth, normals, and camera parameters, often leveraging foundation models to handle diverse scene types. 3D Reconstruction from Uncalibrated Images addresses structure-from-motion and multi-view stereo without strict calibration assumptions, exemplified by works like DiffusionSfM[8] and classical methods such as Multi Viewpoint Stereo[16]. Human-Centric 3D Reconstruction specializes in reconstructing faces, bodies, and hands from images, with representative efforts including PifuHD[4] and Joint Face Reconstruction[12]. Geometric Transformation Estimation targets tasks like homography and change detection, as seen in Deep Homography[5] and Geometric Change Detection[20]. Finally, Generative and Perceptual 3D Models explore synthesis and perception through generative frameworks such as EG3D[2]. Recent activity has concentrated on unified foundation models that predict multiple geometric cues in a single forward pass, contrasting with earlier specialized pipelines. Dens3R[0] exemplifies this trend by jointly estimating dense geometry across unconstrained images, positioning itself within the Unified Foundation Models for Geometry cluster alongside Surface Normal Constraint[19]. While Surface Normal Constraint[19] emphasizes leveraging normal information to refine depth predictions, Dens3R[0] adopts a broader multi-task perspective that integrates several geometric outputs. This shift toward holistic geometric reasoning reflects a move away from isolated depth or normal estimation, as seen in older works like Quasi Dense[11], toward end-to-end systems that exploit cross-task synergies. Key open questions include how to best balance task-specific inductive biases with the flexibility of large-scale pretraining, and whether such unified models can match or exceed the performance of domain-specific methods in challenging scenarios.

Claimed Contributions

Dens3R foundation model for unified geometric dense prediction

The authors introduce Dens3R, a feed-forward visual foundation model that jointly predicts multiple geometric quantities (depth, surface normals, pointmaps) from unconstrained images. Unlike prior methods that estimate geometry in isolation, Dens3R explicitly models structural coupling among these properties to ensure consistency and improve accuracy.

10 retrieved papers
Can Refute
Two-stage training framework with intrinsic-invariant pointmap representation

The authors propose a novel two-stage training strategy. Stage 1 learns a scale-invariant pointmap via cross-view matching features. Stage 2 incorporates surface normals and one-to-one correspondence constraints to transform the representation into an intrinsic-invariant pointmap, simplifying training and improving normal prediction accuracy.

10 retrieved papers
Position-interpolated rotary positional encoding for multi-resolution inputs

The authors design a position-interpolated rotary positional encoding mechanism that enables the model to handle high-resolution and multi-resolution inputs without performance degradation, addressing instability issues in existing methods when processing images beyond training resolution.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Dens3R foundation model for unified geometric dense prediction

The authors introduce Dens3R, a feed-forward visual foundation model that jointly predicts multiple geometric quantities (depth, surface normals, pointmaps) from unconstrained images. Unlike prior methods that estimate geometry in isolation, Dens3R explicitly models structural coupling among these properties to ensure consistency and improve accuracy.

Contribution

Two-stage training framework with intrinsic-invariant pointmap representation

The authors propose a novel two-stage training strategy. Stage 1 learns a scale-invariant pointmap via cross-view matching features. Stage 2 incorporates surface normals and one-to-one correspondence constraints to transform the representation into an intrinsic-invariant pointmap, simplifying training and improving normal prediction accuracy.

Contribution

Position-interpolated rotary positional encoding for multi-resolution inputs

The authors design a position-interpolated rotary positional encoding mechanism that enables the model to handle high-resolution and multi-resolution inputs without performance degradation, addressing instability issues in existing methods when processing images beyond training resolution.