Depth Anything 3: Recovering the Visual Space from Any Views

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

Depth Estimation

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINOv2 encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 35.7% in camera pose accuracy and 23.6% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Depth Anything 3 proposes a unified depth-ray prediction framework using a plain transformer backbone to handle arbitrary numbers of visual inputs with or without known camera poses. The paper resides in the 'All-in-One Geometric Prediction with Flexible Priors' leaf under 'Unified and Multi-Task 3D Prediction'. Notably, this leaf contains only one sibling paper in the taxonomy (the original paper itself), suggesting this specific research direction—combining flexible input handling with minimal architectural specialization—is relatively sparse within the broader field of spatially consistent geometry prediction.

The taxonomy reveals that neighboring research directions are more densely populated. The sibling leaf 'Feed-Forward Gaussian Splatting with Semantic Fields' contains two papers exploring semantic integration with 3D Gaussians. Adjacent branches include 'Multi-View Stereo and Depth Estimation' (nine papers across three sub-categories) and 'Generative and Diffusion-Based 3D Reconstruction' (seven papers). While these areas emphasize explicit multi-view correspondence or generative priors, DA3 diverges by pursuing architectural minimalism and a single prediction target, positioning itself at the intersection of multi-task flexibility and geometric consistency without specialized modules.

Among the 28 candidates examined through semantic search, the teacher-student training paradigm shows one refutable candidate from 10 examined, indicating some prior work in distillation-based approaches for geometric tasks. The core architectural contribution (minimal transformer with depth-ray prediction) found no clear refutations across 10 candidates, suggesting relative novelty in this specific design choice. The visual geometry benchmark contribution also appears distinct, with zero refutations among eight examined candidates. These statistics reflect a limited search scope rather than exhaustive coverage, but suggest the architectural simplification and unified prediction target represent less-explored directions within the examined literature.

Based on the top-28 semantic matches and taxonomy structure, DA3 appears to occupy a sparsely populated niche combining input flexibility with architectural minimalism. The analysis does not cover the full breadth of monocular depth estimation or multi-view stereo literature, focusing instead on methods addressing arbitrary-input geometry prediction. The teacher-student paradigm shows some overlap with existing distillation approaches, while the core architectural choices and benchmark design appear more distinctive within the examined scope.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: predicting spatially consistent geometry from arbitrary number of visual inputs. The field encompasses a diverse set of approaches organized around how methods handle input flexibility and geometric reasoning. Multi-View Stereo and Depth Estimation focuses on classical correspondence-based techniques that aggregate information across calibrated views, while Generative and Diffusion-Based 3D Reconstruction leverages learned priors to synthesize plausible geometry even from limited observations. Novel View Synthesis from Sparse Inputs emphasizes rendering quality and view interpolation, often trading off geometric accuracy for visual coherence. Unified and Multi-Task 3D Prediction aims to build flexible architectures that can handle varying numbers of inputs and produce multiple geometric outputs simultaneously, as seen in works like Gomvs[1] and Geometry Aware Prior[3]. Domain-Specific Geometric Reconstruction targets specialized settings such as indoor scenes or human bodies, while Shape Completion and Reconstruction from Partial Observations addresses the challenge of inferring occluded or missing structure. Additional branches cover 3D Segmentation and Part-Level Understanding, Spatial Reasoning and Scene Understanding, and Specialized Geometric Inference Tasks that tackle niche problems like topology-aware matching or amodal completion. Recent activity highlights tensions between generalization and specialization. Many studies explore how to inject geometric priors into diffusion models (e.g., Diffusion4d[2], FantasyWorld[5]) or how to enforce multi-view consistency in generative pipelines (CoherentGS[10], WorldMirror[11]). Within the Unified and Multi-Task branch, Depth Anything[0] sits alongside methods that emphasize all-in-one geometric prediction with flexible priors, aiming to handle arbitrary input counts without retraining for specific configurations. Compared to Geometry Aware Prior[3], which explicitly incorporates geometric constraints into generative processes, Depth Anything[0] focuses on robust depth estimation that generalizes across diverse visual conditions. Meanwhile, works like SPAD[4] and Stereo Forcing[6] explore how to refine consistency through iterative or adversarial mechanisms. The central open question remains how to balance the expressiveness of learned priors with the reliability of geometric constraints, especially when input views are sparse or uncalibrated.

Claimed Contributions

Depth Anything 3 model with minimal architecture and unified depth-ray prediction

10 retrieved papers

The authors introduce Depth Anything 3, a unified model that recovers 3D geometry from any number of images using a single plain transformer backbone without architectural modifications. The model employs a minimal depth-ray representation as the sole prediction target, avoiding complex multi-task learning frameworks used in prior work.

10 retrieved papers

Teacher-student training paradigm for handling diverse real-world data

Can Refute

10 retrieved papers

The authors develop a teacher-student learning approach where a monocular depth teacher model trained on synthetic data generates high-quality pseudo-labels for real-world training data. This strategy addresses noisy and incomplete real-world depth captures while preserving geometric accuracy.

10 retrieved papers

Can Refute

Visual geometry benchmark for evaluating pose, geometry, and rendering

8 retrieved papers

The authors introduce a comprehensive benchmark spanning 5 datasets with 89 scenes that directly evaluates pose accuracy, depth via reconstruction accuracy, and visual rendering quality. The benchmark includes a novel feed-forward novel view synthesis evaluation across 160 scenes.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Depth Anything 3 model with minimal architecture and unified depth-ray prediction

[13] STViT+: improving self-supervised multi-camera depth estimation with spatial-temporal context and adversarial geometry regularization PDF

Cannot Refute

[59] Transmvsnet: Global context-aware multi-view stereo network with transformers PDF

Cannot Refute

[60] Vision transformers for dense prediction PDF

Cannot Refute

[61] MonoDETR: Depth-guided transformer for monocular 3D object detection PDF

Cannot Refute

[62] Edge_MVSFormer: Edge-Aware Multi-View Stereo Plant Reconstruction Based on Transformer Networks PDF

Cannot Refute

[63] Joint depth prediction and semantic segmentation with multi-view sam PDF

Cannot Refute

[64] StDepthFormer: Predicting spatio-temporal depth from video with a self-supervised transformer model PDF

Cannot Refute

[65] MVSTER: Epipolar transformer for efficient multi-view stereo PDF

Cannot Refute

[66] Is attention all that nerf needs? PDF

Cannot Refute

[67] MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View Stereo PDF

Cannot Refute

Contribution

Teacher-student training paradigm for handling diverse real-world data

[75] EndoOmni: Zero-shot cross-dataset depth estimation in endoscopy by robust self-learning from noisy labels PDF

Can Refute

[68] Semi-supervised iterative teacher-student learning for monocular depth estimation PDF

Cannot Refute

[69] Distill any depth: Distillation creates a stronger monocular depth estimator PDF

Cannot Refute

[70] Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos PDF

Cannot Refute

[71] Exploiting the Potential of Self-Supervised Monocular Depth Estimation via Patch-Based Self-Distillation PDF

Cannot Refute

[72] Monocular depth estimation via self-supervised self-distillation PDF

Cannot Refute

[73] Unsupervised monocular depth learning using self-teaching and contrast-enhanced SSIM loss PDF

Cannot Refute

[74] Er-depth: Enhancing the robustness of self-supervised monocular depth estimation in challenging scenes PDF

Cannot Refute

[76] Self-distilled self-supervised depth estimation in monocular videos PDF

Cannot Refute

[77] 3d distillation: Improving self-supervised monocular depth estimation on reflective surfaces PDF

Cannot Refute

Contribution

Visual geometry benchmark for evaluating pose, geometry, and rendering

[51] E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models PDF

Cannot Refute

[52] Map-free visual relocalization: Metric pose relative to a single image PDF

Cannot Refute

[53] Visual SLAM with 3D Gaussian Primitives and Depth Priors Enabling Novel View Synthesis PDF

Cannot Refute

[54] Revealing scenes by inverting structure from motion reconstructions PDF

Cannot Refute

[55] Towards Intelligent Embodied Perception for Indoor Agent PDF

Cannot Refute

[56] A large-scale, physically-based synthetic dataset for satellite pose estimation PDF

Cannot Refute

[57] MonoIndoor++:Towards Better Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments PDF

Cannot Refute

[58] Vogtareuth Rehab Depth Datasets: Benchmark for Marker-less Posture Estimation in Rehabilitation. PDF

Cannot Refute

Depth Anything 3: Recovering the Visual Space from Any Views

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Depth Anything 3 model with minimal architecture and unified depth-ray prediction

[13] STViT+: improving self-supervised multi-camera depth estimation with spatial-temporal context and adversarial geometry regularization PDF

[59] Transmvsnet: Global context-aware multi-view stereo network with transformers PDF

[60] Vision transformers for dense prediction PDF

[61] MonoDETR: Depth-guided transformer for monocular 3D object detection PDF

[62] Edge_MVSFormer: Edge-Aware Multi-View Stereo Plant Reconstruction Based on Transformer Networks PDF

[63] Joint depth prediction and semantic segmentation with multi-view sam PDF

[64] StDepthFormer: Predicting spatio-temporal depth from video with a self-supervised transformer model PDF

[65] MVSTER: Epipolar transformer for efficient multi-view stereo PDF

[66] Is attention all that nerf needs? PDF

[67] MVSFormer++: Revealing the Devil in Transformer's Details for Multi-View Stereo PDF

Teacher-student training paradigm for handling diverse real-world data

[75] EndoOmni: Zero-shot cross-dataset depth estimation in endoscopy by robust self-learning from noisy labels PDF

[68] Semi-supervised iterative teacher-student learning for monocular depth estimation PDF

[69] Distill any depth: Distillation creates a stronger monocular depth estimator PDF

[70] Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos PDF

[71] Exploiting the Potential of Self-Supervised Monocular Depth Estimation via Patch-Based Self-Distillation PDF

[72] Monocular depth estimation via self-supervised self-distillation PDF

[73] Unsupervised monocular depth learning using self-teaching and contrast-enhanced SSIM loss PDF

[74] Er-depth: Enhancing the robustness of self-supervised monocular depth estimation in challenging scenes PDF

[76] Self-distilled self-supervised depth estimation in monocular videos PDF

[77] 3d distillation: Improving self-supervised monocular depth estimation on reflective surfaces PDF

Visual geometry benchmark for evaluating pose, geometry, and rendering

[51] E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models PDF

[52] Map-free visual relocalization: Metric pose relative to a single image PDF

[53] Visual SLAM with 3D Gaussian Primitives and Depth Priors Enabling Novel View Synthesis PDF

[54] Revealing scenes by inverting structure from motion reconstructions PDF

[55] Towards Intelligent Embodied Perception for Indoor Agent PDF

[56] A large-scale, physically-based synthetic dataset for satellite pose estimation PDF

[57] MonoIndoor++:Towards Better Practice of Self-Supervised Monocular Depth Estimation for Indoor Environments PDF

[58] Vogtareuth Rehab Depth Datasets: Benchmark for Marker-less Posture Estimation in Rehabilitation. PDF

Table of Contents