Abstract:

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINOv2 encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 35.7% in camera pose accuracy and 23.6% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Depth Anything 3 proposes a unified depth-ray prediction framework using a plain transformer backbone to handle arbitrary numbers of visual inputs with or without known camera poses. The paper resides in the 'All-in-One Geometric Prediction with Flexible Priors' leaf under 'Unified and Multi-Task 3D Prediction'. Notably, this leaf contains only one sibling paper in the taxonomy (the original paper itself), suggesting this specific research direction—combining flexible input handling with minimal architectural specialization—is relatively sparse within the broader field of spatially consistent geometry prediction.

The taxonomy reveals that neighboring research directions are more densely populated. The sibling leaf 'Feed-Forward Gaussian Splatting with Semantic Fields' contains two papers exploring semantic integration with 3D Gaussians. Adjacent branches include 'Multi-View Stereo and Depth Estimation' (nine papers across three sub-categories) and 'Generative and Diffusion-Based 3D Reconstruction' (seven papers). While these areas emphasize explicit multi-view correspondence or generative priors, DA3 diverges by pursuing architectural minimalism and a single prediction target, positioning itself at the intersection of multi-task flexibility and geometric consistency without specialized modules.

Among the 28 candidates examined through semantic search, the teacher-student training paradigm shows one refutable candidate from 10 examined, indicating some prior work in distillation-based approaches for geometric tasks. The core architectural contribution (minimal transformer with depth-ray prediction) found no clear refutations across 10 candidates, suggesting relative novelty in this specific design choice. The visual geometry benchmark contribution also appears distinct, with zero refutations among eight examined candidates. These statistics reflect a limited search scope rather than exhaustive coverage, but suggest the architectural simplification and unified prediction target represent less-explored directions within the examined literature.

Based on the top-28 semantic matches and taxonomy structure, DA3 appears to occupy a sparsely populated niche combining input flexibility with architectural minimalism. The analysis does not cover the full breadth of monocular depth estimation or multi-view stereo literature, focusing instead on methods addressing arbitrary-input geometry prediction. The teacher-student paradigm shows some overlap with existing distillation approaches, while the core architectural choices and benchmark design appear more distinctive within the examined scope.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: predicting spatially consistent geometry from arbitrary number of visual inputs. The field encompasses a diverse set of approaches organized around how methods handle input flexibility and geometric reasoning. Multi-View Stereo and Depth Estimation focuses on classical correspondence-based techniques that aggregate information across calibrated views, while Generative and Diffusion-Based 3D Reconstruction leverages learned priors to synthesize plausible geometry even from limited observations. Novel View Synthesis from Sparse Inputs emphasizes rendering quality and view interpolation, often trading off geometric accuracy for visual coherence. Unified and Multi-Task 3D Prediction aims to build flexible architectures that can handle varying numbers of inputs and produce multiple geometric outputs simultaneously, as seen in works like Gomvs[1] and Geometry Aware Prior[3]. Domain-Specific Geometric Reconstruction targets specialized settings such as indoor scenes or human bodies, while Shape Completion and Reconstruction from Partial Observations addresses the challenge of inferring occluded or missing structure. Additional branches cover 3D Segmentation and Part-Level Understanding, Spatial Reasoning and Scene Understanding, and Specialized Geometric Inference Tasks that tackle niche problems like topology-aware matching or amodal completion. Recent activity highlights tensions between generalization and specialization. Many studies explore how to inject geometric priors into diffusion models (e.g., Diffusion4d[2], FantasyWorld[5]) or how to enforce multi-view consistency in generative pipelines (CoherentGS[10], WorldMirror[11]). Within the Unified and Multi-Task branch, Depth Anything[0] sits alongside methods that emphasize all-in-one geometric prediction with flexible priors, aiming to handle arbitrary input counts without retraining for specific configurations. Compared to Geometry Aware Prior[3], which explicitly incorporates geometric constraints into generative processes, Depth Anything[0] focuses on robust depth estimation that generalizes across diverse visual conditions. Meanwhile, works like SPAD[4] and Stereo Forcing[6] explore how to refine consistency through iterative or adversarial mechanisms. The central open question remains how to balance the expressiveness of learned priors with the reliability of geometric constraints, especially when input views are sparse or uncalibrated.

Claimed Contributions

Depth Anything 3 model with minimal architecture and unified depth-ray prediction

The authors introduce Depth Anything 3, a unified model that recovers 3D geometry from any number of images using a single plain transformer backbone without architectural modifications. The model employs a minimal depth-ray representation as the sole prediction target, avoiding complex multi-task learning frameworks used in prior work.

10 retrieved papers
Teacher-student training paradigm for handling diverse real-world data

The authors develop a teacher-student learning approach where a monocular depth teacher model trained on synthetic data generates high-quality pseudo-labels for real-world training data. This strategy addresses noisy and incomplete real-world depth captures while preserving geometric accuracy.

10 retrieved papers
Can Refute
Visual geometry benchmark for evaluating pose, geometry, and rendering

The authors introduce a comprehensive benchmark spanning 5 datasets with 89 scenes that directly evaluates pose accuracy, depth via reconstruction accuracy, and visual rendering quality. The benchmark includes a novel feed-forward novel view synthesis evaluation across 160 scenes.

8 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Depth Anything 3 model with minimal architecture and unified depth-ray prediction

The authors introduce Depth Anything 3, a unified model that recovers 3D geometry from any number of images using a single plain transformer backbone without architectural modifications. The model employs a minimal depth-ray representation as the sole prediction target, avoiding complex multi-task learning frameworks used in prior work.

Contribution

Teacher-student training paradigm for handling diverse real-world data

The authors develop a teacher-student learning approach where a monocular depth teacher model trained on synthetic data generates high-quality pseudo-labels for real-world training data. This strategy addresses noisy and incomplete real-world depth captures while preserving geometric accuracy.

Contribution

Visual geometry benchmark for evaluating pose, geometry, and rendering

The authors introduce a comprehensive benchmark spanning 5 datasets with 89 scenes that directly evaluates pose accuracy, depth via reconstruction accuracy, and visual rendering quality. The benchmark includes a novel feed-forward novel view synthesis evaluation across 160 scenes.