DA2^2: Depth Anything in Any Direction

ICLR 2026 Conference SubmissionAnonymous Authors
PanoramasDepth (Distance) Estimation
Abstract:

Panorama has a full FoV (360×^\circ\times180^\circ), offering a more complete visual description than perspective images. Thanks to this characteristic, panoramic depth estimation is gaining increasing traction in 3D vision. However, due to the scarcity of panoramic data, previous methods are often restricted to in-domain settings, leading to poor zero-shot generalization. Furthermore, due to the spherical distortions inherent in panoramas, many approaches rely on perspective splitting (\textit{e.g.}, cubemaps), which leads to suboptimal efficiency. To address these challenges, we propose \textbf{DA}$$^{\textbf{2}}: D\textbf{D}epth A\textbf{A}nything in A\textbf{A}ny D\textbf{D}irection, an accurate, zero-shot generalizable, and fully end-to-end panoramic depth estimator. Specifically, for scaling up panoramic data, we introduce a data curation engine for generating high-quality panoramic depth data from perspective, and create \sim543K panoramic RGB-depth pairs, bringing the total to \sim607K. To further mitigate the spherical distortions, we present SphereViT, which explicitly leverages spherical coordinates to enforce the spherical geometric consistency in panoramic image features, yielding improved performance. A comprehensive benchmark on multiple datasets clearly demonstrates DA2^{2}'s SoTA performance, with an average 38% improvement on AbsRel over the strongest zero-shot baseline. Surprisingly, DA2^{2} even outperforms prior in-domain methods, highlighting its superior zero-shot generalization. Moreover, as an end-to-end solution, DA2^{2} exhibits much higher efficiency over fusion-based approaches. Both the code and the curated panoramic data will be released.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: panoramic depth estimation. The field has evolved into several major branches that reflect different input modalities and architectural strategies. Monocular panoramic depth estimation forms the largest branch, encompassing distortion-aware architectures that handle spherical geometry through specialized convolutions, coordinate-based methods, and transformer-based designs. Stereo and multi-view approaches leverage multiple panoramic images to improve geometric consistency, while sensor fusion methods combine panoramic cameras with LiDAR or other modalities to extend depth range and accuracy. High-resolution and perspective-panoramic registration techniques address the challenge of aligning standard pinhole images with 360-degree views, and application-specific branches tailor depth estimation to domains such as autonomous driving or indoor scene understanding. Datasets, benchmarks, and foundation models provide the infrastructure for training and evaluation, with recent works like Depth Anywhere[35] and Depth Any Panoramas[47] exploring generalization across diverse panoramic scenarios. Within monocular methods, a central tension exists between approaches that explicitly model spherical distortion versus those that adapt standard perspective architectures. Distortion-aware filters and spherical convolutions, exemplified by early work like Distortion Aware Filters[11] and Omnidepth[18], directly address equirectangular projection artifacts, while coordinate-based methods such as Spherical Deep Network[12] and EGformer[33] encode geometric priors through positional embeddings or spherical harmonics. The original paper, Depth Anything Direction[0], sits within this coordinate-based cluster, emphasizing directional cues in spherical space similarly to SPDET[37] and EGformer[33]. Compared to these neighbors, Depth Anything Direction[0] appears to push toward more flexible geometric representations that can generalize across viewing conditions, contrasting with SPDET[37]'s focus on explicit tangent-plane decompositions. This line of work reflects ongoing exploration of how best to encode 360-degree geometry without sacrificing the representational power of modern deep networks.

Claimed Contributions

Panoramic data curation engine

A pipeline that converts perspective RGB-depth pairs into full panoramic data through Perspective-to-Equirectangular projection and panoramic out-painting using FLUX-I2P. This engine scales up panoramic training data by approximately 10 times, significantly improving zero-shot generalization.

9 retrieved papers
Can Refute
SphereViT architecture

A Vision Transformer backbone that uses cross-attention with spherical embeddings derived from azimuth and polar angles. Image features attend to fixed spherical embeddings to produce distortion-aware representations, mitigating spherical distortions without requiring auxiliary modules or cubemap fusion.

10 retrieved papers
Can Refute
Comprehensive benchmark for panoramic depth estimation

A thorough evaluation framework comparing both zero-shot and in-domain methods, as well as panoramic and perspective approaches, across multiple recognized datasets. The benchmark demonstrates that DA2 achieves state-of-the-art zero-shot performance and even surpasses prior in-domain methods.

9 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Panoramic data curation engine

A pipeline that converts perspective RGB-depth pairs into full panoramic data through Perspective-to-Equirectangular projection and panoramic out-painting using FLUX-I2P. This engine scales up panoramic training data by approximately 10 times, significantly improving zero-shot generalization.

Contribution

SphereViT architecture

A Vision Transformer backbone that uses cross-attention with spherical embeddings derived from azimuth and polar angles. Image features attend to fixed spherical embeddings to produce distortion-aware representations, mitigating spherical distortions without requiring auxiliary modules or cubemap fusion.

Contribution

Comprehensive benchmark for panoramic depth estimation

A thorough evaluation framework comparing both zero-shot and in-domain methods, as well as panoramic and perspective approaches, across multiple recognized datasets. The benchmark demonstrates that DA2 achieves state-of-the-art zero-shot performance and even surpasses prior in-domain methods.