ORCaS: Unsupervised Depth Completion via Occluded Region Completion as Supervision

ICLR 2026 Conference SubmissionAnonymous Authors
Depth completionUnsupervised Learning3D ReconstructionMulti-modal Learning
Abstract:

We propose a method for inferring an egocentric dense depth map from an RGB image and a sparse point cloud. The crux of our method lies in modeling the 3D scene implicitly within the latent space and learning an inductive bias in an unsupervised manner through principles of Structure-from-Motion. To force the learning of this inductive bias, we propose to optimize for an ill-posed objective: predicting latent features that are not observed in the input view, but exists in the 3D scene. This is facilitated by means of rigid warping of latent features from the input view to a nearby or adjacent (co-visible) view of the same 3D scene. "Empty" regions in the latent space that correspond to regions occluded from the input view are completed by a Contextual eXtrapolation mechanism based on features visible in input view. Once learned, the inductive bias can be transferred to modulate the features of the input view to improve fidelity. We term our method "Occluded Region Completion as Supervision" or ORCaS. We evaluate ORCaS on VOID1500 and NYUv2 benchmark datasets, where we improve over the best existing method by 8.91% across all metrics. ORCaS also improves generalization from VOID1500 to ScanNet and NYUv2 by 15.7% and robustness to low density inputs by 31.2%. Code will be released.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ORCaS, a method for unsupervised depth completion that learns an inductive bias by predicting latent features in occluded regions through rigid warping and contextual extrapolation. It resides in the 'Geometric and Structural Constraint Methods' leaf, which contains only three papers total, including ORCaS itself. This leaf sits within the broader 'Self-Supervised Learning Frameworks' branch, indicating a relatively sparse research direction focused on geometric priors rather than photometric or feature-metric losses. The small sibling count suggests this specific angle—using occluded region completion as supervision—is not heavily explored.

The taxonomy reveals that most self-supervised depth completion work clusters around photometric consistency (four papers) or feature-metric odometry (three papers), while geometric constraint methods remain less populated. Neighboring branches include 'Multi-Modal Fusion Architectures' with attention-based and hierarchical fusion strategies, and 'Specialized Depth Representation' methods using 3D spatial processing or implicit representations. ORCaS diverges from these by emphasizing latent-space geometric reasoning over explicit fusion modules or 3D voxel grids, positioning it at the intersection of self-supervision and implicit scene modeling without relying on photometric reconstruction or foundation model priors.

Across three contributions, the analysis examined thirty candidate papers total, with ten candidates per contribution. None of the contributions were clearly refuted by prior work in this limited search. The novel supervision signal from occluded regions, the ORCaS architecture with 3D feature broadcasting, and the alternating training loss function all showed zero refutable candidates among the ten examined for each. This suggests that within the top-thirty semantic matches and their citations, no overlapping prior work was identified, though the search scope remains constrained and does not cover the entire literature exhaustively.

Given the sparse taxonomy leaf and absence of refutations in the limited search, ORCaS appears to occupy a relatively unexplored niche within geometric self-supervision for depth completion. However, the analysis is based on thirty candidates from semantic search, not a comprehensive survey, and the field's broader landscape includes many fusion and foundation model approaches that may address related challenges differently. The novelty assessment is thus provisional, reflecting the examined scope rather than definitive coverage of all relevant prior art.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Unsupervised depth completion from RGB image and sparse point cloud. The field is organized around several complementary strategies for fusing visual and sparse geometric cues without ground-truth supervision. Self-Supervised Learning Frameworks leverage photometric consistency, geometric constraints, and temporal signals to train models that propagate sparse depth measurements across the image, with works like Calibrated Backprojection[3] and HR-Depth[4] exploiting camera geometry and multi-scale reasoning. Multi-Modal Fusion Architectures focus on designing network modules that effectively combine RGB features with sparse LiDAR or radar inputs, often through attention mechanisms or specialized convolution operators as seen in Sparse Dense CNNs[15] and PointMBF[21]. Specialized Depth Representation and Processing methods explore novel ways to encode and refine depth maps, including decomposition strategies (Depth Map Decomposition[8]) and probabilistic formulations (Dense Depth Posterior[9]). Monocular Depth Estimation and Auxiliary Depth Integration approaches incorporate pretrained monocular networks or transfer knowledge from dense depth predictors to guide sparse completion, while Foundation Model-Based and Knowledge Distillation Approaches harness large-scale pretrained models like Depth Anything[47] or Depth Pro[48] to provide robust priors. Application-Specific and Domain-Adapted Methods tailor solutions to challenging scenarios such as all-day operation (All-day Depth[6]), UAV mapping (UAV Depth Mapping[11]), or medical imaging (Endoscopy Densification[29]). Recent work has increasingly explored the interplay between geometric constraints and learned feature fusion, with many studies investigating how to best exploit camera motion, stereo cues, or temporal consistency alongside sparse measurements. ORCaS[0] sits within the Self-Supervised Learning Frameworks branch, specifically among Geometric and Structural Constraint Methods, emphasizing rigorous geometric reasoning to guide unsupervised training. This positions it closely to Heterogeneous Depth Completion[2] and RGB Sparse Completion[38], which similarly prioritize structural consistency and multi-modal alignment without relying on dense supervision. Compared to these neighbors, ORCaS[0] appears to place particular emphasis on leveraging calibrated geometric relationships and structural priors, contrasting with approaches that lean more heavily on learned fusion architectures or foundation model distillation. The broader tension in the field remains between purely data-driven fusion strategies and methods that encode explicit geometric or physical constraints, with ORCaS[0] contributing to the latter tradition.

Claimed Contributions

Novel supervision signal from occluded regions for unsupervised depth completion

The authors propose using regions occluded from the input view but visible in adjacent views as a supervision signal during training. This forces the network to learn an inductive bias about 3D scene structure rather than relying solely on 2D image-based regularizers, improving depth completion fidelity.

10 retrieved papers
ORCaS architecture with 3D feature broadcasting and ConteXt mechanism

The authors introduce an architecture that broadcasts 2D features into 3D volumes across depth planes, rigidly warps them to adjacent views, and uses a Contextual eXtrapolation (ConteXt) mechanism to complete empty regions corresponding to occlusions. The learned inductive bias modulates input view features at inference.

10 retrieved papers
ORCaS loss function for alternating training

The authors design a loss function that enforces consistency between predicted adjacent view features and encoded adjacent view features. This loss is optimized in an alternating training scheme to learn the parameters of the ConteXt mechanism while maintaining standard depth completion objectives.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Novel supervision signal from occluded regions for unsupervised depth completion

The authors propose using regions occluded from the input view but visible in adjacent views as a supervision signal during training. This forces the network to learn an inductive bias about 3D scene structure rather than relying solely on 2D image-based regularizers, improving depth completion fidelity.

Contribution

ORCaS architecture with 3D feature broadcasting and ConteXt mechanism

The authors introduce an architecture that broadcasts 2D features into 3D volumes across depth planes, rigidly warps them to adjacent views, and uses a Contextual eXtrapolation (ConteXt) mechanism to complete empty regions corresponding to occlusions. The learned inductive bias modulates input view features at inference.

Contribution

ORCaS loss function for alternating training

The authors design a loss function that enforces consistency between predicted adjacent view features and encoded adjacent view features. This loss is optimized in an alternating training scheme to learn the parameters of the ConteXt mechanism while maintaining standard depth completion objectives.