ORCaS: Unsupervised Depth Completion via Occluded Region Completion as Supervision

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Depth completionUnsupervised Learning3D ReconstructionMulti-modal Learning

We propose a method for inferring an egocentric dense depth map from an RGB image and a sparse point cloud. The crux of our method lies in modeling the 3D scene implicitly within the latent space and learning an inductive bias in an unsupervised manner through principles of Structure-from-Motion. To force the learning of this inductive bias, we propose to optimize for an ill-posed objective: predicting latent features that are not observed in the input view, but exists in the 3D scene. This is facilitated by means of rigid warping of latent features from the input view to a nearby or adjacent (co-visible) view of the same 3D scene. "Empty" regions in the latent space that correspond to regions occluded from the input view are completed by a Contextual eXtrapolation mechanism based on features visible in input view. Once learned, the inductive bias can be transferred to modulate the features of the input view to improve fidelity. We term our method "Occluded Region Completion as Supervision" or ORCaS. We evaluate ORCaS on VOID1500 and NYUv2 benchmark datasets, where we improve over the best existing method by 8.91% across all metrics. ORCaS also improves generalization from VOID1500 to ScanNet and NYUv2 by 15.7% and robustness to low density inputs by 31.2%. Code will be released.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes ORCaS, a method for unsupervised depth completion that learns an inductive bias by predicting latent features in occluded regions through rigid warping and contextual extrapolation. It resides in the 'Geometric and Structural Constraint Methods' leaf, which contains only three papers total, including ORCaS itself. This leaf sits within the broader 'Self-Supervised Learning Frameworks' branch, indicating a relatively sparse research direction focused on geometric priors rather than photometric or feature-metric losses. The small sibling count suggests this specific angle—using occluded region completion as supervision—is not heavily explored.

The taxonomy reveals that most self-supervised depth completion work clusters around photometric consistency (four papers) or feature-metric odometry (three papers), while geometric constraint methods remain less populated. Neighboring branches include 'Multi-Modal Fusion Architectures' with attention-based and hierarchical fusion strategies, and 'Specialized Depth Representation' methods using 3D spatial processing or implicit representations. ORCaS diverges from these by emphasizing latent-space geometric reasoning over explicit fusion modules or 3D voxel grids, positioning it at the intersection of self-supervision and implicit scene modeling without relying on photometric reconstruction or foundation model priors.

Across three contributions, the analysis examined thirty candidate papers total, with ten candidates per contribution. None of the contributions were clearly refuted by prior work in this limited search. The novel supervision signal from occluded regions, the ORCaS architecture with 3D feature broadcasting, and the alternating training loss function all showed zero refutable candidates among the ten examined for each. This suggests that within the top-thirty semantic matches and their citations, no overlapping prior work was identified, though the search scope remains constrained and does not cover the entire literature exhaustively.

Given the sparse taxonomy leaf and absence of refutations in the limited search, ORCaS appears to occupy a relatively unexplored niche within geometric self-supervision for depth completion. However, the analysis is based on thirty candidates from semantic search, not a comprehensive survey, and the field's broader landscape includes many fusion and foundation model approaches that may address related challenges differently. The novelty assessment is thus provisional, reflecting the examined scope rather than definitive coverage of all relevant prior art.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Unsupervised depth completion from RGB image and sparse point cloud. The field is organized around several complementary strategies for fusing visual and sparse geometric cues without ground-truth supervision. Self-Supervised Learning Frameworks leverage photometric consistency, geometric constraints, and temporal signals to train models that propagate sparse depth measurements across the image, with works like Calibrated Backprojection[3] and HR-Depth[4] exploiting camera geometry and multi-scale reasoning. Multi-Modal Fusion Architectures focus on designing network modules that effectively combine RGB features with sparse LiDAR or radar inputs, often through attention mechanisms or specialized convolution operators as seen in Sparse Dense CNNs[15] and PointMBF[21]. Specialized Depth Representation and Processing methods explore novel ways to encode and refine depth maps, including decomposition strategies (Depth Map Decomposition[8]) and probabilistic formulations (Dense Depth Posterior[9]). Monocular Depth Estimation and Auxiliary Depth Integration approaches incorporate pretrained monocular networks or transfer knowledge from dense depth predictors to guide sparse completion, while Foundation Model-Based and Knowledge Distillation Approaches harness large-scale pretrained models like Depth Anything[47] or Depth Pro[48] to provide robust priors. Application-Specific and Domain-Adapted Methods tailor solutions to challenging scenarios such as all-day operation (All-day Depth[6]), UAV mapping (UAV Depth Mapping[11]), or medical imaging (Endoscopy Densification[29]). Recent work has increasingly explored the interplay between geometric constraints and learned feature fusion, with many studies investigating how to best exploit camera motion, stereo cues, or temporal consistency alongside sparse measurements. ORCaS[0] sits within the Self-Supervised Learning Frameworks branch, specifically among Geometric and Structural Constraint Methods, emphasizing rigorous geometric reasoning to guide unsupervised training. This positions it closely to Heterogeneous Depth Completion[2] and RGB Sparse Completion[38], which similarly prioritize structural consistency and multi-modal alignment without relying on dense supervision. Compared to these neighbors, ORCaS[0] appears to place particular emphasis on leveraging calibrated geometric relationships and structural priors, contrasting with approaches that lean more heavily on learned fusion architectures or foundation model distillation. The broader tension in the field remains between purely data-driven fusion strategies and methods that encode explicit geometric or physical constraints, with ORCaS[0] contributing to the latter tradition.

Claimed Contributions

Novel supervision signal from occluded regions for unsupervised depth completion

10 retrieved papers

The authors propose using regions occluded from the input view but visible in adjacent views as a supervision signal during training. This forces the network to learn an inductive bias about 3D scene structure rather than relying solely on 2D image-based regularizers, improving depth completion fidelity.

10 retrieved papers

ORCaS architecture with 3D feature broadcasting and ConteXt mechanism

10 retrieved papers

The authors introduce an architecture that broadcasts 2D features into 3D volumes across depth planes, rigidly warps them to adjacent views, and uses a Contextual eXtrapolation (ConteXt) mechanism to complete empty regions corresponding to occlusions. The learned inductive bias modulates input view features at inference.

10 retrieved papers

ORCaS loss function for alternating training

10 retrieved papers

The authors design a loss function that enforces consistency between predicted adjacent view features and encoded adjacent view features. This loss is optimized in an alternating training scheme to learn the parameters of the ConteXt mechanism while maintaining standard depth completion objectives.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] Unsupervised deep depth completion with heterogeneous LiDAR and RGB-D camera depth information PDF

Guohua Gou, Han Li, Xuanhao Wang, Hao Zhang, Wei Yang, Haigang Sui (2025)

[38] Unsupervised depth completion based on RGB image and sparse depth map PDF

Qingchen Yang, Wenpeng Cui, Zhe Zheng, Yuzhe Chen, Bokuan Yang, Wei Gao (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Novel supervision signal from occluded regions for unsupervised depth completion

[51] Sc-depthv3: Robust self-supervised monocular depth estimation for dynamic scenes PDF

Cannot Refute

[52] Self-supervised scene de-occlusion PDF

Cannot Refute

[53] PlaneDepth: Self-Supervised Depth Estimation via Orthogonal Planes PDF

Cannot Refute

[54] Superdepth: Self-supervised, super-resolved monocular depth estimation PDF

Cannot Refute

[55] Unsupervised Depth Completion Guided by Visual Inertial System and Confidence PDF

Cannot Refute

[56] Self-supervised depth completion based on multi-modal spatio-temporal consistency PDF

Cannot Refute

[57] Perceptual losses for self-supervised depth estimation PDF

Cannot Refute

[58] Self-supervised monocular depth estimation with occlusion mask and edge awareness PDF

Cannot Refute

[59] Spatially variant biases considered self-supervised depth estimation based on laparoscopic videos PDF

Cannot Refute

[60] Image masking for robust self-supervised monocular depth estimation PDF

Cannot Refute

Contribution

ORCaS architecture with 3D feature broadcasting and ConteXt mechanism

[61] Occlusion Boundary Prediction and Transformer Based Depth-Map Refinement From Single Image PDF

Cannot Refute

[62] Deformable spatial propagation network for depth completion PDF

Cannot Refute

[63] Self-Supervised Large Scale Point Cloud Completion for Archaeological Site Restoration PDF

Cannot Refute

[64] Mixssc: Forward-backward mixture for vision-based 3d semantic scene completion PDF

Cannot Refute

[65] SOAP: Vision-Centric 3D Semantic Scene Completion with Scene-Adaptive Decoder and Occluded Region-Aware View Projection PDF

Cannot Refute

[66] Hybridocc: Nerf enhanced transformer-based multi-camera 3d occupancy prediction PDF

Cannot Refute

[67] SLFNet: A Stereo and LiDAR Fusion Network for Depth Completion PDF

Cannot Refute

[68] Stereo-LiDAR Depth Estimation with Deformable Propagation and Learned Disparity-Depth Conversion PDF

Cannot Refute

[69] SDL-MVS: View space and depth deformable learning paradigm for multi-view stereo reconstruction in remote sensing PDF

Cannot Refute

[70] Dense Depth-Guided Generalizable NeRF PDF

Cannot Refute

Contribution

ORCaS loss function for alternating training

[2] Unsupervised deep depth completion with heterogeneous LiDAR and RGB-D camera depth information PDF

Cannot Refute

[30] Monitored Distillation for Positive Congruent Depth Completion PDF

Cannot Refute

[71] Learning temporally consistent video depth from video diffusion priors PDF

Cannot Refute

[72] An adaptive framework for learning unsupervised depth completion PDF

Cannot Refute

[73] Adversarial learning for unguided single depth map completion of indoor scenes PDF

Cannot Refute

[74] Exploiting temporal consistency for real-time video depth estimation PDF

Cannot Refute

[75] VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving PDF

Cannot Refute

[76] Self-ensembling depth completion via density-aware consistency PDF

Cannot Refute

[77] SLCF-Net: Sequential LiDAR-camera fusion for semantic scene completion using a 3D recurrent U-Net PDF

Cannot Refute

[78] Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints PDF

Cannot Refute

ORCaS: Unsupervised Depth Completion via Occluded Region Completion as Supervision

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] Unsupervised deep depth completion with heterogeneous LiDAR and RGB-D camera depth information PDF

[38] Unsupervised depth completion based on RGB image and sparse depth map PDF

Contribution Analysis

Novel supervision signal from occluded regions for unsupervised depth completion

[51] Sc-depthv3: Robust self-supervised monocular depth estimation for dynamic scenes PDF

[52] Self-supervised scene de-occlusion PDF

[53] PlaneDepth: Self-Supervised Depth Estimation via Orthogonal Planes PDF

[54] Superdepth: Self-supervised, super-resolved monocular depth estimation PDF

[55] Unsupervised Depth Completion Guided by Visual Inertial System and Confidence PDF

[56] Self-supervised depth completion based on multi-modal spatio-temporal consistency PDF

[57] Perceptual losses for self-supervised depth estimation PDF

[58] Self-supervised monocular depth estimation with occlusion mask and edge awareness PDF

[59] Spatially variant biases considered self-supervised depth estimation based on laparoscopic videos PDF

[60] Image masking for robust self-supervised monocular depth estimation PDF

ORCaS architecture with 3D feature broadcasting and ConteXt mechanism

[61] Occlusion Boundary Prediction and Transformer Based Depth-Map Refinement From Single Image PDF

[62] Deformable spatial propagation network for depth completion PDF

[63] Self-Supervised Large Scale Point Cloud Completion for Archaeological Site Restoration PDF

[64] Mixssc: Forward-backward mixture for vision-based 3d semantic scene completion PDF

[65] SOAP: Vision-Centric 3D Semantic Scene Completion with Scene-Adaptive Decoder and Occluded Region-Aware View Projection PDF

[66] Hybridocc: Nerf enhanced transformer-based multi-camera 3d occupancy prediction PDF

[67] SLFNet: A Stereo and LiDAR Fusion Network for Depth Completion PDF

[68] Stereo-LiDAR Depth Estimation with Deformable Propagation and Learned Disparity-Depth Conversion PDF

[69] SDL-MVS: View space and depth deformable learning paradigm for multi-view stereo reconstruction in remote sensing PDF

[70] Dense Depth-Guided Generalizable NeRF PDF

ORCaS loss function for alternating training

[2] Unsupervised deep depth completion with heterogeneous LiDAR and RGB-D camera depth information PDF

[30] Monitored Distillation for Positive Congruent Depth Completion PDF

[71] Learning temporally consistent video depth from video diffusion priors PDF

[72] An adaptive framework for learning unsupervised depth completion PDF

[73] Adversarial learning for unguided single depth map completion of indoor scenes PDF

[74] Exploiting temporal consistency for real-time video depth estimation PDF

[75] VisionPAD: A Vision-Centric Pre-training Paradigm for Autonomous Driving PDF

[76] Self-ensembling depth completion via density-aware consistency PDF

[77] SLCF-Net: Sequential LiDAR-camera fusion for semantic scene completion using a 3D recurrent U-Net PDF

[78] Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints PDF

Table of Contents