Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

3D perception; manipulation; sim-to-real; depth foundation model

Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties-such as distance, size, and shape-than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated, reflective, and slender objects, with little to no performance degradation. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Camera Depth Models (CDMs) as a plugin to enhance depth accuracy from commodity RGB-D sensors for robotic manipulation. It resides in the 'Simulation-to-Real Transfer for Depth-Based Manipulation' leaf, which currently contains only this paper among the 50 surveyed works. This isolation suggests the taxonomy captures a relatively sparse research direction explicitly focused on sim-to-real depth transfer, distinguishing it from the broader 'Depth Acquisition and Enhancement Methods' branch where most depth refinement work clusters. The paper's emphasis on modeling depth camera noise patterns to bridge simulation and reality positions it at the intersection of depth enhancement and domain adaptation.

The taxonomy reveals substantial activity in neighboring areas. The 'Depth Completion and Refinement for Challenging Materials' subtopic contains ten papers addressing transparent objects and general depth enhancement, while 'Grasp Detection and Synthesis Using Depth' includes multiple subtopics with methods fusing RGB-D data for manipulation. The 'Foundation Model-Based 3D Manipulation' leaf explores lifting 2D representations to 3D for generalizable policies. The original paper diverges from these by targeting the upstream problem of depth sensor fidelity rather than downstream task-specific fusion or material-specific completion, though its neural data engine approach shares methodological overlap with learned depth refinement techniques in adjacent leaves.

Among 30 candidates examined, the neural data engine contribution shows the most substantial prior work overlap, with three refutable candidates identified from ten examined. The CDM plugin concept and ByteCameraDepth dataset contributions each examined ten candidates with zero refutations, suggesting these elements may be more distinctive within the limited search scope. The statistics indicate that while the depth noise modeling approach has recognizable precedents in the examined literature, the specific framing as a camera-agnostic plugin and the dataset contribution appear less directly anticipated by the top-30 semantic matches and their citations.

Given the limited search scope of 30 candidates, this assessment captures novelty relative to closely related work but cannot claim exhaustive coverage of depth enhancement or sim-to-real transfer literature. The paper's unique taxonomy position and the dataset's zero refutations suggest potential distinctiveness, though the neural data engine's three refutable candidates indicate this component builds on established noise modeling techniques. A broader search might reveal additional precedents, particularly in computer vision depth estimation or domain randomization literature outside the manipulation-focused scope examined here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: depth enhancement for robotic manipulation using camera depth models. The field addresses how robots can acquire, refine, and exploit depth information to perform reliable grasping and manipulation in diverse environments. The taxonomy organizes research into four main branches: Depth Acquisition and Enhancement Methods focus on improving raw depth signals through learning-based refinement, stereo reconstruction, and specialized techniques for challenging materials like transparent or reflective objects (e.g., ClearDepth[11], Transparent Object Depth[3]); Depth-Guided Manipulation Frameworks integrate enhanced depth into end-to-end policies and grasp planning systems (e.g., GraspNet[7], Lift3D Foundation[8]); Benchmarks, Datasets, and Evaluation Frameworks provide standardized testbeds and metrics; and Simulation-to-Real Transfer for Depth-Based Manipulation explores how depth models trained in simulation can generalize to physical systems. A particularly active line of work targets transparent and reflective objects, where standard depth sensors fail—methods like Clear Grasp[6] and Rethinking Transparent Grasping[4] propose neural completion and physics-informed priors to recover missing geometry. Another contrasting direction emphasizes large-scale foundation models and multi-modal fusion (Lift3D Policy[14], Prompting Depth Anything[38]) that leverage pre-trained representations to generalize across object categories. The original paper, Manipulation as Simulation[0], sits within the Simulation-to-Real Transfer branch and emphasizes bridging the gap between synthetic training environments and real-world deployment. Compared to works like KineDepth[5], which refines depth online during manipulation, or Transparent Depth Completion[22], which focuses on material-specific enhancement, Manipulation as Simulation[0] appears to prioritize robust transfer mechanisms that maintain depth fidelity across the sim-to-real boundary, addressing domain shift challenges that remain central to deploying learned depth models in practice.

Claimed Contributions

Camera Depth Models (CDMs)

10 retrieved papers

The authors introduce Camera Depth Models, a plug-in solution for depth cameras that processes RGB images and noisy depth signals to produce high-quality, denoised metric depth. CDMs are designed to enhance geometric accuracy for specific depth cameras, enabling robots to perceive 3D information with near-simulation-level precision.

10 retrieved papers

Neural data engine for depth camera noise modeling

Can Refute

10 retrieved papers

The authors develop a neural data engine that learns and models the noise patterns of depth cameras to synthesize high-quality paired training data in simulation. This includes training hole noise and value noise models on real-world data, then using them to generate realistic noisy depth images for training CDMs.

10 retrieved papers

Can Refute

ByteCameraDepth dataset

10 retrieved papers

The authors collect and release ByteCameraDepth, a multi-camera depth dataset containing over 170,000 RGB-depth pairs from seven different depth cameras across ten depth modes. This dataset captures typical depth patterns and noise characteristics from commonly used depth cameras in robotic experiments.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Camera Depth Models (CDMs)

[71] Diffusiondepth: Diffusion denoising approach for monocular depth estimation PDF

Cannot Refute

[72] Self-supervised depth enhancement PDF

Cannot Refute

[73] PGDENet: Progressive Guided Fusion and Depth Enhancement Network for RGB-D Indoor Scene Parsing PDF

Cannot Refute

[74] RGB-guided depth map recovery by two-stage coarse-to-fine dense CRF models PDF

Cannot Refute

[75] Cow depth image restoration method based on RGB guided network with modulation branch in the cowshed environment PDF

Cannot Refute

[76] Adaptive Depth Enhancement Network for RGB-D Salient Object Detection PDF

Cannot Refute

[77] Real-time shading-based refinement for consumer depth cameras PDF

Cannot Refute

[78] Selfredepth: Self-supervised real-time depth restoration for consumer-grade sensors PDF

Cannot Refute

[79] A generic framework for depth reconstruction enhancement PDF

Cannot Refute

[80] Depth map recovery based on a unified depth boundary distortion model PDF

Cannot Refute

Contribution

Neural data engine for depth camera noise modeling

[63] Realistic depth image synthesis for 3d hand pose estimation PDF

Can Refute

[65] Enhancement of 3D Camera Synthetic Training Data with Noise Models PDF

Can Refute

[70] Multimodal deep learning for robust RGB-D object recognition PDF

Can Refute

[61] Improved sensor model for realistic synthetic data generation PDF

Cannot Refute

[62] Understanding real world indoor scenes with synthetic data PDF

Cannot Refute

[64] A physics-based noise formation model for extreme low-light raw denoising PDF

Cannot Refute

[66] The benefits of depth information for head-mounted gaze estimation PDF

Cannot Refute

[67] DEPTHOR: Depth Enhancement from a Practical Light-Weight dToF Sensor and RGB Image PDF

Cannot Refute

[68] PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation PDF

Cannot Refute

[69] Synthetic training data in AI-driven quality inspection: The significance of camera, lighting, and noise parameters PDF

Cannot Refute

Contribution

ByteCameraDepth dataset

[51] SoundLoc3D: Invisible 3D Sound Source Localization and Classification Using a Multimodal RGB-D Acoustic Camera PDF

Cannot Refute

[52] Data Fusion of RGB and Depth Data with Image Enhancement PDF

Cannot Refute

[53] Unsupervised depth completion and denoising for rgb-d sensors PDF

Cannot Refute

[54] The robodepth challenge: Methods and advancements towards robust depth estimation PDF

Cannot Refute

[55] Multi-camera vision-based synchronous positioning and mapping for green construction of electric substations PDF

Cannot Refute

[56] Deep denoising for multiview depth cameras PDF

Cannot Refute

[57] Self-supervised deep depth denoising PDF

Cannot Refute

[58] Usage of RGB-D Multi-Sensor Imaging System for Medical Applications PDF

Cannot Refute

[59] A Multi-spectral Dataset for Evaluating Motion Estimation Systems PDF

Cannot Refute

[60] Animal Pose Tracking: 3D Multimodal Dataset and Token-based Pose Optimization PDF

Cannot Refute

Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Camera Depth Models (CDMs)

[71] Diffusiondepth: Diffusion denoising approach for monocular depth estimation PDF

[72] Self-supervised depth enhancement PDF

[73] PGDENet: Progressive Guided Fusion and Depth Enhancement Network for RGB-D Indoor Scene Parsing PDF

[74] RGB-guided depth map recovery by two-stage coarse-to-fine dense CRF models PDF

[75] Cow depth image restoration method based on RGB guided network with modulation branch in the cowshed environment PDF

[76] Adaptive Depth Enhancement Network for RGB-D Salient Object Detection PDF

[77] Real-time shading-based refinement for consumer depth cameras PDF

[78] Selfredepth: Self-supervised real-time depth restoration for consumer-grade sensors PDF

[79] A generic framework for depth reconstruction enhancement PDF

[80] Depth map recovery based on a unified depth boundary distortion model PDF

Neural data engine for depth camera noise modeling

[63] Realistic depth image synthesis for 3d hand pose estimation PDF

[65] Enhancement of 3D Camera Synthetic Training Data with Noise Models PDF

[70] Multimodal deep learning for robust RGB-D object recognition PDF

[61] Improved sensor model for realistic synthetic data generation PDF

[62] Understanding real world indoor scenes with synthetic data PDF

[64] A physics-based noise formation model for extreme low-light raw denoising PDF

[66] The benefits of depth information for head-mounted gaze estimation PDF

[67] DEPTHOR: Depth Enhancement from a Practical Light-Weight dToF Sensor and RGB Image PDF

[68] PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation PDF

[69] Synthetic training data in AI-driven quality inspection: The significance of camera, lighting, and noise parameters PDF

ByteCameraDepth dataset

[51] SoundLoc3D: Invisible 3D Sound Source Localization and Classification Using a Multimodal RGB-D Acoustic Camera PDF

[52] Data Fusion of RGB and Depth Data with Image Enhancement PDF

[53] Unsupervised depth completion and denoising for rgb-d sensors PDF

[54] The robodepth challenge: Methods and advancements towards robust depth estimation PDF

[55] Multi-camera vision-based synchronous positioning and mapping for green construction of electric substations PDF

[56] Deep denoising for multiview depth cameras PDF

[57] Self-supervised deep depth denoising PDF

[58] Usage of RGB-D Multi-Sensor Imaging System for Medical Applications PDF

[59] A Multi-spectral Dataset for Evaluating Motion Estimation Systems PDF

[60] Animal Pose Tracking: 3D Multimodal Dataset and Token-based Pose Optimization PDF

Table of Contents