Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots

ICLR 2026 Conference SubmissionAnonymous Authors
3D perception; manipulation; sim-to-real; depth foundation model
Abstract:

Modern robotic manipulation primarily relies on visual observations in a 2D color space for skill learning but suffers from poor generalization. In contrast, humans, living in a 3D world, depend more on physical properties-such as distance, size, and shape-than on texture when interacting with objects. Since such 3D geometric information can be acquired from widely available depth cameras, it appears feasible to endow robots with similar perceptual capabilities. Our pilot study found that using depth cameras for manipulation is challenging, primarily due to their limited accuracy and susceptibility to various types of noise. In this work, we propose Camera Depth Models (CDMs) as a simple plugin on daily-use depth cameras, which take RGB images and raw depth signals as input and output denoised, accurate metric depth. To achieve this, we develop a neural data engine that generates high-quality paired data from simulation by modeling a depth camera's noise pattern. Our results show that CDMs achieve nearly simulation-level accuracy in depth prediction, effectively bridging the sim-to-real gap for manipulation tasks. Notably, our experiments demonstrate, for the first time, that a policy trained on raw simulated depth, without the need for adding noise or real-world fine-tuning, generalizes seamlessly to real-world robots on two challenging long-horizon tasks involving articulated, reflective, and slender objects, with little to no performance degradation. We hope our findings will inspire future research in utilizing simulation data and 3D information in general robot policies.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Camera Depth Models (CDMs) as a plugin to enhance depth accuracy from commodity RGB-D sensors for robotic manipulation. It resides in the 'Simulation-to-Real Transfer for Depth-Based Manipulation' leaf, which currently contains only this paper among the 50 surveyed works. This isolation suggests the taxonomy captures a relatively sparse research direction explicitly focused on sim-to-real depth transfer, distinguishing it from the broader 'Depth Acquisition and Enhancement Methods' branch where most depth refinement work clusters. The paper's emphasis on modeling depth camera noise patterns to bridge simulation and reality positions it at the intersection of depth enhancement and domain adaptation.

The taxonomy reveals substantial activity in neighboring areas. The 'Depth Completion and Refinement for Challenging Materials' subtopic contains ten papers addressing transparent objects and general depth enhancement, while 'Grasp Detection and Synthesis Using Depth' includes multiple subtopics with methods fusing RGB-D data for manipulation. The 'Foundation Model-Based 3D Manipulation' leaf explores lifting 2D representations to 3D for generalizable policies. The original paper diverges from these by targeting the upstream problem of depth sensor fidelity rather than downstream task-specific fusion or material-specific completion, though its neural data engine approach shares methodological overlap with learned depth refinement techniques in adjacent leaves.

Among 30 candidates examined, the neural data engine contribution shows the most substantial prior work overlap, with three refutable candidates identified from ten examined. The CDM plugin concept and ByteCameraDepth dataset contributions each examined ten candidates with zero refutations, suggesting these elements may be more distinctive within the limited search scope. The statistics indicate that while the depth noise modeling approach has recognizable precedents in the examined literature, the specific framing as a camera-agnostic plugin and the dataset contribution appear less directly anticipated by the top-30 semantic matches and their citations.

Given the limited search scope of 30 candidates, this assessment captures novelty relative to closely related work but cannot claim exhaustive coverage of depth enhancement or sim-to-real transfer literature. The paper's unique taxonomy position and the dataset's zero refutations suggest potential distinctiveness, though the neural data engine's three refutable candidates indicate this component builds on established noise modeling techniques. A broader search might reveal additional precedents, particularly in computer vision depth estimation or domain randomization literature outside the manipulation-focused scope examined here.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: depth enhancement for robotic manipulation using camera depth models. The field addresses how robots can acquire, refine, and exploit depth information to perform reliable grasping and manipulation in diverse environments. The taxonomy organizes research into four main branches: Depth Acquisition and Enhancement Methods focus on improving raw depth signals through learning-based refinement, stereo reconstruction, and specialized techniques for challenging materials like transparent or reflective objects (e.g., ClearDepth[11], Transparent Object Depth[3]); Depth-Guided Manipulation Frameworks integrate enhanced depth into end-to-end policies and grasp planning systems (e.g., GraspNet[7], Lift3D Foundation[8]); Benchmarks, Datasets, and Evaluation Frameworks provide standardized testbeds and metrics; and Simulation-to-Real Transfer for Depth-Based Manipulation explores how depth models trained in simulation can generalize to physical systems. A particularly active line of work targets transparent and reflective objects, where standard depth sensors fail—methods like Clear Grasp[6] and Rethinking Transparent Grasping[4] propose neural completion and physics-informed priors to recover missing geometry. Another contrasting direction emphasizes large-scale foundation models and multi-modal fusion (Lift3D Policy[14], Prompting Depth Anything[38]) that leverage pre-trained representations to generalize across object categories. The original paper, Manipulation as Simulation[0], sits within the Simulation-to-Real Transfer branch and emphasizes bridging the gap between synthetic training environments and real-world deployment. Compared to works like KineDepth[5], which refines depth online during manipulation, or Transparent Depth Completion[22], which focuses on material-specific enhancement, Manipulation as Simulation[0] appears to prioritize robust transfer mechanisms that maintain depth fidelity across the sim-to-real boundary, addressing domain shift challenges that remain central to deploying learned depth models in practice.

Claimed Contributions

Camera Depth Models (CDMs)

The authors introduce Camera Depth Models, a plug-in solution for depth cameras that processes RGB images and noisy depth signals to produce high-quality, denoised metric depth. CDMs are designed to enhance geometric accuracy for specific depth cameras, enabling robots to perceive 3D information with near-simulation-level precision.

10 retrieved papers
Neural data engine for depth camera noise modeling

The authors develop a neural data engine that learns and models the noise patterns of depth cameras to synthesize high-quality paired training data in simulation. This includes training hole noise and value noise models on real-world data, then using them to generate realistic noisy depth images for training CDMs.

10 retrieved papers
Can Refute
ByteCameraDepth dataset

The authors collect and release ByteCameraDepth, a multi-camera depth dataset containing over 170,000 RGB-depth pairs from seven different depth cameras across ten depth modes. This dataset captures typical depth patterns and noise characteristics from commonly used depth cameras in robotic experiments.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Camera Depth Models (CDMs)

The authors introduce Camera Depth Models, a plug-in solution for depth cameras that processes RGB images and noisy depth signals to produce high-quality, denoised metric depth. CDMs are designed to enhance geometric accuracy for specific depth cameras, enabling robots to perceive 3D information with near-simulation-level precision.

Contribution

Neural data engine for depth camera noise modeling

The authors develop a neural data engine that learns and models the noise patterns of depth cameras to synthesize high-quality paired training data in simulation. This includes training hole noise and value noise models on real-world data, then using them to generate realistic noisy depth images for training CDMs.

Contribution

ByteCameraDepth dataset

The authors collect and release ByteCameraDepth, a multi-camera depth dataset containing over 170,000 RGB-depth pairs from seven different depth cameras across ten depth modes. This dataset captures typical depth patterns and noise characteristics from commonly used depth cameras in robotic experiments.