D-REX: Differentiable Real-to-Sim-to-Real Engine for Learning Dexterous Grasping

ICLR 2026 Conference SubmissionAnonymous Authors
Real-to-Sim-to-Real; Differentiable Simulation; Learning Robotic Policies from Videos; System Identification;
Abstract:

Simulation provides a cost-effective and flexible platform for data generation and policy learning to develop robotic systems. However, bridging the gap between simulation and real-world dynamics remains a significant challenge, especially in physical parameter identification. In this work, we introduce a real-to-sim-to-real engine that leverages the Gaussian Splat representations to build a differentiable engine, enabling object mass identification from real-world visual observations and robot control signals, while enabling grasping policy learning simultaneously. Through optimizing the mass of the manipulated object, our method automatically builds high-fidelity and physically plausible digital twins. Additionally, we propose a novel approach to train force-aware grasping policies from limited data by transferring feasible human demonstrations into simulated robot demonstrations. Through comprehensive experiments, we demonstrate that our engine achieves accurate and robust performance in mass identification across various object geometries and mass values. Those optimized mass values facilitate force-aware policy learning, achieving superior and high performance in object grasping, effectively reducing the sim-to-real gap. Our code is included in the Supplementary Material and will be open source to facilitate reproducibility. Anonymous project page is available at https://robot-drex-engine.github.io.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a differentiable real-to-sim-to-real framework that identifies object mass from visual observations and robot control signals while simultaneously learning force-aware grasping policies. It resides in the Force-Aware and Compliant Manipulation leaf under Multi-Modal Sensing and Fusion. Notably, this leaf contains only one paper in the taxonomy (the original submission itself), indicating a relatively sparse research direction within the broader field of fifty surveyed works. This positioning suggests the work addresses a niche intersection of physical parameter identification and force-aware policy learning.

The taxonomy reveals that neighboring leaves focus on Tactile-Visual Integration (three papers) and broader Vision-Based Deep Reinforcement Learning branches (multiple subtopics with two to four papers each). The scope note for Force-Aware and Compliant Manipulation explicitly includes force control and compliance for adaptive grasping, excluding purely visual or tactile methods. The paper's differentiable simulation approach connects to the Sim-to-Real Policy Transfer leaf (one paper) and contrasts with purely vision-driven methods in Closed-Loop Vision-Based Control (three papers). This structural context highlights that force-aware manipulation remains less explored than tactile-vision fusion or standard visual reinforcement learning.

Among twenty-nine candidates examined, the contribution-level statistics reveal varying degrees of prior overlap. The differentiable real-to-sim-to-real framework examined ten candidates with three appearing to provide overlapping prior work. Force-aware policy learning from human demonstrations examined ten candidates with one refutable match. End-to-end mass identification through differentiable simulation examined nine candidates with five showing potential overlap. These numbers indicate that within the limited search scope, several existing works address related parameter identification or force-aware learning problems, though the specific combination of Gaussian Splat representations and simultaneous mass identification with policy learning may offer a distinct integration.

Based on the top-thirty semantic matches examined, the work appears to occupy a moderately explored niche. The taxonomy structure confirms that force-aware manipulation is less crowded than tactile-vision fusion or standard visual reinforcement learning. However, the contribution-level statistics suggest that individual technical components (differentiable simulation, mass identification, force-aware policies) have precedents in the examined literature. The analysis does not cover exhaustive domain-specific venues or recent preprints beyond the candidate set, leaving open questions about incremental versus transformative novelty.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
9
Refutable Paper

Research Landscape Overview

Core task: learning dexterous grasping from visual observations and robot control signals. The field organizes around several complementary branches that reflect different methodological emphases and problem settings. Vision-Based Deep Reinforcement Learning and Simulation-to-Reality Transfer focus on end-to-end policy learning, often leveraging large-scale synthetic data and domain randomization to bridge the sim-to-real gap, as seen in works like QT-Opt[4] and Scalable Vision Manipulation[3]. Learning from Human Demonstrations and Priors emphasizes imitation and teleoperation to bootstrap policies efficiently, while Multi-Modal Sensing and Fusion integrates tactile, force, and proprioceptive signals alongside vision to enable compliant and adaptive manipulation. Vision-Language-Action Models and Foundational Methods explore how pre-trained representations and language grounding can generalize across tasks, with recent efforts like Dexterous Arma-Hand VLA[19] and RoboDexVLM[27] pushing toward unified architectures. Meanwhile, Specialized Dexterous Manipulation Tasks and Application-Specific branches address domain constraints in areas such as assembly, deformable object handling, and assistive robotics, and Emerging Paradigms investigate active perception and next-generation sensing modalities. Within this landscape, a particularly active line of work centers on fusing multiple sensory modalities to achieve robust, contact-rich manipulation. D-REX[0] sits squarely in the Multi-Modal Sensing and Fusion branch under Force-Aware and Compliant Manipulation, emphasizing the integration of visual feedback with force or tactile cues to handle delicate grasping scenarios. This contrasts with purely vision-driven approaches like Vision Deep RL Grasping[10] or Simulated Depth Grasping[42], which rely on depth or RGB alone, and complements recent tactile-vision fusion methods such as ViTacFormer[31] and See to Touch[38]. Compared to Compliant Vision Demonstration[21], which also targets compliant control but leans on human demonstrations, D-REX[0] explores how force-aware policies can be learned more autonomously from multi-modal observations. The central trade-off across these branches remains balancing sensor complexity, data efficiency, and generalization: while richer modalities promise finer control, they also raise challenges in sensor calibration, sim-to-real transfer, and scalable data collection.

Claimed Contributions

Differentiable real-to-sim-to-real framework for object mass identification

The authors propose a framework that combines Gaussian Splat representations with differentiable physics simulation to identify object mass from visual observations and robot control signals. This enables automatic construction of high-fidelity, physically plausible digital twins through end-to-end optimization.

10 retrieved papers
Can Refute
Force-aware grasping policy learning from human demonstrations

The authors introduce a method that transfers human demonstrations into robot-executable trajectories in simulation and trains policies that combine position and force control conditioned on identified object mass. This hybrid control approach enables robust grasping across varying object masses.

10 retrieved papers
Can Refute
End-to-end mass identification through differentiable simulation

The framework leverages differentiable physics engines to optimize object mass by minimizing trajectory discrepancies between simulation and real-world robot-object interactions. Unlike prior methods requiring manually specified forces, this approach uses consistent robotic control signals for end-to-end optimization.

9 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Differentiable real-to-sim-to-real framework for object mass identification

The authors propose a framework that combines Gaussian Splat representations with differentiable physics simulation to identify object mass from visual observations and robot control signals. This enables automatic construction of high-fidelity, physically plausible digital twins through end-to-end optimization.

Contribution

Force-aware grasping policy learning from human demonstrations

The authors introduce a method that transfers human demonstrations into robot-executable trajectories in simulation and trains policies that combine position and force control conditioned on identified object mass. This hybrid control approach enables robust grasping across varying object masses.

Contribution

End-to-end mass identification through differentiable simulation

The framework leverages differentiable physics engines to optimize object mass by minimizing trajectory discrepancies between simulation and real-world robot-object interactions. Unlike prior methods requiring manually specified forces, this approach uses consistent robotic control signals for end-to-end optimization.