Visuo-Tactile World Models

ICLR 2026 Conference SubmissionAnonymous Authors
world modelsroboticstactile sensing
Abstract:

We introduce multi-task Visuo-Tactile World Models (VT-WM), which capture the physics of contact through touch reasoning. By complementing vision with tactile sensing, VT-WM better understands robot–object interactions in contact-rich tasks, avoiding common failure modes of vision-only models under occlusion or ambiguous contact states, such as objects disappearing, teleporting, or moving in ways that violate basic physics. Trained across a set of contact-rich manipulation tasks, VT-WM improves physical fidelity in imagination, achieving 33% better performance at maintaining object permanence and 29% better compliance with the laws of motion in autoregressive rollouts. Moreover, experiments show that grounding in contact dynamics also translates to planning. In zero-shot real-robot experiments, VT-WM achieves up to 35% higher success rates, with the largest gains in multi-step, contact-rich tasks. Finally, VT-WM shows data efficiency when targeting a new task, outperforming a behavioral cloning policy by over 3.5×\times in success rate with limited demonstrations.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a multi-task visuo-tactile world model that predicts future states by integrating vision and tactile sensing for contact-rich manipulation. It resides in the 'World Models and Predictive Frameworks' leaf of the taxonomy, which currently contains only this single paper. This isolation suggests the leaf represents a relatively sparse research direction within the broader field of contact-rich manipulation, where most work concentrates on sensor design, policy learning, or application-specific tasks rather than predictive modeling with multimodal fusion.

The taxonomy reveals neighboring branches focused on reactive control, imitation learning, and reinforcement learning for contact-rich tasks, as well as cross-modal representation learning and simulation frameworks. The original paper diverges from reactive strategies by emphasizing forward prediction rather than immediate feedback loops, and differs from pure representation learning by targeting planning and rollout quality. Its position bridges perception-focused work on tactile state estimation and control-oriented policy learning, occupying a niche that combines predictive modeling with multimodal grounding in contact dynamics.

Among 28 candidates examined, the core world model contribution shows one refutable candidate out of 10 examined, indicating some prior work in visuo-tactile prediction exists within this limited search scope. The imagination quality improvement contribution examined 8 candidates with none refutable, suggesting this aspect may be less directly addressed in prior literature. The zero-shot planning contribution examined 10 candidates with none refutable, hinting at relative novelty in applying learned world models to real-robot planning without task-specific fine-tuning, though the search scope remains constrained.

Based on top-28 semantic matches, the work appears to occupy a moderately explored intersection of world modeling and multimodal sensing, with the strongest prior overlap in predictive modeling itself but less direct precedent for the specific combination of visuo-tactile grounding and zero-shot planning. The analysis does not cover exhaustive citation networks or domain-specific venues, leaving open the possibility of additional related work beyond the examined candidates.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: contact-rich robotic manipulation with vision and tactile sensing. This field encompasses a broad spectrum of research directions, from the foundational hardware and sensor design that enables multimodal perception, through representation learning and simulation frameworks that bridge the gap between synthetic and real-world data, to high-level policy learning and control strategies that leverage both visual and tactile cues. The taxonomy reveals several major branches: some focus on the physical substrates—developing novel tactile sensors and hardware platforms—while others emphasize computational challenges such as learning robust representations from high-dimensional tactile signals, building accurate simulators for contact dynamics, or integrating vision-language-action models with tactile feedback. Application-specific branches address tasks like dexterous in-hand manipulation, bimanual coordination, and handling deformable or fragile objects, while benchmarks and datasets provide standardized evaluation protocols. Surveys and perspectives offer meta-level insights into the evolving landscape of tactile manipulation research. Among the most active lines of work, a central theme is the development of world models and predictive frameworks that can anticipate contact dynamics and guide manipulation policies. Visuo-Tactile World Models[0] exemplifies this direction by learning forward models that predict future sensory states from combined vision and touch, enabling more robust planning in contact-rich scenarios. This approach contrasts with reactive strategies such as Reactive Manipulation[3], which emphasizes fast feedback loops without explicit predictive modeling, and complements simulation-driven methods like Soft Contact Simulation[2] that focus on accurate physics modeling. Meanwhile, vision-language-action integration efforts such as VTLA[4] and OmniVTLA[6] explore how to ground language instructions in multimodal sensory streams, opening pathways toward more generalizable manipulation policies. The original paper sits squarely within the predictive modeling cluster, sharing conceptual ground with works that build internal models of contact dynamics, yet it distinguishes itself by emphasizing the synergy between visual and tactile modalities in learning these forward models, a theme that resonates with recent benchmarks like ManiSkill-ViTac Challenge[1] and representation learning frameworks such as Robot Synesthesia[8].

Claimed Contributions

Multi-task Visuo-Tactile World Model (VT-WM)

The authors propose the first multi-task world model that integrates fingertip tactile sensing with exocentric vision to jointly model global context and local contact dynamics. This enables the model to ground imagination in contact physics and produce more accurate rollouts for contact-rich manipulation tasks.

10 retrieved papers
Can Refute
Improved imagination quality through visuo-tactile grounding

The authors demonstrate that incorporating tactile sensing improves the world model's ability to maintain object permanence under occlusion and to comply with physical laws of motion during autoregressive rollouts, evaluated across multiple manipulation tasks.

8 retrieved papers
More reliable zero-shot planning on real robots

The authors show that the improved contact perception from VT-WM translates to better planning performance, achieving higher success rates in real-robot experiments, particularly in multi-step contact-rich manipulation tasks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Multi-task Visuo-Tactile World Model (VT-WM)

The authors propose the first multi-task world model that integrates fingertip tactile sensing with exocentric vision to jointly model global context and local contact dynamics. This enables the model to ground imagination in contact physics and produce more accurate rollouts for contact-rich manipulation tasks.

Contribution

Improved imagination quality through visuo-tactile grounding

The authors demonstrate that incorporating tactile sensing improves the world model's ability to maintain object permanence under occlusion and to comply with physical laws of motion during autoregressive rollouts, evaluated across multiple manipulation tasks.

Contribution

More reliable zero-shot planning on real robots

The authors show that the improved contact perception from VT-WM translates to better planning performance, achieving higher success rates in real-robot experiments, particularly in multi-step contact-rich manipulation tasks.