Visuo-Tactile World Models
Overview
Overall Novelty Assessment
The paper introduces a multi-task visuo-tactile world model that predicts future states by integrating vision and tactile sensing for contact-rich manipulation. It resides in the 'World Models and Predictive Frameworks' leaf of the taxonomy, which currently contains only this single paper. This isolation suggests the leaf represents a relatively sparse research direction within the broader field of contact-rich manipulation, where most work concentrates on sensor design, policy learning, or application-specific tasks rather than predictive modeling with multimodal fusion.
The taxonomy reveals neighboring branches focused on reactive control, imitation learning, and reinforcement learning for contact-rich tasks, as well as cross-modal representation learning and simulation frameworks. The original paper diverges from reactive strategies by emphasizing forward prediction rather than immediate feedback loops, and differs from pure representation learning by targeting planning and rollout quality. Its position bridges perception-focused work on tactile state estimation and control-oriented policy learning, occupying a niche that combines predictive modeling with multimodal grounding in contact dynamics.
Among 28 candidates examined, the core world model contribution shows one refutable candidate out of 10 examined, indicating some prior work in visuo-tactile prediction exists within this limited search scope. The imagination quality improvement contribution examined 8 candidates with none refutable, suggesting this aspect may be less directly addressed in prior literature. The zero-shot planning contribution examined 10 candidates with none refutable, hinting at relative novelty in applying learned world models to real-robot planning without task-specific fine-tuning, though the search scope remains constrained.
Based on top-28 semantic matches, the work appears to occupy a moderately explored intersection of world modeling and multimodal sensing, with the strongest prior overlap in predictive modeling itself but less direct precedent for the specific combination of visuo-tactile grounding and zero-shot planning. The analysis does not cover exhaustive citation networks or domain-specific venues, leaving open the possibility of additional related work beyond the examined candidates.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose the first multi-task world model that integrates fingertip tactile sensing with exocentric vision to jointly model global context and local contact dynamics. This enables the model to ground imagination in contact physics and produce more accurate rollouts for contact-rich manipulation tasks.
The authors demonstrate that incorporating tactile sensing improves the world model's ability to maintain object permanence under occlusion and to comply with physical laws of motion during autoregressive rollouts, evaluated across multiple manipulation tasks.
The authors show that the improved contact perception from VT-WM translates to better planning performance, achieving higher success rates in real-robot experiments, particularly in multi-step contact-rich manipulation tasks.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
Multi-task Visuo-Tactile World Model (VT-WM)
The authors propose the first multi-task world model that integrates fingertip tactile sensing with exocentric vision to jointly model global context and local contact dynamics. This enables the model to ground imagination in contact physics and produce more accurate rollouts for contact-rich manipulation tasks.
[40] Making sense of vision and touch: Learning multimodal representations for contact-rich tasks PDF
[4] VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation PDF
[10] Multimodal tactile sensing fused with vision for dexterous robotic housekeeping PDF
[46] MLA: A Multisensory Language-Action Model for Multimodal Understanding and Forecasting in Robotic Manipulation PDF
[60] Reactive Diffusion Policy: Slow-Fast Visual-Tactile Policy Learning for Contact-Rich Manipulation PDF
[61] TLA: Tactile-Language-Action Model for Contact-Rich Manipulation PDF
[62] Binding touch to everything: Learning unified multimodal tactile representations PDF
[63] Masked Visual-Tactile Pre-training for Robot Manipulation PDF
[64] Towards interpretable visuo-tactile predictive models for soft robot interactions PDF
[65] Making sense of vision and touch: Self-supervised learning of multimodal representations for contact-rich tasks PDF
Improved imagination quality through visuo-tactile grounding
The authors demonstrate that incorporating tactile sensing improves the world model's ability to maintain object permanence under occlusion and to comply with physical laws of motion during autoregressive rollouts, evaluated across multiple manipulation tasks.
[66] CAD model based virtual assembly simulation, planning and training PDF
[67] A sequential group VAE for robot learning of haptic representations PDF
[68] Toward a computational model of constraint-driven exploration and haptic object identification PDF
[69] The tactile continuity illusion. PDF
[70] Reality skins: Creating immersive and tactile virtual environments PDF
[71] Deformation Control of a Deformable Object Based on Visual and Tactile Feedback PDF
[72] A physical constraint on perceptual learning: tactile spatial acuity improves with training to a limit set by finger size. PDF
[73] Michael Zillich, Johann Prankl, Markus Vincze, Yasemin PDF
More reliable zero-shot planning on real robots
The authors show that the improved contact perception from VT-WM translates to better planning performance, achieving higher success rates in real-robot experiments, particularly in multi-step contact-rich manipulation tasks.