TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions
Overview
Overall Novelty Assessment
The paper introduces free-form hand-object interaction generation, extending beyond fixed grasping patterns to diverse manipulations like pushing and poking. It resides in the Contact-Guided and Constraint-Based Diffusion leaf, which contains three papers including TOUCH itself. This leaf sits within the broader Diffusion-Based Interaction Generation branch, indicating a moderately populated research direction focused on incorporating explicit physical constraints into diffusion models. The taxonomy reveals this is an active but not overcrowded area, with sibling leaves exploring dual-branch architectures and temporal decomposition strategies.
The taxonomy structure shows TOUCH's leaf neighbors include Dual-Branch and Modular Diffusion Architectures and Staged and Temporal Diffusion Processes, both addressing complementary aspects of interaction synthesis. The broader Interaction Synthesis Approaches branch encompasses alternative paradigms like LLM-based token generation and joint-level kinematic modeling. The WildO2 dataset contribution connects to the Data Collection and Annotation branch, specifically Video-Based Dataset Construction, which contains only one other paper. This positioning suggests the work bridges generative modeling innovations with data infrastructure needs in a relatively underexplored intersection.
Among thirty candidates examined, none clearly refute the three core contributions. The free-form interaction task formulation examined ten candidates with zero refutations, suggesting novelty in extending beyond stability-focused grasping. The WildO2 dataset construction pipeline similarly showed no overlapping prior work among ten candidates, though the limited search scope means comprehensive video-based HOI datasets may exist outside this sample. The TOUCH framework's multi-level diffusion architecture examined ten candidates without refutation, indicating the specific combination of contact guidance and fine-grained semantic control appears distinctive within the examined literature.
Based on the top-thirty semantic matches and taxonomy structure, the work appears to occupy a relatively sparse intersection of contact-aware diffusion and diverse interaction modeling. The analysis covers diffusion-based synthesis methods and related dataset construction efforts but does not exhaustively survey all video-based HOI datasets or alternative generative paradigms. The absence of refutations across contributions suggests meaningful novelty within the examined scope, though the limited candidate pool precludes definitive claims about the broader literature landscape.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors propose a new task that extends hand-object interaction generation beyond traditional grasp-centric approaches to encompass diverse non-grasping manipulations such as pushing, poking, and rotating. This task emphasizes fine-grained semantic control and physical plausibility while capturing the rich diversity of daily interactions.
The authors build WildO2, a large-scale 3D hand-object interaction dataset collected from in-the-wild videos. The dataset includes diverse non-grasping interactions with fine-grained semantic annotations, constructed through an automated pipeline that recovers 3D interactions from internet videos.
The authors introduce TOUCH, a three-stage generation framework featuring explicit contact modeling, multi-level diffusion with coarse-to-fine semantic control, and physical constraint refinement. This framework enables the generation of diverse, controllable, and physically plausible hand-object interactions guided by fine-grained textual descriptions.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[9] Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions PDF
[18] HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Free-form hand-object interaction generation task
The authors propose a new task that extends hand-object interaction generation beyond traditional grasp-centric approaches to encompass diverse non-grasping manipulations such as pushing, poking, and rotating. This task emphasizes fine-grained semantic control and physical plausibility while capturing the rich diversity of daily interactions.
[9] Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions PDF
[55] Predictive visuo-tactile interactive perception framework for object properties inference PDF
[56] Trends and challenges in robot manipulation PDF
[57] O2o-afford: Annotation-free large-scale object-object affordance learning PDF
[58] Push to know!-visuo-tactile based active object parameter inference with dual differentiable filtering PDF
[59] Learning from human videos for robotic manipulation PDF
[60] MultiSCOPE: Disambiguating in-hand object poses with proprioception and sequential interactions PDF
[61] Push-grasping with dexterous hands: Mechanics and a method PDF
[62] Hand-object interaction: from grasping to using PDF
[63] Experimental Evaluation of Precise Placement with Pushing Primitive Based on Cartesian Force Control PDF
WildO2 dataset with automated construction pipeline
The authors build WildO2, a large-scale 3D hand-object interaction dataset collected from in-the-wild videos. The dataset includes diverse non-grasping interactions with fine-grained semantic annotations, constructed through an automated pipeline that recovers 3D interactions from internet videos.
[40] Affordance diffusion: Synthesizing hand-object interactions PDF
[41] Fine-grained egocentric hand-object segmentation: Dataset, model, and applications PDF
[42] Hoi4d: A 4d egocentric dataset for category-level human-object interaction PDF
[43] Understanding human hands in contact at internet scale PDF
[44] H2o: Two hands manipulating objects for first person interaction recognition PDF
[45] H+ o: Unified egocentric recognition of 3d hand-object poses and interactions PDF
[46] HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos PDF
[47] HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction PDF
[48] HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics PDF
[49] RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos PDF
TOUCH framework for controllable HOI generation
The authors introduce TOUCH, a three-stage generation framework featuring explicit contact modeling, multi-level diffusion with coarse-to-fine semantic control, and physical constraint refinement. This framework enables the generation of diverse, controllable, and physically plausible hand-object interactions guided by fine-grained textual descriptions.