TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

hand-object interaction3D generation

Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce $\textbf{Free-Form HOI Generation}$ , which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct $\textbf{WildO2}$ , an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 403 object categories, each with detailed semantic annotations. Building on this dataset, we propose $\textbf{TOUCH}$ , a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method's ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces free-form hand-object interaction generation, extending beyond fixed grasping patterns to diverse manipulations like pushing and poking. It resides in the Contact-Guided and Constraint-Based Diffusion leaf, which contains three papers including TOUCH itself. This leaf sits within the broader Diffusion-Based Interaction Generation branch, indicating a moderately populated research direction focused on incorporating explicit physical constraints into diffusion models. The taxonomy reveals this is an active but not overcrowded area, with sibling leaves exploring dual-branch architectures and temporal decomposition strategies.

The taxonomy structure shows TOUCH's leaf neighbors include Dual-Branch and Modular Diffusion Architectures and Staged and Temporal Diffusion Processes, both addressing complementary aspects of interaction synthesis. The broader Interaction Synthesis Approaches branch encompasses alternative paradigms like LLM-based token generation and joint-level kinematic modeling. The WildO2 dataset contribution connects to the Data Collection and Annotation branch, specifically Video-Based Dataset Construction, which contains only one other paper. This positioning suggests the work bridges generative modeling innovations with data infrastructure needs in a relatively underexplored intersection.

Among thirty candidates examined, none clearly refute the three core contributions. The free-form interaction task formulation examined ten candidates with zero refutations, suggesting novelty in extending beyond stability-focused grasping. The WildO2 dataset construction pipeline similarly showed no overlapping prior work among ten candidates, though the limited search scope means comprehensive video-based HOI datasets may exist outside this sample. The TOUCH framework's multi-level diffusion architecture examined ten candidates without refutation, indicating the specific combination of contact guidance and fine-grained semantic control appears distinctive within the examined literature.

Based on the top-thirty semantic matches and taxonomy structure, the work appears to occupy a relatively sparse intersection of contact-aware diffusion and diverse interaction modeling. The analysis covers diffusion-based synthesis methods and related dataset construction efforts but does not exhaustively survey all video-based HOI datasets or alternative generative paradigms. The absence of refutations across contributions suggests meaningful novelty within the examined scope, though the limited candidate pool precludes definitive claims about the broader literature landscape.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: text-guided controllable generation of free-form hand-object interactions. The field has organized itself around several complementary branches that address different facets of synthesizing realistic hand-object interactions from natural language descriptions. Interaction Synthesis Approaches encompasses the algorithmic strategies—ranging from diffusion-based methods like Hoidiffusion[3] and Diffh2o[9] to transformer and state-space architectures such as Mamba HOI[17]—that generate plausible motion sequences. Interaction Representation and Decomposition focuses on how to encode and structure the problem, often breaking interactions into contact patterns, grasp phases, or temporal stages as seen in works like Chainhoi[5]. Specialized Interaction Contexts targets domain-specific scenarios such as robotic handovers or egocentric manipulation, while Data Collection and Annotation and Affordance and Contact Modeling provide the foundational resources and geometric reasoning needed to ground these generations in physical plausibility. Robotic Manipulation Applications bridges the gap to real-world deployment, and Related Motion Generation Tasks situates this work within the broader landscape of human motion synthesis. Within the diffusion-based synthesis branch, a particularly active line of research emphasizes contact-guided and constraint-based generation to ensure physical realism and fine-grained control. TOUCH[0] exemplifies this direction by incorporating explicit contact constraints into the diffusion process, enabling more precise manipulation of where and how the hand engages with objects. This approach contrasts with earlier diffusion methods like Hoidiffusion[3], which may rely more heavily on learned priors without explicit geometric guidance, and complements recent efforts such as HOIDiNi[18] that explore alternative constraint formulations. The trade-off centers on balancing generative flexibility with physical fidelity: purely data-driven diffusion can produce diverse outputs but may struggle with rare or geometrically intricate interactions, whereas contact-aware methods like TOUCH[0] sacrifice some variability to maintain tighter adherence to physical plausibility. Open questions remain around scalability to complex multi-object scenes and the integration of higher-level semantic reasoning from text, as explored in works like HOIGPT[16] and Text2HOI[24].

Claimed Contributions

Free-form hand-object interaction generation task

10 retrieved papers

The authors propose a new task that extends hand-object interaction generation beyond traditional grasp-centric approaches to encompass diverse non-grasping manipulations such as pushing, poking, and rotating. This task emphasizes fine-grained semantic control and physical plausibility while capturing the rich diversity of daily interactions.

10 retrieved papers

WildO2 dataset with automated construction pipeline

10 retrieved papers

The authors build WildO2, a large-scale 3D hand-object interaction dataset collected from in-the-wild videos. The dataset includes diverse non-grasping interactions with fine-grained semantic annotations, constructed through an automated pipeline that recovers 3D interactions from internet videos.

10 retrieved papers

TOUCH framework for controllable HOI generation

10 retrieved papers

The authors introduce TOUCH, a three-stage generation framework featuring explicit contact modeling, multi-level diffusion with coarse-to-fine semantic control, and physical constraint refinement. This framework enables the generation of diverse, controllable, and physically plausible hand-object interactions guided by fine-grained textual descriptions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions PDF

Sammy Christen, Shreyas Hampali, S. Christen, Fadime Sener, Edoardo Remelli, Tomas Hodan, TomÃ¡s Hodan, Shugao Ma, Eric Sauser, Bugra Tekin (2024)

[18] HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization PDF

Tevet, Guy, Roey Ron, Sawdayee, Haim, Guy Tevet, Bermano, Amit H., Haim Sawdayee, A. Bermano (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Free-form hand-object interaction generation task

[9] Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions PDF

Cannot Refute

[55] Predictive visuo-tactile interactive perception framework for object properties inference PDF

Cannot Refute

[56] Trends and challenges in robot manipulation PDF

Cannot Refute

[57] O2o-afford: Annotation-free large-scale object-object affordance learning PDF

Cannot Refute

[58] Push to know!-visuo-tactile based active object parameter inference with dual differentiable filtering PDF

Cannot Refute

[59] Learning from human videos for robotic manipulation PDF

Cannot Refute

[60] MultiSCOPE: Disambiguating in-hand object poses with proprioception and sequential interactions PDF

Cannot Refute

[61] Push-grasping with dexterous hands: Mechanics and a method PDF

Cannot Refute

[62] Hand-object interaction: from grasping to using PDF

Cannot Refute

[63] Experimental Evaluation of Precise Placement with Pushing Primitive Based on Cartesian Force Control PDF

Cannot Refute

Contribution

WildO2 dataset with automated construction pipeline

[40] Affordance diffusion: Synthesizing hand-object interactions PDF

Cannot Refute

[41] Fine-grained egocentric hand-object segmentation: Dataset, model, and applications PDF

Cannot Refute

[42] Hoi4d: A 4d egocentric dataset for category-level human-object interaction PDF

Cannot Refute

[43] Understanding human hands in contact at internet scale PDF

Cannot Refute

[44] H2o: Two hands manipulating objects for first person interaction recognition PDF

Cannot Refute

[45] H+ o: Unified egocentric recognition of 3d hand-object poses and interactions PDF

Cannot Refute

[46] HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos PDF

Cannot Refute

[47] HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction PDF

Cannot Refute

[48] HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics PDF

Cannot Refute

[49] RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos PDF

Cannot Refute

Contribution

TOUCH framework for controllable HOI generation

[1] Controllable human-object interaction synthesis PDF

Cannot Refute

[3] Hoidiffusion: Generating realistic 3d hand-object interaction data PDF

Cannot Refute

[6] Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models PDF

Cannot Refute

[9] Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions PDF

Cannot Refute

[12] Thor: Text to human-object interaction diffusion via relation intervention PDF

Cannot Refute

[50] Graspdiff: Grasping generation for hand-object interaction with multimodal guided diffusion PDF

Cannot Refute

[51] Coda: Coordinated diffusion noise optimization for whole-body manipulation of articulated objects PDF

Cannot Refute

[52] A motion conditioned diffusion model for real-time hand trajectory semantic prediction PDF

Cannot Refute

[53] HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation PDF

Cannot Refute

[54] Diffusion-guided reconstruction of everyday hand-object interaction clips PDF

Cannot Refute

TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions PDF

[18] HOIDiNi: Human-Object Interaction through Diffusion Noise Optimization PDF

Contribution Analysis

Free-form hand-object interaction generation task

[9] Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions PDF

[55] Predictive visuo-tactile interactive perception framework for object properties inference PDF

[56] Trends and challenges in robot manipulation PDF

[57] O2o-afford: Annotation-free large-scale object-object affordance learning PDF

[58] Push to know!-visuo-tactile based active object parameter inference with dual differentiable filtering PDF

[59] Learning from human videos for robotic manipulation PDF

[60] MultiSCOPE: Disambiguating in-hand object poses with proprioception and sequential interactions PDF

[61] Push-grasping with dexterous hands: Mechanics and a method PDF

[62] Hand-object interaction: from grasping to using PDF

[63] Experimental Evaluation of Precise Placement with Pushing Primitive Based on Cartesian Force Control PDF

WildO2 dataset with automated construction pipeline

[40] Affordance diffusion: Synthesizing hand-object interactions PDF

[41] Fine-grained egocentric hand-object segmentation: Dataset, model, and applications PDF

[42] Hoi4d: A 4d egocentric dataset for category-level human-object interaction PDF

[43] Understanding human hands in contact at internet scale PDF

[44] H2o: Two hands manipulating objects for first person interaction recognition PDF

[45] H+ o: Unified egocentric recognition of 3d hand-object poses and interactions PDF

[46] HOT3D: Hand and Object Tracking in 3D from Egocentric Multi-View Videos PDF

[47] HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction PDF

[48] HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics PDF

[49] RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos PDF

TOUCH framework for controllable HOI generation

[1] Controllable human-object interaction synthesis PDF

[3] Hoidiffusion: Generating realistic 3d hand-object interaction data PDF

[6] Hoi-diff: Text-driven synthesis of 3d human-object interactions using diffusion models PDF

[9] Diffh2o: Diffusion-based synthesis of hand-object interactions from textual descriptions PDF

[12] Thor: Text to human-object interaction diffusion via relation intervention PDF

[50] Graspdiff: Grasping generation for hand-object interaction with multimodal guided diffusion PDF

[51] Coda: Coordinated diffusion noise optimization for whole-body manipulation of articulated objects PDF

[52] A motion conditioned diffusion model for real-time hand trajectory semantic prediction PDF

[53] HunyuanVideo-HOMA: Generic Human-Object Interaction in Multimodal Driven Human Animation PDF

[54] Diffusion-guided reconstruction of everyday hand-object interaction clips PDF

Table of Contents