TOUCH: Text-guided Controllable Generation of Free-Form Hand-Object Interactions

ICLR 2026 Conference SubmissionAnonymous Authors
hand-object interaction3D generation
Abstract:

Hand-object interaction (HOI) is fundamental for humans to express intent. Existing HOI generation research is predominantly confined to fixed grasping patterns, where control is tied to physical priors such as force closure or generic intent instructions, even when expressed through elaborate language. Such an overly general conditioning imposes a strong inductive bias for stable grasps, thus failing to capture the diversity of daily HOI. To address these limitations, we introduce Free-Form HOI Generation\textbf{Free-Form HOI Generation}, which aims to generate controllable, diverse, and physically plausible HOI conditioned on fine-grained intent, extending HOI from grasping to free-form interactions, like pushing, poking, and rotating. To support this task, we construct WildO2\textbf{WildO2}, an in-the-wild diverse 3D HOI dataset, which includes diverse HOI derived from internet videos. Specifically, it contains 4.4k unique interactions across 92 intents and 403 object categories, each with detailed semantic annotations. Building on this dataset, we propose TOUCH\textbf{TOUCH}, a three-stage framework centered on a multi-level diffusion model that facilitates fine-grained semantic control to generate versatile hand poses beyond grasping priors. This process leverages explicit contact modeling for conditioning and is subsequently refined with contact consistency and physical constraints to ensure realism. Comprehensive experiments demonstrate our method's ability to generate controllable, diverse, and physically plausible hand interactions representative of daily activities.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces free-form hand-object interaction generation, extending beyond fixed grasping patterns to diverse manipulations like pushing and poking. It resides in the Contact-Guided and Constraint-Based Diffusion leaf, which contains three papers including TOUCH itself. This leaf sits within the broader Diffusion-Based Interaction Generation branch, indicating a moderately populated research direction focused on incorporating explicit physical constraints into diffusion models. The taxonomy reveals this is an active but not overcrowded area, with sibling leaves exploring dual-branch architectures and temporal decomposition strategies.

The taxonomy structure shows TOUCH's leaf neighbors include Dual-Branch and Modular Diffusion Architectures and Staged and Temporal Diffusion Processes, both addressing complementary aspects of interaction synthesis. The broader Interaction Synthesis Approaches branch encompasses alternative paradigms like LLM-based token generation and joint-level kinematic modeling. The WildO2 dataset contribution connects to the Data Collection and Annotation branch, specifically Video-Based Dataset Construction, which contains only one other paper. This positioning suggests the work bridges generative modeling innovations with data infrastructure needs in a relatively underexplored intersection.

Among thirty candidates examined, none clearly refute the three core contributions. The free-form interaction task formulation examined ten candidates with zero refutations, suggesting novelty in extending beyond stability-focused grasping. The WildO2 dataset construction pipeline similarly showed no overlapping prior work among ten candidates, though the limited search scope means comprehensive video-based HOI datasets may exist outside this sample. The TOUCH framework's multi-level diffusion architecture examined ten candidates without refutation, indicating the specific combination of contact guidance and fine-grained semantic control appears distinctive within the examined literature.

Based on the top-thirty semantic matches and taxonomy structure, the work appears to occupy a relatively sparse intersection of contact-aware diffusion and diverse interaction modeling. The analysis covers diffusion-based synthesis methods and related dataset construction efforts but does not exhaustively survey all video-based HOI datasets or alternative generative paradigms. The absence of refutations across contributions suggests meaningful novelty within the examined scope, though the limited candidate pool precludes definitive claims about the broader literature landscape.

Taxonomy

Core-task Taxonomy Papers
39
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: text-guided controllable generation of free-form hand-object interactions. The field has organized itself around several complementary branches that address different facets of synthesizing realistic hand-object interactions from natural language descriptions. Interaction Synthesis Approaches encompasses the algorithmic strategies—ranging from diffusion-based methods like Hoidiffusion[3] and Diffh2o[9] to transformer and state-space architectures such as Mamba HOI[17]—that generate plausible motion sequences. Interaction Representation and Decomposition focuses on how to encode and structure the problem, often breaking interactions into contact patterns, grasp phases, or temporal stages as seen in works like Chainhoi[5]. Specialized Interaction Contexts targets domain-specific scenarios such as robotic handovers or egocentric manipulation, while Data Collection and Annotation and Affordance and Contact Modeling provide the foundational resources and geometric reasoning needed to ground these generations in physical plausibility. Robotic Manipulation Applications bridges the gap to real-world deployment, and Related Motion Generation Tasks situates this work within the broader landscape of human motion synthesis. Within the diffusion-based synthesis branch, a particularly active line of research emphasizes contact-guided and constraint-based generation to ensure physical realism and fine-grained control. TOUCH[0] exemplifies this direction by incorporating explicit contact constraints into the diffusion process, enabling more precise manipulation of where and how the hand engages with objects. This approach contrasts with earlier diffusion methods like Hoidiffusion[3], which may rely more heavily on learned priors without explicit geometric guidance, and complements recent efforts such as HOIDiNi[18] that explore alternative constraint formulations. The trade-off centers on balancing generative flexibility with physical fidelity: purely data-driven diffusion can produce diverse outputs but may struggle with rare or geometrically intricate interactions, whereas contact-aware methods like TOUCH[0] sacrifice some variability to maintain tighter adherence to physical plausibility. Open questions remain around scalability to complex multi-object scenes and the integration of higher-level semantic reasoning from text, as explored in works like HOIGPT[16] and Text2HOI[24].

Claimed Contributions

Free-form hand-object interaction generation task

The authors propose a new task that extends hand-object interaction generation beyond traditional grasp-centric approaches to encompass diverse non-grasping manipulations such as pushing, poking, and rotating. This task emphasizes fine-grained semantic control and physical plausibility while capturing the rich diversity of daily interactions.

10 retrieved papers
WildO2 dataset with automated construction pipeline

The authors build WildO2, a large-scale 3D hand-object interaction dataset collected from in-the-wild videos. The dataset includes diverse non-grasping interactions with fine-grained semantic annotations, constructed through an automated pipeline that recovers 3D interactions from internet videos.

10 retrieved papers
TOUCH framework for controllable HOI generation

The authors introduce TOUCH, a three-stage generation framework featuring explicit contact modeling, multi-level diffusion with coarse-to-fine semantic control, and physical constraint refinement. This framework enables the generation of diverse, controllable, and physically plausible hand-object interactions guided by fine-grained textual descriptions.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Free-form hand-object interaction generation task

The authors propose a new task that extends hand-object interaction generation beyond traditional grasp-centric approaches to encompass diverse non-grasping manipulations such as pushing, poking, and rotating. This task emphasizes fine-grained semantic control and physical plausibility while capturing the rich diversity of daily interactions.

Contribution

WildO2 dataset with automated construction pipeline

The authors build WildO2, a large-scale 3D hand-object interaction dataset collected from in-the-wild videos. The dataset includes diverse non-grasping interactions with fine-grained semantic annotations, constructed through an automated pipeline that recovers 3D interactions from internet videos.

Contribution

TOUCH framework for controllable HOI generation

The authors introduce TOUCH, a three-stage generation framework featuring explicit contact modeling, multi-level diffusion with coarse-to-fine semantic control, and physical constraint refinement. This framework enables the generation of diverse, controllable, and physically plausible hand-object interactions guided by fine-grained textual descriptions.