Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy

ICLR 2026 Conference SubmissionAnonymous Authors
Human-Object interactionCharacter animationHuman motion generation
Abstract:

Human-object interaction (HOI) synthesis is crucial for applications in animation, simulation, and robotics. However, existing approaches either rely on expensive motion capture data or require manual reward engineering, limiting their scalability and generalizability. In this work, we introduce the first unified physics-based HOI framework that leverages Vision-Language Models (VLMs) to enable long-horizon interactions with diverse object types — including static, dynamic, and articulated objects. We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-grained spatio-temporal bipartite representation that automatically constructs goal states and reward functions for reinforcement learning. By encoding structured relationships between human and object parts, RMD enables VLMs to generate semantically grounded, interaction-aware motion guidance without manual reward tuning. To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans. Extensive experiments demonstrate that our framework outperforms existing methods in synthesizing natural, human-like motions across both simple single-task and complex multi-task scenarios. For more details, please refer to our project webpage: https://vlm-rmd.github.io/.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a unified physics-based framework that uses Vision-Language Models to guide reinforcement learning for long-horizon human-object interactions across static, dynamic, and articulated objects. It resides in the 'VLM-Guided Motion Policy Design' leaf under 'Language and Vision-Guided Interaction', which contains only two papers total (including this one). This represents a relatively sparse research direction within the broader taxonomy of 50 papers across 32 leaf nodes, suggesting the specific combination of VLMs with physics-based HOI synthesis is an emerging area rather than a crowded subfield.

The parent branch 'Language and Vision-Guided Interaction' encompasses five leaves addressing different aspects of language-conditioned synthesis: VLM-guided policy design, LLM-driven task planning, text-to-3D generation, language-guided sparse control, and contact-aware text-driven motion. Neighboring branches include 'Physics-Based Motion Imitation and Control' (which emphasizes learning from motion capture without language guidance) and 'Kinematic and Diffusion-Based Synthesis' (which uses generative models rather than reinforcement learning). The taxonomy's scope notes clarify that VLM-guided methods specifically automate reward design, distinguishing them from manual reward engineering approaches in adjacent physics-based leaves.

Among 29 candidates examined through semantic search and citation expansion, none were found to clearly refute any of the three main contributions. The first contribution (unified VLM-physics framework) examined 10 candidates with zero refutations; the second (RMD representation) examined 9 with zero refutations; the third (Interplay dataset) examined 10 with zero refutations. This limited search scope—covering roughly 60% of the taxonomy's total papers—suggests that within the examined literature, the specific integration of VLMs for automatic reward construction in physics-based HOI appears relatively unexplored, though the analysis cannot claim exhaustive coverage of all potentially relevant prior work.

The analysis indicates novelty within the examined scope, particularly in combining VLM-based semantic reasoning with physics simulation for diverse object types. However, the search examined only top-K semantic matches rather than a comprehensive field survey, and the sparse population of the target taxonomy leaf (2 papers) may reflect either genuine novelty or incomplete taxonomy coverage. The contribution-level statistics consistently show no clear refutations, but this should be interpreted as 'no overlapping work found among 29 candidates' rather than definitive proof of absolute novelty across the entire research landscape.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: synthesizing physics-based human-object interactions. The field organizes around several complementary branches that address different facets of generating realistic human motions with objects. Physics-Based Motion Imitation and Control emphasizes reinforcement learning and trajectory optimization to produce dynamically stable behaviors, often drawing on reference motion data or learned policies (e.g., Intermimic Universal Control[3], PhysHOI Imitation[5]). Kinematic and Diffusion-Based Synthesis leverages generative models—particularly diffusion frameworks—to sample plausible interaction sequences while balancing kinematic realism with physical constraints (HOI Diff[4], InterDiff Physics[6]). Data-Driven Interaction Modeling and Datasets focuses on curating large-scale collections and benchmarks that capture diverse contact patterns and object affordances. Scene-Aware Interaction Synthesis tackles the challenge of placing and adapting motions within cluttered or geometrically complex environments, while Language and Vision-Guided Interaction explores how high-level instructions or visual cues can steer motion policies. Physical Consistency and Contact Refinement refines generated outputs to satisfy contact mechanics and force balance, and Specialized Interaction Domains targets niche settings such as hand manipulation or cooperative tasks. Recent work highlights a tension between purely kinematic generation—which can produce visually smooth results quickly—and physics-driven approaches that enforce dynamic feasibility at the cost of greater computational expense. A growing number of studies blend diffusion priors with physics-based post-processing to achieve both diversity and stability (Physics Aware Denoising[12], Physics Driven Generation[13]). Within the Language and Vision-Guided Interaction branch, VLM Guided Motion[0] exemplifies efforts to integrate vision-language models into motion policy design, enabling more intuitive control through natural language or image-based prompts. This direction contrasts with purely data-driven methods like Full Body HOI[2] or force-centric frameworks such as Force Physics HOI[1], which prioritize contact realism over high-level semantic guidance. By situating language-conditioned policies alongside physics simulators, VLM Guided Motion[0] bridges the gap between user intent and physically grounded execution, a theme echoed by neighboring work on relative movement reasoning (VLM Relative Movement[36]).

Claimed Contributions

Unified physics-based HOI framework leveraging VLMs for long-horizon interactions

The authors introduce a unified framework that uses Vision-Language Models to enable physics-based synthesis of long-horizon human-object interactions. This framework supports diverse object types (static, dynamic, and articulated) without requiring expensive motion capture data or manual reward engineering.

10 retrieved papers
VLM-Guided Relative Movement Dynamics (RMD) representation

The authors propose RMD, a structured spatio-temporal representation that encodes fine-grained relationships between human and object parts. This representation enables VLMs to automatically generate goal states and reward functions for reinforcement learning, eliminating the need for manual reward engineering while supporting both static and dynamic interactions.

9 retrieved papers
Interplay dataset for long-horizon static and dynamic interaction tasks

The authors present Interplay, a new dataset containing thousands of long-horizon interaction plans that include both static and dynamic interaction tasks across varied scene contexts. This dataset addresses the gap in existing datasets that typically focus on either static interactions or object rearrangement separately.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Unified physics-based HOI framework leveraging VLMs for long-horizon interactions

The authors introduce a unified framework that uses Vision-Language Models to enable physics-based synthesis of long-horizon human-object interactions. This framework supports diverse object types (static, dynamic, and articulated) without requiring expensive motion capture data or manual reward engineering.

Contribution

VLM-Guided Relative Movement Dynamics (RMD) representation

The authors propose RMD, a structured spatio-temporal representation that encodes fine-grained relationships between human and object parts. This representation enables VLMs to automatically generate goal states and reward functions for reinforcement learning, eliminating the need for manual reward engineering while supporting both static and dynamic interactions.

Contribution

Interplay dataset for long-horizon static and dynamic interaction tasks

The authors present Interplay, a new dataset containing thousands of long-horizon interaction plans that include both static and dynamic interaction tasks across varied scene contexts. This dataset addresses the gap in existing datasets that typically focus on either static interactions or object rearrangement separately.

Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy | Novelty Validation