Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

Human-Object interactionCharacter animationHuman motion generation

Human-object interaction (HOI) synthesis is crucial for applications in animation, simulation, and robotics. However, existing approaches either rely on expensive motion capture data or require manual reward engineering, limiting their scalability and generalizability. In this work, we introduce the first unified physics-based HOI framework that leverages Vision-Language Models (VLMs) to enable long-horizon interactions with diverse object types — including static, dynamic, and articulated objects. We introduce VLM-Guided Relative Movement Dynamics (RMD), a fine-grained spatio-temporal bipartite representation that automatically constructs goal states and reward functions for reinforcement learning. By encoding structured relationships between human and object parts, RMD enables VLMs to generate semantically grounded, interaction-aware motion guidance without manual reward tuning. To support our methodology, we present Interplay, a novel dataset with thousands of long-horizon static and dynamic interaction plans. Extensive experiments demonstrate that our framework outperforms existing methods in synthesizing natural, human-like motions across both simple single-task and complex multi-task scenarios. For more details, please refer to our project webpage: https://vlm-rmd.github.io/.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a unified physics-based framework that uses Vision-Language Models to guide reinforcement learning for long-horizon human-object interactions across static, dynamic, and articulated objects. It resides in the 'VLM-Guided Motion Policy Design' leaf under 'Language and Vision-Guided Interaction', which contains only two papers total (including this one). This represents a relatively sparse research direction within the broader taxonomy of 50 papers across 32 leaf nodes, suggesting the specific combination of VLMs with physics-based HOI synthesis is an emerging area rather than a crowded subfield.

The parent branch 'Language and Vision-Guided Interaction' encompasses five leaves addressing different aspects of language-conditioned synthesis: VLM-guided policy design, LLM-driven task planning, text-to-3D generation, language-guided sparse control, and contact-aware text-driven motion. Neighboring branches include 'Physics-Based Motion Imitation and Control' (which emphasizes learning from motion capture without language guidance) and 'Kinematic and Diffusion-Based Synthesis' (which uses generative models rather than reinforcement learning). The taxonomy's scope notes clarify that VLM-guided methods specifically automate reward design, distinguishing them from manual reward engineering approaches in adjacent physics-based leaves.

Among 29 candidates examined through semantic search and citation expansion, none were found to clearly refute any of the three main contributions. The first contribution (unified VLM-physics framework) examined 10 candidates with zero refutations; the second (RMD representation) examined 9 with zero refutations; the third (Interplay dataset) examined 10 with zero refutations. This limited search scope—covering roughly 60% of the taxonomy's total papers—suggests that within the examined literature, the specific integration of VLMs for automatic reward construction in physics-based HOI appears relatively unexplored, though the analysis cannot claim exhaustive coverage of all potentially relevant prior work.

The analysis indicates novelty within the examined scope, particularly in combining VLM-based semantic reasoning with physics simulation for diverse object types. However, the search examined only top-K semantic matches rather than a comprehensive field survey, and the sparse population of the target taxonomy leaf (2 papers) may reflect either genuine novelty or incomplete taxonomy coverage. The contribution-level statistics consistently show no clear refutations, but this should be interpreted as 'no overlapping work found among 29 candidates' rather than definitive proof of absolute novelty across the entire research landscape.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: synthesizing physics-based human-object interactions. The field organizes around several complementary branches that address different facets of generating realistic human motions with objects. Physics-Based Motion Imitation and Control emphasizes reinforcement learning and trajectory optimization to produce dynamically stable behaviors, often drawing on reference motion data or learned policies (e.g., Intermimic Universal Control[3], PhysHOI Imitation[5]). Kinematic and Diffusion-Based Synthesis leverages generative models—particularly diffusion frameworks—to sample plausible interaction sequences while balancing kinematic realism with physical constraints (HOI Diff[4], InterDiff Physics[6]). Data-Driven Interaction Modeling and Datasets focuses on curating large-scale collections and benchmarks that capture diverse contact patterns and object affordances. Scene-Aware Interaction Synthesis tackles the challenge of placing and adapting motions within cluttered or geometrically complex environments, while Language and Vision-Guided Interaction explores how high-level instructions or visual cues can steer motion policies. Physical Consistency and Contact Refinement refines generated outputs to satisfy contact mechanics and force balance, and Specialized Interaction Domains targets niche settings such as hand manipulation or cooperative tasks. Recent work highlights a tension between purely kinematic generation—which can produce visually smooth results quickly—and physics-driven approaches that enforce dynamic feasibility at the cost of greater computational expense. A growing number of studies blend diffusion priors with physics-based post-processing to achieve both diversity and stability (Physics Aware Denoising[12], Physics Driven Generation[13]). Within the Language and Vision-Guided Interaction branch, VLM Guided Motion[0] exemplifies efforts to integrate vision-language models into motion policy design, enabling more intuitive control through natural language or image-based prompts. This direction contrasts with purely data-driven methods like Full Body HOI[2] or force-centric frameworks such as Force Physics HOI[1], which prioritize contact realism over high-level semantic guidance. By situating language-conditioned policies alongside physics simulators, VLM Guided Motion[0] bridges the gap between user intent and physically grounded execution, a theme echoed by neighboring work on relative movement reasoning (VLM Relative Movement[36]).

Claimed Contributions

Unified physics-based HOI framework leveraging VLMs for long-horizon interactions

10 retrieved papers

The authors introduce a unified framework that uses Vision-Language Models to enable physics-based synthesis of long-horizon human-object interactions. This framework supports diverse object types (static, dynamic, and articulated) without requiring expensive motion capture data or manual reward engineering.

10 retrieved papers

VLM-Guided Relative Movement Dynamics (RMD) representation

9 retrieved papers

The authors propose RMD, a structured spatio-temporal representation that encodes fine-grained relationships between human and object parts. This representation enables VLMs to automatically generate goal states and reward functions for reinforcement learning, eliminating the need for manual reward engineering while supporting both static and dynamic interactions.

9 retrieved papers

Interplay dataset for long-horizon static and dynamic interaction tasks

10 retrieved papers

The authors present Interplay, a new dataset containing thousands of long-horizon interaction plans that include both static and dynamic interaction tasks across varied scene contexts. This dataset addresses the gap in existing datasets that typically focus on either static interactions or object rearrangement separately.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[36] Human-Object Interaction with Vision-Language Model Guided Relative Movement Dynamics PDF

Zekai Deng, Ye Shi, Kaiyang Ji, Lan Xu, Shaoli Huang, Jingya Wang (2025) • arXiv.org

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Unified physics-based HOI framework leveraging VLMs for long-horizon interactions

[10] Human-object interaction from human-level instructions PDF

Cannot Refute

[15] Controllable human-object interaction synthesis PDF

Cannot Refute

[60] OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model PDF

Cannot Refute

[61] AffordanceLLM: Grounding Affordance from Vision Language Models PDF

Cannot Refute

[62] Sofar: Language-grounded orientation bridges spatial reasoning and object manipulation PDF

Cannot Refute

[63] Generating Human Motion in 3D Scenes from Text Descriptions PDF

Cannot Refute

[64] Anyskill: Learning open-vocabulary physical skill for interactive agents PDF

Cannot Refute

[65] PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models PDF

Cannot Refute

[66] Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions PDF

Cannot Refute

[67] HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid PDF

Cannot Refute

Contribution

VLM-Guided Relative Movement Dynamics (RMD) representation

[51] Human-oriented representation learning for robotic manipulation PDF

Cannot Refute

[52] Task-Oriented Scanpath Prediction with Spatial-Temporal Information in Driving Scenarios PDF

Cannot Refute

[53] GENFLOWRL: Generative Object-Centric Flow Matching for Reward Shaping in Visual Reinforcement Learning PDF

Cannot Refute

[54] Deep selective feature learning for action recognition PDF

Cannot Refute

[55] A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Manipulation PDF

Cannot Refute

[56] LSTM-GCN Hybrid Architecture for Model Predictive Control of Deformable Linear Objects PDF

Cannot Refute

[57] Learning human utility from video demonstrations for deductive planning in robotics PDF

Cannot Refute

[58] Teaching Virtual Agents to Perform Complex Spatial-Temporal Activities. PDF

Cannot Refute

[59] Learning Reward Functions for Robotic Manipulation by Observing Humans PDF

Cannot Refute

Contribution

Interplay dataset for long-horizon static and dynamic interaction tasks

[2] Full-body articulated human-object interaction PDF

Cannot Refute

[5] Physhoi: Physics-based imitation of dynamic human-object interaction PDF

Cannot Refute

[41] Interdreamer: Zero-shot text to 3d dynamic human-object interaction PDF

Cannot Refute

[68] Spatial-temporal human-object interaction detection PDF

Cannot Refute

[69] Hoi4d: A 4d egocentric dataset for category-level human-object interaction PDF

Cannot Refute

[70] Interacted object grounding in spatio-temporal human-object interactions PDF

Cannot Refute

[71] Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation PDF

Cannot Refute

[72] Exploring spatioâtemporal graph convolution for video-based humanâobject interaction recognition PDF

Cannot Refute

[73] Discovering a variety of objects in spatio-temporal human-object interactions PDF

Cannot Refute

[74] Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation PDF

Cannot Refute

Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[36] Human-Object Interaction with Vision-Language Model Guided Relative Movement Dynamics PDF

Contribution Analysis

Unified physics-based HOI framework leveraging VLMs for long-horizon interactions

[10] Human-object interaction from human-level instructions PDF

[15] Controllable human-object interaction synthesis PDF

[60] OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model PDF

[61] AffordanceLLM: Grounding Affordance from Vision Language Models PDF

[62] Sofar: Language-grounded orientation bridges spatial reasoning and object manipulation PDF

[63] Generating Human Motion in 3D Scenes from Text Descriptions PDF

[64] Anyskill: Learning open-vocabulary physical skill for interactive agents PDF

[65] PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models PDF

[66] Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions PDF

[67] HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid PDF

VLM-Guided Relative Movement Dynamics (RMD) representation

[51] Human-oriented representation learning for robotic manipulation PDF

[52] Task-Oriented Scanpath Prediction with Spatial-Temporal Information in Driving Scenarios PDF

[53] GENFLOWRL: Generative Object-Centric Flow Matching for Reward Shaping in Visual Reinforcement Learning PDF

[54] Deep selective feature learning for action recognition PDF

[55] A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Manipulation PDF

[56] LSTM-GCN Hybrid Architecture for Model Predictive Control of Deformable Linear Objects PDF

[57] Learning human utility from video demonstrations for deductive planning in robotics PDF

[58] Teaching Virtual Agents to Perform Complex Spatial-Temporal Activities. PDF

[59] Learning Reward Functions for Robotic Manipulation by Observing Humans PDF

Interplay dataset for long-horizon static and dynamic interaction tasks

[2] Full-body articulated human-object interaction PDF

[5] Physhoi: Physics-based imitation of dynamic human-object interaction PDF

[41] Interdreamer: Zero-shot text to 3d dynamic human-object interaction PDF

[68] Spatial-temporal human-object interaction detection PDF

[69] Hoi4d: A 4d egocentric dataset for category-level human-object interaction PDF

[70] Interacted object grounding in spatio-temporal human-object interactions PDF

[71] Furniturebench: Reproducible real-world benchmark for long-horizon complex manipulation PDF

[72] Exploring spatioâtemporal graph convolution for video-based humanâobject interaction recognition PDF

[73] Discovering a variety of objects in spatio-temporal human-object interactions PDF

[74] Dynamem: Online dynamic spatio-semantic memory for open world mobile manipulation PDF

Table of Contents

[72] Exploring spatioâtemporal graph convolution for video-based humanâobject interaction recognition PDF