Human-Object Interaction via Automatically Designed VLM-Guided Motion Policy
Overview
Overall Novelty Assessment
The paper proposes a unified physics-based framework that uses Vision-Language Models to guide reinforcement learning for long-horizon human-object interactions across static, dynamic, and articulated objects. It resides in the 'VLM-Guided Motion Policy Design' leaf under 'Language and Vision-Guided Interaction', which contains only two papers total (including this one). This represents a relatively sparse research direction within the broader taxonomy of 50 papers across 32 leaf nodes, suggesting the specific combination of VLMs with physics-based HOI synthesis is an emerging area rather than a crowded subfield.
The parent branch 'Language and Vision-Guided Interaction' encompasses five leaves addressing different aspects of language-conditioned synthesis: VLM-guided policy design, LLM-driven task planning, text-to-3D generation, language-guided sparse control, and contact-aware text-driven motion. Neighboring branches include 'Physics-Based Motion Imitation and Control' (which emphasizes learning from motion capture without language guidance) and 'Kinematic and Diffusion-Based Synthesis' (which uses generative models rather than reinforcement learning). The taxonomy's scope notes clarify that VLM-guided methods specifically automate reward design, distinguishing them from manual reward engineering approaches in adjacent physics-based leaves.
Among 29 candidates examined through semantic search and citation expansion, none were found to clearly refute any of the three main contributions. The first contribution (unified VLM-physics framework) examined 10 candidates with zero refutations; the second (RMD representation) examined 9 with zero refutations; the third (Interplay dataset) examined 10 with zero refutations. This limited search scope—covering roughly 60% of the taxonomy's total papers—suggests that within the examined literature, the specific integration of VLMs for automatic reward construction in physics-based HOI appears relatively unexplored, though the analysis cannot claim exhaustive coverage of all potentially relevant prior work.
The analysis indicates novelty within the examined scope, particularly in combining VLM-based semantic reasoning with physics simulation for diverse object types. However, the search examined only top-K semantic matches rather than a comprehensive field survey, and the sparse population of the target taxonomy leaf (2 papers) may reflect either genuine novelty or incomplete taxonomy coverage. The contribution-level statistics consistently show no clear refutations, but this should be interpreted as 'no overlapping work found among 29 candidates' rather than definitive proof of absolute novelty across the entire research landscape.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce a unified framework that uses Vision-Language Models to enable physics-based synthesis of long-horizon human-object interactions. This framework supports diverse object types (static, dynamic, and articulated) without requiring expensive motion capture data or manual reward engineering.
The authors propose RMD, a structured spatio-temporal representation that encodes fine-grained relationships between human and object parts. This representation enables VLMs to automatically generate goal states and reward functions for reinforcement learning, eliminating the need for manual reward engineering while supporting both static and dynamic interactions.
The authors present Interplay, a new dataset containing thousands of long-horizon interaction plans that include both static and dynamic interaction tasks across varied scene contexts. This dataset addresses the gap in existing datasets that typically focus on either static interactions or object rearrangement separately.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[36] Human-Object Interaction with Vision-Language Model Guided Relative Movement Dynamics PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
Unified physics-based HOI framework leveraging VLMs for long-horizon interactions
The authors introduce a unified framework that uses Vision-Language Models to enable physics-based synthesis of long-horizon human-object interactions. This framework supports diverse object types (static, dynamic, and articulated) without requiring expensive motion capture data or manual reward engineering.
[10] Human-object interaction from human-level instructions PDF
[15] Controllable human-object interaction synthesis PDF
[60] OpenHOI: Open-World Hand-Object Interaction Synthesis with Multimodal Large Language Model PDF
[61] AffordanceLLM: Grounding Affordance from Vision Language Models PDF
[62] Sofar: Language-grounded orientation bridges spatial reasoning and object manipulation PDF
[63] Generating Human Motion in 3D Scenes from Text Descriptions PDF
[64] Anyskill: Learning open-vocabulary physical skill for interactive agents PDF
[65] PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models PDF
[66] Grounding 3D Object Affordance with Language Instructions, Visual Observations and Interactions PDF
[67] HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid PDF
VLM-Guided Relative Movement Dynamics (RMD) representation
The authors propose RMD, a structured spatio-temporal representation that encodes fine-grained relationships between human and object parts. This representation enables VLMs to automatically generate goal states and reward functions for reinforcement learning, eliminating the need for manual reward engineering while supporting both static and dynamic interactions.
[51] Human-oriented representation learning for robotic manipulation PDF
[52] Task-Oriented Scanpath Prediction with Spatial-Temporal Information in Driving Scenarios PDF
[53] GENFLOWRL: Generative Object-Centric Flow Matching for Reward Shaping in Visual Reinforcement Learning PDF
[54] Deep selective feature learning for action recognition PDF
[55] A Survey on Reinforcement Learning of Vision-Language-Action Models for Robotic Manipulation PDF
[56] LSTM-GCN Hybrid Architecture for Model Predictive Control of Deformable Linear Objects PDF
[57] Learning human utility from video demonstrations for deductive planning in robotics PDF
[58] Teaching Virtual Agents to Perform Complex Spatial-Temporal Activities. PDF
[59] Learning Reward Functions for Robotic Manipulation by Observing Humans PDF
Interplay dataset for long-horizon static and dynamic interaction tasks
The authors present Interplay, a new dataset containing thousands of long-horizon interaction plans that include both static and dynamic interaction tasks across varied scene contexts. This dataset addresses the gap in existing datasets that typically focus on either static interactions or object rearrangement separately.