InfBaGel: Human-Object-Scene Interaction Generation with Dynamic Perception and Iterative Refinement

ICLR 2026 Conference SubmissionAnonymous Authors
Interaction GenerationConsistency ModelHuman Motion
Abstract:

Human–object–scene interactions (HOSI) generation has broad applications in embodied AI, simulation, and animation. Unlike human–object interaction (HOI) and human–scene interaction (HSI), HOSI generation requires reasoning over dynamic object–scene changes, yet suffers from limited annotated data. To address these issues, we propose a coarse‑to‑fine instruction‑conditioned interaction generation framework that is explicitly aligned with the iterative denoising process of a consistency model. In particular, we adopt a dynamic perception strategy that leverages trajectories from the preceding refinement to update scene context and condition subsequent refinement at each denoising step of consistency model, yielding consistent interactions. To further reduce physical artifacts, we introduce a bump‑aware guidance that mitigates collisions and penetrations during sampling without requiring fine‑grained scene geometry, enabling real‑time generation. To overcome data scarcity, we design a hybrid training startegy that synthesizes pseudo‑HOSI samples by injecting voxelized scene occupancy into HOI datasets and jointly trains with high‑fidelity HSI data, allowing interaction learning while preserving realistic scene awareness. Extensive experiments demonstrate that our method achieves state‑of‑the‑art performance in both HOSI and HOI generation, and strong generalization to unseen scenes. Code and datasets will be released upon acceptance.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a coarse-to-fine framework for generating human-object-scene interactions from textual instructions, employing a consistency model with dynamic perception and bump-aware guidance. It resides in the 'Instruction-Driven Multi-Stage Interaction Generation' leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader taxonomy of 38 papers. This leaf focuses on methods that decompose interaction synthesis into sequential stages guided by language, distinguishing it from unified end-to-end diffusion approaches or reinforcement learning-based control methods.

The taxonomy reveals that neighboring leaves include 'Unified Interaction Generation with Diffusion Models' (four papers) and 'Contact-Guided and Relation-Based Interaction Modeling' (two papers), both addressing full-body interactions but with different architectural philosophies. The paper's multi-stage approach contrasts with unified diffusion methods that jointly generate human and object motion without explicit stage separation. Its dynamic perception strategy and iterative refinement align with the scope of instruction-driven decomposition, while the bump-aware guidance shares conceptual overlap with contact-based modeling, though without requiring fine-grained geometry annotations.

Among 30 candidates examined across three contributions, none were flagged as clearly refuting the proposed methods. The coarse-to-fine framework with dynamic perception examined 10 candidates with zero refutable overlaps, as did the hybrid training strategy and bump-aware guidance. This suggests that within the limited search scope, the specific combination of consistency models, dynamic scene context updates, and voxelized occupancy injection appears underexplored. However, the absence of refutations reflects the search scale rather than exhaustive coverage of the literature, and the sparse leaf population hints at emerging rather than saturated research terrain.

Given the limited 30-candidate search and the sparse three-paper leaf, the work appears positioned in a relatively novel direction within instruction-driven interaction generation. The consistency model integration and hybrid data strategy may represent incremental advances over sibling methods, but the analysis does not capture potential overlaps in broader diffusion-based or contact-aware frameworks outside the examined scope. The novelty assessment remains provisional pending deeper exploration of related diffusion and multi-stage synthesis literature.

Taxonomy

Core-task Taxonomy Papers
38
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Generating human-object-scene interactions from textual instructions and goal locations. This field encompasses a diverse landscape of methods for synthesizing realistic interactions between humans, objects, and environments. The taxonomy reveals several major branches: Full-Body Human-Object-Scene Interaction Synthesis addresses holistic character animation in 3D spaces, often requiring multi-stage pipelines that handle navigation, approach, and manipulation. Hand-Object Interaction Synthesis focuses on fine-grained dexterous manipulation, while Scene-Agnostic approaches aim for generalization across environments. Zero-Shot and Out-of-Domain methods tackle novel object categories and unseen scenarios, and Multi-Human and Social Interaction Generation extends to collaborative settings. Reinforcement Learning-Based Control and Video-Based or Image-Based methods offer alternative paradigms, alongside specialized modalities and theoretical contributions that provide foundational understanding and benchmarks. Within Full-Body Human-Object-Scene Interaction Synthesis, a particularly active line of work centers on instruction-driven multi-stage generation, where systems decompose complex tasks into sequential sub-goals. InfBaGel[0] exemplifies this approach by generating interactions from textual instructions and goal locations, emphasizing the coordination of whole-body motion with scene geometry. This contrasts with works like Human Level Instructions[1], which may prioritize high-level task planning, and Autonomous Character Interaction[5], which explores autonomous decision-making in interactive environments. A key trade-off across these methods involves balancing physical plausibility with semantic fidelity to instructions, as well as handling the compositional complexity of multi-step interactions. InfBaGel[0] sits within this cluster of instruction-driven frameworks, sharing the challenge of bridging natural language understanding with physically grounded motion synthesis while navigating cluttered, realistic scenes.

Claimed Contributions

Coarse-to-fine instruction-conditioned interaction generation framework with dynamic perception and iterative refinement

The authors introduce a unified generation framework that performs coarse-to-fine motion synthesis by aligning with the few-step denoising of a consistency model. A dynamic perception strategy updates scene context at each denoising step using trajectories from preceding refinement, enabling consistent human-object-scene interactions. Additionally, bump-aware guidance mitigates collisions during sampling without requiring fine-grained scene geometry.

10 retrieved papers
Hybrid data training strategy combining synthesized HOSI and high-fidelity HSI data

The authors propose a training strategy that addresses data scarcity by synthesizing pseudo-HOSI samples through voxelizing the spatial volume occupied by humans and objects in HOI datasets, then jointly training with real high-fidelity HSI data. This approach enables the model to learn diverse interactions while preserving realistic scene awareness and achieving strong zero-shot generalization to unseen scenes.

10 retrieved papers
Bump-aware guidance for collision-free motion generation

The authors develop a lightweight guidance mechanism that detects collisions using voxelized scene representations and directs iterative sampling toward collision-free solutions. This approach avoids the computational cost of detailed mesh-based collision detection while progressively reducing penetration and enhancing physical plausibility during the consistency model sampling process.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Coarse-to-fine instruction-conditioned interaction generation framework with dynamic perception and iterative refinement

The authors introduce a unified generation framework that performs coarse-to-fine motion synthesis by aligning with the few-step denoising of a consistency model. A dynamic perception strategy updates scene context at each denoising step using trajectories from preceding refinement, enabling consistent human-object-scene interactions. Additionally, bump-aware guidance mitigates collisions during sampling without requiring fine-grained scene geometry.

Contribution

Hybrid data training strategy combining synthesized HOSI and high-fidelity HSI data

The authors propose a training strategy that addresses data scarcity by synthesizing pseudo-HOSI samples through voxelizing the spatial volume occupied by humans and objects in HOI datasets, then jointly training with real high-fidelity HSI data. This approach enables the model to learn diverse interactions while preserving realistic scene awareness and achieving strong zero-shot generalization to unseen scenes.

Contribution

Bump-aware guidance for collision-free motion generation

The authors develop a lightweight guidance mechanism that detects collisions using voxelized scene representations and directs iterative sampling toward collision-free solutions. This approach avoids the computational cost of detailed mesh-based collision detection while progressively reducing penetration and enhancing physical plausibility during the consistency model sampling process.