Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation

ICLR 2026 Conference SubmissionAnonymous Authors
Human MotionHuman-Human Interaction3D CVMotion Generation
Abstract:

Generating realistic 3D Human-Human Interaction (HHI) requires coherent modeling of the physical plausibility of the agents and their interaction semantics. Existing methods compress all motion information into a single latent representation, limiting their ability to capture fine-grained actions and inter-agent interactions. This often leads to semantic misalignment and physically implausible artifacts, such as penetration or missed contact. We propose Disentangled Hierarchical Variational Autoencoder (DHVAE) based latent diffusion for structured and controllable HHI generation. DHVAE explicitly disentangles the global interaction context and individual motion patterns into a decoupled latent structure by employing a CoTransformer module. To mitigate implausible and physically inconsistent contacts in HHI, we incorporate contrastive learning constraints with our DHVAE to promote a more discriminative and physically plausible latent interaction space. For high-fidelity interaction synthesis, DHVAE employs a DDIM-based diffusion denoising process in the hierarchical latent space, enhanced by a skip-connected AdaLN-Transformer denoiser. Extensive evaluations show that DHVAE achieves superior motion fidelity, text alignment, and physical plausibility with greater computational efficiency.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a Disentangled Hierarchical Variational Autoencoder (DHVAE) for text-driven 3D human-human interaction generation, explicitly separating global interaction context from individual motion patterns using a CoTransformer module and contrastive learning constraints. It resides in the 'Hierarchical and Disentangled Latent Diffusion' leaf under 'Diffusion-Based Dyadic Interaction Synthesis', which contains only two papers including this one. This is a relatively sparse research direction within the broader multi-person interaction generation branch, suggesting the hierarchical disentanglement approach is not yet widely explored in the field.

The taxonomy reveals that the paper's immediate neighbors include 'Rectified Flow and Unified Multi-Modal Frameworks' and several sibling categories like 'Masked Modeling and Discrete Token Approaches' and 'Autoregressive and State-Space Models'. These adjacent leaves represent alternative architectural strategies for dyadic interaction synthesis, such as discrete tokenization or autoregressive generation. The paper diverges from these by emphasizing continuous hierarchical latent structures and diffusion-based denoising, positioning itself at the intersection of structured representation learning and probabilistic generative modeling rather than discrete or sequential paradigms.

Among the four candidates examined, the contrastive learning strategy for interaction latent shows one refutable candidate, while the DHVAE architecture itself examined one candidate with no clear refutation. The skip-connected AdaLN-Transformer denoiser was not examined against any candidates. This limited search scope—only four papers total—means the analysis captures a narrow slice of potentially relevant prior work. The contrastive learning component appears to have more substantial overlap with existing methods, whereas the hierarchical disentanglement architecture may be more distinctive within the examined set.

Based on the top-four semantic matches examined, the hierarchical disentanglement and skip-connected denoiser components appear less directly addressed by prior work, while the contrastive learning strategy has clearer precedent. The sparse population of the taxonomy leaf and the limited search scope suggest the work occupies a relatively underexplored niche, though a broader literature review would be needed to confirm whether the hierarchical VAE design and CoTransformer module represent substantive architectural novelty beyond the examined candidates.

Taxonomy

Core-task Taxonomy Papers
33
3
Claimed Contributions
4
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: 3D human-human interaction generation from text. The field has evolved from foundational single-person text-to-motion generation methods into a diverse landscape addressing increasingly complex scenarios. The taxonomy reveals five major branches: Single-Person Text-to-Motion Generation establishes baseline techniques for mapping language to individual movement; Human-Object Interaction Generation explores how humans engage with physical objects; Multi-Person Interaction Generation tackles the challenge of synthesizing coordinated dyadic or group behaviors; Multimodal and Holistic Behavior Generation integrates speech, facial expressions, and body motion; and Application-Oriented and Platform Systems focus on practical deployment and user interfaces. Works like TM2T[7] and Diverse Natural Motions[2] exemplify early single-person approaches, while HOI-Diff[3] and Contact-aware Motion[4] represent object-centric methods. The multi-person branch has grown particularly dense, with diffusion-based approaches becoming a dominant paradigm for modeling the intricate dependencies between interacting individuals. Within multi-person interaction synthesis, a central tension exists between modeling flexibility and structural control. Many recent efforts employ diffusion models to capture the stochastic nature of human coordination, yet they differ in how they represent and disentangle interaction semantics. Text2interact[10] and InterControl[5] illustrate contrasting strategies for conditioning on textual descriptions, while Open Domain Multi-Person[1] and Fine-grained Dual-Human[23] explore scalability and fine-grained control. Disentangled Hierarchical VAE[0] sits within the hierarchical and disentangled latent diffusion cluster, emphasizing a structured latent space that separates global interaction semantics from local motion details. Compared to Text2interact[10], which focuses on end-to-end diffusion, Disentangled Hierarchical VAE[0] introduces an explicit hierarchical decomposition to improve interpretability and controllability. This design choice reflects ongoing debates about whether to prioritize expressive generative capacity or modular, editable representations in synthesizing realistic dyadic interactions.

Claimed Contributions

Disentangled Hierarchical Variational Autoencoder (DHVAE)

The authors introduce a hierarchical VAE architecture that explicitly disentangles HHI representations into three latent variables: individual motion latents for each person and a shared global interaction latent. This structured decomposition, implemented via a CoTransformer module, enables fine-grained control over agent-specific behaviors and shared interaction semantics.

1 retrieved paper
Contrastive learning strategy for interaction latent

The authors develop a contrastive learning approach that constructs positive and negative motion pairs based on physical contact and spatial plausibility. This strategy imposes supervision on the global interaction latent to encourage physically realistic contact modeling and reduce implausible artifacts like penetration or missed contact.

3 retrieved papers
Can Refute
Skip-connected AdaLN-Transformer denoiser for hierarchical latent diffusion

The authors propose a denoiser architecture that combines adaptive layer normalization transformers with U-Net-style skip connections, segment positional encoding, and token scaling. This design enables stable and efficient diffusion-based generation in the structured latent space while handling scale imbalance across latent components.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Disentangled Hierarchical Variational Autoencoder (DHVAE)

The authors introduce a hierarchical VAE architecture that explicitly disentangles HHI representations into three latent variables: individual motion latents for each person and a shared global interaction latent. This structured decomposition, implemented via a CoTransformer module, enables fine-grained control over agent-specific behaviors and shared interaction semantics.

Contribution

Contrastive learning strategy for interaction latent

The authors develop a contrastive learning approach that constructs positive and negative motion pairs based on physical contact and spatial plausibility. This strategy imposes supervision on the global interaction latent to encourage physically realistic contact modeling and reduce implausible artifacts like penetration or missed contact.

Contribution

Skip-connected AdaLN-Transformer denoiser for hierarchical latent diffusion

The authors propose a denoiser architecture that combines adaptive layer normalization transformers with U-Net-style skip connections, segment positional encoding, and token scaling. This design enables stable and efficient diffusion-based generation in the structured latent space while handling scale imbalance across latent components.

Disentangled Hierarchical VAE for 3D Human-Human Interaction Generation | Novelty Validation