H $^3$ DP: Triply‑Hierarchical Diffusion Policy for Visuomotor Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Imitation LearningRepresentation LearningDiffusion Model

Visuomotor policy learning has witnessed substantial progress in robotic manipulation, with recent approaches predominantly relying on generative models to model the action distribution. However, these methods often overlook the critical coupling between visual perception and action prediction. In this work, we introduce Triply-Hierarchical Diffusion Policy (H $^3$ DP), a novel visuomotor learning framework that explicitly incorporates hierarchical structures to strengthen the integration between visual features and action generation. H $^3$ DP contains $\mathbf{3}$ levels of hierarchy: (1) depth-aware input layering that organizes RGB-D observations based on depth information; (2) multi-scale visual representations that encode semantic features at varying levels of granularity; and (3) a hierarchically conditioned diffusion process that aligns the generation of coarse-to-fine actions with corresponding visual features. Extensive experiments demonstrate that H $^3$ DP yields a $+ \mathbf{27.5}$ % average relative improvement over baselines across $\mathbf{44}$ simulation tasks and achieves superior performance in $\mathbf{4}$ challenging bimanual real-world manipulation tasks. Project Page: https://h3-dp.github.io/.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a triply-hierarchical diffusion policy framework that integrates depth-aware input layering, multi-scale visual representations, and hierarchically conditioned action generation. It resides in the Hierarchical Visual Feature Encoding leaf, which contains four papers total (including this one). This leaf sits within Multi-Scale Visual Representation Learning, a moderately populated branch addressing how visual features at different granularities inform action prediction. The taxonomy reveals this is an active but not overcrowded research direction, with sibling leaves exploring spatial attention mechanisms and multi-view perception.

The broader Multi-Scale Visual Representation Learning branch neighbors Hierarchical Policy Architectures (which decomposes tasks into high-level planning and low-level execution) and Generative Action Models (which treats actions as distributions). The Hierarchical Visual Feature Encoding leaf explicitly excludes methods using only final-layer features or single-scale encodings, positioning it as a middle ground between flat visual processing and full task decomposition. Nearby leaves like Spatial Attention and Graph-Based Reasoning focus on relational modeling rather than scale-based feature hierarchies, while Multi-View and Depth-Aware Perception emphasizes viewpoint integration over hierarchical conditioning.

Among thirty candidates examined, the triply-hierarchical framework itself shows no clear refutation (ten candidates, zero refutable). The hierarchically conditioned diffusion process similarly appears novel (ten candidates, zero refutable). However, the depth-aware layering strategy encounters one refutable candidate among ten examined, suggesting some prior work addresses depth-based input organization. The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The framework's novelty appears strongest in its integrated coupling of visual and action hierarchies rather than individual components.

Given the search examined thirty candidates across three contributions, the analysis captures immediate semantic neighbors but cannot rule out relevant work outside this scope. The taxonomy structure suggests the paper occupies a moderately explored niche where hierarchical visual encoding meets diffusion-based action generation. The depth-aware layering component shows the most overlap with prior art, while the end-to-end integration of three hierarchical levels appears less directly anticipated by examined literature.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: visuomotor policy learning with hierarchical visual-action coupling. The field addresses how agents can learn to map visual observations to motor actions by exploiting structure at multiple levels of abstraction. The taxonomy reveals several complementary perspectives: Hierarchical Policy Architectures decompose decision-making into high-level planning and low-level control (e.g., Hierarchical Imitation Driving[3], HAMSTER[5]); Multi-Scale Visual Representation Learning focuses on extracting features at different spatial or temporal resolutions (e.g., HDP[6], VAT[48]); Generative Action Models treat action sequences as distributions to be sampled or refined; Multi-Modal Sensory Integration combines vision with tactile or proprioceptive signals; Specialized Task Domains target navigation, manipulation, or driving; Learning Paradigms span imitation, reinforcement, and self-supervised methods; and Neuroscience-Inspired models draw on predictive coding or active inference principles. Together, these branches reflect a shared goal of bridging the gap between raw sensory input and coordinated motor output through layered representations. A particularly active line of work explores how to encode visual information hierarchically so that coarse scene understanding guides fine-grained action selection. H3DP[0] sits within the Multi-Scale Visual Representation Learning branch, specifically under Hierarchical Visual Feature Encoding, where it emphasizes coupling visual features at different scales directly to corresponding action granularities. This contrasts with approaches like Spatial Policy[1], which may prioritize spatial attention mechanisms, or HDP[6], which structures policies around explicit hierarchical decompositions of the action space. Meanwhile, VAT[48] leverages transformer architectures for multi-scale encoding, highlighting a trend toward attention-based feature aggregation. The central trade-off across these methods involves balancing representational expressiveness—capturing rich visual detail—with computational efficiency and sample complexity during training. H3DP[0] addresses this by tightly integrating visual and action hierarchies, aiming to improve generalization across tasks that demand both global scene context and precise local control.

Claimed Contributions

Triply-Hierarchical Diffusion Policy (H³DP) framework

9 retrieved papers

The authors propose H³DP, a visuomotor policy learning framework that integrates three levels of hierarchy: depth-aware input layering of RGB-D observations, multi-scale visual representations encoding features at varying granularity, and hierarchically conditioned diffusion process aligning coarse-to-fine action generation with corresponding visual features.

9 retrieved papers

Depth-aware layering strategy for RGB-D input

Can Refute

9 retrieved papers

The authors introduce a method that decomposes RGB-D images into multiple non-overlapping layers based on depth values, enabling the policy to explicitly distinguish foreground from background and suppress distractors and occlusions, thereby enhancing spatial structure understanding in cluttered visual scenarios.

9 retrieved papers

Can Refute

Hierarchically conditioned diffusion process for action generation

9 retrieved papers

The authors design a diffusion-based action generation mechanism where coarse visual features guide initial denoising steps to shape global action structure (low-frequency components), while fine-grained features inform later steps to refine precise details (high-frequency components), establishing tighter coupling between action generation and visual encoding.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning PDF

Liu Yi-jun, Liu Yuwei, Yijun Liu, Meng Yuan, Yuwei Liu, Yuan Meng, Zhou Yu-wei, Jieheng Zhang, Li Ye, Yuwei Zhou, Jiang Jiacheng, Ye Li, Jiacheng Jiang, Ge Shijia, Kangye Ji, Wang Zhi, Shijia Ge, Zhu, Wenwu, Zhi Wang, Wenwu Zhu (2025) • arXiv.org

[47] VAT: Vision Action Transformer by Unlocking Full Representation of ViT PDF

Wenhao Li, Chengwei Ma, Weixin Mao (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Triply-Hierarchical Diffusion Policy (H³DP) framework

[1] Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning PDF

Cannot Refute

[4] LLaDA-VLA: Vision Language Diffusion Action Models PDF

Cannot Refute

[19] Hierarchical Visual Policy Learning for Long-Horizon Robot Manipulation in Densely Cluttered Scenes PDF

Cannot Refute

[68] : A Vision-Language-Action Flow Model for General Robot Control PDF

Cannot Refute

[69] HieroAction: Hierarchically Guided VLM for Fine-Grained Action Analysis PDF

Cannot Refute

[70] MinD: Unified Visual Imagination and Control via Hierarchical World Models PDF

Cannot Refute

[71] HIQL: Offline Goal-Conditioned RL with Latent States as Actions PDF

Cannot Refute

[72] Any2Policy: Learning Visuomotor Policy with Any-Modality PDF

Cannot Refute

[73] Vision-Language-Action Model and Diffusion Policy Switching Enables Dexterous Control of an Anthropomorphic Hand PDF

Cannot Refute

Contribution

Depth-aware layering strategy for RGB-D input

[66] Learning depth-aware deep representations for robotic perception PDF

Can Refute

[59] Structered deep visual models for robot manipulation PDF

Cannot Refute

[60] RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic Manipulation PDF

Cannot Refute

[61] Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation PDF

Cannot Refute

[62] Hierarchical, Dense and Dynamic 3D Reconstruction Based on VDB Data Structure for Robotic Manipulation Tasks PDF

Cannot Refute

[63] Integrating visual foundation models for enhanced robot manipulation and motion planning: A layered approach PDF

Cannot Refute

[64] Depth Helps: Improving Pre-trained RGB-based Policy with Depth Information Injection PDF

Cannot Refute

[65] Enhancing spatial awareness via multi-modal fusion of cnn-based visual and depth features PDF

Cannot Refute

[67] Disentangled Object-Centric Image Representation for Robotic Manipulation PDF

Cannot Refute

Contribution

Hierarchically conditioned diffusion process for action generation

[50] Coarse-to-Fine: a Hierarchical Diffusion Model for Molecule Generation in 3D PDF

Cannot Refute

[51] HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation PDF

Cannot Refute

[52] Leveraging the Spatial Hierarchy: Coarse-to-fine Trajectory Generation via Cascaded Hybrid Diffusion PDF

Cannot Refute

[53] Equivariant Blurring Diffusion for Hierarchical Molecular Conformer Generation PDF

Cannot Refute

[54] Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation PDF

Cannot Refute

[55] Generate subgoal images before act: Unlocking the chain-of-thought reasoning in diffusion model for robot manipulation with multimodal prompts PDF

Cannot Refute

[56] CoCoDiff: Diversifying Skeleton Action Features via Coarse-Fine Text-Co-Guided Latent Diffusion PDF

Cannot Refute

[57] Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs PDF

Cannot Refute

[58] Generate the Forest before the Trees -- A Hierarchical Diffusion model for Climate Downscaling PDF

Cannot Refute

H3^33DP: Triply‑Hierarchical Diffusion Policy for Visuomotor Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning PDF

[47] VAT: Vision Action Transformer by Unlocking Full Representation of ViT PDF

Contribution Analysis

Triply-Hierarchical Diffusion Policy (H³DP) framework

[1] Spatial Policy: Guiding Visuomotor Robotic Manipulation with Spatial-Aware Modeling and Reasoning PDF

[4] LLaDA-VLA: Vision Language Diffusion Action Models PDF

[19] Hierarchical Visual Policy Learning for Long-Horizon Robot Manipulation in Densely Cluttered Scenes PDF

[68] : A Vision-Language-Action Flow Model for General Robot Control PDF

[69] HieroAction: Hierarchically Guided VLM for Fine-Grained Action Analysis PDF

[70] MinD: Unified Visual Imagination and Control via Hierarchical World Models PDF

[71] HIQL: Offline Goal-Conditioned RL with Latent States as Actions PDF

[72] Any2Policy: Learning Visuomotor Policy with Any-Modality PDF

[73] Vision-Language-Action Model and Diffusion Policy Switching Enables Dexterous Control of an Anthropomorphic Hand PDF

Depth-aware layering strategy for RGB-D input

[66] Learning depth-aware deep representations for robotic perception PDF

[59] Structered deep visual models for robot manipulation PDF

[60] RoboFlamingo-Plus: Fusion of Depth and RGB Perception with Vision-Language Models for Enhanced Robotic Manipulation PDF

[61] Act3D: 3D Feature Field Transformers for Multi-Task Robotic Manipulation PDF

[62] Hierarchical, Dense and Dynamic 3D Reconstruction Based on VDB Data Structure for Robotic Manipulation Tasks PDF

[63] Integrating visual foundation models for enhanced robot manipulation and motion planning: A layered approach PDF

[64] Depth Helps: Improving Pre-trained RGB-based Policy with Depth Information Injection PDF

[65] Enhancing spatial awareness via multi-modal fusion of cnn-based visual and depth features PDF

[67] Disentangled Object-Centric Image Representation for Robotic Manipulation PDF

Hierarchically conditioned diffusion process for action generation

[50] Coarse-to-Fine: a Hierarchical Diffusion Model for Molecule Generation in 3D PDF

[51] HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation PDF

[52] Leveraging the Spatial Hierarchy: Coarse-to-fine Trajectory Generation via Cascaded Hybrid Diffusion PDF

[53] Equivariant Blurring Diffusion for Hierarchical Molecular Conformer Generation PDF

[54] Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation PDF

[55] Generate subgoal images before act: Unlocking the chain-of-thought reasoning in diffusion model for robot manipulation with multimodal prompts PDF

[56] CoCoDiff: Diversifying Skeleton Action Features via Coarse-Fine Text-Co-Guided Latent Diffusion PDF

[57] Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs PDF

[58] Generate the Forest before the Trees -- A Hierarchical Diffusion model for Climate Downscaling PDF

Table of Contents

H $^3$ DP: Triply‑Hierarchical Diffusion Policy for Visuomotor Learning