H3^3DP: Triply‑Hierarchical Diffusion Policy for Visuomotor Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Imitation LearningRepresentation LearningDiffusion Model
Abstract:

Visuomotor policy learning has witnessed substantial progress in robotic manipulation, with recent approaches predominantly relying on generative models to model the action distribution. However, these methods often overlook the critical coupling between visual perception and action prediction. In this work, we introduce Triply-Hierarchical Diffusion Policy (H3^3DP), a novel visuomotor learning framework that explicitly incorporates hierarchical structures to strengthen the integration between visual features and action generation. H3^3DP contains 3\mathbf{3} levels of hierarchy: (1) depth-aware input layering that organizes RGB-D observations based on depth information; (2) multi-scale visual representations that encode semantic features at varying levels of granularity; and (3) a hierarchically conditioned diffusion process that aligns the generation of coarse-to-fine actions with corresponding visual features. Extensive experiments demonstrate that H3^3DP yields a +27.5+ \mathbf{27.5}% average relative improvement over baselines across 44\mathbf{44} simulation tasks and achieves superior performance in 4\mathbf{4} challenging bimanual real-world manipulation tasks. Project Page: https://h3-dp.github.io/.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a triply-hierarchical diffusion policy framework that integrates depth-aware input layering, multi-scale visual representations, and hierarchically conditioned action generation. It resides in the Hierarchical Visual Feature Encoding leaf, which contains four papers total (including this one). This leaf sits within Multi-Scale Visual Representation Learning, a moderately populated branch addressing how visual features at different granularities inform action prediction. The taxonomy reveals this is an active but not overcrowded research direction, with sibling leaves exploring spatial attention mechanisms and multi-view perception.

The broader Multi-Scale Visual Representation Learning branch neighbors Hierarchical Policy Architectures (which decomposes tasks into high-level planning and low-level execution) and Generative Action Models (which treats actions as distributions). The Hierarchical Visual Feature Encoding leaf explicitly excludes methods using only final-layer features or single-scale encodings, positioning it as a middle ground between flat visual processing and full task decomposition. Nearby leaves like Spatial Attention and Graph-Based Reasoning focus on relational modeling rather than scale-based feature hierarchies, while Multi-View and Depth-Aware Perception emphasizes viewpoint integration over hierarchical conditioning.

Among thirty candidates examined, the triply-hierarchical framework itself shows no clear refutation (ten candidates, zero refutable). The hierarchically conditioned diffusion process similarly appears novel (ten candidates, zero refutable). However, the depth-aware layering strategy encounters one refutable candidate among ten examined, suggesting some prior work addresses depth-based input organization. The limited search scope means these statistics reflect top-K semantic matches and citation expansion, not exhaustive coverage. The framework's novelty appears strongest in its integrated coupling of visual and action hierarchies rather than individual components.

Given the search examined thirty candidates across three contributions, the analysis captures immediate semantic neighbors but cannot rule out relevant work outside this scope. The taxonomy structure suggests the paper occupies a moderately explored niche where hierarchical visual encoding meets diffusion-based action generation. The depth-aware layering component shows the most overlap with prior art, while the end-to-end integration of three hierarchical levels appears less directly anticipated by examined literature.

Taxonomy

Core-task Taxonomy Papers
49
3
Claimed Contributions
27
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: visuomotor policy learning with hierarchical visual-action coupling. The field addresses how agents can learn to map visual observations to motor actions by exploiting structure at multiple levels of abstraction. The taxonomy reveals several complementary perspectives: Hierarchical Policy Architectures decompose decision-making into high-level planning and low-level control (e.g., Hierarchical Imitation Driving[3], HAMSTER[5]); Multi-Scale Visual Representation Learning focuses on extracting features at different spatial or temporal resolutions (e.g., HDP[6], VAT[48]); Generative Action Models treat action sequences as distributions to be sampled or refined; Multi-Modal Sensory Integration combines vision with tactile or proprioceptive signals; Specialized Task Domains target navigation, manipulation, or driving; Learning Paradigms span imitation, reinforcement, and self-supervised methods; and Neuroscience-Inspired models draw on predictive coding or active inference principles. Together, these branches reflect a shared goal of bridging the gap between raw sensory input and coordinated motor output through layered representations. A particularly active line of work explores how to encode visual information hierarchically so that coarse scene understanding guides fine-grained action selection. H3DP[0] sits within the Multi-Scale Visual Representation Learning branch, specifically under Hierarchical Visual Feature Encoding, where it emphasizes coupling visual features at different scales directly to corresponding action granularities. This contrasts with approaches like Spatial Policy[1], which may prioritize spatial attention mechanisms, or HDP[6], which structures policies around explicit hierarchical decompositions of the action space. Meanwhile, VAT[48] leverages transformer architectures for multi-scale encoding, highlighting a trend toward attention-based feature aggregation. The central trade-off across these methods involves balancing representational expressiveness—capturing rich visual detail—with computational efficiency and sample complexity during training. H3DP[0] addresses this by tightly integrating visual and action hierarchies, aiming to improve generalization across tasks that demand both global scene context and precise local control.

Claimed Contributions

Triply-Hierarchical Diffusion Policy (H³DP) framework

The authors propose H³DP, a visuomotor policy learning framework that integrates three levels of hierarchy: depth-aware input layering of RGB-D observations, multi-scale visual representations encoding features at varying granularity, and hierarchically conditioned diffusion process aligning coarse-to-fine action generation with corresponding visual features.

9 retrieved papers
Depth-aware layering strategy for RGB-D input

The authors introduce a method that decomposes RGB-D images into multiple non-overlapping layers based on depth values, enabling the policy to explicitly distinguish foreground from background and suppress distractors and occlusions, thereby enhancing spatial structure understanding in cluttered visual scenarios.

9 retrieved papers
Can Refute
Hierarchically conditioned diffusion process for action generation

The authors design a diffusion-based action generation mechanism where coarse visual features guide initial denoising steps to shape global action structure (low-frequency components), while fine-grained features inform later steps to refine precise details (high-frequency components), establishing tighter coupling between action generation and visual encoding.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Triply-Hierarchical Diffusion Policy (H³DP) framework

The authors propose H³DP, a visuomotor policy learning framework that integrates three levels of hierarchy: depth-aware input layering of RGB-D observations, multi-scale visual representations encoding features at varying granularity, and hierarchically conditioned diffusion process aligning coarse-to-fine action generation with corresponding visual features.

Contribution

Depth-aware layering strategy for RGB-D input

The authors introduce a method that decomposes RGB-D images into multiple non-overlapping layers based on depth values, enabling the policy to explicitly distinguish foreground from background and suppress distractors and occlusions, thereby enhancing spatial structure understanding in cluttered visual scenarios.

Contribution

Hierarchically conditioned diffusion process for action generation

The authors design a diffusion-based action generation mechanism where coarse visual features guide initial denoising steps to shape global action structure (low-frequency components), while fine-grained features inform later steps to refine precise details (high-frequency components), establishing tighter coupling between action generation and visual encoding.