Learning Part-Aware Dense 3D Feature Field For Generalizable Articulated Object Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors
Robotic ManipulationImitation Learning3D RepresentationGeneralizable Policy
Abstract:

Articulated object manipulation is essential for various real-world robotic tasks, yet generalizing across diverse objects remains a major challenge. A key to generalization lies in understanding functional parts (e.g., door handles and knobs), which indicate where and how to manipulate across diverse object categories and shapes. Previous works attempted to achieve generalization by introducing foundation features, while these features are mostly 2D-based and do not specifically consider functional parts. When lifting these 2D features to geometry-profound 3D space, challenges arise, such as long runtimes, multi-view inconsistencies, and low spatial resolution with insufficient geometric information. To address these issues, we propose \textbf{Part-Aware 3D Feature Field (PA3FF)}, a novel dense 3D feature with part awareness for generalizable articulated object manipulation. PA3FF is trained by 3D part proposals from a large-scale labeled datasets, via a contrastive learning formulation. Given point clouds as input, PA3FF predicts a continuous 3D feature field in a feedforward manner, where the distance between point feature reflects the proximity of functional parts: points with similar features are more likely to belong to the same part. Building on this feature, we introduce the \textbf{Part-Aware Diffusion Policy (PADP)}, an imitation learning framework aimed at enhancing sample efficiency and generalization for robotic manipulation. We evaluate PADP on several simulated and real-world tasks, demonstrating that PA3FF consistently outperforms a range of 2D and 3D representations in manipulation scenarios, including CLIP, DINOv2, and Grounded-SAM, achieving state-of-the-art performance. Beyond imitation learning, PA3FF enables diverse downstream methods, including correspondence learning and segmentation task, making it a versatile foundation for robotic manipulation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Part-Aware 3D Feature Field (PA3FF), a dense continuous 3D feature representation trained via contrastive learning on large-scale part-annotated datasets, and Part-Aware Diffusion Policy (PADP) for manipulation. It resides in the 'Dense 3D Feature Fields for Part-Aware Manipulation' leaf, which contains only three papers including the original work. This is a relatively sparse research direction within the broader taxonomy of fifty papers, suggesting that continuous 3D feature fields specifically designed for part-aware manipulation remain an emerging area compared to more populated branches like cross-category generalization or articulation modeling.

The taxonomy reveals several neighboring research directions. The sibling leaf 'Cross-Category Part-Based Generalization' (four papers) emphasizes shared part semantics across categories, while 'Affordance and Actionable Part Learning' (three papers) focuses on predicting interaction points rather than dense fields. The parent branch 'Part-Aware Representation Learning for Manipulation' also includes 'Part-Level Instruction Following' and 'Superpoint and Hierarchical Part Representations', indicating that the field explores multiple granularities of part encoding. The paper's approach of learning continuous fields contrasts with discrete segmentation methods in 'Articulation Modeling and Motion Estimation', particularly 'Part Segmentation and Motion Decomposition' (five papers), which jointly segment and estimate motion parameters.

Among thirty candidates examined, the contrastive learning framework for part-aware features shows overlap with prior work: two refutable candidates were identified from ten examined. The PA3FF contribution itself (ten candidates examined, zero refutable) and PADP (ten candidates examined, zero refutable) appear more novel within the limited search scope. The statistics suggest that while the core feature field and policy components may be relatively unexplored in this specific formulation, the training methodology via contrastive learning on part proposals has more substantial precedent. The analysis does not claim exhaustive coverage; these findings reflect top-thirty semantic matches and their citation neighborhoods.

Based on the limited literature search, the work appears to occupy a sparsely populated niche at the intersection of dense 3D representations and part-aware manipulation. The taxonomy structure and contribution-level statistics suggest moderate novelty for the feature field and policy components, with the contrastive learning approach showing clearer connections to existing methods. The scope examined—thirty candidates across three contributions—provides a snapshot rather than definitive coverage of the field.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: Generalizable articulated object manipulation using part-aware 3D features. The field organizes around several complementary branches that together address the challenge of enabling robots to interact with articulated objects such as doors, drawers, and cabinets. Part-Aware Representation Learning for Manipulation focuses on extracting dense geometric and semantic features that distinguish functional parts, often leveraging neural fields or point cloud encodings to ground manipulation policies. Articulation Modeling and Motion Estimation tackles the inverse problem of inferring kinematic structures and motion parameters from observations, while Manipulation Policy Learning and Execution develops control strategies that exploit these representations. Supporting branches include 3D Reconstruction and Pose Estimation for building object models, Generative Modeling of Articulated Objects for synthesis and data augmentation, and Simulation Environments and Benchmarks such as SAPIEN[28] and ArtiBench[49] that provide standardized testbeds. Specialized Manipulation Scenarios address domain-specific challenges like deformable connections or temporal logic constraints. Recent work reveals a tension between end-to-end learning approaches and modular pipelines that explicitly model part semantics and articulation structure. Dense feature field methods like Part-Aware Dense Feature[0] and its close neighbors PartGS[23] and Part2GS[46] emphasize learning rich 3D representations that encode part boundaries and affordances directly from visual input, enabling generalization across object categories. In contrast, works such as GAPartNet[4] and PartManip[12] rely on explicit part segmentation and geometric reasoning to guide manipulation. Part-Aware Dense Feature[0] sits within the dense feature field cluster, sharing with PartGS[23] and Part2GS[46] an emphasis on continuous spatial encodings but differing in how part-level structure is regularized or supervised. This line of work contrasts with more structured approaches like UniArt[5] or ArtFormer[10], which impose stronger priors on articulation types, highlighting an ongoing exploration of how much geometric inductive bias is necessary for robust generalization.

Claimed Contributions

Part-Aware 3D Feature Field (PA3FF)

A 3D-native representation that encodes dense, semantic, and functional part-aware features directly from point clouds in a feedforward manner. The feature field is trained using contrastive learning on 3D part proposals from large-scale labeled datasets, where feature proximity reflects functional part similarity.

10 retrieved papers
Part-Aware Diffusion Policy (PADP)

An imitation learning framework that integrates PA3FF with a diffusion policy architecture for action generation. PADP leverages the part-aware 3D features to achieve sample-efficient and generalizable manipulation behaviors across diverse objects.

10 retrieved papers
Contrastive learning framework for part-aware features

A training approach combining geometric loss (encouraging spatial consistency within parts) and semantic loss (aligning point features with part name embeddings from SigLip) to enhance part-awareness in the 3D feature field.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Part-Aware 3D Feature Field (PA3FF)

A 3D-native representation that encodes dense, semantic, and functional part-aware features directly from point clouds in a feedforward manner. The feature field is trained using contrastive learning on 3D part proposals from large-scale labeled datasets, where feature proximity reflects functional part similarity.

Contribution

Part-Aware Diffusion Policy (PADP)

An imitation learning framework that integrates PA3FF with a diffusion policy architecture for action generation. PADP leverages the part-aware 3D features to achieve sample-efficient and generalizable manipulation behaviors across diverse objects.

Contribution

Contrastive learning framework for part-aware features

A training approach combining geometric loss (encouraging spatial consistency within parts) and semantic loss (aligning point features with part name embeddings from SigLip) to enhance part-awareness in the 3D feature field.