GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Category-Agnostic Pose EstimationGraph LearningVariational Autoencoder

Category-agnostic pose estimation (CAPE) aims to localize keypoints on query images from arbitrary categories, using only a few annotated support examples for guidance. Recent approaches either treat keypoints as isolated entities or rely on manually defined skeleton priors, which are costly to annotate and inherently inflexible across diverse categories. Such oversimplification limits the model’s capacity to capture instance-wise structural cues critical for accurate pixel-level localization. To overcome these limitations, we propose \textbf{GenCape}, a \textbf{Gen}erative-based framework for \textbf{CAPE} that infers keypoint relationships solely from image-based support inputs, without additional textual descriptions or predefined skeletons. Our framework consists of two principal components: an iterative Structure-aware Variational Autoencoder (i-SVAE) and a Compositional Graph Transfer (CGT) module. The former infers soft, instance-specific adjacency matrices from support features through variational inference, embedded layer-wise into the Graph Transformer Decoder for progressive structural priors refinement. The latter adaptively aggregates multiple latent graphs into a query-aware structure via Bayesian fusion and attention-based reweighting, enhancing resilience to visual uncertainty and support-induced bias. This structure-aware design facilitates effective message propagation among keypoints and promotes semantic alignment across object categories with diverse keypoint topologies. Experimental results on the MP-100 dataset show that our method achieves substantial gains over graph-support baselines under both 1- and 5-shot settings, while maintaining competitive performance against text-support counterparts.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: category-agnostic pose estimation with learned keypoint structure. The field addresses the challenge of estimating object pose and keypoint configurations without relying on category-specific priors, enabling generalization across diverse object classes. The taxonomy reveals several complementary research directions. Few-Shot Support-Based Pose Estimation methods leverage small sets of annotated examples to guide keypoint prediction, often employing matching or meta-learning strategies to transfer structural knowledge from support images to query instances. Multimodal and Support-Free Approaches explore alternative supervision signals, such as language or self-supervised cues, to bypass the need for explicit support sets. Multi-Object and Open-Vocabulary Detection extends pose estimation to scenarios involving multiple objects or open-ended category vocabularies, while Category-Level 6D Pose Estimation focuses on recovering full 3D orientation and translation. Domain-Specific Pose Estimation tailors methods to particular application areas like human or animal pose, and Foundational Techniques and Auxiliary Methods provide core algorithmic building blocks such as graph-based reasoning and keypoint refinement. Within Few-Shot Support-Based Pose Estimation, a particularly active line of work explores how to model keypoint structure explicitly. Generative and Bayesian Frameworks offer probabilistic treatments of keypoint dependencies, enabling uncertainty quantification and structured prediction. GenCape[0] exemplifies this direction by employing structure-inductive generative modeling to learn keypoint relationships in a principled manner. This contrasts with more direct matching-based approaches like X-pose[2] or CapeX[4], which rely on feature correspondence between support and query images without explicit probabilistic modeling. Meanwhile, works such as Learning structure-supporting dependencies via[3] and Recurrent Feature Mining and[5] emphasize iterative refinement and dependency learning, highlighting ongoing efforts to balance structural expressiveness with computational efficiency. GenCape[0] sits at the intersection of these themes, leveraging generative priors to capture keypoint structure while maintaining flexibility across object categories.

Claimed Contributions

GenCape framework with iterative Structure-aware Variational Autoencoder

10 retrieved papers

The authors propose GenCape, a generative framework that uses an iterative Structure-aware Variational Autoencoder (i-SVAE) to learn instance-specific keypoint relationships (adjacency matrices) directly from support images, eliminating the need for predefined anatomical priors or textual descriptions.

10 retrieved papers

Compositional Graph Transfer mechanism

10 retrieved papers

The authors introduce a Compositional Graph Transfer (CGT) module that fuses multiple latent graph hypotheses into a query-aware structure using Bayesian fusion and attention-based reweighting, improving robustness under noisy or mismatched support scenarios.

10 retrieved papers

State-of-the-art performance on MP-100 without external annotations

Can Refute

10 retrieved papers

The authors report that GenCape achieves state-of-the-art results on the MP-100 benchmark in both 1-shot and 5-shot settings, outperforming existing methods without requiring external structural or textual annotations.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GenCape framework with iterative Structure-aware Variational Autoencoder

[11] Scape: A simple and strong category-agnostic pose estimator PDF

Cannot Refute

[35] Priormotion: Generative class-agnostic motion prediction with raster-vector motion field priors PDF

Cannot Refute

[36] Ghum & ghuml: Generative 3d human shape and articulated pose models PDF

Cannot Refute

[37] Animax: Animating the inanimate in 3d with joint video-pose diffusion models PDF

Cannot Refute

[38] Weakly-supervised action localization by generative attention modeling PDF

Cannot Refute

[39] Using EfficientNet-B7 (CNN), variational auto encoder (VAE) and Siamese twins' networks to evaluate human exercises as super objects in a TSSCI images PDF

Cannot Refute

[40] SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting PDF

Cannot Refute

[41] ROLA: real-world object-centric learning with attention optimization PDF

Cannot Refute

[42] Modulating Depth Map Features to Estimate 3D Human Pose via Multi-Task Variational Autoencoders PDF

Cannot Refute

[43] NeRF Explored: A Comprehensive Analysis of Neural Radiance Field in 3D Vision PDF

Cannot Refute

Contribution

Compositional Graph Transfer mechanism

[44] Heterogeneous graph embedding by aggregating meta-path and meta-structure through attention mechanism PDF

Cannot Refute

[45] Attention-guided fusion of transformers and CNNs for enhanced medical image segmentation PDF

Cannot Refute

[46] BaGFN: Broad Attentive Graph Fusion Network for High-Order Feature Interactions PDF

Cannot Refute

[47] Modality-Guided Dynamic Graph Fusion and Temporal Diffusion for Self-Supervised RGB-T Tracking PDF

Cannot Refute

[48] A Multimodal Graph Recommendation Method Based on Cross-Attention Fusion PDF

Cannot Refute

[49] Layer-wise representation fusion for compositional generalization PDF

Cannot Refute

[50] SAFFNet: self-attention based on Fourier frequency domain filter network for visual question answering PDF

Cannot Refute

[51] AMGNet: An Attention-Guided Multi-Graph Collaborative Decision Network for Safe Medication Recommendation PDF

Cannot Refute

[52] Attention Multihop Graph and Multiscale Convolutional Fusion Network for Hyperspectral Image Classification PDF

Cannot Refute

[53] Adaptive Neighbor Graph Aggregated Graph Attention Network for Heterogeneous Graph Embedding PDF

Cannot Refute

Contribution

State-of-the-art performance on MP-100 without external annotations

[4] CapeX: Category-Agnostic Pose Estimation from Textual Point Explanation PDF

Can Refute

[10] CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models PDF

Can Refute

[11] Scape: A simple and strong category-agnostic pose estimator PDF

Can Refute

[14] ESCAPE: Encoding Super-keypoints for Category-Agnostic Pose Estimation PDF

Can Refute

[18] A Graph-Based Approach for Category-Agnostic Pose Estimation PDF

Can Refute

[20] Matching is not enough: A two-stage framework for category-agnostic pose estimation PDF

Can Refute

[5] Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose Estimation PDF

Cannot Refute

[21] Meta-Point Learning and Refining for Category-Agnostic Pose Estimation PDF

Cannot Refute

[32] Edge Weight Prediction For Category-Agnostic Pose Estimation PDF

Cannot Refute

[54] Harnessing text-to-image diffusion models for category-agnostic pose estimation PDF

Cannot Refute

GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

GenCape framework with iterative Structure-aware Variational Autoencoder

[11] Scape: A simple and strong category-agnostic pose estimator PDF

[35] Priormotion: Generative class-agnostic motion prediction with raster-vector motion field priors PDF

[36] Ghum & ghuml: Generative 3d human shape and articulated pose models PDF

[37] Animax: Animating the inanimate in 3d with joint video-pose diffusion models PDF

[38] Weakly-supervised action localization by generative attention modeling PDF

[39] Using EfficientNet-B7 (CNN), variational auto encoder (VAE) and Siamese twins' networks to evaluate human exercises as super objects in a TSSCI images PDF

[40] SV-GS: Sparse View 4D Reconstruction with Skeleton-Driven Gaussian Splatting PDF

[41] ROLA: real-world object-centric learning with attention optimization PDF

[42] Modulating Depth Map Features to Estimate 3D Human Pose via Multi-Task Variational Autoencoders PDF

[43] NeRF Explored: A Comprehensive Analysis of Neural Radiance Field in 3D Vision PDF

Compositional Graph Transfer mechanism

[44] Heterogeneous graph embedding by aggregating meta-path and meta-structure through attention mechanism PDF

[45] Attention-guided fusion of transformers and CNNs for enhanced medical image segmentation PDF

[46] BaGFN: Broad Attentive Graph Fusion Network for High-Order Feature Interactions PDF

[47] Modality-Guided Dynamic Graph Fusion and Temporal Diffusion for Self-Supervised RGB-T Tracking PDF

[48] A Multimodal Graph Recommendation Method Based on Cross-Attention Fusion PDF

[49] Layer-wise representation fusion for compositional generalization PDF

[50] SAFFNet: self-attention based on Fourier frequency domain filter network for visual question answering PDF

[51] AMGNet: An Attention-Guided Multi-Graph Collaborative Decision Network for Safe Medication Recommendation PDF

[52] Attention Multihop Graph and Multiscale Convolutional Fusion Network for Hyperspectral Image Classification PDF

[53] Adaptive Neighbor Graph Aggregated Graph Attention Network for Heterogeneous Graph Embedding PDF

State-of-the-art performance on MP-100 without external annotations

[4] CapeX: Category-Agnostic Pose Estimation from Textual Point Explanation PDF

[10] CapeLLM: Support-Free Category-Agnostic Pose Estimation with Multimodal Large Language Models PDF

[11] Scape: A simple and strong category-agnostic pose estimator PDF

[14] ESCAPE: Encoding Super-keypoints for Category-Agnostic Pose Estimation PDF

[18] A Graph-Based Approach for Category-Agnostic Pose Estimation PDF

[20] Matching is not enough: A two-stage framework for category-agnostic pose estimation PDF

[5] Recurrent Feature Mining and Keypoint Mixup Padding for Category-Agnostic Pose Estimation PDF

[21] Meta-Point Learning and Refining for Category-Agnostic Pose Estimation PDF

[32] Edge Weight Prediction For Category-Agnostic Pose Estimation PDF

[54] Harnessing text-to-image diffusion models for category-agnostic pose estimation PDF

Table of Contents