GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation

ICLR 2026 Conference SubmissionAnonymous Authors
Category-Agnostic Pose EstimationGraph LearningVariational Autoencoder
Abstract:

Category-agnostic pose estimation (CAPE) aims to localize keypoints on query images from arbitrary categories, using only a few annotated support examples for guidance. Recent approaches either treat keypoints as isolated entities or rely on manually defined skeleton priors, which are costly to annotate and inherently inflexible across diverse categories. Such oversimplification limits the model’s capacity to capture instance-wise structural cues critical for accurate pixel-level localization. To overcome these limitations, we propose \textbf{GenCape}, a \textbf{Gen}erative-based framework for \textbf{CAPE} that infers keypoint relationships solely from image-based support inputs, without additional textual descriptions or predefined skeletons. Our framework consists of two principal components: an iterative Structure-aware Variational Autoencoder (i-SVAE) and a Compositional Graph Transfer (CGT) module. The former infers soft, instance-specific adjacency matrices from support features through variational inference, embedded layer-wise into the Graph Transformer Decoder for progressive structural priors refinement. The latter adaptively aggregates multiple latent graphs into a query-aware structure via Bayesian fusion and attention-based reweighting, enhancing resilience to visual uncertainty and support-induced bias. This structure-aware design facilitates effective message propagation among keypoints and promotes semantic alignment across object categories with diverse keypoint topologies. Experimental results on the MP-100 dataset show that our method achieves substantial gains over graph-support baselines under both 1- and 5-shot settings, while maintaining competitive performance against text-support counterparts.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
34
3
Claimed Contributions
30
Contribution Candidate Papers Compared
6
Refutable Paper

Research Landscape Overview

Core task: category-agnostic pose estimation with learned keypoint structure. The field addresses the challenge of estimating object pose and keypoint configurations without relying on category-specific priors, enabling generalization across diverse object classes. The taxonomy reveals several complementary research directions. Few-Shot Support-Based Pose Estimation methods leverage small sets of annotated examples to guide keypoint prediction, often employing matching or meta-learning strategies to transfer structural knowledge from support images to query instances. Multimodal and Support-Free Approaches explore alternative supervision signals, such as language or self-supervised cues, to bypass the need for explicit support sets. Multi-Object and Open-Vocabulary Detection extends pose estimation to scenarios involving multiple objects or open-ended category vocabularies, while Category-Level 6D Pose Estimation focuses on recovering full 3D orientation and translation. Domain-Specific Pose Estimation tailors methods to particular application areas like human or animal pose, and Foundational Techniques and Auxiliary Methods provide core algorithmic building blocks such as graph-based reasoning and keypoint refinement. Within Few-Shot Support-Based Pose Estimation, a particularly active line of work explores how to model keypoint structure explicitly. Generative and Bayesian Frameworks offer probabilistic treatments of keypoint dependencies, enabling uncertainty quantification and structured prediction. GenCape[0] exemplifies this direction by employing structure-inductive generative modeling to learn keypoint relationships in a principled manner. This contrasts with more direct matching-based approaches like X-pose[2] or CapeX[4], which rely on feature correspondence between support and query images without explicit probabilistic modeling. Meanwhile, works such as Learning structure-supporting dependencies via[3] and Recurrent Feature Mining and[5] emphasize iterative refinement and dependency learning, highlighting ongoing efforts to balance structural expressiveness with computational efficiency. GenCape[0] sits at the intersection of these themes, leveraging generative priors to capture keypoint structure while maintaining flexibility across object categories.

Claimed Contributions

GenCape framework with iterative Structure-aware Variational Autoencoder

The authors propose GenCape, a generative framework that uses an iterative Structure-aware Variational Autoencoder (i-SVAE) to learn instance-specific keypoint relationships (adjacency matrices) directly from support images, eliminating the need for predefined anatomical priors or textual descriptions.

10 retrieved papers
Compositional Graph Transfer mechanism

The authors introduce a Compositional Graph Transfer (CGT) module that fuses multiple latent graph hypotheses into a query-aware structure using Bayesian fusion and attention-based reweighting, improving robustness under noisy or mismatched support scenarios.

10 retrieved papers
State-of-the-art performance on MP-100 without external annotations

The authors report that GenCape achieves state-of-the-art results on the MP-100 benchmark in both 1-shot and 5-shot settings, outperforming existing methods without requiring external structural or textual annotations.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

GenCape framework with iterative Structure-aware Variational Autoencoder

The authors propose GenCape, a generative framework that uses an iterative Structure-aware Variational Autoencoder (i-SVAE) to learn instance-specific keypoint relationships (adjacency matrices) directly from support images, eliminating the need for predefined anatomical priors or textual descriptions.

Contribution

Compositional Graph Transfer mechanism

The authors introduce a Compositional Graph Transfer (CGT) module that fuses multiple latent graph hypotheses into a query-aware structure using Bayesian fusion and attention-based reweighting, improving robustness under noisy or mismatched support scenarios.

Contribution

State-of-the-art performance on MP-100 without external annotations

The authors report that GenCape achieves state-of-the-art results on the MP-100 benchmark in both 1-shot and 5-shot settings, outperforming existing methods without requiring external structural or textual annotations.