Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks

ICLR 2026 Conference SubmissionAnonymous Authors
Human heads3D shape correspondencefoundation modelsvision transformerpoint tracking
Abstract:

We propose DenseMarks -- a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DenseMarks, a learned 3D canonical embedding representation for human heads that maps each pixel in a 2D image to a location in a 3D unit cube. It resides in the Learned Embedding-Based Correspondence leaf, which contains five papers total, including the original work. This leaf sits within the broader Correspondence Representation and Learning Frameworks branch, indicating a moderately populated research direction focused on data-driven embedding methods rather than explicit parametric models. The taxonomy shows this is an active but not overcrowded area, with sibling works exploring similar embedding-based strategies for dense correspondence.

The taxonomy reveals neighboring research directions that provide important context. The sibling leaf, Parametric Model-Based Correspondence, contains seven papers using 3D morphable models and template registration, representing a more traditional approach. Adjacent branches include Correspondence-Driven Reconstruction and Synthesis (with subcategories for neural rendering and multi-view reconstruction) and Correspondence for Pose and Alignment Estimation (covering head pose and dense alignment). The scope notes clarify that DenseMarks belongs in the embedding-based category because it learns correspondences from data rather than relying on explicit parametric templates, distinguishing it from morphable model approaches in the neighboring leaf.

Among twenty-five candidates examined, the contribution-level analysis reveals varying degrees of prior work overlap. The core DenseMarks representation examined ten candidates and found two potentially refutable, suggesting some existing work in learned canonical embeddings. The training procedure using point tracks and contrastive learning also examined ten candidates with one refutable match, indicating that contrastive learning for correspondence is not entirely novel. The interpretable canonical space contribution examined five candidates with zero refutations, appearing more distinctive. These statistics reflect a limited semantic search scope, not an exhaustive survey, so the true novelty landscape may differ from what these top-K matches suggest.

Based on the limited search scope of twenty-five semantically similar papers, the work appears to occupy a moderately explored niche within learned embedding-based correspondence. The taxonomy structure shows this is an established research direction with several related methods, though not as densely populated as parametric model-based approaches. The contribution-level statistics suggest incremental advances over existing embedding and contrastive learning techniques, with the canonical space design showing fewer direct precedents among examined candidates. A broader literature search might reveal additional overlapping work not captured in this top-K analysis.

Taxonomy

Core-task Taxonomy Papers
47
3
Claimed Contributions
25
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: Dense correspondence learning for human head images. The field organizes around several major branches that reflect different emphases in how correspondence is represented, applied, and evaluated. Correspondence Representation and Learning Frameworks explores foundational methods for encoding point-to-point mappings, ranging from classical geometric alignment to modern learned embeddings that capture semantic similarity across poses and identities. Correspondence-Driven Reconstruction and Synthesis focuses on leveraging these mappings to build 3D models or generate novel views, often integrating morphable models or neural rendering pipelines. Correspondence for Pose and Alignment Estimation targets the estimation of head orientation and facial landmark localization, which are critical for downstream tasks like gaze tracking and expression analysis. Correspondence for Body and Scene Understanding extends similar ideas beyond the head to full-body or environmental contexts, while Specialized Correspondence Applications and Analysis addresses niche problems such as medical imaging, interspecies alignment, or asymmetry measurement. Representative works like Denserac[3] and Learning Dense Facial Correspondences[4] illustrate how embedding-based approaches have matured over time, bridging classical registration techniques with deep learning. Recent activity highlights a tension between dense pixel-level methods and sparser landmark-driven schemes, as well as trade-offs between generalization across identities versus fine-grained per-subject accuracy. Within the Learned Embedding-Based Correspondence branch, several studies pursue self-supervised or contrastive strategies to learn robust feature spaces without exhaustive manual annotation. Densemarks[0] sits naturally in this cluster, emphasizing dense semantic embeddings for head images in a manner closely related to Learning Dense Correspondence for[1] and ConVol-E[24], both of which also explore volumetric or embedding-based representations. Compared to earlier geometric methods like Denserac[3], Densemarks[0] leverages modern neural architectures to handle greater variability in pose and lighting, while neighboring works such as Dense interspecies face embedding[25] extend similar embedding ideas to cross-species scenarios. These developments underscore ongoing questions about scalability, the role of synthetic training data, and the balance between dense correspondence quality and computational efficiency.

Claimed Contributions

DenseMarks: a learned 3D canonical embedding representation for human heads

The authors introduce a method that maps each pixel of a 2D human head image to a 3D location in a canonical unit cube via a Vision Transformer network. This representation enables dense correspondences across diverse poses and individuals, covering the entire head including hair and accessories.

10 retrieved papers
Can Refute
Training procedure using point tracks from in-the-wild videos with contrastive learning

The authors develop a training approach that leverages point tracks from an off-the-shelf tracker on talking head videos. They use a contrastive loss to encourage matched points to have close embeddings, combined with multi-task learning using face landmarks and segmentation constraints.

10 retrieved papers
Can Refute
Interpretable and queryable canonical space with spatial continuity

The authors design a structured 3D canonical space by discretizing a unit cube with learnable latent features at each voxel and applying spatial smoothness via Gaussian filtering. This produces a semantically meaningful space where users can query specific regions and the embeddings remain smooth and continuous.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DenseMarks: a learned 3D canonical embedding representation for human heads

The authors introduce a method that maps each pixel of a 2D human head image to a 3D location in a canonical unit cube via a Vision Transformer network. This representation enables dense correspondences across diverse poses and individuals, covering the entire head including hair and accessories.

Contribution

Training procedure using point tracks from in-the-wild videos with contrastive learning

The authors develop a training approach that leverages point tracks from an off-the-shelf tracker on talking head videos. They use a contrastive loss to encourage matched points to have close embeddings, combined with multi-task learning using face landmarks and segmentation constraints.

Contribution

Interpretable and queryable canonical space with spatial continuity

The authors design a structured 3D canonical space by discretizing a unit cube with learnable latent features at each voxel and applying spatial smoothness via Gaussian filtering. This produces a semantically meaningful space where users can query specific regions and the embeddings remain smooth and continuous.