Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Human heads3D shape correspondencefoundation modelsvision transformerpoint tracking

We propose DenseMarks -- a new learned representation for human heads, enabling high-quality dense correspondences of human head images. For a 2D image of a human head, a Vision Transformer network predicts a 3D embedding for each pixel, which corresponds to a location in a 3D canonical unit cube. In order to train our network, we collect a dataset of pairwise point matches, estimated by a state-of-the-art point tracker over a collection of diverse in-the-wild talking heads videos, and guide the mapping via a contrastive loss, encouraging matched points to have close embeddings. We further employ multi-task learning with face landmarks and segmentation constraints, as well as imposing spatial continuity of embeddings through latent cube features, which results in an interpretable and queryable canonical space. The representation can be used for finding common semantic parts, face/head tracking, and stereo reconstruction. Due to the strong supervision, our method is robust to pose variations and covers the entire head, including hair. Additionally, the canonical space bottleneck makes sure the obtained representations are consistent across diverse poses and individuals. We demonstrate state-of-the-art results in geometry-aware point matching and monocular head tracking with 3D Morphable Models. The code and the model checkpoint will be made available to the public.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DenseMarks, a learned 3D canonical embedding representation for human heads that maps each pixel in a 2D image to a location in a 3D unit cube. It resides in the Learned Embedding-Based Correspondence leaf, which contains five papers total, including the original work. This leaf sits within the broader Correspondence Representation and Learning Frameworks branch, indicating a moderately populated research direction focused on data-driven embedding methods rather than explicit parametric models. The taxonomy shows this is an active but not overcrowded area, with sibling works exploring similar embedding-based strategies for dense correspondence.

The taxonomy reveals neighboring research directions that provide important context. The sibling leaf, Parametric Model-Based Correspondence, contains seven papers using 3D morphable models and template registration, representing a more traditional approach. Adjacent branches include Correspondence-Driven Reconstruction and Synthesis (with subcategories for neural rendering and multi-view reconstruction) and Correspondence for Pose and Alignment Estimation (covering head pose and dense alignment). The scope notes clarify that DenseMarks belongs in the embedding-based category because it learns correspondences from data rather than relying on explicit parametric templates, distinguishing it from morphable model approaches in the neighboring leaf.

Among twenty-five candidates examined, the contribution-level analysis reveals varying degrees of prior work overlap. The core DenseMarks representation examined ten candidates and found two potentially refutable, suggesting some existing work in learned canonical embeddings. The training procedure using point tracks and contrastive learning also examined ten candidates with one refutable match, indicating that contrastive learning for correspondence is not entirely novel. The interpretable canonical space contribution examined five candidates with zero refutations, appearing more distinctive. These statistics reflect a limited semantic search scope, not an exhaustive survey, so the true novelty landscape may differ from what these top-K matches suggest.

Based on the limited search scope of twenty-five semantically similar papers, the work appears to occupy a moderately explored niche within learned embedding-based correspondence. The taxonomy structure shows this is an established research direction with several related methods, though not as densely populated as parametric model-based approaches. The contribution-level statistics suggest incremental advances over existing embedding and contrastive learning techniques, with the canonical space design showing fewer direct precedents among examined candidates. A broader literature search might reveal additional overlapping work not captured in this top-K analysis.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Dense correspondence learning for human head images. The field organizes around several major branches that reflect different emphases in how correspondence is represented, applied, and evaluated. Correspondence Representation and Learning Frameworks explores foundational methods for encoding point-to-point mappings, ranging from classical geometric alignment to modern learned embeddings that capture semantic similarity across poses and identities. Correspondence-Driven Reconstruction and Synthesis focuses on leveraging these mappings to build 3D models or generate novel views, often integrating morphable models or neural rendering pipelines. Correspondence for Pose and Alignment Estimation targets the estimation of head orientation and facial landmark localization, which are critical for downstream tasks like gaze tracking and expression analysis. Correspondence for Body and Scene Understanding extends similar ideas beyond the head to full-body or environmental contexts, while Specialized Correspondence Applications and Analysis addresses niche problems such as medical imaging, interspecies alignment, or asymmetry measurement. Representative works like Denserac[3] and Learning Dense Facial Correspondences[4] illustrate how embedding-based approaches have matured over time, bridging classical registration techniques with deep learning. Recent activity highlights a tension between dense pixel-level methods and sparser landmark-driven schemes, as well as trade-offs between generalization across identities versus fine-grained per-subject accuracy. Within the Learned Embedding-Based Correspondence branch, several studies pursue self-supervised or contrastive strategies to learn robust feature spaces without exhaustive manual annotation. Densemarks[0] sits naturally in this cluster, emphasizing dense semantic embeddings for head images in a manner closely related to Learning Dense Correspondence for[1] and ConVol-E[24], both of which also explore volumetric or embedding-based representations. Compared to earlier geometric methods like Denserac[3], Densemarks[0] leverages modern neural architectures to handle greater variability in pose and lighting, while neighboring works such as Dense interspecies face embedding[25] extend similar embedding ideas to cross-species scenarios. These developments underscore ongoing questions about scalability, the role of synthetic training data, and the balance between dense correspondence quality and computational efficiency.

Claimed Contributions

DenseMarks: a learned 3D canonical embedding representation for human heads

Can Refute

10 retrieved papers

The authors introduce a method that maps each pixel of a 2D human head image to a 3D location in a canonical unit cube via a Vision Transformer network. This representation enables dense correspondences across diverse poses and individuals, covering the entire head including hair and accessories.

10 retrieved papers

Can Refute

Training procedure using point tracks from in-the-wild videos with contrastive learning

Can Refute

10 retrieved papers

The authors develop a training approach that leverages point tracks from an off-the-shelf tracker on talking head videos. They use a contrastive loss to encourage matched points to have close embeddings, combined with multi-task learning using face landmarks and segmentation constraints.

10 retrieved papers

Can Refute

Interpretable and queryable canonical space with spatial continuity

5 retrieved papers

The authors design a structured 3D canonical space by discretizing a unit cube with learnable latent features at each voxel and applying spatial smoothness via Gaussian filtering. This produces a semantically meaningful space where users can query specific regions and the embeddings remain smooth and continuous.

5 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] Learning Dense Facial Correspondences in Unconstrained Images PDF

Ronald Yu, Shunsuke Saito, Haoxiang Li, Duygu Ceylan, Hao Li (2022)

[24] ConVol-E: Continuous Volumetric Embeddings for Human-Centric Dense Correspondence Estimation PDF

Amogh Tiwari, Pranav Manu, Nakul Rathore, Astitva Srivastava, N. Rathore, Avinash Sharma (2023) • 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

[25] Dense interspecies face embedding PDF

S Yang, S Jeon, S Nam, SJ Kim (2022)

[29] Self-supervised visual descriptor learning for dense correspondence PDF

Tanner Schmidt, Richard A. Newcombe, D. Fox, Richard Newcombe, Dieter Fox (2016)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DenseMarks: a learned 3D canonical embedding representation for human heads

[66] HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences PDF

Can Refute

[67] Canonical 3D Deformer Maps: Unifying parametric and non-parametric methods for dense weakly-supervised category reconstruction PDF

Can Refute

[1] Learning Dense Correspondence for NeRF-Based Face Reenactment PDF

Cannot Refute

[24] ConVol-E: Continuous Volumetric Embeddings for Human-Centric Dense Correspondence Estimation PDF

Cannot Refute

[28] A 3d morphable model of craniofacial shape and texture variation PDF

Cannot Refute

[63] Coordgan: Self-supervised dense correspondences emerge from gans PDF

Cannot Refute

[64] Controlling Avatar Diffusion with Learnable Gaussian Embedding PDF

Cannot Refute

[65] Towards fine-grained optimal 3d face dense registration: An iterative dividing and diffusing method PDF

Cannot Refute

[68] SPHEAR: Spherical Head Registration for Complete Statistical 3D Modeling PDF

Cannot Refute

[69] Modular 3D dense surface analysis and GWAS reveal localized genetic effects on human facial morphology involving multiple novel loci PDF

Cannot Refute

Contribution

Training procedure using point tracks from in-the-wild videos with contrastive learning

[50] Mining better samples for contrastive learning of temporal correspondence PDF

Can Refute

[48] Space-time correspondence as a contrastive random walk PDF

Cannot Refute

[49] Contrastive learning for space-time correspondence via self-cycle consistency PDF

Cannot Refute

[51] PreViTS: Contrastive Pretraining with Video Tracking Supervision PDF

Cannot Refute

[52] Learning pixel trajectories with multiscale contrastive random walks PDF

Cannot Refute

[53] On Exploring PDE Modeling for Point Cloud Video Representation Learning PDF

Cannot Refute

[54] Spatial-then-temporal self-supervised learning for video correspondence PDF

Cannot Refute

[55] Dfnet: Enhance absolute pose regression with direct feature matching PDF

Cannot Refute

[56] Locality-aware inter-and intra-video reconstruction for self-supervised correspondence learning PDF

Cannot Refute

[57] Self-Supervised Any-Point Tracking by Contrastive Random Walks PDF

Cannot Refute

Contribution

Interpretable and queryable canonical space with spatial continuity

[58] Video probabilistic diffusion models in projected latent space PDF

Cannot Refute

[59] Sofgan: A portrait image generator with dynamic styling PDF

Cannot Refute

[60] Hq3davatar: High-quality implicit 3d head avatar PDF

Cannot Refute

[61] Multi-Stream 3D latent feature clustering for abnormality detection in videos PDF

Cannot Refute

[62] Volume-Based Space-Time Cube for Large-Scale Continuous Spatial Time Series. PDF

Cannot Refute

Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] Learning Dense Facial Correspondences in Unconstrained Images PDF

[24] ConVol-E: Continuous Volumetric Embeddings for Human-Centric Dense Correspondence Estimation PDF

[25] Dense interspecies face embedding PDF

[29] Self-supervised visual descriptor learning for dense correspondence PDF

Contribution Analysis

DenseMarks: a learned 3D canonical embedding representation for human heads

[66] HumanGPS: Geodesic PreServing Feature for Dense Human Correspondences PDF

[67] Canonical 3D Deformer Maps: Unifying parametric and non-parametric methods for dense weakly-supervised category reconstruction PDF

[1] Learning Dense Correspondence for NeRF-Based Face Reenactment PDF

[24] ConVol-E: Continuous Volumetric Embeddings for Human-Centric Dense Correspondence Estimation PDF

[28] A 3d morphable model of craniofacial shape and texture variation PDF

[63] Coordgan: Self-supervised dense correspondences emerge from gans PDF

[64] Controlling Avatar Diffusion with Learnable Gaussian Embedding PDF

[65] Towards fine-grained optimal 3d face dense registration: An iterative dividing and diffusing method PDF

[68] SPHEAR: Spherical Head Registration for Complete Statistical 3D Modeling PDF

[69] Modular 3D dense surface analysis and GWAS reveal localized genetic effects on human facial morphology involving multiple novel loci PDF

Training procedure using point tracks from in-the-wild videos with contrastive learning

[50] Mining better samples for contrastive learning of temporal correspondence PDF

[48] Space-time correspondence as a contrastive random walk PDF

[49] Contrastive learning for space-time correspondence via self-cycle consistency PDF

[51] PreViTS: Contrastive Pretraining with Video Tracking Supervision PDF

[52] Learning pixel trajectories with multiscale contrastive random walks PDF

[53] On Exploring PDE Modeling for Point Cloud Video Representation Learning PDF

[54] Spatial-then-temporal self-supervised learning for video correspondence PDF

[55] Dfnet: Enhance absolute pose regression with direct feature matching PDF

[56] Locality-aware inter-and intra-video reconstruction for self-supervised correspondence learning PDF

[57] Self-Supervised Any-Point Tracking by Contrastive Random Walks PDF

Interpretable and queryable canonical space with spatial continuity

[58] Video probabilistic diffusion models in projected latent space PDF

[59] Sofgan: A portrait image generator with dynamic styling PDF

[60] Hq3davatar: High-quality implicit 3d head avatar PDF

[61] Multi-Stream 3D latent feature clustering for abnormality detection in videos PDF

[62] Volume-Based Space-Time Cube for Large-Scale Continuous Spatial Time Series. PDF

Table of Contents