Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Single Image Face ReconstructionFace TrackingFoundation Model Finetuning

We address the 3D reconstruction of human faces from a single RGB image. To this end, we propose Pixel3DMM, a set of highly-generalized vision transformers which predict per-pixel geometric cues in order to constrain the optimization of a 3D morphable face model (3DMM). We exploit the latent features of the DINO foundation model, and introduce a tailored surface normal and uv-coordinate prediction head. We train our model by registering three high-quality 3D face datasets against the FLAME mesh topology, which results in a total of over 1,000 identities and 976K images. For 3D face reconstruction, we propose a FLAME fitting opitmization that solves for the 3DMM parameters from the uv-coordinate and normal estimates. To evaluate our method, we introduce a new benchmark for single-image face reconstruction, which features high diversity facial expressions, viewing angles, and ethnicities. Crucially, our benchmark is the first to evaluate both posed and neutral facial geometry. Ultimately, our method outperforms the state-of-the-art (SoTA) by over 15% in terms of geometric accuracy for posed facial expressions.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Pixel3DMM, a vision transformer-based approach that predicts per-pixel geometric cues (surface normals and UV coordinates) to guide FLAME 3DMM fitting. It resides in the 'Photometric and Geometric Refinement' leaf under 'Optimization-Based Fitting', which contains only two papers including this one. This leaf represents a relatively focused research direction within the broader parametric model-based reconstruction landscape, emphasizing iterative refinement through photometric and geometric constraints rather than direct parameter regression.

The taxonomy reveals that parametric model-based reconstruction dominates the field, with neighboring branches including regression-based parameter prediction (three sub-leaves with 7+ papers) and detail enhancement techniques (two sub-leaves with 4 papers). The paper's approach bridges optimization-based fitting with learned geometric priors, connecting to both the regression-based methods that predict parameters directly and the detail enhancement methods that add high-frequency geometry. The taxonomy's scope notes clarify that this leaf excludes landmark-only methods and pure photometric optimization, positioning the work at the intersection of learned feature extraction and geometric refinement.

Among 22 candidates examined across three contributions, no clearly refuting prior work was identified. The vision transformer component for per-pixel cues examined 3 candidates with no refutations, the FLAME fitting optimization examined 9 candidates with no refutations, and the benchmark contribution examined 10 candidates with no refutations. This limited search scope suggests that within the top-K semantic matches and citation expansion, the specific combination of transformer-based per-pixel prediction with UV-coordinate-driven FLAME fitting appears relatively unexplored, though the analysis does not claim exhaustive coverage of all related work.

Based on the limited literature search of 22 candidates, the work appears to occupy a relatively sparse position within its immediate taxonomy leaf. The combination of foundation model features, per-pixel geometric prediction, and FLAME optimization represents a specific technical approach not directly matched in the examined candidates. However, the search scope remains constrained to semantic similarity and citations, leaving open the possibility of related work in adjacent research directions not captured by this analysis.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: 3D face reconstruction from single RGB images. The field has evolved into several major branches that reflect different modeling philosophies and application priorities. Parametric Model-Based Reconstruction remains a dominant paradigm, leveraging statistical face models (e.g., 3DMM) through optimization-based fitting or regression-based prediction to ensure plausible geometry. Neural Implicit Representations have emerged as a flexible alternative, encoding surfaces via learned functions that can capture fine details without explicit mesh topology. Direct Geometry Prediction methods bypass intermediate representations to output meshes or point clouds end-to-end, while Generative and Adversarial Approaches exploit GANs and diffusion models to synthesize realistic 3D faces. Specialized Reconstruction Scenarios address challenges like occlusion, extreme pose, or video sequences, and a smaller set of works tackles General Scene and Object Reconstruction beyond faces. Survey and Review Papers provide periodic snapshots of progress across these diverse directions. Within Parametric Model-Based Reconstruction, a particularly active line focuses on Optimization-Based Fitting with Photometric and Geometric Refinement, where methods iteratively adjust model parameters to match image evidence while preserving realistic shape priors. Pixel3DMM[0] sits squarely in this cluster, emphasizing pixel-level photometric consistency and geometric detail refinement to improve reconstruction fidelity. This contrasts with purely regression-based approaches that predict parameters in a single forward pass, trading off iterative accuracy for speed. Nearby works such as Detailed RGB Face[9] similarly pursue high-fidelity geometry through careful alignment and refinement stages, while Learning Detailed Face[1] explores learning-based detail layers atop coarse parametric fits. The central tension across these branches involves balancing model expressiveness, computational efficiency, and robustness to in-the-wild variations, with Pixel3DMM[0] contributing refined optimization strategies that leverage dense photometric cues for enhanced geometric accuracy.

Claimed Contributions

Pixel3DMM: Vision transformers for per-pixel geometric cues

3 retrieved papers

The authors introduce Pixel3DMM, a pair of vision transformer networks that predict per-pixel surface normals and uv-coordinates. These networks exploit DINO foundation model features and are trained on registered high-quality 3D face datasets to provide geometric priors for constraining 3DMM optimization.

3 retrieved papers

FLAME fitting optimization using uv-coordinates and normals

9 retrieved papers

The authors develop a novel optimization-based 3D face reconstruction approach that fits FLAME model parameters by leveraging predicted uv-coordinates and surface normals. The method transfers uv-coordinate information into a 2D vertex loss to provide a wider basin of attraction during optimization.

9 retrieved papers

New benchmark for single-image face reconstruction

10 retrieved papers

The authors propose a new evaluation benchmark based on the NeRSemble dataset that includes diverse facial expressions, viewpoints, and ethnicities. This benchmark is the first to simultaneously evaluate both posed and neutral facial geometry, enabling better assessment of fitting fidelity and disentanglement of expression and identity.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] Detailed 3D Face Reconstruction from a Single RGB Image PDF

Moreno-Noguer, Francesc (2019) • Journal of WSCG

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Pixel3DMM: Vision transformers for per-pixel geometric cues

[51] Shape Transformers: TopologyâIndependent 3D Shape Models Using Transformers PDF

Cannot Refute

[52] WarpHE4D: Dense 4D Head Map toward Full Head Reconstruction PDF

Cannot Refute

[53] Geometry-Based and Physics-Informed 3D Face & Eye Reconstruction for Facial Behavior Analysis PDF

Cannot Refute

Contribution

FLAME fitting optimization using uv-coordinates and normals

[54] Texture and Normal Map Estimation for 3D Face Reconstruction PDF

Cannot Refute

[55] A 3d morphable model of craniofacial shape and texture variation PDF

Cannot Refute

[56] Relightify: Relightable 3D Faces from a Single Image via Diffusion Models PDF

Cannot Refute

[57] BareSkinNet: Deâmakeup and Deâlighting via 3D Face Reconstruction PDF

Cannot Refute

[58] Geometrical Consistency Modeling on B-Spline Parameter Domain for 3D Face Reconstruction From Limited Number of Wild Images PDF

Cannot Refute

[59] Dense semantic and topological correspondence of 3D faces without landmarks PDF

Cannot Refute

[60] A Unified Multi-output Semi-supervised Network for 3D Face Reconstruction PDF

Cannot Refute

[61] Deep Learning for 3D Face Reconstruction From a Single Image PDF

Cannot Refute

[62] FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction PDF

Cannot Refute

Contribution

New benchmark for single-image face reconstruction

[14] State-of-the-Art in 3D Face Reconstruction from a Single RGB Image PDF

Cannot Refute

[50] Facescape: 3d facial dataset and benchmark for single-view 3d face reconstruction PDF

Cannot Refute

[63] Expression-driven monocular 3D face reconstruction based on cross-modal guidance PDF

Cannot Refute

[64] Emoca: Emotion driven monocular face capture and animation PDF

Cannot Refute

[65] Joint face alignment and 3D face reconstruction with application to face recognition PDF

Cannot Refute

[66] Towards metrical reconstruction of human faces PDF

Cannot Refute

[67] SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting PDF

Cannot Refute

[68] The Florence 4D Facial Expression Dataset PDF

Cannot Refute

[69] 3D shape regression for real-time facial animation PDF

Cannot Refute

[70] A comparative study of 3-D face recognition under expression variations PDF

Cannot Refute

Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] Detailed 3D Face Reconstruction from a Single RGB Image PDF

Contribution Analysis

Pixel3DMM: Vision transformers for per-pixel geometric cues

[51] Shape Transformers: TopologyâIndependent 3D Shape Models Using Transformers PDF

[52] WarpHE4D: Dense 4D Head Map toward Full Head Reconstruction PDF

[53] Geometry-Based and Physics-Informed 3D Face & Eye Reconstruction for Facial Behavior Analysis PDF

FLAME fitting optimization using uv-coordinates and normals

[54] Texture and Normal Map Estimation for 3D Face Reconstruction PDF

[55] A 3d morphable model of craniofacial shape and texture variation PDF

[56] Relightify: Relightable 3D Faces from a Single Image via Diffusion Models PDF

[57] BareSkinNet: Deâmakeup and Deâlighting via 3D Face Reconstruction PDF

[58] Geometrical Consistency Modeling on B-Spline Parameter Domain for 3D Face Reconstruction From Limited Number of Wild Images PDF

[59] Dense semantic and topological correspondence of 3D faces without landmarks PDF

[60] A Unified Multi-output Semi-supervised Network for 3D Face Reconstruction PDF

[61] Deep Learning for 3D Face Reconstruction From a Single Image PDF

[62] FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction PDF

New benchmark for single-image face reconstruction

[14] State-of-the-Art in 3D Face Reconstruction from a Single RGB Image PDF

[50] Facescape: 3d facial dataset and benchmark for single-view 3d face reconstruction PDF

[63] Expression-driven monocular 3D face reconstruction based on cross-modal guidance PDF

[64] Emoca: Emotion driven monocular face capture and animation PDF

[65] Joint face alignment and 3D face reconstruction with application to face recognition PDF

[66] Towards metrical reconstruction of human faces PDF

[67] SEREP: Semantic Facial Expression Representation for Robust In-the-Wild Capture and Retargeting PDF

[68] The Florence 4D Facial Expression Dataset PDF

[69] 3D shape regression for real-time facial animation PDF

[70] A comparative study of 3-D face recognition under expression variations PDF

Table of Contents

[51] Shape Transformers: TopologyâIndependent 3D Shape Models Using Transformers PDF

[57] BareSkinNet: Deâmakeup and Deâlighting via 3D Face Reconstruction PDF