Pixel3DMM: Versatile Screen-Space Priors for Single-Image 3D Face Reconstruction

ICLR 2026 Conference SubmissionAnonymous Authors
Single Image Face ReconstructionFace TrackingFoundation Model Finetuning
Abstract:

We address the 3D reconstruction of human faces from a single RGB image. To this end, we propose Pixel3DMM, a set of highly-generalized vision transformers which predict per-pixel geometric cues in order to constrain the optimization of a 3D morphable face model (3DMM). We exploit the latent features of the DINO foundation model, and introduce a tailored surface normal and uv-coordinate prediction head. We train our model by registering three high-quality 3D face datasets against the FLAME mesh topology, which results in a total of over 1,000 identities and 976K images. For 3D face reconstruction, we propose a FLAME fitting opitmization that solves for the 3DMM parameters from the uv-coordinate and normal estimates. To evaluate our method, we introduce a new benchmark for single-image face reconstruction, which features high diversity facial expressions, viewing angles, and ethnicities. Crucially, our benchmark is the first to evaluate both posed and neutral facial geometry. Ultimately, our method outperforms the state-of-the-art (SoTA) by over 15% in terms of geometric accuracy for posed facial expressions.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Pixel3DMM, a vision transformer-based approach that predicts per-pixel geometric cues (surface normals and UV coordinates) to guide FLAME 3DMM fitting. It resides in the 'Photometric and Geometric Refinement' leaf under 'Optimization-Based Fitting', which contains only two papers including this one. This leaf represents a relatively focused research direction within the broader parametric model-based reconstruction landscape, emphasizing iterative refinement through photometric and geometric constraints rather than direct parameter regression.

The taxonomy reveals that parametric model-based reconstruction dominates the field, with neighboring branches including regression-based parameter prediction (three sub-leaves with 7+ papers) and detail enhancement techniques (two sub-leaves with 4 papers). The paper's approach bridges optimization-based fitting with learned geometric priors, connecting to both the regression-based methods that predict parameters directly and the detail enhancement methods that add high-frequency geometry. The taxonomy's scope notes clarify that this leaf excludes landmark-only methods and pure photometric optimization, positioning the work at the intersection of learned feature extraction and geometric refinement.

Among 22 candidates examined across three contributions, no clearly refuting prior work was identified. The vision transformer component for per-pixel cues examined 3 candidates with no refutations, the FLAME fitting optimization examined 9 candidates with no refutations, and the benchmark contribution examined 10 candidates with no refutations. This limited search scope suggests that within the top-K semantic matches and citation expansion, the specific combination of transformer-based per-pixel prediction with UV-coordinate-driven FLAME fitting appears relatively unexplored, though the analysis does not claim exhaustive coverage of all related work.

Based on the limited literature search of 22 candidates, the work appears to occupy a relatively sparse position within its immediate taxonomy leaf. The combination of foundation model features, per-pixel geometric prediction, and FLAME optimization represents a specific technical approach not directly matched in the examined candidates. However, the search scope remains constrained to semantic similarity and citations, leaving open the possibility of related work in adjacent research directions not captured by this analysis.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: 3D face reconstruction from single RGB images. The field has evolved into several major branches that reflect different modeling philosophies and application priorities. Parametric Model-Based Reconstruction remains a dominant paradigm, leveraging statistical face models (e.g., 3DMM) through optimization-based fitting or regression-based prediction to ensure plausible geometry. Neural Implicit Representations have emerged as a flexible alternative, encoding surfaces via learned functions that can capture fine details without explicit mesh topology. Direct Geometry Prediction methods bypass intermediate representations to output meshes or point clouds end-to-end, while Generative and Adversarial Approaches exploit GANs and diffusion models to synthesize realistic 3D faces. Specialized Reconstruction Scenarios address challenges like occlusion, extreme pose, or video sequences, and a smaller set of works tackles General Scene and Object Reconstruction beyond faces. Survey and Review Papers provide periodic snapshots of progress across these diverse directions. Within Parametric Model-Based Reconstruction, a particularly active line focuses on Optimization-Based Fitting with Photometric and Geometric Refinement, where methods iteratively adjust model parameters to match image evidence while preserving realistic shape priors. Pixel3DMM[0] sits squarely in this cluster, emphasizing pixel-level photometric consistency and geometric detail refinement to improve reconstruction fidelity. This contrasts with purely regression-based approaches that predict parameters in a single forward pass, trading off iterative accuracy for speed. Nearby works such as Detailed RGB Face[9] similarly pursue high-fidelity geometry through careful alignment and refinement stages, while Learning Detailed Face[1] explores learning-based detail layers atop coarse parametric fits. The central tension across these branches involves balancing model expressiveness, computational efficiency, and robustness to in-the-wild variations, with Pixel3DMM[0] contributing refined optimization strategies that leverage dense photometric cues for enhanced geometric accuracy.

Claimed Contributions

Pixel3DMM: Vision transformers for per-pixel geometric cues

The authors introduce Pixel3DMM, a pair of vision transformer networks that predict per-pixel surface normals and uv-coordinates. These networks exploit DINO foundation model features and are trained on registered high-quality 3D face datasets to provide geometric priors for constraining 3DMM optimization.

3 retrieved papers
FLAME fitting optimization using uv-coordinates and normals

The authors develop a novel optimization-based 3D face reconstruction approach that fits FLAME model parameters by leveraging predicted uv-coordinates and surface normals. The method transfers uv-coordinate information into a 2D vertex loss to provide a wider basin of attraction during optimization.

9 retrieved papers
New benchmark for single-image face reconstruction

The authors propose a new evaluation benchmark based on the NeRSemble dataset that includes diverse facial expressions, viewpoints, and ethnicities. This benchmark is the first to simultaneously evaluate both posed and neutral facial geometry, enabling better assessment of fitting fidelity and disentanglement of expression and identity.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Pixel3DMM: Vision transformers for per-pixel geometric cues

The authors introduce Pixel3DMM, a pair of vision transformer networks that predict per-pixel surface normals and uv-coordinates. These networks exploit DINO foundation model features and are trained on registered high-quality 3D face datasets to provide geometric priors for constraining 3DMM optimization.

Contribution

FLAME fitting optimization using uv-coordinates and normals

The authors develop a novel optimization-based 3D face reconstruction approach that fits FLAME model parameters by leveraging predicted uv-coordinates and surface normals. The method transfers uv-coordinate information into a 2D vertex loss to provide a wider basin of attraction during optimization.

Contribution

New benchmark for single-image face reconstruction

The authors propose a new evaluation benchmark based on the NeRSemble dataset that includes diverse facial expressions, viewpoints, and ethnicities. This benchmark is the first to simultaneously evaluate both posed and neutral facial geometry, enabling better assessment of fitting fidelity and disentanglement of expression and identity.