FastGHA: Generalized Few-Shot 3D Gaussian Head Avatars with Real-Time Animation

ICLR 2026 Conference SubmissionAnonymous Authors
AnimationGaussian AvatarFeedforward Gaussian Model
Abstract:

Despite recent progress in 3D Gaussian-based head avatar modeling, efficiently generating high fidelity avatars remains a challenge. Current methods typically rely on extensive multi-view capture setups or monocular videos with per-identity optimization during inference, limiting their scalability and ease of use on unseen subjects. To overcome these efficiency drawbacks, we propose FastGHA, a feed-forward method to generate high-quality Gaussian head avatars from only a few input images while supporting real-time animation. Our approach directly learns a per-pixel Gaussian representation from the input images, and aggregates multi-view information using a transformer-based encoder that fuses image features from both DINOv3 and Stable Diffusion VAE. For real-time animation, we extend the explicit Gaussian representations with per-Gaussian features and introduce a lightweight MLP-based dynamic network to predict 3D Gaussian deformations from expression codes. Furthermore, to enhance geometric smoothness of the 3D head, we employ point maps from a pre-trained large reconstruction model as geometry supervision. Experiments show that our approach significantly outperforms existing methods in both rendering quality and inference efficiency, while supporting real-time dynamic avatar animation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes FastGHA, a feed-forward framework for generating 3D Gaussian head avatars from a few input images with real-time animation capability. It resides in the Transformer-Based Generalization leaf, which contains only three papers total, indicating a relatively sparse research direction within the broader few-shot reconstruction landscape. This leaf focuses specifically on using transformer architectures to aggregate multi-view information and learn generalizable representations, distinguishing it from prior-based or single-image approaches in neighboring leaves.

The taxonomy reveals that FastGHA's immediate neighbors include Prior-Based Generalization methods that leverage learned 3D head priors from multi-view datasets, and Single-Image Reconstruction techniques that generate avatars from a single input. The broader Generalized Few-Shot Reconstruction Methods branch contrasts with Identity-Specific Reconstruction from Monocular Video, which requires per-identity optimization, and Multi-View Capture-Based Reconstruction, which demands extensive camera setups. FastGHA's transformer-based aggregation and cross-identity generalization position it at the intersection of efficiency and quality, diverging from both identity-specific optimization and multi-view capture paradigms.

Among the 30 candidates examined, the first contribution (FastGHA framework) shows one refutable candidate out of 10 examined, suggesting some prior work in generalized few-shot reconstruction exists but is limited in scope. The second contribution (lightweight MLP-based deformation network) has two refutable candidates among 10 examined, indicating more substantial overlap in real-time animation techniques. The third contribution (geometry prior regularization using VGGT) shows no refutable candidates among 10 examined, suggesting this specific supervision approach may be more novel within the limited search scope.

Based on the top-30 semantic matches examined, the work appears to occupy a moderately explored niche within transformer-based generalization for few-shot avatar reconstruction. The sparse taxonomy leaf (three papers) and limited refutation evidence suggest the specific combination of feed-forward transformer aggregation with real-time MLP deformation may offer incremental novelty, though the analysis does not cover the full breadth of related work in adjacent reconstruction paradigms or recent unpublished developments.

Taxonomy

Core-task Taxonomy Papers
33
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: few-shot 3D Gaussian head avatar reconstruction with real-time animation. The field organizes around several complementary directions. Generalized Few-Shot Reconstruction Methods aim to build models that can reconstruct avatars from minimal input across different identities, often leveraging transformer architectures or learned priors to achieve cross-identity generalization. Identity-Specific Reconstruction from Monocular Video focuses on optimizing high-fidelity avatars for individual subjects using longer monocular sequences, trading generalization for per-identity detail. Multi-View Capture-Based Reconstruction exploits synchronized camera rigs to achieve photorealistic quality and relightability, as seen in works like Relightable gaussian codec avatars[1] and Gaussian Pixel Codec Avatars[24]. Audio-Driven Animation tackles the challenge of synthesizing realistic lip sync and facial motion from speech signals, with methods such as GaussianSpeech[22] and EAvatar[23]. Specialized Animation and Generation Tasks explore text-to-avatar synthesis, expression transfer, and other creative applications, while Cross-Domain Applications and Surveys provide broader context and holistic reviews of talking head generation. Within the generalized few-shot branch, a key tension emerges between speed and quality: some approaches prioritize real-time performance by distilling complex priors into efficient feed-forward networks, while others invest in richer geometric or appearance models at the cost of longer inference. FastGHA[0] sits squarely in the transformer-based generalization cluster, emphasizing rapid reconstruction from sparse views by learning cross-identity patterns. This contrasts with nearby identity-specific methods like StreamME[3], which optimizes per-subject fidelity through iterative refinement, and with multi-view systems such as SEGA[2] that assume denser capture setups. The trade-off between generalization and per-identity detail remains a central open question, as does the challenge of maintaining temporal coherence and expression fidelity under extreme view sparsity.

Claimed Contributions

FastGHA framework for generalized few-shot 3D Gaussian head avatar reconstruction

The authors introduce FastGHA, a feed-forward method that generates high-quality animatable 3D Gaussian head avatars from only a few input images. The framework learns per-pixel Gaussian representations and aggregates multi-view information using a transformer-based encoder that fuses features from DINOv3 and Stable Diffusion VAE, achieving superior reconstruction quality compared to existing approaches.

10 retrieved papers
Can Refute
Lightweight MLP-based deformation network for real-time animation

The authors design a lightweight multi-layer perceptron that extends explicit Gaussian representations with learnable per-Gaussian features and predicts 3D Gaussian deformations from FLAME expression codes. This enables real-time dynamic avatar animation by acting independently on each Gaussian point for efficient and parallelizable deformation.

10 retrieved papers
Can Refute
Geometry prior regularization using VGGT for improved 3D consistency

The authors employ point maps predicted from a pre-trained large reconstruction model (VGGT) as geometry supervision during training. Unlike prior work that directly uses predicted point maps as input, this approach incorporates the geometry prior as a regularization loss to enhance geometric smoothness and robustness without propagating artifacts from the prior.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FastGHA framework for generalized few-shot 3D Gaussian head avatar reconstruction

The authors introduce FastGHA, a feed-forward method that generates high-quality animatable 3D Gaussian head avatars from only a few input images. The framework learns per-pixel Gaussian representations and aggregates multi-view information using a transformer-based encoder that fuses features from DINOv3 and Stable Diffusion VAE, achieving superior reconstruction quality compared to existing approaches.

Contribution

Lightweight MLP-based deformation network for real-time animation

The authors design a lightweight multi-layer perceptron that extends explicit Gaussian representations with learnable per-Gaussian features and predicts 3D Gaussian deformations from FLAME expression codes. This enables real-time dynamic avatar animation by acting independently on each Gaussian point for efficient and parallelizable deformation.

Contribution

Geometry prior regularization using VGGT for improved 3D consistency

The authors employ point maps predicted from a pre-trained large reconstruction model (VGGT) as geometry supervision during training. Unlike prior work that directly uses predicted point maps as input, this approach incorporates the geometry prior as a regularization loss to enhance geometric smoothness and robustness without propagating artifacts from the prior.