UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

3D clothed human reconstructionimage-based reconstructionhuman digitizationSMPLmulti-view diffusion model

We present UP2You, the first tuning-free solution for reconstructing high-fidelity 3D clothed portraits from extremely unconstrained in-the-wild 2D photos. Unlike previous approaches that require "clean" inputs (e.g., full-body images with minimal occlusions, or well calibrated cross-view captures), UP2You directly processes raw, unstructured photographs, which may vary significantly in pose, viewpoint, cropping, and occlusion. Instead of compressing data into tokens for slow online text-to-3D optimization, we introduce a data rectifier paradigm that efficiently converts unconstrained inputs into clean, orthogonal multi-view images in a single forward pass within seconds, simplifying the 3D reconstruction. Central to UP2You is a pose-correlated feature aggregation module PCFA, that selectively fuses information from multiple reference images w.r.t. target poses, enabling better identity preservation and nearly constant memory footprint, with more observations. Extensive experiments on 4D-Dress, PuzzleIOI, and in-the-wild captures demonstrate that UP2You consistently surpasses previous methods in both geometric accuracy (Chamfer-15% $\\downarrow$ , P2S-18% $\\downarrow$ on PuzzleIOI) and texture fidelity (PSNR-21% $\\uparrow$ , LPIPS 46% $\\downarrow$ on 4D-Dress). UP2You is efficient (1.5 minutes per person), and versatile (supports arbitrary pose control, and training-free multi-garment 3D virtual try-on), making it practical for real-world scenarios where humans are casually captured. Both models and code will be released to facilitate future research on this underexplored task.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

UP2You introduces a tuning-free data rectifier paradigm that converts unconstrained in-the-wild photos into clean orthogonal multi-view images for 3D clothed human reconstruction. The paper resides in the 'Unconstrained Multi-View Methods' leaf, which contains four papers total, including the original work. This leaf sits within the broader 'Multi-View Sparse Reconstruction' branch, indicating a moderately populated research direction focused on handling uncontrolled capture conditions. The taxonomy reveals this is neither an overcrowded nor entirely sparse area, with sibling leaves addressing calibrated setups and limited-view scenarios, suggesting active exploration of different multi-view constraints.

The taxonomy structure shows UP2You's leaf neighbors calibrated multi-view methods that assume controlled environments and limited-view approaches designed for minimal consumer-device captures. The broader 'Input Modality and Capture Constraints' branch also includes single-view monocular reconstruction (with three distinct sub-approaches) and video-based temporal methods, highlighting alternative strategies for handling input variability. UP2You's focus on unconstrained multi-view inputs positions it between single-image methods that lack geometric consistency and calibrated approaches that sacrifice real-world applicability. The taxonomy's scope notes emphasize this leaf specifically excludes controlled capture, distinguishing it from sibling calibrated methods.

Among sixteen candidates examined across three contributions, no clearly refuting prior work was identified. The core data rectifier paradigm examined five candidates with zero refutations, suggesting this framing may be relatively novel within the limited search scope. The PCFA module analyzed ten candidates without finding overlapping prior work, though this reflects top-K semantic matches rather than exhaustive coverage. The Perceiver-based shape predictor examined only one candidate, indicating either sparse related work or limited retrieval. These statistics suggest the contributions appear distinct within the examined literature, though the modest search scale (sixteen total candidates) means potentially relevant work outside top semantic matches remains unexplored.

Based on the limited literature search covering sixteen semantically similar papers, UP2You's contributions appear to occupy a relatively distinct position within unconstrained multi-view reconstruction. The absence of refuting candidates across all three contributions, combined with the moderately populated taxonomy leaf, suggests the work introduces novel technical elements while addressing an established problem space. However, the analysis explicitly does not cover exhaustive prior art beyond top-K retrieval and citation expansion.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: 3D clothed human reconstruction from unconstrained photo collections. The field organizes around several complementary dimensions. Input Modality and Capture Constraints distinguishes methods by the type and quality of available imagery—ranging from controlled multi-view setups to casual internet photos—and determines what priors or regularization strategies are needed. Reconstruction Approach and Representation addresses the core algorithmic choices: whether to use parametric body models, implicit surfaces, or layered garment representations, and how to handle texture and geometry jointly. Domain Adaptation and Robustness focuses on bridging the gap between synthetic training data and real-world diversity, tackling challenges like pose variation, occlusion, and lighting. Application-Specific Methods tailors solutions to downstream tasks such as virtual try-on or avatar creation, while Datasets and Benchmarks provides the empirical foundation, offering both controlled captures like MVP Human Dataset[6] and in-the-wild collections that stress-test generalization. Within the multi-view sparse reconstruction branch, a central tension emerges between leveraging geometric consistency across views and handling the severe sparsity and pose ambiguity typical of uncontrolled collections. Works like Sparse MultiView Clothed[1] and Normal Maps Sparse[5] exploit multi-view cues to recover fine garment detail, yet must contend with incomplete coverage and inconsistent lighting. UP2You[0] sits naturally in this cluster of unconstrained multi-view methods, emphasizing robustness to the variability inherent in casual photo sets—contrasting with more controlled approaches that assume dense viewpoints or studio conditions. Compared to HAMSt3R[8], which may prioritize different input assumptions or representation choices, UP2You[0] appears to focus on extracting coherent 3D geometry from minimal and noisy observations, a recurring challenge across this branch.

Claimed Contributions

UP2You: tuning-free data rectifier paradigm for unconstrained photo reconstruction

5 retrieved papers

The authors introduce UP2You, a tuning-free method that acts as a data rectifier, directly converting unconstrained photo collections into clean orthogonal multi-view images and normal maps in a single forward pass. This paradigm shift enables efficient 3D reconstruction without requiring DreamBooth fine-tuning or SDS optimization.

5 retrieved papers

Pose-Correlated Feature Aggregation (PCFA) module

10 retrieved papers

The authors propose PCFA, a module that predicts correlation maps between reference images and target poses to selectively aggregate the most informative features. This enables efficient processing of varying numbers of input photos with nearly constant memory usage while preserving identity.

10 retrieved papers

Perceiver-based multi-reference shape predictor

1 retrieved paper

The authors design a shape predictor based on perceiver structure that directly regresses SMPL-X shape parameters from unconstrained photo collections, eliminating the dependency on ground-truth body shapes or templates required by previous methods.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] MVP-Human Dataset for 3D Human Avatar Reconstruction from Unconstrained Frames PDF

Zhu, Xiangyu, Xiangyu Zhu, Liao Ting-ting, Tingting Liao, Lyu, Jiangjing, Jiangjing Lyu, Yan Xiang, Wang Yunfeng, Yunfeng Wang, Xiangyi Yan, Guo Kan, Kan Guo, Cao Qiong, Qiong Cao, Li, Stan Z., Stan Z. Li, Lei Zhen, Zhen Lei, S. Li (2022)

[8] HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction PDF

Rojas, Sara, Armando, Matthieu, Weinzaepfel, Philippe, Leroy Vincent, Rogez, Gregory (2025) • arXiv.org

[30] PFAvatar: Avatar Reconstruction from Multiple In-the-wild Images PDF

D Xi, Y Liu, Z Liu, J Zhu, Y Huo, R Zhang, J Lu, R Wang (0)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

UP2You: tuning-free data rectifier paradigm for unconstrained photo reconstruction

[26] SeSDF: Self-evolved Signed Distance Field for Implicit 3D Clothed Human Reconstruction PDF

Cannot Refute

[32] Wildvidfit: Video virtual try-on in the wild via image-based controlled diffusion models PDF

Cannot Refute

[33] StorySync: Training-Free Subject Consistency via Region Harmonization PDF

Cannot Refute

[34] Computer Analysis of Images and Patterns: 21st International Conference, CAIP 2025, Las Palmas de Gran Canaria, Spain, September 22â25, 2025 â¦ PDF

Cannot Refute

[35] OPTIMIZING ID CONSISTENCY IN MULTIMODAL LARGE MODELS: FACIAL RESTORATION VIA ALIGN-MENT, ENTANGLEMENT, AND DISENTANGLEMENT PDF

Cannot Refute

Contribution

Pose-Correlated Feature Aggregation (PCFA) module

[36] Identity Consistency Multi-Viewpoint Generative Aggregation for Person Re-Identification PDF

Cannot Refute

[37] Multi-view feature fusion for person re-identification PDF

Cannot Refute

[38] Query-Driven Feature Learning for Cross-View Geo-Localization PDF

Cannot Refute

[39] Aware attentive multi-view inference for vehicle re-identification PDF

Cannot Refute

[40] Nerfeditor: Differentiable style decomposition for 3d scene editing PDF

Cannot Refute

[41] Dual-Level Viewpoint-Learning for Cross-Domain Vehicle Re-Identification PDF

Cannot Refute

[42] Free-viewpoint human animation with pose-correlated reference selection PDF

Cannot Refute

[43] WT-MVSNet: Window-based Transformers for Multi-view Stereo PDF

Cannot Refute

[44] Consistent View Synthesis with Pose-Guided Diffusion Models PDF

Cannot Refute

[45] SCANimate: Weakly supervised learning of skinned clothed avatar networks PDF

Cannot Refute

Contribution

Perceiver-based multi-reference shape predictor

[31] Equivariant Ray Embeddings for Implicit Multi-View Depth Estimation PDF

Cannot Refute

UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] MVP-Human Dataset for 3D Human Avatar Reconstruction from Unconstrained Frames PDF

[8] HAMSt3R: Human-Aware Multi-view Stereo 3D Reconstruction PDF

[30] PFAvatar: Avatar Reconstruction from Multiple In-the-wild Images PDF

Contribution Analysis

UP2You: tuning-free data rectifier paradigm for unconstrained photo reconstruction

[26] SeSDF: Self-evolved Signed Distance Field for Implicit 3D Clothed Human Reconstruction PDF

[32] Wildvidfit: Video virtual try-on in the wild via image-based controlled diffusion models PDF

[33] StorySync: Training-Free Subject Consistency via Region Harmonization PDF

[34] Computer Analysis of Images and Patterns: 21st International Conference, CAIP 2025, Las Palmas de Gran Canaria, Spain, September 22â25, 2025 â¦ PDF

[35] OPTIMIZING ID CONSISTENCY IN MULTIMODAL LARGE MODELS: FACIAL RESTORATION VIA ALIGN-MENT, ENTANGLEMENT, AND DISENTANGLEMENT PDF

Pose-Correlated Feature Aggregation (PCFA) module

[36] Identity Consistency Multi-Viewpoint Generative Aggregation for Person Re-Identification PDF

[37] Multi-view feature fusion for person re-identification PDF

[38] Query-Driven Feature Learning for Cross-View Geo-Localization PDF

[39] Aware attentive multi-view inference for vehicle re-identification PDF

[40] Nerfeditor: Differentiable style decomposition for 3d scene editing PDF

[41] Dual-Level Viewpoint-Learning for Cross-Domain Vehicle Re-Identification PDF

[42] Free-viewpoint human animation with pose-correlated reference selection PDF

[43] WT-MVSNet: Window-based Transformers for Multi-view Stereo PDF

[44] Consistent View Synthesis with Pose-Guided Diffusion Models PDF

[45] SCANimate: Weakly supervised learning of skinned clothed avatar networks PDF

Perceiver-based multi-reference shape predictor

[31] Equivariant Ray Embeddings for Implicit Multi-View Depth Estimation PDF

Table of Contents

[34] Computer Analysis of Images and Patterns: 21st International Conference, CAIP 2025, Las Palmas de Gran Canaria, Spain, September 22â25, 2025 â¦ PDF