UP2You: Fast Reconstruction of Yourself from Unconstrained Photo Collections

ICLR 2026 Conference SubmissionAnonymous Authors
3D clothed human reconstructionimage-based reconstructionhuman digitizationSMPLmulti-view diffusion model
Abstract:

We present UP2You, the first tuning-free solution for reconstructing high-fidelity 3D clothed portraits from extremely unconstrained in-the-wild 2D photos. Unlike previous approaches that require "clean" inputs (e.g., full-body images with minimal occlusions, or well calibrated cross-view captures), UP2You directly processes raw, unstructured photographs, which may vary significantly in pose, viewpoint, cropping, and occlusion. Instead of compressing data into tokens for slow online text-to-3D optimization, we introduce a data rectifier paradigm that efficiently converts unconstrained inputs into clean, orthogonal multi-view images in a single forward pass within seconds, simplifying the 3D reconstruction. Central to UP2You is a pose-correlated feature aggregation module PCFA, that selectively fuses information from multiple reference images w.r.t. target poses, enabling better identity preservation and nearly constant memory footprint, with more observations. Extensive experiments on 4D-Dress, PuzzleIOI, and in-the-wild captures demonstrate that UP2You consistently surpasses previous methods in both geometric accuracy (Chamfer-15%downarrow\\downarrow, P2S-18%downarrow\\downarrow on PuzzleIOI) and texture fidelity (PSNR-21%uparrow\\uparrow, LPIPS 46%downarrow\\downarrow on 4D-Dress). UP2You is efficient (1.5 minutes per person), and versatile (supports arbitrary pose control, and training-free multi-garment 3D virtual try-on), making it practical for real-world scenarios where humans are casually captured. Both models and code will be released to facilitate future research on this underexplored task.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

UP2You introduces a tuning-free data rectifier paradigm that converts unconstrained in-the-wild photos into clean orthogonal multi-view images for 3D clothed human reconstruction. The paper resides in the 'Unconstrained Multi-View Methods' leaf, which contains four papers total, including the original work. This leaf sits within the broader 'Multi-View Sparse Reconstruction' branch, indicating a moderately populated research direction focused on handling uncontrolled capture conditions. The taxonomy reveals this is neither an overcrowded nor entirely sparse area, with sibling leaves addressing calibrated setups and limited-view scenarios, suggesting active exploration of different multi-view constraints.

The taxonomy structure shows UP2You's leaf neighbors calibrated multi-view methods that assume controlled environments and limited-view approaches designed for minimal consumer-device captures. The broader 'Input Modality and Capture Constraints' branch also includes single-view monocular reconstruction (with three distinct sub-approaches) and video-based temporal methods, highlighting alternative strategies for handling input variability. UP2You's focus on unconstrained multi-view inputs positions it between single-image methods that lack geometric consistency and calibrated approaches that sacrifice real-world applicability. The taxonomy's scope notes emphasize this leaf specifically excludes controlled capture, distinguishing it from sibling calibrated methods.

Among sixteen candidates examined across three contributions, no clearly refuting prior work was identified. The core data rectifier paradigm examined five candidates with zero refutations, suggesting this framing may be relatively novel within the limited search scope. The PCFA module analyzed ten candidates without finding overlapping prior work, though this reflects top-K semantic matches rather than exhaustive coverage. The Perceiver-based shape predictor examined only one candidate, indicating either sparse related work or limited retrieval. These statistics suggest the contributions appear distinct within the examined literature, though the modest search scale (sixteen total candidates) means potentially relevant work outside top semantic matches remains unexplored.

Based on the limited literature search covering sixteen semantically similar papers, UP2You's contributions appear to occupy a relatively distinct position within unconstrained multi-view reconstruction. The absence of refuting candidates across all three contributions, combined with the moderately populated taxonomy leaf, suggests the work introduces novel technical elements while addressing an established problem space. However, the analysis explicitly does not cover exhaustive prior art beyond top-K retrieval and citation expansion.

Taxonomy

Core-task Taxonomy Papers
30
3
Claimed Contributions
16
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: 3D clothed human reconstruction from unconstrained photo collections. The field organizes around several complementary dimensions. Input Modality and Capture Constraints distinguishes methods by the type and quality of available imagery—ranging from controlled multi-view setups to casual internet photos—and determines what priors or regularization strategies are needed. Reconstruction Approach and Representation addresses the core algorithmic choices: whether to use parametric body models, implicit surfaces, or layered garment representations, and how to handle texture and geometry jointly. Domain Adaptation and Robustness focuses on bridging the gap between synthetic training data and real-world diversity, tackling challenges like pose variation, occlusion, and lighting. Application-Specific Methods tailors solutions to downstream tasks such as virtual try-on or avatar creation, while Datasets and Benchmarks provides the empirical foundation, offering both controlled captures like MVP Human Dataset[6] and in-the-wild collections that stress-test generalization. Within the multi-view sparse reconstruction branch, a central tension emerges between leveraging geometric consistency across views and handling the severe sparsity and pose ambiguity typical of uncontrolled collections. Works like Sparse MultiView Clothed[1] and Normal Maps Sparse[5] exploit multi-view cues to recover fine garment detail, yet must contend with incomplete coverage and inconsistent lighting. UP2You[0] sits naturally in this cluster of unconstrained multi-view methods, emphasizing robustness to the variability inherent in casual photo sets—contrasting with more controlled approaches that assume dense viewpoints or studio conditions. Compared to HAMSt3R[8], which may prioritize different input assumptions or representation choices, UP2You[0] appears to focus on extracting coherent 3D geometry from minimal and noisy observations, a recurring challenge across this branch.

Claimed Contributions

UP2You: tuning-free data rectifier paradigm for unconstrained photo reconstruction

The authors introduce UP2You, a tuning-free method that acts as a data rectifier, directly converting unconstrained photo collections into clean orthogonal multi-view images and normal maps in a single forward pass. This paradigm shift enables efficient 3D reconstruction without requiring DreamBooth fine-tuning or SDS optimization.

5 retrieved papers
Pose-Correlated Feature Aggregation (PCFA) module

The authors propose PCFA, a module that predicts correlation maps between reference images and target poses to selectively aggregate the most informative features. This enables efficient processing of varying numbers of input photos with nearly constant memory usage while preserving identity.

10 retrieved papers
Perceiver-based multi-reference shape predictor

The authors design a shape predictor based on perceiver structure that directly regresses SMPL-X shape parameters from unconstrained photo collections, eliminating the dependency on ground-truth body shapes or templates required by previous methods.

1 retrieved paper

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

UP2You: tuning-free data rectifier paradigm for unconstrained photo reconstruction

The authors introduce UP2You, a tuning-free method that acts as a data rectifier, directly converting unconstrained photo collections into clean orthogonal multi-view images and normal maps in a single forward pass. This paradigm shift enables efficient 3D reconstruction without requiring DreamBooth fine-tuning or SDS optimization.

Contribution

Pose-Correlated Feature Aggregation (PCFA) module

The authors propose PCFA, a module that predicts correlation maps between reference images and target poses to selectively aggregate the most informative features. This enables efficient processing of varying numbers of input photos with nearly constant memory usage while preserving identity.

Contribution

Perceiver-based multi-reference shape predictor

The authors design a shape predictor based on perceiver structure that directly regresses SMPL-X shape parameters from unconstrained photo collections, eliminating the dependency on ground-truth body shapes or templates required by previous methods.