Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild

ICLR 2026 Conference SubmissionAnonymous Authors
single-view 3D human reconstructionimage-to-3Dmulti-view diffusion modelalignmentpost training
Abstract:

Single-view 3D human reconstruction has achieved remarkable progress through the adoption of multi-view diffusion models, yet the recovered 3D humans often exhibit unnatural poses. This phenomenon becomes pronounced when reconstructing 3D humans with dynamic or challenging poses, which we attribute to the limited scale of available 3D human datasets with diverse poses. To address this limitation, we introduce DrPose, a Direct Reward fine-tuning algorithm on Poses, which enables post-training of a multi-view diffusion model on diverse poses without requiring expensive 3D human assets. DrPose trains a model using only human poses paired with single-view images, employing a direct reward fine-tuning to maximize PoseScore, which is our proposed differentiable reward that quantifies consistency between a generated multi-view latent image and a ground-truth human pose. This optimization is conducted on DrPose15K, a novel dataset that was constructed from an existing human motion dataset and a pose-conditioned video generative model. Constructed from abundant human pose sequence data, DrPose15K exhibits a broader pose distribution compared to existing 3D human datasets. We validate our approach through evaluation on conventional benchmark datasets, in-the-wild images, and a newly constructed benchmark, with a particular focus on assessing performance on challenging human poses. Our results demonstrate consistent qualitative and quantitative improvements across all benchmarks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DrPose, a direct reward fine-tuning algorithm that optimizes multi-view diffusion models for diverse human poses using a differentiable PoseScore reward. It resides in the Diffusion-Based Reconstruction leaf alongside two sibling papers (SiTH and PSHuman), representing a relatively sparse research direction within the broader Generative Model-Driven Reconstruction branch. This leaf contains only three papers out of fifty in the taxonomy, suggesting that diffusion-based approaches for single-view 3D human reconstruction remain an emerging area compared to more established parametric model-based methods.

The taxonomy reveals that DrPose's parent branch, Generative Model-Driven Reconstruction, encompasses diffusion models, GANs, and neural rendering approaches. Neighboring leaves include GAN-Based Synthesis (focusing on adversarial training for normal map prediction) and Neural Rendering (using radiance fields for photorealistic synthesis). The scope note clarifies that diffusion-based methods specifically generate multi-view images or 3D representations conditioned on single-view input, distinguishing them from discriminative regression approaches in Direct Regression from Images and parametric fitting methods in Parametric Model-Based Reconstruction. DrPose's reward-guided sampling strategy represents a methodological departure from standard conditioning mechanisms used by its siblings.

Among twenty-two candidates examined across three contributions, none were identified as clearly refuting the work. The DrPose algorithm examined two candidates with zero refutable matches, while both the DrPose15K dataset and MixamoRP benchmark each examined ten candidates without finding overlapping prior work. This limited search scope—top-K semantic search plus citation expansion—suggests the analysis captures closely related diffusion-based reconstruction methods but may not encompass the full landscape of pose-aware training strategies or reward-based fine-tuning approaches in adjacent fields. The absence of refutations among examined candidates indicates potential novelty within the diffusion-based reconstruction paradigm, though the small candidate pool limits definitive conclusions.

Based on the limited literature search covering twenty-two semantically similar papers, DrPose appears to occupy a relatively unexplored niche combining diffusion models with direct reward optimization for pose diversity. The sparse population of its taxonomy leaf and the absence of refutable prior work among examined candidates suggest methodological novelty, though the analysis does not cover potential overlaps with reinforcement learning-based pose refinement or reward-guided generation in broader computer vision contexts beyond the top-K matches.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: single-view 3D human reconstruction with diverse poses. The field has evolved into a rich ecosystem of approaches that can be broadly organized by their underlying representation and learning paradigm. Parametric model-based methods leverage statistical body models like SMPL Reconstruction[2] to constrain the solution space, while implicit representation methods encode geometry in continuous functions for finer detail. Generative model-driven reconstruction has emerged as a powerful branch, employing diffusion models and GANs to handle ambiguity and produce plausible outputs even under occlusion or extreme articulation. Pose estimation from 2D serves as a foundational step in many pipelines, converting image evidence into skeletal structure. Probabilistic and multi-hypothesis methods address inherent depth ambiguities by modeling distributions over possible 3D configurations, and temporal or motion-based approaches exploit video sequences to resolve static-image uncertainties. Physics-based reconstruction incorporates biomechanical constraints, while specialized scenarios target particular domains such as crowded scenes or clinical gait analysis. Self-supervised and weakly-supervised methods reduce annotation burden, and application-specific branches tailor solutions to tasks like ergonomic assessment or yoga pose analysis. Within the generative model-driven branch, diffusion-based reconstruction has attracted considerable attention for its ability to sample diverse plausible hypotheses and refine them iteratively. Direct Reward Poses[0] sits squarely in this diffusion-based cluster, emphasizing reward-guided sampling to handle challenging pose diversity. Nearby works such as SiTH[1] and PSHuman[7] also leverage diffusion priors but differ in their conditioning strategies and the granularity of shape detail they target. In contrast, Complex Poses Cloth[3] focuses on handling intricate garment deformations alongside pose variation, highlighting a trade-off between geometric fidelity and computational efficiency. Multi-Hypothesis Diffusion[13] explores probabilistic ensembles to capture ambiguity, while Real-time RGB Reconstruction[9] prioritizes speed over exhaustive sampling. Direct Reward Poses[0] distinguishes itself by integrating task-specific reward signals directly into the diffusion process, offering a middle ground between purely data-driven generation and explicit constraint enforcement, and positioning it as a flexible framework for scenarios where pose priors alone may be insufficient.

Claimed Contributions

DRPOSE: Direct Reward Fine-Tuning Algorithm on Poses

The authors introduce DRPOSE, a direct reward fine-tuning algorithm that post-trains multi-view diffusion models using only human poses paired with single-view images. The method employs a differentiable reward function called POSE SCORE to maximize consistency between generated multi-view latent images and ground-truth human poses, enabling improved reconstruction of 3D humans in challenging postures without requiring expensive 3D human assets.

2 retrieved papers
DRPOSE 15K Dataset

The authors construct DRPOSE 15K, a novel training dataset containing 15K diverse human poses paired with corresponding single-view images. The dataset is built by leveraging the Motion-X human motion dataset and a pose-conditioned video generative model, exhibiting broader pose distribution coverage compared to existing 3D human datasets.

10 retrieved papers
MIXAMO RP Benchmark

The authors introduce MIXAMO RP, a new evaluation benchmark specifically designed to assess single-view 3D human reconstruction performance on complex and dynamic human poses. The benchmark contains 60 human scans constructed by assigning distinct poses from Mixamo animations to RenderPeople 3D models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DRPOSE: Direct Reward Fine-Tuning Algorithm on Poses

The authors introduce DRPOSE, a direct reward fine-tuning algorithm that post-trains multi-view diffusion models using only human poses paired with single-view images. The method employs a differentiable reward function called POSE SCORE to maximize consistency between generated multi-view latent images and ground-truth human poses, enabling improved reconstruction of 3D humans in challenging postures without requiring expensive 3D human assets.

Contribution

DRPOSE 15K Dataset

The authors construct DRPOSE 15K, a novel training dataset containing 15K diverse human poses paired with corresponding single-view images. The dataset is built by leveraging the Motion-X human motion dataset and a pose-conditioned video generative model, exhibiting broader pose distribution coverage compared to existing 3D human datasets.

Contribution

MIXAMO RP Benchmark

The authors introduce MIXAMO RP, a new evaluation benchmark specifically designed to assess single-view 3D human reconstruction performance on complex and dynamic human poses. The benchmark contains 60 human scans constructed by assigning distinct poses from Mixamo animations to RenderPeople 3D models.

Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild | Novelty Validation