Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

single-view 3D human reconstructionimage-to-3Dmulti-view diffusion modelalignmentpost training

Single-view 3D human reconstruction has achieved remarkable progress through the adoption of multi-view diffusion models, yet the recovered 3D humans often exhibit unnatural poses. This phenomenon becomes pronounced when reconstructing 3D humans with dynamic or challenging poses, which we attribute to the limited scale of available 3D human datasets with diverse poses. To address this limitation, we introduce DrPose, a Direct Reward fine-tuning algorithm on Poses, which enables post-training of a multi-view diffusion model on diverse poses without requiring expensive 3D human assets. DrPose trains a model using only human poses paired with single-view images, employing a direct reward fine-tuning to maximize PoseScore, which is our proposed differentiable reward that quantifies consistency between a generated multi-view latent image and a ground-truth human pose. This optimization is conducted on DrPose15K, a novel dataset that was constructed from an existing human motion dataset and a pose-conditioned video generative model. Constructed from abundant human pose sequence data, DrPose15K exhibits a broader pose distribution compared to existing 3D human datasets. We validate our approach through evaluation on conventional benchmark datasets, in-the-wild images, and a newly constructed benchmark, with a particular focus on assessing performance on challenging human poses. Our results demonstrate consistent qualitative and quantitative improvements across all benchmarks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces DrPose, a direct reward fine-tuning algorithm that optimizes multi-view diffusion models for diverse human poses using a differentiable PoseScore reward. It resides in the Diffusion-Based Reconstruction leaf alongside two sibling papers (SiTH and PSHuman), representing a relatively sparse research direction within the broader Generative Model-Driven Reconstruction branch. This leaf contains only three papers out of fifty in the taxonomy, suggesting that diffusion-based approaches for single-view 3D human reconstruction remain an emerging area compared to more established parametric model-based methods.

The taxonomy reveals that DrPose's parent branch, Generative Model-Driven Reconstruction, encompasses diffusion models, GANs, and neural rendering approaches. Neighboring leaves include GAN-Based Synthesis (focusing on adversarial training for normal map prediction) and Neural Rendering (using radiance fields for photorealistic synthesis). The scope note clarifies that diffusion-based methods specifically generate multi-view images or 3D representations conditioned on single-view input, distinguishing them from discriminative regression approaches in Direct Regression from Images and parametric fitting methods in Parametric Model-Based Reconstruction. DrPose's reward-guided sampling strategy represents a methodological departure from standard conditioning mechanisms used by its siblings.

Among twenty-two candidates examined across three contributions, none were identified as clearly refuting the work. The DrPose algorithm examined two candidates with zero refutable matches, while both the DrPose15K dataset and MixamoRP benchmark each examined ten candidates without finding overlapping prior work. This limited search scope—top-K semantic search plus citation expansion—suggests the analysis captures closely related diffusion-based reconstruction methods but may not encompass the full landscape of pose-aware training strategies or reward-based fine-tuning approaches in adjacent fields. The absence of refutations among examined candidates indicates potential novelty within the diffusion-based reconstruction paradigm, though the small candidate pool limits definitive conclusions.

Based on the limited literature search covering twenty-two semantically similar papers, DrPose appears to occupy a relatively unexplored niche combining diffusion models with direct reward optimization for pose diversity. The sparse population of its taxonomy leaf and the absence of refutable prior work among examined candidates suggest methodological novelty, though the analysis does not cover potential overlaps with reinforcement learning-based pose refinement or reward-guided generation in broader computer vision contexts beyond the top-K matches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: single-view 3D human reconstruction with diverse poses. The field has evolved into a rich ecosystem of approaches that can be broadly organized by their underlying representation and learning paradigm. Parametric model-based methods leverage statistical body models like SMPL Reconstruction[2] to constrain the solution space, while implicit representation methods encode geometry in continuous functions for finer detail. Generative model-driven reconstruction has emerged as a powerful branch, employing diffusion models and GANs to handle ambiguity and produce plausible outputs even under occlusion or extreme articulation. Pose estimation from 2D serves as a foundational step in many pipelines, converting image evidence into skeletal structure. Probabilistic and multi-hypothesis methods address inherent depth ambiguities by modeling distributions over possible 3D configurations, and temporal or motion-based approaches exploit video sequences to resolve static-image uncertainties. Physics-based reconstruction incorporates biomechanical constraints, while specialized scenarios target particular domains such as crowded scenes or clinical gait analysis. Self-supervised and weakly-supervised methods reduce annotation burden, and application-specific branches tailor solutions to tasks like ergonomic assessment or yoga pose analysis. Within the generative model-driven branch, diffusion-based reconstruction has attracted considerable attention for its ability to sample diverse plausible hypotheses and refine them iteratively. Direct Reward Poses[0] sits squarely in this diffusion-based cluster, emphasizing reward-guided sampling to handle challenging pose diversity. Nearby works such as SiTH[1] and PSHuman[7] also leverage diffusion priors but differ in their conditioning strategies and the granularity of shape detail they target. In contrast, Complex Poses Cloth[3] focuses on handling intricate garment deformations alongside pose variation, highlighting a trade-off between geometric fidelity and computational efficiency. Multi-Hypothesis Diffusion[13] explores probabilistic ensembles to capture ambiguity, while Real-time RGB Reconstruction[9] prioritizes speed over exhaustive sampling. Direct Reward Poses[0] distinguishes itself by integrating task-specific reward signals directly into the diffusion process, offering a middle ground between purely data-driven generation and explicit constraint enforcement, and positioning it as a flexible framework for scenarios where pose priors alone may be insufficient.

Claimed Contributions

DRPOSE: Direct Reward Fine-Tuning Algorithm on Poses

2 retrieved papers

The authors introduce DRPOSE, a direct reward fine-tuning algorithm that post-trains multi-view diffusion models using only human poses paired with single-view images. The method employs a differentiable reward function called POSE SCORE to maximize consistency between generated multi-view latent images and ground-truth human poses, enabling improved reconstruction of 3D humans in challenging postures without requiring expensive 3D human assets.

2 retrieved papers

DRPOSE 15K Dataset

10 retrieved papers

The authors construct DRPOSE 15K, a novel training dataset containing 15K diverse human poses paired with corresponding single-view images. The dataset is built by leveraging the Motion-X human motion dataset and a pose-conditioned video generative model, exhibiting broader pose distribution coverage compared to existing 3D human datasets.

10 retrieved papers

MIXAMO RP Benchmark

10 retrieved papers

The authors introduce MIXAMO RP, a new evaluation benchmark specifically designed to assess single-view 3D human reconstruction performance on complex and dynamic human poses. The benchmark contains 60 human scans constructed by assigning distinct poses from Mixamo animations to RenderPeople 3D models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion PDF

Hsuan-I Ho, Jie Song, Otmar Hilliges (2024)

[7] PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing PDF

Peng Li, Wangguandong Zheng, Yuan Liu, Tao Yu, Yangguang Li, Xingqun Qi, Xiao-wei Chi, Siyu Xia, Xiaowei Chi, Yan-Pei Cao, Si-Yu Xia, Wei Xue, Wenhan Luo, Yike Guo, Yi-Ting Guo (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DRPOSE: Direct Reward Fine-Tuning Algorithm on Poses

[59] VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator PDF

Cannot Refute

[60] Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation PDF

Cannot Refute

Contribution

DRPOSE 15K Dataset

[23] End-to-End Human Pose and Mesh Reconstruction with Transformers PDF

Cannot Refute

[28] Probabilistic Monocular 3D Human Pose Estimation with Normalizing Flows PDF

Cannot Refute

[51] SMPLR: Deep SMPL reverse for 3D human pose and shape recovery PDF

Cannot Refute

[52] Bodynet: Volumetric inference of 3d human body shapes PDF

Cannot Refute

[53] Monocular 2D and 3D human pose estimation review PDF

Cannot Refute

[54] Deep learning in monocular 3D human pose estimation: Systematic review of contemporary techniques and applications PDF

Cannot Refute

[55] Generating Multi-View Action Data from a Monocular Camera Video by Fusing Human Mesh Recovery and 3D Scene Reconstruction PDF

Cannot Refute

[56] Monocular 3d human pose estimation by predicting depth on joints PDF

Cannot Refute

[57] Adapted human pose: monocular 3D human pose estimation with zero real 3D pose data PDF

Cannot Refute

[58] ChatPose: Chatting about 3D Human Pose PDF

Cannot Refute

Contribution

MIXAMO RP Benchmark

[10] Holopose: Holistic 3d human reconstruction in-the-wild PDF

Cannot Refute

[61] Freeman: Towards benchmarking 3d human pose estimation under real-world conditions PDF

Cannot Refute

[62] AthletePose3D: A benchmark dataset for 3D human pose estimation and kinematic validation in athletic movements PDF

Cannot Refute

[63] Generating Holistic 3D Human Motion from Speech PDF

Cannot Refute

[64] Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset PDF

Cannot Refute

[65] CLOTH4D: A Dataset for Clothed Human Reconstruction PDF

Cannot Refute

[66] 3D Human Pose Estimation via Intuitive Physics PDF

Cannot Refute

[67] Learning 3d human dynamics from video PDF

Cannot Refute

[68] TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis PDF

Cannot Refute

[69] Resolving 3D human pose ambiguities with 3D scene constraints PDF

Cannot Refute

Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion PDF

[7] PSHuman: Photorealistic Single-image 3D Human Reconstruction using Cross-Scale Multiview Diffusion and Explicit Remeshing PDF

Contribution Analysis

DRPOSE: Direct Reward Fine-Tuning Algorithm on Poses

[59] VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator PDF

[60] Ar2Can: An Architect and an Artist Leveraging a Canvas for Multi-Human Generation PDF

DRPOSE 15K Dataset

[23] End-to-End Human Pose and Mesh Reconstruction with Transformers PDF

[28] Probabilistic Monocular 3D Human Pose Estimation with Normalizing Flows PDF

[51] SMPLR: Deep SMPL reverse for 3D human pose and shape recovery PDF

[52] Bodynet: Volumetric inference of 3d human body shapes PDF

[53] Monocular 2D and 3D human pose estimation review PDF

[54] Deep learning in monocular 3D human pose estimation: Systematic review of contemporary techniques and applications PDF

[55] Generating Multi-View Action Data from a Monocular Camera Video by Fusing Human Mesh Recovery and 3D Scene Reconstruction PDF

[56] Monocular 3d human pose estimation by predicting depth on joints PDF

[57] Adapted human pose: monocular 3D human pose estimation with zero real 3D pose data PDF

[58] ChatPose: Chatting about 3D Human Pose PDF

MIXAMO RP Benchmark

[10] Holopose: Holistic 3d human reconstruction in-the-wild PDF

[61] Freeman: Towards benchmarking 3d human pose estimation under real-world conditions PDF

[62] AthletePose3D: A benchmark dataset for 3D human pose estimation and kinematic validation in athletic movements PDF

[63] Generating Holistic 3D Human Motion from Speech PDF

[64] Motion-X: A Large-scale 3D Expressive Whole-body Human Motion Dataset PDF

[65] CLOTH4D: A Dataset for Clothed Human Reconstruction PDF

[66] 3D Human Pose Estimation via Intuitive Physics PDF

[67] Learning 3d human dynamics from video PDF

[68] TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis PDF

[69] Resolving 3D human pose ambiguities with 3D scene constraints PDF

Table of Contents