Human3R: Everyone Everywhere All at Once

ICLR 2026 Conference SubmissionAnonymous Authors
Human Motion EstimationSMPL4D reconstruction
Abstract:

We present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world coordinate frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies (i.e., human detection and cropping, tracking, segmentation, camera pose or metric depth estimation, SLAM for 3D scenes, local human mesh recovery, etc.), Human3R jointly recovers global multi-person SMPL-X bodies (“everyone”), dense 3D scene geometry (“everywhere”), and camera trajectories in a single forward pass (“all-at-once”). Our method builds upon the 4D reconstruction foundation model CUT3R, and leverages parameter-efficient visual prompt tuning to preserve its original rich spatiotemporal priors while enabling direct readout of SMPL-X parameters. To further improve the accuracy of global human pose and shape estimation, we introduce a bottom-up (one-shot) multi-person SMPL-X regressor, trained on human-specific datasets. By removing heavy dependencies and iterative refinement, and only training on a relatively small-scale synthetic dataset, BEDLAM, Human3R achieves state-of-the-art performance with remarkable efficiency: it requires just one day of training on a single consumer GPU (NVIDIA RTX 4090) and operates in real time (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance, across all relevant tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. In summary, Human3R achieves one unified model, one-stage inference, one-shot multi-person estimation, and requires just one day of training on one GPU — enabling real-time, online processing of streaming inputs. We hope that Human3R will serve as a simple yet effective baseline, which can be easily extended by other researchers for new applications, such as 6D object pose estimation (“everything”), thereby facilitating future research in this direction. Code and models will be made publicly available.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Human3R proposes a unified feed-forward framework for online 4D human-scene reconstruction from monocular video, jointly recovering multi-person SMPL-X bodies, dense scene geometry, and camera trajectories in a single forward pass. The paper sits within the 'Feed-Forward Joint Reconstruction' leaf of the taxonomy, which contains only three papers total. This represents a relatively sparse research direction compared to more crowded areas like Template-Based Human Reconstruction (nine papers) or Gaussian Splatting-Based Methods (six papers), suggesting the feed-forward joint reconstruction paradigm remains an emerging approach rather than a saturated field.

The taxonomy structure reveals that Human3R's closest neighbors are optimization-based joint reconstruction methods (four papers) and human-scene interaction modeling approaches (five papers). While optimization-based methods like HSR and DressRecon emphasize iterative refinement for quality, Human3R diverges by prioritizing single-pass efficiency. The broader Joint Human-Scene Reconstruction branch (twelve papers total) sits between purely human-centric methods (sixteen papers across multiple leaves) and general dynamic scene reconstruction (thirteen papers), positioning Human3R at the intersection of human-specific modeling and holistic scene understanding. The taxonomy's scope notes clarify that feed-forward methods explicitly exclude iterative optimization, distinguishing Human3R's architectural philosophy from refinement-heavy alternatives.

Among the nineteen candidates examined across three contributions, no clearly refuting prior work was identified. The unified feed-forward framework contribution examined nine candidates with zero refutations, suggesting limited direct overlap in the constrained search scope. The parameter-efficient visual prompt tuning method examined only one candidate without refutation, indicating either genuine novelty or insufficient search coverage in this specific technical dimension. The real-time multi-person reconstruction contribution also examined nine candidates with no refutations. These statistics reflect a top-K semantic search rather than exhaustive coverage, meaning the absence of refutations indicates no obvious overlaps within the limited candidate pool examined, not definitive novelty across all prior work.

Based on the limited search scope of nineteen candidates, Human3R appears to occupy a relatively underexplored position within feed-forward joint reconstruction, though the small candidate pool prevents strong conclusions about absolute novelty. The sparse population of its taxonomy leaf and absence of refutations among examined papers suggest the specific combination of feed-forward architecture, joint human-scene modeling, and SMPL-X parameter readout may represent a less-traveled path. However, the analysis does not cover exhaustive literature in related areas like optimization-based methods or human-centric reconstruction, where overlapping ideas might exist outside the semantic search radius.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
19
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: online 4D human-scene reconstruction from monocular video. This field aims to recover dynamic 3D geometry and motion of both humans and their surrounding environments from single-camera footage, often in real-time or near-real-time settings. The taxonomy reveals several complementary research directions. Human-Centric Reconstruction Methods focus primarily on capturing detailed human body shape and motion, often leveraging parametric models or learned priors. Joint Human-Scene Reconstruction tackles the coupled problem of simultaneously modeling people and their environments, addressing challenges like occlusion handling and consistent spatial alignment. Dynamic Scene Reconstruction emphasizes general non-rigid or articulated scene motion without necessarily privileging human subjects, while Generative and Diffusion-Based Reconstruction explores synthesis-driven approaches that can hallucinate plausible geometry from limited observations. Specialized Reconstruction Scenarios address domain-specific constraints such as endoscopic imaging, robotic manipulation, or aerial capture. Within Joint Human-Scene Reconstruction, a particularly active line of work explores feed-forward architectures that predict 4D representations in a single pass, balancing speed and fidelity. Human3R[0] exemplifies this feed-forward joint reconstruction approach, aiming for efficient inference without iterative optimization. Nearby methods like ODHSR[15] and Synergistic Global-Space[22] similarly pursue real-time or online processing but may differ in their scene representation choices—some favor Gaussian splatting primitives while others use neural radiance fields or hybrid schemes. Compared to optimization-heavy pipelines such as HSR[4] or DressRecon[5], which refine geometry over many frames, Human3R[0] prioritizes immediacy and generalization across diverse scenes. This trade-off between reconstruction quality and computational efficiency remains a central open question, with recent works exploring how much geometric detail can be recovered from a single forward pass versus how much benefit iterative refinement truly provides in dynamic human-scene settings.

Claimed Contributions

Human3R unified feed-forward framework for online 4D human-scene reconstruction

Human3R is a unified model that jointly recovers global multi-person SMPL-X bodies, dense 3D scene geometry, and camera trajectories from monocular video in a single forward pass, eliminating multi-stage pipelines and heavy dependencies such as human detection, depth estimation, and SLAM preprocessing.

9 retrieved papers
Parameter-efficient visual prompt tuning method for human reconstruction

The authors introduce a parameter-efficient finetuning approach that uses visual prompt tuning on CUT3R, detecting human head tokens and transforming them into human prompts via learnable projection layers, while keeping the CUT3R backbone frozen to preserve its spatiotemporal priors.

1 retrieved paper
Real-time one-shot multi-person reconstruction with minimal training

Human3R achieves efficient training and inference by requiring only one day of training on a single GPU using the BEDLAM dataset, while enabling real-time reconstruction at 15 FPS with low memory usage and supporting bottom-up multi-person reconstruction in a single forward pass.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Human3R unified feed-forward framework for online 4D human-scene reconstruction

Human3R is a unified model that jointly recovers global multi-person SMPL-X bodies, dense 3D scene geometry, and camera trajectories from monocular video in a single forward pass, eliminating multi-stage pipelines and heavy dependencies such as human detection, depth estimation, and SLAM preprocessing.

Contribution

Parameter-efficient visual prompt tuning method for human reconstruction

The authors introduce a parameter-efficient finetuning approach that uses visual prompt tuning on CUT3R, detecting human head tokens and transforming them into human prompts via learnable projection layers, while keeping the CUT3R backbone frozen to preserve its spatiotemporal priors.

Contribution

Real-time one-shot multi-person reconstruction with minimal training

Human3R achieves efficient training and inference by requiring only one day of training on a single GPU using the BEDLAM dataset, while enabling real-time reconstruction at 15 FPS with low memory usage and supporting bottom-up multi-person reconstruction in a single forward pass.

Human3R: Everyone Everywhere All at Once | Novelty Validation