Human3R: Everyone Everywhere All at Once

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Human Motion EstimationSMPL4D reconstruction

We present Human3R, a unified, feed-forward framework for online 4D human-scene reconstruction, in the world coordinate frame, from casually captured monocular videos. Unlike previous approaches that rely on multi-stage pipelines, iterative contact-aware refinement between humans and scenes, and heavy dependencies (i.e., human detection and cropping, tracking, segmentation, camera pose or metric depth estimation, SLAM for 3D scenes, local human mesh recovery, etc.), Human3R jointly recovers global multi-person SMPL-X bodies (“everyone”), dense 3D scene geometry (“everywhere”), and camera trajectories in a single forward pass (“all-at-once”). Our method builds upon the 4D reconstruction foundation model CUT3R, and leverages parameter-efficient visual prompt tuning to preserve its original rich spatiotemporal priors while enabling direct readout of SMPL-X parameters. To further improve the accuracy of global human pose and shape estimation, we introduce a bottom-up (one-shot) multi-person SMPL-X regressor, trained on human-specific datasets. By removing heavy dependencies and iterative refinement, and only training on a relatively small-scale synthetic dataset, BEDLAM, Human3R achieves state-of-the-art performance with remarkable efficiency: it requires just one day of training on a single consumer GPU (NVIDIA RTX 4090) and operates in real time (15 FPS) with a low memory footprint (8 GB). Extensive experiments demonstrate that Human3R delivers state-of-the-art or competitive performance, across all relevant tasks, including global human motion estimation, local human mesh recovery, video depth estimation, and camera pose estimation, with a single unified model. In summary, Human3R achieves one unified model, one-stage inference, one-shot multi-person estimation, and requires just one day of training on one GPU — enabling real-time, online processing of streaming inputs. We hope that Human3R will serve as a simple yet effective baseline, which can be easily extended by other researchers for new applications, such as 6D object pose estimation (“everything”), thereby facilitating future research in this direction. Code and models will be made publicly available.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Human3R proposes a unified feed-forward framework for online 4D human-scene reconstruction from monocular video, jointly recovering multi-person SMPL-X bodies, dense scene geometry, and camera trajectories in a single forward pass. The paper sits within the 'Feed-Forward Joint Reconstruction' leaf of the taxonomy, which contains only three papers total. This represents a relatively sparse research direction compared to more crowded areas like Template-Based Human Reconstruction (nine papers) or Gaussian Splatting-Based Methods (six papers), suggesting the feed-forward joint reconstruction paradigm remains an emerging approach rather than a saturated field.

The taxonomy structure reveals that Human3R's closest neighbors are optimization-based joint reconstruction methods (four papers) and human-scene interaction modeling approaches (five papers). While optimization-based methods like HSR and DressRecon emphasize iterative refinement for quality, Human3R diverges by prioritizing single-pass efficiency. The broader Joint Human-Scene Reconstruction branch (twelve papers total) sits between purely human-centric methods (sixteen papers across multiple leaves) and general dynamic scene reconstruction (thirteen papers), positioning Human3R at the intersection of human-specific modeling and holistic scene understanding. The taxonomy's scope notes clarify that feed-forward methods explicitly exclude iterative optimization, distinguishing Human3R's architectural philosophy from refinement-heavy alternatives.

Among the nineteen candidates examined across three contributions, no clearly refuting prior work was identified. The unified feed-forward framework contribution examined nine candidates with zero refutations, suggesting limited direct overlap in the constrained search scope. The parameter-efficient visual prompt tuning method examined only one candidate without refutation, indicating either genuine novelty or insufficient search coverage in this specific technical dimension. The real-time multi-person reconstruction contribution also examined nine candidates with no refutations. These statistics reflect a top-K semantic search rather than exhaustive coverage, meaning the absence of refutations indicates no obvious overlaps within the limited candidate pool examined, not definitive novelty across all prior work.

Based on the limited search scope of nineteen candidates, Human3R appears to occupy a relatively underexplored position within feed-forward joint reconstruction, though the small candidate pool prevents strong conclusions about absolute novelty. The sparse population of its taxonomy leaf and absence of refutations among examined papers suggest the specific combination of feed-forward architecture, joint human-scene modeling, and SMPL-X parameter readout may represent a less-traveled path. However, the analysis does not cover exhaustive literature in related areas like optimization-based methods or human-centric reconstruction, where overlapping ideas might exist outside the semantic search radius.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: online 4D human-scene reconstruction from monocular video. This field aims to recover dynamic 3D geometry and motion of both humans and their surrounding environments from single-camera footage, often in real-time or near-real-time settings. The taxonomy reveals several complementary research directions. Human-Centric Reconstruction Methods focus primarily on capturing detailed human body shape and motion, often leveraging parametric models or learned priors. Joint Human-Scene Reconstruction tackles the coupled problem of simultaneously modeling people and their environments, addressing challenges like occlusion handling and consistent spatial alignment. Dynamic Scene Reconstruction emphasizes general non-rigid or articulated scene motion without necessarily privileging human subjects, while Generative and Diffusion-Based Reconstruction explores synthesis-driven approaches that can hallucinate plausible geometry from limited observations. Specialized Reconstruction Scenarios address domain-specific constraints such as endoscopic imaging, robotic manipulation, or aerial capture. Within Joint Human-Scene Reconstruction, a particularly active line of work explores feed-forward architectures that predict 4D representations in a single pass, balancing speed and fidelity. Human3R[0] exemplifies this feed-forward joint reconstruction approach, aiming for efficient inference without iterative optimization. Nearby methods like ODHSR[15] and Synergistic Global-Space[22] similarly pursue real-time or online processing but may differ in their scene representation choices—some favor Gaussian splatting primitives while others use neural radiance fields or hybrid schemes. Compared to optimization-heavy pipelines such as HSR[4] or DressRecon[5], which refine geometry over many frames, Human3R[0] prioritizes immediacy and generalization across diverse scenes. This trade-off between reconstruction quality and computational efficiency remains a central open question, with recent works exploring how much geometric detail can be recovered from a single forward pass versus how much benefit iterative refinement truly provides in dynamic human-scene settings.

Claimed Contributions

Human3R unified feed-forward framework for online 4D human-scene reconstruction

9 retrieved papers

Human3R is a unified model that jointly recovers global multi-person SMPL-X bodies, dense 3D scene geometry, and camera trajectories from monocular video in a single forward pass, eliminating multi-stage pipelines and heavy dependencies such as human detection, depth estimation, and SLAM preprocessing.

9 retrieved papers

Parameter-efficient visual prompt tuning method for human reconstruction

1 retrieved paper

The authors introduce a parameter-efficient finetuning approach that uses visual prompt tuning on CUT3R, detecting human head tokens and transforming them into human prompts via learnable projection layers, while keeping the CUT3R backbone frozen to preserve its spatiotemporal priors.

1 retrieved paper

Real-time one-shot multi-person reconstruction with minimal training

9 retrieved papers

Human3R achieves efficient training and inference by requiring only one day of training on a single GPU using the BEDLAM dataset, while enabling real-time reconstruction at 15 FPS with low memory usage and supporting bottom-up multi-person reconstruction in a single forward pass.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[15] ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos PDF

Zetong Zhang, Manuel Kaufmann, Lixin Xue, Jie Song, Martin R. Oswald (2025)

[22] Synergistic global-space camera and human reconstruction from videos PDF

Yizhou Zhao, Bhiksha Raj, Tuanfeng Y. Wang, Min Xu, Jimei Yang, Chun-Hao Paul Huang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Human3R unified feed-forward framework for online 4D human-scene reconstruction

[3] L4gm: Large 4d gaussian reconstruction model PDF

Cannot Refute

[18] CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models PDF

Cannot Refute

[23] Feature4x: Bridging any monocular video to 4d agentic ai with versatile gaussian feature fields PDF

Cannot Refute

[33] 4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos PDF

Cannot Refute

[51] MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second PDF

Cannot Refute

[52] MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds PDF

Cannot Refute

[53] Dynamic neural radiance fields for monocular 4d facial avatar reconstruction PDF

Cannot Refute

[54] 4dnex: Feed-forward 4d generative modeling made easy PDF

Cannot Refute

[55] Tensor4D: Efficient Neural 4D Decomposition for High-Fidelity Dynamic Reconstruction and Rendering PDF

Cannot Refute

Contribution

Parameter-efficient visual prompt tuning method for human reconstruction

[56] PromptHMR: Promptable Human Mesh Recovery PDF

Cannot Refute

Contribution

Real-time one-shot multi-person reconstruction with minimal training

[58] Multi-hmr: Multi-person whole-body human mesh recovery in a single shot PDF

Cannot Refute

[59] Few-Shot Multi-Human Neural Rendering Using Geometry Constraints PDF

Cannot Refute

[60] 3D real-time human reconstruction with a single RGBD camera PDF

Cannot Refute

[61] Real-time omnidirectional 3D multi-person human pose estimation with occlusion handling

Cannot Refute

[62] XNect: Real-time Multi-Person 3D Motion Capture with a Single RGB Camera PDF

Cannot Refute

[63] Exploring Novel Methods for Real-Time Multi-Camera People Tracking in Machine Learning PDF

Cannot Refute

[64] Light3DPose: Real-time Multi-Person 3D Pose Estimation from Multiple Views PDF

Cannot Refute

[65] Multi-Person 3D Pose Estimation in Mobile Edge Computing Devices for Real-Time Applications PDF

Cannot Refute

[66] Real-Time 3D Multi-Person Pose Estimation Using an Omnidirectional Camera and mmWave Radars PDF

Cannot Refute

Human3R: Everyone Everywhere All at Once

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[15] ODHSR: Online Dense 3D Reconstruction of Humans and Scenes from Monocular Videos PDF

[22] Synergistic global-space camera and human reconstruction from videos PDF

Contribution Analysis

Human3R unified feed-forward framework for online 4D human-scene reconstruction

[3] L4gm: Large 4d gaussian reconstruction model PDF

[18] CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models PDF

[23] Feature4x: Bridging any monocular video to 4d agentic ai with versatile gaussian feature fields PDF

[33] 4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos PDF

[51] MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second PDF

[52] MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds PDF

[53] Dynamic neural radiance fields for monocular 4d facial avatar reconstruction PDF

[54] 4dnex: Feed-forward 4d generative modeling made easy PDF

[55] Tensor4D: Efficient Neural 4D Decomposition for High-Fidelity Dynamic Reconstruction and Rendering PDF

Parameter-efficient visual prompt tuning method for human reconstruction

[56] PromptHMR: Promptable Human Mesh Recovery PDF

Real-time one-shot multi-person reconstruction with minimal training

[58] Multi-hmr: Multi-person whole-body human mesh recovery in a single shot PDF

[59] Few-Shot Multi-Human Neural Rendering Using Geometry Constraints PDF

[60] 3D real-time human reconstruction with a single RGBD camera PDF

[61] Real-time omnidirectional 3D multi-person human pose estimation with occlusion handling

[62] XNect: Real-time Multi-Person 3D Motion Capture with a Single RGB Camera PDF

[63] Exploring Novel Methods for Real-Time Multi-Camera People Tracking in Machine Learning PDF

[64] Light3DPose: Real-time Multi-Person 3D Pose Estimation from Multiple Views PDF

[65] Multi-Person 3D Pose Estimation in Mobile Edge Computing Devices for Real-Time Applications PDF

[66] Real-Time 3D Multi-Person Pose Estimation Using an Omnidirectional Camera and mmWave Radars PDF

Table of Contents