Sapiens2

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

computer visionhuman posesegmentationtransformersfoundation models

We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+22.3 mIoU), normal estimation (+29.2 rel-angular error) and extends to new tasks such as pointmap and albedo estimation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Sapiens2 contributes a family of high-resolution transformers (0.4–5B parameters) for multi-task human-centric vision, combining masked reconstruction with self-distilled contrastive pretraining and scaling to 4K resolution via windowed attention. The paper resides in the 'Multi-Task Human-Centric Foundation Models' leaf, which contains only three papers including Sapiens2 itself. This is a relatively sparse research direction within the broader taxonomy of 50 papers across 18 leaf nodes, suggesting the work targets an emerging but not yet crowded subfield focused on unified architectures for diverse human analysis tasks.

The taxonomy reveals neighboring leaves addressing related but distinct challenges: 'Human Pose Estimation Architectures' explores specialized keypoint detection methods (e.g., token clustering, high-resolution parallel branches), while 'Body Surface Reconstruction and Mesh Recovery' focuses on 3D geometry. Sapiens2 diverges by pursuing a unified multi-task framework rather than task-specific architectures. The 'High-Resolution Image Generation and Synthesis' branch addresses generative modeling, whereas Sapiens2 emphasizes discriminative dense prediction. The scope_note for its leaf explicitly excludes single-task specialists, positioning Sapiens2 as a generalist foundation model rather than a narrow solution.

Among 30 candidates examined, the unified pretraining objective (Contribution A) shows 2 refutable candidates from 10 examined, indicating some prior work on combining reconstruction and contrastive learning for human-centric tasks. The Humans-750M dataset (Contribution B) found no refutations across 10 candidates, suggesting novelty in dataset curation scale or composition. The 4K hierarchical architecture (Contribution C) encountered 3 refutable candidates from 10, reflecting existing exploration of windowed attention or multi-resolution strategies. The limited search scope means these statistics capture top-30 semantic matches, not exhaustive coverage of all relevant prior work.

Given the sparse taxonomy leaf and the scale of literature examined, Sapiens2 appears to advance an emerging research direction where unified multi-task human-centric models remain relatively underexplored. The contribution-level statistics suggest incremental architectural and pretraining innovations rather than entirely unprecedented techniques, though the dataset and integration choices may offer practical value. This assessment is constrained by the top-30 candidate scope and does not account for concurrent or unpublished work in this rapidly evolving area.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: human-centric vision with high-resolution transformers. This field centers on leveraging transformer architectures to process high-resolution imagery for tasks involving human subjects, ranging from pose estimation and body parsing to generation and synthesis. The taxonomy reveals five main branches: Dense Prediction Tasks for Human Body Analysis focuses on pixel-level understanding of human anatomy and attributes, often employing multi-task frameworks like ViTPose[6] and unified models such as Unihcp[7]; High-Resolution Image Generation and Synthesis addresses the creation and manipulation of detailed human imagery, including super-resolution methods like Hybrid Vision SuperResolution[5] and style-based approaches such as StyleSwin[36]; Scene Understanding and Environmental Analysis extends beyond isolated humans to contextual reasoning in complex environments; Efficient Transformer Architectures and Training explores computational optimizations like token clustering (Token Clustering Transformer[2]) and neural architecture search (Hr-nas[12]); and Domain-Specific Applications targets specialized settings from medical imaging to metaverse avatars (CycleGAN PPO Metaverse[8]). A particularly active line of work involves multi-task human-centric foundation models that unify diverse prediction tasks under a single architecture, balancing generalization with task-specific performance. Sapiens2[0] exemplifies this direction by extending the capabilities of its predecessor Sapiens Foundation[3], aiming to handle multiple human analysis tasks at high resolution within a shared framework. Compared to Sapiens Foundation[3], which established foundational multi-task learning for human understanding, Sapiens2[0] pushes toward broader task coverage and improved scalability. Meanwhile, Unihcp[7] represents a parallel effort in unified human-centric prediction, emphasizing modular design for flexible task composition. These works collectively address the trade-off between model complexity and the ability to capture fine-grained human details, a challenge that remains central as the field moves toward more comprehensive and efficient foundation models for human-centric vision.

Claimed Contributions

Sapiens2 model family with unified pretraining objective

Can Refute

10 retrieved papers

The authors introduce Sapiens2, a family of vision transformers ranging from 0.4B to 5B parameters that support native 1K and hierarchical 4K resolution. The models combine masked image reconstruction with self-distilled contrastive objectives to learn features capturing both low-level details for dense prediction and high-level semantics for zero-shot or few-label settings.

10 retrieved papers

Can Refute

Humans-750M pretraining dataset

10 retrieved papers

The authors curate and introduce a large-scale dataset of 750 million high-quality human images from a web-scale corpus through multi-stage filtering. The dataset spans diverse ages, ethnicities, backgrounds, and real-world conditions with no task-specific labels or human-specific priors injected during pretraining.

10 retrieved papers

Hierarchical 4K architecture with windowed attention

Can Refute

10 retrieved papers

The authors introduce a hierarchical architecture design for 4K resolution processing that uses windowed self-attention in early layers to capture local structure, followed by spatial downsampling and global attention layers. This design enables high-resolution dense prediction while maintaining computational tractability and compatibility with masked pretraining.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Sapiens: Foundation for Human Vision Models PDF

Rawal Khirodkar, Timur Bagautdinov, Julieta MartÃnez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, Shunsuke Saito (2024) • European Conference on Computer Vision

[7] Unihcp: A unified model for human-centric perceptions PDF

Yuanzheng Ci, Yizhou Wang, Meilin Chen, Tang Shixiang, Lei Bai, Feng Zhu, Rui Zhao, Fengwei Yu, Donglian Qi, Wanli Ouyang (2023)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Sapiens2 model family with unified pretraining objective

[51] Contrastive masked autoencoders are stronger vision learners PDF

Can Refute

[56] Mimco: Masked image modeling pre-training with contrastive teacher PDF

Can Refute

[52] X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs PDF

Cannot Refute

[53] Contrastive feature masking open-vocabulary vision transformer PDF

Cannot Refute

[54] Cross-Modal Contrastive Masked AutoEncoder for Compressed Video Pre-Training PDF

Cannot Refute

[55] A Theoretical Analysis of Self-Supervised Learning for Vision Transformers PDF

Cannot Refute

[57] Masked contrastive reconstruction for cross-modal medical image-report retrieval PDF

Cannot Refute

[58] MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training PDF

Cannot Refute

[59] Unveiling the power of audio-visual early fusion transformers with dense interactions through masked modeling PDF

Cannot Refute

[60] Bringing masked autoencoders explicit contrastive properties for point cloud self-supervised learning PDF

Cannot Refute

Contribution

Humans-750M pretraining dataset

[3] Sapiens: Foundation for Human Vision Models PDF

Cannot Refute

[71] Laion-5b: An open large-scale dataset for training next generation image-text models PDF

Cannot Refute

[72] Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark PDF

Cannot Refute

[73] Large-scale reinforcement learning for diffusion models PDF

Cannot Refute

[74] Scaling up vision-language pre-training for image captioning PDF

Cannot Refute

[75] Disco: Disentangled control for realistic human dance generation PDF

Cannot Refute

[76] Multimodal c4: An open, billion-scale corpus of images interleaved with text PDF

Cannot Refute

[77] Image representations learned with unsupervised pre-training contain human-like biases PDF

Cannot Refute

[78] The Neglected Tails in Vision-Language Models PDF

Cannot Refute

[79] Personvit: large-scale self-supervised vision transformer for person re-identification PDF

Cannot Refute

Contribution

Hierarchical 4K architecture with windowed attention

[63] Hrformer: High-resolution vision transformer for dense predict PDF

Can Refute

[64] HRFormer: High-Resolution Transformer for Dense Prediction PDF

Can Refute

[67] Fastervit: Fast vision transformers with hierarchical attention PDF

Can Refute

[61] Hierarchical multi-scale attention for semantic segmentation PDF

Cannot Refute

[62] AutoHFormer: Efficient Hierarchical Autoregressive Transformer for Time Series Prediction PDF

Cannot Refute

[65] A Pyramid Fusion MLP for Dense Prediction PDF

Cannot Refute

[66] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows PDF

Cannot Refute

[68] Attention guided multi-level feature aggregation network for camouflaged object detection PDF

Cannot Refute

[69] Attention receptive pyramid network for ship detection in SAR images PDF

Cannot Refute

[70] Pedestrian Trajectory Prediction via Window Attention and Spatial Graph Interaction Network PDF

Cannot Refute

Sapiens2

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Sapiens: Foundation for Human Vision Models PDF

[7] Unihcp: A unified model for human-centric perceptions PDF

Contribution Analysis

Sapiens2 model family with unified pretraining objective

[51] Contrastive masked autoencoders are stronger vision learners PDF

[56] Mimco: Masked image modeling pre-training with contrastive teacher PDF

[52] X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs PDF

[53] Contrastive feature masking open-vocabulary vision transformer PDF

[54] Cross-Modal Contrastive Masked AutoEncoder for Compressed Video Pre-Training PDF

[55] A Theoretical Analysis of Self-Supervised Learning for Vision Transformers PDF

[57] Masked contrastive reconstruction for cross-modal medical image-report retrieval PDF

[58] MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training PDF

[59] Unveiling the power of audio-visual early fusion transformers with dense interactions through masked modeling PDF

[60] Bringing masked autoencoders explicit contrastive properties for point cloud self-supervised learning PDF

Humans-750M pretraining dataset

[3] Sapiens: Foundation for Human Vision Models PDF

[71] Laion-5b: An open large-scale dataset for training next generation image-text models PDF

[72] Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark PDF

[73] Large-scale reinforcement learning for diffusion models PDF

[74] Scaling up vision-language pre-training for image captioning PDF

[75] Disco: Disentangled control for realistic human dance generation PDF

[76] Multimodal c4: An open, billion-scale corpus of images interleaved with text PDF

[77] Image representations learned with unsupervised pre-training contain human-like biases PDF

[78] The Neglected Tails in Vision-Language Models PDF

[79] Personvit: large-scale self-supervised vision transformer for person re-identification PDF

Hierarchical 4K architecture with windowed attention

[63] Hrformer: High-resolution vision transformer for dense predict PDF

[64] HRFormer: High-Resolution Transformer for Dense Prediction PDF

[67] Fastervit: Fast vision transformers with hierarchical attention PDF

[61] Hierarchical multi-scale attention for semantic segmentation PDF

[62] AutoHFormer: Efficient Hierarchical Autoregressive Transformer for Time Series Prediction PDF

[65] A Pyramid Fusion MLP for Dense Prediction PDF

[66] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows PDF

[68] Attention guided multi-level feature aggregation network for camouflaged object detection PDF

[69] Attention receptive pyramid network for ship detection in SAR images PDF

[70] Pedestrian Trajectory Prediction via Window Attention and Spatial Graph Interaction Network PDF

Table of Contents