Sapiens2

ICLR 2026 Conference SubmissionAnonymous Authors
computer visionhuman posesegmentationtransformersfoundation models
Abstract:

We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5 billion parameters, with native 1K resolution and hierarchical variants that support 4K. Sapiens2 substantially improves over its predecessor in both pretraining and post-training. First, to learn features that capture low-level details (for dense prediction) and high-level semantics (for zero-shot or few-label settings), we combine masked image reconstruction with self-distilled contrastive objectives. Our evaluations show that this unified pretraining objective is better suited for a wider range of downstream tasks. Second, along the data axis, we pretrain on a curated dataset of 1 billion high-quality human images and improve the quality and quantity of task annotations. Third, architecturally, we incorporate advances from frontier models that enable longer training schedules with improved stability. Our 4K models adopt windowed attention to reason over longer spatial context and are pretrained with 2K output resolution. Sapiens2 sets a new state-of-the-art and improves over the first generation on pose (+4 mAP), body-part segmentation (+22.3 mIoU), normal estimation (+29.2 rel-angular error) and extends to new tasks such as pointmap and albedo estimation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Sapiens2 contributes a family of high-resolution transformers (0.4–5B parameters) for multi-task human-centric vision, combining masked reconstruction with self-distilled contrastive pretraining and scaling to 4K resolution via windowed attention. The paper resides in the 'Multi-Task Human-Centric Foundation Models' leaf, which contains only three papers including Sapiens2 itself. This is a relatively sparse research direction within the broader taxonomy of 50 papers across 18 leaf nodes, suggesting the work targets an emerging but not yet crowded subfield focused on unified architectures for diverse human analysis tasks.

The taxonomy reveals neighboring leaves addressing related but distinct challenges: 'Human Pose Estimation Architectures' explores specialized keypoint detection methods (e.g., token clustering, high-resolution parallel branches), while 'Body Surface Reconstruction and Mesh Recovery' focuses on 3D geometry. Sapiens2 diverges by pursuing a unified multi-task framework rather than task-specific architectures. The 'High-Resolution Image Generation and Synthesis' branch addresses generative modeling, whereas Sapiens2 emphasizes discriminative dense prediction. The scope_note for its leaf explicitly excludes single-task specialists, positioning Sapiens2 as a generalist foundation model rather than a narrow solution.

Among 30 candidates examined, the unified pretraining objective (Contribution A) shows 2 refutable candidates from 10 examined, indicating some prior work on combining reconstruction and contrastive learning for human-centric tasks. The Humans-750M dataset (Contribution B) found no refutations across 10 candidates, suggesting novelty in dataset curation scale or composition. The 4K hierarchical architecture (Contribution C) encountered 3 refutable candidates from 10, reflecting existing exploration of windowed attention or multi-resolution strategies. The limited search scope means these statistics capture top-30 semantic matches, not exhaustive coverage of all relevant prior work.

Given the sparse taxonomy leaf and the scale of literature examined, Sapiens2 appears to advance an emerging research direction where unified multi-task human-centric models remain relatively underexplored. The contribution-level statistics suggest incremental architectural and pretraining innovations rather than entirely unprecedented techniques, though the dataset and integration choices may offer practical value. This assessment is constrained by the top-30 candidate scope and does not account for concurrent or unpublished work in this rapidly evolving area.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: human-centric vision with high-resolution transformers. This field centers on leveraging transformer architectures to process high-resolution imagery for tasks involving human subjects, ranging from pose estimation and body parsing to generation and synthesis. The taxonomy reveals five main branches: Dense Prediction Tasks for Human Body Analysis focuses on pixel-level understanding of human anatomy and attributes, often employing multi-task frameworks like ViTPose[6] and unified models such as Unihcp[7]; High-Resolution Image Generation and Synthesis addresses the creation and manipulation of detailed human imagery, including super-resolution methods like Hybrid Vision SuperResolution[5] and style-based approaches such as StyleSwin[36]; Scene Understanding and Environmental Analysis extends beyond isolated humans to contextual reasoning in complex environments; Efficient Transformer Architectures and Training explores computational optimizations like token clustering (Token Clustering Transformer[2]) and neural architecture search (Hr-nas[12]); and Domain-Specific Applications targets specialized settings from medical imaging to metaverse avatars (CycleGAN PPO Metaverse[8]). A particularly active line of work involves multi-task human-centric foundation models that unify diverse prediction tasks under a single architecture, balancing generalization with task-specific performance. Sapiens2[0] exemplifies this direction by extending the capabilities of its predecessor Sapiens Foundation[3], aiming to handle multiple human analysis tasks at high resolution within a shared framework. Compared to Sapiens Foundation[3], which established foundational multi-task learning for human understanding, Sapiens2[0] pushes toward broader task coverage and improved scalability. Meanwhile, Unihcp[7] represents a parallel effort in unified human-centric prediction, emphasizing modular design for flexible task composition. These works collectively address the trade-off between model complexity and the ability to capture fine-grained human details, a challenge that remains central as the field moves toward more comprehensive and efficient foundation models for human-centric vision.

Claimed Contributions

Sapiens2 model family with unified pretraining objective

The authors introduce Sapiens2, a family of vision transformers ranging from 0.4B to 5B parameters that support native 1K and hierarchical 4K resolution. The models combine masked image reconstruction with self-distilled contrastive objectives to learn features capturing both low-level details for dense prediction and high-level semantics for zero-shot or few-label settings.

10 retrieved papers
Can Refute
Humans-750M pretraining dataset

The authors curate and introduce a large-scale dataset of 750 million high-quality human images from a web-scale corpus through multi-stage filtering. The dataset spans diverse ages, ethnicities, backgrounds, and real-world conditions with no task-specific labels or human-specific priors injected during pretraining.

10 retrieved papers
Hierarchical 4K architecture with windowed attention

The authors introduce a hierarchical architecture design for 4K resolution processing that uses windowed self-attention in early layers to capture local structure, followed by spatial downsampling and global attention layers. This design enables high-resolution dense prediction while maintaining computational tractability and compatibility with masked pretraining.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Sapiens2 model family with unified pretraining objective

The authors introduce Sapiens2, a family of vision transformers ranging from 0.4B to 5B parameters that support native 1K and hierarchical 4K resolution. The models combine masked image reconstruction with self-distilled contrastive objectives to learn features capturing both low-level details for dense prediction and high-level semantics for zero-shot or few-label settings.

Contribution

Humans-750M pretraining dataset

The authors curate and introduce a large-scale dataset of 750 million high-quality human images from a web-scale corpus through multi-stage filtering. The dataset spans diverse ages, ethnicities, backgrounds, and real-world conditions with no task-specific labels or human-specific priors injected during pretraining.

Contribution

Hierarchical 4K architecture with windowed attention

The authors introduce a hierarchical architecture design for 4K resolution processing that uses windowed self-attention in early layers to capture local structure, followed by spatial downsampling and global attention layers. This design enables high-resolution dense prediction while maintaining computational tractability and compatibility with masked pretraining.

Sapiens2 | Novelty Validation