SAM 3: Segment Anything with Concepts

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

foundation modelsopen vocabulary segmentationsemantic instance segmentationobject tracking

We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., “yellow school bus”), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of a vision backbone shared between an image-level detector and a memory-based video tracker. Recognition and localization are decoupled with a presence head, which significantly boosts detection accuracy. SAM 3 delivers a 2x gain over existing systems in both image and video PCS, and improves previous SAM capabilities in interactive visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Promptable Concept Segmentation (PCS), a task that unifies detection, segmentation, and tracking across images and videos using concept prompts (noun phrases, image exemplars, or both). It resides in the 'Unified Promptable Segmentation Frameworks' leaf alongside three sibling papers: Text Image Prompts, ControlNet-based methods, and UniVS Prompts Queries. This leaf represents a moderately populated research direction within the broader Foundation Model Architectures branch, which anchors the field's core segmentation engines. The taxonomy reveals that unified frameworks constitute one of sixteen leaf nodes across fifty papers, indicating a well-established but not overcrowded area.

The taxonomy tree shows that neighboring leaves include Video Object Segmentation with Memory Mechanisms (four papers on temporal tracking) and Attention and Diffusion-Based Segmentation Mechanisms (three papers on cross-modal alignment). The paper's integration of memory-based video tracking connects it to the video segmentation branch, while its decoupled recognition-localization architecture relates to attention mechanism research. The scope notes clarify that unified frameworks handle multiple prompt types within single architectures, distinguishing them from single-modality methods or domain-specific adaptations found in medical and remote sensing branches. This positioning suggests the work bridges foundational architecture design with video-specific temporal reasoning.

Among thirty candidates examined, the architecture contribution (decoupled recognition and localization) shows overlap with three prior works, while the PCS task formulation and data engine contributions appear more distinctive with zero refutable candidates each from ten examined papers. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The architecture's presence head for decoupling recognition and localization appears to have precedent in the examined literature, whereas the concept-level prompt formulation and the human-AI data engine methodology show less direct overlap within the candidate set. The SA-Co benchmark contribution also lacks clear refutation among examined candidates.

Based on the limited thirty-candidate search, the work appears to make incremental architectural contributions while introducing a novel task formulation and benchmark. The taxonomy context reveals a moderately active research area with established sibling frameworks, suggesting the field has matured beyond initial exploration but remains open to refinement. The analysis does not cover exhaustive prior work in video segmentation or concept-based retrieval systems outside the top semantic matches, leaving open questions about broader novelty claims.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: promptable concept segmentation in images and videos. The field has evolved around enabling flexible, user-guided segmentation through diverse prompt modalities—ranging from points and boxes to text descriptions and even audio cues. The taxonomy reveals six major branches that capture this landscape. Foundation Model Architectures and Core Mechanisms anchor the field with unified frameworks like SAM Concepts[0] and Segment Everything Everywhere[16], which provide general-purpose segmentation engines adaptable to multiple prompt types. Prompt Design and Learning Strategies explore how to optimize prompt representations, whether through learnable tokens as in Learning Prompt SAM[20] or dynamic adaptation schemes. Domain Adaptation and Transfer Learning address the challenge of moving these models into specialized contexts such as medical imaging (SAM Medical Survey[1], Medical SAM 2[8]) or remote sensing (Few-shot Remote Sensing[12]), often with minimal retraining. Task-Specific Applications and Extensions branch into niche problems like surgical tool segmentation (Zero-shot Surgical Tool[17]) and urban flood mapping (Urban Floods Interactive[11]), while Few-Shot and Personalized Segmentation focuses on tailoring models to individual users or rare concepts (Personalize SAM[31]). Finally, Generative Models for Segmentation leverage diffusion architectures (ControlNet[29], Unleashing Diffusion Perception[23]) to integrate segmentation with synthesis. A particularly active tension lies between building universal, training-free frameworks versus domain-specific fine-tuning: works like Training-free Open-World[9] and PODA Zero-shot[40] pursue broad generalization without additional data, while medical and remote sensing branches emphasize adaptation to specialized distributions. SAM Concepts[0] sits squarely within the Foundation Model Architectures branch, specifically under Unified Promptable Segmentation Frameworks, alongside Text Image Prompts[5] and UniVS Prompts Queries[33]. Compared to these neighbors, SAM Concepts[0] emphasizes a holistic approach to handling diverse concept-level prompts within a single architecture, whereas Text Image Prompts[5] pioneered early multimodal prompt fusion and UniVS Prompts Queries[33] focuses on unifying prompt and query mechanisms for video understanding. This positioning reflects a trend toward consolidating multiple prompt modalities into cohesive systems that balance generality with task-specific performance.

Claimed Contributions

Promptable Concept Segmentation (PCS) task and SA-Co benchmark

10 retrieved papers

The authors introduce a new task called Promptable Concept Segmentation where users provide concept prompts (short noun phrases, image exemplars, or both) to detect, segment, and track all matching object instances in images and videos. They also create the SA-Co benchmark containing 214K unique concepts with exhaustive masks in 124K images and 1.7K videos.

10 retrieved papers

SAM 3 architecture with decoupled recognition and localization

Can Refute

10 retrieved papers

The authors present SAM 3, a unified model consisting of a detector and tracker that share a vision encoder. A key architectural innovation is the presence head that decouples recognition (what) from localization (where), which significantly improves detection accuracy especially when training with challenging negative phrases.

10 retrieved papers

Can Refute

Scalable human- and AI-in-the-loop data engine

10 retrieved papers

The authors develop a data engine that iteratively generates annotated data through a feedback loop involving SAM 3, human annotators, and AI annotators (fine-tuned MLLMs). This engine produces high-quality training data with 4M unique phrases and 52M masks, plus synthetic data with 38M phrases and 1.4B masks, while doubling annotation throughput by delegating verification tasks to AI models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Image segmentation using text and image prompts PDF

Timo LÃ¼ddecke, Alexander Ecker, Alexander S. Ecker (2022)

[16] Segment everything everywhere all at once PDF

Zou, Xueyan, Yang Jianwei, Xueyan Zou, Zhang Hao, Jianwei Yang, Li Feng, Hao Zhang, Li, Linjie, Feng Li, Wang Jian-feng, Linjie Li, Wang Li-juan, Jianfeng Gao, Gao, Jianfeng, Yong Jae Lee, Lee Yong Jae (2023)

[33] UniVS: Unified and Universal Video Segmentation with Prompts as Queries PDF

Minghan Li, Shuai Li, Ming-hui Li, Xindong Zhang, Lei Zhang (2024) • Computer Vision and Pattern Recognition

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Promptable Concept Segmentation (PCS) task and SA-Co benchmark

[61] Citetracker: Correlating image and text for visual tracking PDF

Cannot Refute

[62] Linguistic query-guided mask generation for referring image segmentation PDF

Cannot Refute

[63] Clip2: Contrastive language-image-point pretraining from real-world point cloud data PDF

Cannot Refute

[64] Consensus-aware visual-semantic embedding for image-text matching PDF

Cannot Refute

[65] Lowis3d: Language-driven open-world instance-level 3d scene understanding PDF

Cannot Refute

[66] Referring Video Object Segmentation With Cross-Modality Proxy Queries PDF

Cannot Refute

[67] Dual-level information interactive learning model for text-image person Re-identification PDF

Cannot Refute

[68] Tracking-forced Referring Video Object Segmentation PDF

Cannot Refute

[69] Extending CLIPâs Image-Text Alignment to Referring Image Segmentation PDF

Cannot Refute

[70] Referring Image Segmentation via Text Guided Multi-Level Interaction PDF

Cannot Refute

Contribution

SAM 3 architecture with decoupled recognition and localization

[51] A simple framework for open-vocabulary segmentation and detection PDF

Can Refute

[58] LLaMA-Unidetector: An LLaMA-Based Universal Framework for Open-Vocabulary Object Detection in Remote Sensing Imagery PDF

Can Refute

[60] Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction PDF

Can Refute

[52] Generalized decoupled learning for enhancing open-vocabulary dense perception PDF

Cannot Refute

[53] Declip: Decoupled learning for open-vocabulary dense perception PDF

Cannot Refute

[54] Multi-modal Prompts with Feature Decoupling for Open-Vocabulary Object Detection PDF

Cannot Refute

[55] Opensd: Unified open-vocabulary segmentation and detection PDF

Cannot Refute

[56] OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition PDF

Cannot Refute

[57] What makes good open-vocabulary detector: A disassembling perspective PDF

Cannot Refute

[59] Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 Object Detectors for Computer Vision and Pattern Recognition PDF

Cannot Refute

Contribution

Scalable human- and AI-in-the-loop data engine

[71] Reducing human annotation effort using self-supervised learning for image segmentation PDF

Cannot Refute

[72] Open-world point cloud semantic segmentation: A human-in-the-loop framework PDF

Cannot Refute

[73] MedUHIP: Towards Human-In-the-Loop Medical Segmentation PDF

Cannot Refute

[74] Labelany3d: Label any object 3d in the wild PDF

Cannot Refute

[75] Understanding Novice's Annotation Process For 3D Semantic Segmentation Task With Human-In-The-Loop PDF

Cannot Refute

[76] Towards Clinician-Preferred Segmentation: Leveraging Human-in-the-Loop for Test Time Adaptation in Medical Image Segmentation PDF

Cannot Refute

[77] A unified microstructure segmentation approach via human-in-the-loop machine learning PDF

Cannot Refute

[78] An Integrated in Situ Image Acquisition and Annotation Scheme for Instance Segmentation Models in Open Scenes With a HumanâRobot Interaction Approach PDF

Cannot Refute

[79] â¦ transfer for medical image segmentation via iterative human-in-the-loop update: from labelled public to unlabelled clinical datasets for multi-organ segmentation in CT PDF

Cannot Refute

[80] AI-human interactive pipeline with feedback to accelerate medical image annotation PDF

Cannot Refute

SAM 3: Segment Anything with Concepts

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Image segmentation using text and image prompts PDF

[16] Segment everything everywhere all at once PDF

[33] UniVS: Unified and Universal Video Segmentation with Prompts as Queries PDF

Contribution Analysis

Promptable Concept Segmentation (PCS) task and SA-Co benchmark

[61] Citetracker: Correlating image and text for visual tracking PDF

[62] Linguistic query-guided mask generation for referring image segmentation PDF

[63] Clip2: Contrastive language-image-point pretraining from real-world point cloud data PDF

[64] Consensus-aware visual-semantic embedding for image-text matching PDF

[65] Lowis3d: Language-driven open-world instance-level 3d scene understanding PDF

[66] Referring Video Object Segmentation With Cross-Modality Proxy Queries PDF

[67] Dual-level information interactive learning model for text-image person Re-identification PDF

[68] Tracking-forced Referring Video Object Segmentation PDF

[69] Extending CLIPâs Image-Text Alignment to Referring Image Segmentation PDF

[70] Referring Image Segmentation via Text Guided Multi-Level Interaction PDF

SAM 3 architecture with decoupled recognition and localization

[51] A simple framework for open-vocabulary segmentation and detection PDF

[58] LLaMA-Unidetector: An LLaMA-Based Universal Framework for Open-Vocabulary Object Detection in Remote Sensing Imagery PDF

[60] Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction PDF

[52] Generalized decoupled learning for enhancing open-vocabulary dense perception PDF

[53] Declip: Decoupled learning for open-vocabulary dense perception PDF

[54] Multi-modal Prompts with Feature Decoupling for Open-Vocabulary Object Detection PDF

[55] Opensd: Unified open-vocabulary segmentation and detection PDF

[56] OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition PDF

[57] What makes good open-vocabulary detector: A disassembling perspective PDF

[59] Ultralytics YOLO Evolution: An Overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 Object Detectors for Computer Vision and Pattern Recognition PDF

Scalable human- and AI-in-the-loop data engine

[71] Reducing human annotation effort using self-supervised learning for image segmentation PDF

[72] Open-world point cloud semantic segmentation: A human-in-the-loop framework PDF

[73] MedUHIP: Towards Human-In-the-Loop Medical Segmentation PDF

[74] Labelany3d: Label any object 3d in the wild PDF

[75] Understanding Novice's Annotation Process For 3D Semantic Segmentation Task With Human-In-The-Loop PDF

[76] Towards Clinician-Preferred Segmentation: Leveraging Human-in-the-Loop for Test Time Adaptation in Medical Image Segmentation PDF

[77] A unified microstructure segmentation approach via human-in-the-loop machine learning PDF

[78] An Integrated in Situ Image Acquisition and Annotation Scheme for Instance Segmentation Models in Open Scenes With a HumanâRobot Interaction Approach PDF

[79] â¦ transfer for medical image segmentation via iterative human-in-the-loop update: from labelled public to unlabelled clinical datasets for multi-organ segmentation in CT PDF

[80] AI-human interactive pipeline with feedback to accelerate medical image annotation PDF

Table of Contents

[69] Extending CLIPâs Image-Text Alignment to Referring Image Segmentation PDF

[78] An Integrated in Situ Image Acquisition and Annotation Scheme for Instance Segmentation Models in Open Scenes With a HumanâRobot Interaction Approach PDF

[79] â¦ transfer for medical image segmentation via iterative human-in-the-loop update: from labelled public to unlabelled clinical datasets for multi-organ segmentation in CT PDF