SAM 3: Segment Anything with Concepts

ICLR 2026 Conference SubmissionAnonymous Authors
foundation modelsopen vocabulary segmentationsemantic instance segmentationobject tracking
Abstract:

We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts, which we define as either short noun phrases (e.g., “yellow school bus”), image exemplars, or a combination of both. Promptable Concept Segmentation (PCS) takes such prompts and returns segmentation masks and unique identities for all matching object instances. To advance PCS, we build a scalable data engine that produces a high-quality dataset with 4M unique concept labels, including hard negatives, across images and videos. Our model consists of a vision backbone shared between an image-level detector and a memory-based video tracker. Recognition and localization are decoupled with a presence head, which significantly boosts detection accuracy. SAM 3 delivers a 2x gain over existing systems in both image and video PCS, and improves previous SAM capabilities in interactive visual segmentation tasks. We open source SAM 3 along with our new Segment Anything with Concepts (SA-Co) benchmark.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Promptable Concept Segmentation (PCS), a task that unifies detection, segmentation, and tracking across images and videos using concept prompts (noun phrases, image exemplars, or both). It resides in the 'Unified Promptable Segmentation Frameworks' leaf alongside three sibling papers: Text Image Prompts, ControlNet-based methods, and UniVS Prompts Queries. This leaf represents a moderately populated research direction within the broader Foundation Model Architectures branch, which anchors the field's core segmentation engines. The taxonomy reveals that unified frameworks constitute one of sixteen leaf nodes across fifty papers, indicating a well-established but not overcrowded area.

The taxonomy tree shows that neighboring leaves include Video Object Segmentation with Memory Mechanisms (four papers on temporal tracking) and Attention and Diffusion-Based Segmentation Mechanisms (three papers on cross-modal alignment). The paper's integration of memory-based video tracking connects it to the video segmentation branch, while its decoupled recognition-localization architecture relates to attention mechanism research. The scope notes clarify that unified frameworks handle multiple prompt types within single architectures, distinguishing them from single-modality methods or domain-specific adaptations found in medical and remote sensing branches. This positioning suggests the work bridges foundational architecture design with video-specific temporal reasoning.

Among thirty candidates examined, the architecture contribution (decoupled recognition and localization) shows overlap with three prior works, while the PCS task formulation and data engine contributions appear more distinctive with zero refutable candidates each from ten examined papers. The limited search scope means these statistics reflect top-K semantic matches rather than exhaustive coverage. The architecture's presence head for decoupling recognition and localization appears to have precedent in the examined literature, whereas the concept-level prompt formulation and the human-AI data engine methodology show less direct overlap within the candidate set. The SA-Co benchmark contribution also lacks clear refutation among examined candidates.

Based on the limited thirty-candidate search, the work appears to make incremental architectural contributions while introducing a novel task formulation and benchmark. The taxonomy context reveals a moderately active research area with established sibling frameworks, suggesting the field has matured beyond initial exploration but remains open to refinement. The analysis does not cover exhaustive prior work in video segmentation or concept-based retrieval systems outside the top semantic matches, leaving open questions about broader novelty claims.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: promptable concept segmentation in images and videos. The field has evolved around enabling flexible, user-guided segmentation through diverse prompt modalities—ranging from points and boxes to text descriptions and even audio cues. The taxonomy reveals six major branches that capture this landscape. Foundation Model Architectures and Core Mechanisms anchor the field with unified frameworks like SAM Concepts[0] and Segment Everything Everywhere[16], which provide general-purpose segmentation engines adaptable to multiple prompt types. Prompt Design and Learning Strategies explore how to optimize prompt representations, whether through learnable tokens as in Learning Prompt SAM[20] or dynamic adaptation schemes. Domain Adaptation and Transfer Learning address the challenge of moving these models into specialized contexts such as medical imaging (SAM Medical Survey[1], Medical SAM 2[8]) or remote sensing (Few-shot Remote Sensing[12]), often with minimal retraining. Task-Specific Applications and Extensions branch into niche problems like surgical tool segmentation (Zero-shot Surgical Tool[17]) and urban flood mapping (Urban Floods Interactive[11]), while Few-Shot and Personalized Segmentation focuses on tailoring models to individual users or rare concepts (Personalize SAM[31]). Finally, Generative Models for Segmentation leverage diffusion architectures (ControlNet[29], Unleashing Diffusion Perception[23]) to integrate segmentation with synthesis. A particularly active tension lies between building universal, training-free frameworks versus domain-specific fine-tuning: works like Training-free Open-World[9] and PODA Zero-shot[40] pursue broad generalization without additional data, while medical and remote sensing branches emphasize adaptation to specialized distributions. SAM Concepts[0] sits squarely within the Foundation Model Architectures branch, specifically under Unified Promptable Segmentation Frameworks, alongside Text Image Prompts[5] and UniVS Prompts Queries[33]. Compared to these neighbors, SAM Concepts[0] emphasizes a holistic approach to handling diverse concept-level prompts within a single architecture, whereas Text Image Prompts[5] pioneered early multimodal prompt fusion and UniVS Prompts Queries[33] focuses on unifying prompt and query mechanisms for video understanding. This positioning reflects a trend toward consolidating multiple prompt modalities into cohesive systems that balance generality with task-specific performance.

Claimed Contributions

Promptable Concept Segmentation (PCS) task and SA-Co benchmark

The authors introduce a new task called Promptable Concept Segmentation where users provide concept prompts (short noun phrases, image exemplars, or both) to detect, segment, and track all matching object instances in images and videos. They also create the SA-Co benchmark containing 214K unique concepts with exhaustive masks in 124K images and 1.7K videos.

10 retrieved papers
SAM 3 architecture with decoupled recognition and localization

The authors present SAM 3, a unified model consisting of a detector and tracker that share a vision encoder. A key architectural innovation is the presence head that decouples recognition (what) from localization (where), which significantly improves detection accuracy especially when training with challenging negative phrases.

10 retrieved papers
Can Refute
Scalable human- and AI-in-the-loop data engine

The authors develop a data engine that iteratively generates annotated data through a feedback loop involving SAM 3, human annotators, and AI annotators (fine-tuned MLLMs). This engine produces high-quality training data with 4M unique phrases and 52M masks, plus synthetic data with 38M phrases and 1.4B masks, while doubling annotation throughput by delegating verification tasks to AI models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Promptable Concept Segmentation (PCS) task and SA-Co benchmark

The authors introduce a new task called Promptable Concept Segmentation where users provide concept prompts (short noun phrases, image exemplars, or both) to detect, segment, and track all matching object instances in images and videos. They also create the SA-Co benchmark containing 214K unique concepts with exhaustive masks in 124K images and 1.7K videos.

Contribution

SAM 3 architecture with decoupled recognition and localization

The authors present SAM 3, a unified model consisting of a detector and tracker that share a vision encoder. A key architectural innovation is the presence head that decouples recognition (what) from localization (where), which significantly improves detection accuracy especially when training with challenging negative phrases.

Contribution

Scalable human- and AI-in-the-loop data engine

The authors develop a data engine that iteratively generates annotated data through a feedback loop involving SAM 3, human annotators, and AI annotators (fine-tuned MLLMs). This engine produces high-quality training data with 4M unique phrases and 52M masks, plus synthetic data with 38M phrases and 1.4B masks, while doubling annotation throughput by delegating verification tasks to AI models.