SAM-Veteran: An MLLM-Based Human-like SAM Agent for Reasoning Segmentation

ICLR 2026 Conference SubmissionAnonymous Authors
reasoning segmentationmulti-modal large language modelreinforcement learning
Abstract:

Significant progress has been made in reasoning segmentation by combining multi-modal large language models (MLLMs) with the Segment Anything Model (SAM): the former excel in reasoning and vision–language alignment, while the latter offers powerful pixel-level understanding. However, current paradigms fall short in exploiting SAM’s strengths, especially the ability to support iterative mask refinement by interactive segmentation, a process that human users can naturally perform. To bridge this gap, we introduce SAM-Veteran, an experienced mask-aware SAM agent capable of emulating human interaction with SAM via a reasoning-driven segmentation workflow that integrates (i) generating bounding boxes given image–query pairs for SAM input, (ii) proposing refinement points based on SAM-generated masks, and (iii) adaptively terminating the process. Aiming for this goal, we propose a multi-task reinforcement learning framework based on Group Relative Policy Optimization (GRPO), which enhances the MLLM’s abilities in textual grounding and mask comprehension. Furthermore, we introduce a dynamic sampling strategy tailored for generating both boxes and points to stabilize training. Extensive experiments across diverse datasets show that SAM-Veteran achieves human-like interaction with SAM and establishes new state-of-the-art performance on both in-domain and out-of-domain benchmarks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SAM-Veteran, an MLLM-based agent that orchestrates iterative mask refinement through a reasoning-driven workflow combining box generation, refinement point proposal, and adaptive termination. Within the taxonomy, it resides in the Multi-Round Interactive Reasoning leaf under Vision-Language Reasoning Segmentation, which contains only three papers total. This positions the work in a relatively sparse research direction focused on conversational memory and iterative query refinement across multiple interaction rounds, distinguishing it from single-stage reasoning approaches that dominate neighboring leaves.

The taxonomy reveals that Multi-Round Interactive Reasoning sits alongside four other Vision-Language Reasoning Segmentation subcategories: Chain-of-Thought Reasoning Segmentation (four papers emphasizing explicit multi-step decomposition), Direct MLLM-Based Segmentation (four papers coupling LLMs with mask decoders without explicit reasoning steps), High-Resolution Perception Enhancement (two papers addressing encoder resolution limits), and 3D Reasoning Segmentation (two papers extending to point clouds). SAM-Veteran diverges from these neighbors by emphasizing conversational refinement cycles rather than single-pass reasoning or resolution enhancement, while sharing the broader goal of integrating language understanding with segmentation.

Among thirty candidates examined, none clearly refute any of the three core contributions. The SAM-Veteran agent concept examined ten candidates with zero refutable overlaps, the GRPO-based multi-task reinforcement learning framework examined ten candidates with zero refutations, and the dynamic sampling strategy for box and point generation similarly found no clear prior work among ten candidates. This suggests that within the limited search scope, the combination of MLLM-driven iterative refinement, reinforcement learning for mask comprehension, and dynamic sampling appears relatively unexplored, though the analysis does not claim exhaustive coverage of all relevant literature.

Based on the top-thirty semantic matches and citation expansion, the work appears to occupy a distinct niche within multi-round interactive reasoning, particularly in its use of reinforcement learning to train an agent for human-like SAM interaction. The limited search scope means potentially relevant work in broader reinforcement learning for vision-language tasks or alternative SAM adaptation strategies may not be fully represented. The sparse population of the Multi-Round Interactive Reasoning leaf and absence of refutable candidates suggest novelty within the examined literature, though a more comprehensive search could reveal additional connections.

Taxonomy

Core-task Taxonomy Papers
47
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: reasoning segmentation with iterative mask refinement. This field addresses the challenge of producing precise object masks by combining language-based reasoning with progressive boundary correction. The taxonomy reveals several complementary directions: Vision-Language Reasoning Segmentation explores how multimodal models can ground complex linguistic descriptions in visual scenes, often leveraging large language models to parse instructions and guide segmentation. Referring Expression Segmentation focuses on mapping natural language phrases to target regions, while Foundation Model Adaptation and Prompting investigates how pretrained architectures like SAM can be steered via learned or handcrafted prompts. Interactive and Click-Based Segmentation emphasizes user-driven refinement through point or box inputs, and Automatic Mask Refinement Mechanisms develop self-correcting pipelines that iteratively improve mask quality without human intervention. Specialized Segmentation Contexts apply these ideas to domains such as medical imaging, remote sensing, and 3D scenes, where domain-specific priors and data characteristics demand tailored strategies. A particularly active line of work centers on multi-round interactive reasoning, where systems engage in dialogue-like exchanges to progressively clarify ambiguous queries and refine masks. SAM Veteran[0] exemplifies this approach by orchestrating iterative cycles that adjust segmentation boundaries based on feedback from both visual features and language cues, closely aligning with methods like MMC[17] and Chain-of-Ground[40] that also emphasize stepwise reasoning and grounding. In contrast, works such as ClipSAM[3] prioritize efficient prompt engineering to adapt foundation models with minimal retraining, while PixelLM[2] integrates pixel-level language embeddings for fine-grained alignment. The interplay between automatic refinement loops and interactive guidance remains an open question: some approaches favor fully autonomous correction mechanisms, whereas others rely on human or agent-driven prompts to navigate complex scenes. SAM Veteran[0] sits within the multi-round interactive reasoning cluster, distinguishing itself by treating refinement as a conversational process that iteratively reconciles language semantics with evolving mask hypotheses.

Claimed Contributions

SAM-Veteran: An MLLM-based SAM agent for human-like reasoning segmentation

The authors propose SAM-Veteran, an MLLM-based agent that mimics human usage of SAM by performing a complete workflow including generating bounding boxes, proposing refinement points based on SAM-generated masks, and adaptively terminating the refinement process.

10 retrieved papers
Multi-task reinforcement learning framework based on GRPO

The authors develop a multi-task RL framework using GRPO that trains the MLLM through three tasks: textual grounding (generating bounding boxes), mask comprehension (evaluating and refining masks), and an auxiliary task (identifying flaws in corrupted masks).

10 retrieved papers
Dynamic sampling strategy for box and point generation

The authors propose a dynamic sampling strategy adapted from DAPO that over-samples candidate boxes and actions to ensure diversity in rewards across rollouts, thereby stabilizing the GRPO-based reinforcement learning process.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SAM-Veteran: An MLLM-based SAM agent for human-like reasoning segmentation

The authors propose SAM-Veteran, an MLLM-based agent that mimics human usage of SAM by performing a complete workflow including generating bounding boxes, proposing refinement points based on SAM-generated masks, and adaptively terminating the refinement process.

Contribution

Multi-task reinforcement learning framework based on GRPO

The authors develop a multi-task RL framework using GRPO that trains the MLLM through three tasks: textual grounding (generating bounding boxes), mask comprehension (evaluating and refining masks), and an auxiliary task (identifying flaws in corrupted masks).

Contribution

Dynamic sampling strategy for box and point generation

The authors propose a dynamic sampling strategy adapted from DAPO that over-samples candidate boxes and actions to ensure diversity in rewards across rollouts, thereby stabilizing the GRPO-based reinforcement learning process.