Benchmarking Open-ended Segmentation

ICLR 2026 Conference SubmissionAnonymous Authors
BenchmarkingOpen-ended SegmentationEvaluation ProtocolLexical Alignment
Abstract:

Open-ended segmentation requires models capable of generating free-form descriptions of previously unseen concepts and regions. Despite advancements in model development, current evaluation protocols for open-ended segmentation tasks fail to capture the true semantic accuracy of the generated descriptions. We empirically demonstrate that embedding‐based similarity score mappings diverge significantly from human judgments. To address this issue, we introduce a novel mapping function that considers multiple lexical relationships between free‐form outputs and test‐vocabulary labels, yielding much closer alignment with human annotations. We integrate this mapping into a robust evaluation framework and re‐benchmark previous state‐of‐the‐art methods. Additionally, we present the first Multi-modal Large‐Language Model trained with a contrastive objective to jointly align visual regions and textual descriptions, achieving new state‐of‐the‐art results in open‐ended panoptic segmentation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a novel lexical mapping function and evaluation framework for open-ended segmentation, alongside OPAL, a multimodal large language model trained with contrastive learning. It resides in the 'Lexical Alignment Metrics for Segmentation' leaf under 'Evaluation Frameworks for Open-Ended Outputs', where it is currently the sole paper. This isolation suggests the work addresses an underexplored niche: rigorous evaluation protocols for free-form segmentation outputs. The broader taxonomy shows active research in open-vocabulary segmentation methods (e.g., contrastive alignment, prompt-driven approaches) but limited focus on evaluation frameworks, indicating a gap the paper aims to fill.

The taxonomy reveals neighboring branches in open-vocabulary visual segmentation (image-level and video-level methods) and generalist recognition systems, which produce the outputs this paper seeks to evaluate. Sibling evaluation work exists in 'Text Generation Evaluation with Preference Alignment', addressing free-form text but not visual segmentation. The 'Lexical and Subword Segmentation Methods' branch explores lexical alignment in text processing contexts, yet excludes visual tasks. This positioning highlights the paper's bridging role: applying lexical alignment principles from text domains to visual segmentation evaluation, a connection not explicitly formalized in prior taxonomy nodes.

Among sixteen candidates examined, no contributions were clearly refuted. The lexical mapping function (five candidates examined, zero refutable) and Lexical Alignment Curve protocol (one candidate examined, zero refutable) appear novel within the limited search scope. OPAL's contrastive training for open-ended segmentation (ten candidates examined, zero refutable) shows no direct overlap among top semantic matches. However, the search scale is modest: sixteen papers cannot exhaustively cover all contrastive vision-language models or evaluation metrics. The absence of refutations suggests novelty within the examined subset, but broader literature may contain relevant prior work not captured here.

Based on top-sixteen semantic matches and taxonomy structure, the work appears to occupy a sparse research direction, particularly in evaluation methodology. The taxonomy's single-paper leaf and lack of refutable candidates within the examined scope support this impression. Limitations include the narrow search scale and potential for relevant work in adjacent domains (e.g., text generation metrics, vision-language alignment) not surfaced by semantic search. The analysis covers immediate neighbors but cannot confirm exhaustive novelty across all related fields.

Taxonomy

Core-task Taxonomy Papers
38
3
Claimed Contributions
16
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: evaluating open-ended segmentation with lexical alignment. The field encompasses diverse approaches to segmentation and alignment problems where outputs are not constrained to fixed categories. The taxonomy reveals several major branches: open-vocabulary visual segmentation methods that extend beyond closed-set recognition (e.g., Scaling Open-Vocabulary Segmentation[1], Patch Aligned Contrastive[3]), evaluation frameworks designed to assess open-ended outputs when ground truth is ambiguous or flexible, multimodal vision-language assistants that integrate visual and textual understanding (e.g., Llava-med[5]), and specialized domains ranging from lexical and subword segmentation to speech alignment and survey response analysis. These branches reflect a shared challenge: how to define, produce, and measure quality when the space of valid answers is large or even unbounded, requiring alignment mechanisms that go beyond exact matching. Within this landscape, particularly active lines of work explore training-free or weakly-supervised techniques (Training-free Attention Prompts[2], Prototypical Weakly Open-Vocabulary[17]) and methods that leverage latent or lexical alignment to bridge modalities or granularities (Latent Alignment Segmentation[10], Lexically Grounded Subword[11]). Benchmarking Open-ended Segmentation[0] sits squarely within the evaluation frameworks branch, focusing on lexical alignment metrics for segmentation tasks where outputs may vary in granularity or terminology. This emphasis on metric design distinguishes it from neighboring works like AlignSAM[14] or Unified Embedding Alignment[13], which prioritize model architectures or embedding strategies, and from domain-specific efforts such as Child-directed Speech Segmentation[16] or Wine Minerality Segmentation[23] that address narrow application contexts. The original paper's contribution lies in formalizing how to score segmentation quality when reference labels are open-ended, a recurring challenge across many branches but rarely addressed with rigorous benchmarking.

Claimed Contributions

Novel lexical mapping function for open-ended segmentation evaluation

The authors introduce a mapping function that considers multiple lexical relationships (exact matches, synonyms, hyponyms, meronyms) between free-form descriptions and test vocabulary categories, rather than relying on single embedding-based similarity scores. This approach achieves significantly higher alignment with human annotations than existing methods like Sentence-BERT.

5 retrieved papers
Lexical Alignment Curve evaluation protocol

The authors develop a comprehensive evaluation framework called Lexical Alignment Curve (LAC) that integrates their lexical mapping function. This protocol computes recognition metrics across all lexical levels and plots them as a curve, providing diagnostic insights into model performance and enabling standardized re-benchmarking of existing methods.

1 retrieved paper
OPAL: First MLLM with contrastive learning for open-ended segmentation

The authors present OPAL, which they claim is the first Multi-modal Large Language Model trained with a contrastive objective alongside the standard generative loss for open-ended segmentation. This dual-objective approach jointly aligns visual regions and textual descriptions, achieving state-of-the-art results on open-ended panoptic segmentation benchmarks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Novel lexical mapping function for open-ended segmentation evaluation

The authors introduce a mapping function that considers multiple lexical relationships (exact matches, synonyms, hyponyms, meronyms) between free-form descriptions and test vocabulary categories, rather than relying on single embedding-based similarity scores. This approach achieves significantly higher alignment with human annotations than existing methods like Sentence-BERT.

Contribution

Lexical Alignment Curve evaluation protocol

The authors develop a comprehensive evaluation framework called Lexical Alignment Curve (LAC) that integrates their lexical mapping function. This protocol computes recognition metrics across all lexical levels and plots them as a curve, providing diagnostic insights into model performance and enabling standardized re-benchmarking of existing methods.

Contribution

OPAL: First MLLM with contrastive learning for open-ended segmentation

The authors present OPAL, which they claim is the first Multi-modal Large Language Model trained with a contrastive objective alongside the standard generative loss for open-ended segmentation. This dual-objective approach jointly aligns visual regions and textual descriptions, achieving state-of-the-art results on open-ended panoptic segmentation benchmarks.