Benchmarking Open-ended Segmentation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

BenchmarkingOpen-ended SegmentationEvaluation ProtocolLexical Alignment

Open-ended segmentation requires models capable of generating free-form descriptions of previously unseen concepts and regions. Despite advancements in model development, current evaluation protocols for open-ended segmentation tasks fail to capture the true semantic accuracy of the generated descriptions. We empirically demonstrate that embedding‐based similarity score mappings diverge significantly from human judgments. To address this issue, we introduce a novel mapping function that considers multiple lexical relationships between free‐form outputs and test‐vocabulary labels, yielding much closer alignment with human annotations. We integrate this mapping into a robust evaluation framework and re‐benchmark previous state‐of‐the‐art methods. Additionally, we present the first Multi-modal Large‐Language Model trained with a contrastive objective to jointly align visual regions and textual descriptions, achieving new state‐of‐the‐art results in open‐ended panoptic segmentation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces a novel lexical mapping function and evaluation framework for open-ended segmentation, alongside OPAL, a multimodal large language model trained with contrastive learning. It resides in the 'Lexical Alignment Metrics for Segmentation' leaf under 'Evaluation Frameworks for Open-Ended Outputs', where it is currently the sole paper. This isolation suggests the work addresses an underexplored niche: rigorous evaluation protocols for free-form segmentation outputs. The broader taxonomy shows active research in open-vocabulary segmentation methods (e.g., contrastive alignment, prompt-driven approaches) but limited focus on evaluation frameworks, indicating a gap the paper aims to fill.

The taxonomy reveals neighboring branches in open-vocabulary visual segmentation (image-level and video-level methods) and generalist recognition systems, which produce the outputs this paper seeks to evaluate. Sibling evaluation work exists in 'Text Generation Evaluation with Preference Alignment', addressing free-form text but not visual segmentation. The 'Lexical and Subword Segmentation Methods' branch explores lexical alignment in text processing contexts, yet excludes visual tasks. This positioning highlights the paper's bridging role: applying lexical alignment principles from text domains to visual segmentation evaluation, a connection not explicitly formalized in prior taxonomy nodes.

Among sixteen candidates examined, no contributions were clearly refuted. The lexical mapping function (five candidates examined, zero refutable) and Lexical Alignment Curve protocol (one candidate examined, zero refutable) appear novel within the limited search scope. OPAL's contrastive training for open-ended segmentation (ten candidates examined, zero refutable) shows no direct overlap among top semantic matches. However, the search scale is modest: sixteen papers cannot exhaustively cover all contrastive vision-language models or evaluation metrics. The absence of refutations suggests novelty within the examined subset, but broader literature may contain relevant prior work not captured here.

Based on top-sixteen semantic matches and taxonomy structure, the work appears to occupy a sparse research direction, particularly in evaluation methodology. The taxonomy's single-paper leaf and lack of refutable candidates within the examined scope support this impression. Limitations include the narrow search scale and potential for relevant work in adjacent domains (e.g., text generation metrics, vision-language alignment) not surfaced by semantic search. The analysis covers immediate neighbors but cannot confirm exhaustive novelty across all related fields.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating open-ended segmentation with lexical alignment. The field encompasses diverse approaches to segmentation and alignment problems where outputs are not constrained to fixed categories. The taxonomy reveals several major branches: open-vocabulary visual segmentation methods that extend beyond closed-set recognition (e.g., Scaling Open-Vocabulary Segmentation[1], Patch Aligned Contrastive[3]), evaluation frameworks designed to assess open-ended outputs when ground truth is ambiguous or flexible, multimodal vision-language assistants that integrate visual and textual understanding (e.g., Llava-med[5]), and specialized domains ranging from lexical and subword segmentation to speech alignment and survey response analysis. These branches reflect a shared challenge: how to define, produce, and measure quality when the space of valid answers is large or even unbounded, requiring alignment mechanisms that go beyond exact matching. Within this landscape, particularly active lines of work explore training-free or weakly-supervised techniques (Training-free Attention Prompts[2], Prototypical Weakly Open-Vocabulary[17]) and methods that leverage latent or lexical alignment to bridge modalities or granularities (Latent Alignment Segmentation[10], Lexically Grounded Subword[11]). Benchmarking Open-ended Segmentation[0] sits squarely within the evaluation frameworks branch, focusing on lexical alignment metrics for segmentation tasks where outputs may vary in granularity or terminology. This emphasis on metric design distinguishes it from neighboring works like AlignSAM[14] or Unified Embedding Alignment[13], which prioritize model architectures or embedding strategies, and from domain-specific efforts such as Child-directed Speech Segmentation[16] or Wine Minerality Segmentation[23] that address narrow application contexts. The original paper's contribution lies in formalizing how to score segmentation quality when reference labels are open-ended, a recurring challenge across many branches but rarely addressed with rigorous benchmarking.

Claimed Contributions

Novel lexical mapping function for open-ended segmentation evaluation

5 retrieved papers

The authors introduce a mapping function that considers multiple lexical relationships (exact matches, synonyms, hyponyms, meronyms) between free-form descriptions and test vocabulary categories, rather than relying on single embedding-based similarity scores. This approach achieves significantly higher alignment with human annotations than existing methods like Sentence-BERT.

5 retrieved papers

Lexical Alignment Curve evaluation protocol

1 retrieved paper

The authors develop a comprehensive evaluation framework called Lexical Alignment Curve (LAC) that integrates their lexical mapping function. This protocol computes recognition metrics across all lexical levels and plots them as a curve, providing diagnostic insights into model performance and enabling standardized re-benchmarking of existing methods.

1 retrieved paper

OPAL: First MLLM with contrastive learning for open-ended segmentation

10 retrieved papers

The authors present OPAL, which they claim is the first Multi-modal Large Language Model trained with a contrastive objective alongside the standard generative loss for open-ended segmentation. This dual-objective approach jointly aligns visual regions and textual descriptions, achieving state-of-the-art results on open-ended panoptic segmentation benchmarks.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Novel lexical mapping function for open-ended segmentation evaluation

[40] Sentiment Analysis in the Medical Domain PDF

Cannot Refute

[41] Semantic networks for automatic coding (v2) PDF

Cannot Refute

[42] An Investigation of the Use of Lexical Cohesive Devices in Academic Writing Essays of Grade 9 Learners at an American School in Sharjah PDF

Cannot Refute

[43] THE SYNONYMY OF MEDICAL TERMS IN ROMANIAN PDF

Cannot Refute

[44] Word Associations as a Source of Commonsense Knowledge PDF

Cannot Refute

Contribution

Lexical Alignment Curve evaluation protocol

[39] An approach for efficient open vocabulary spoken term detection PDF

Cannot Refute

Contribution

OPAL: First MLLM with contrastive learning for open-ended segmentation

[45] CoCa: Contrastive Captioners are Image-Text Foundation Models PDF

Cannot Refute

[46] Cross-Modal Contrastive Learning for Text-to-Image Generation PDF

Cannot Refute

[47] MedCLIP: Contrastive Learning from Unpaired Medical Images and Text PDF

Cannot Refute

[48] Unified Contrastive Learning in Image-Text-Label Space PDF

Cannot Refute

[49] Cross-modal Contrastive Learning for Multimodal Fake News Detection PDF

Cannot Refute

[50] Multimodal Federated Learning via Contrastive Representation Ensemble PDF

Cannot Refute

[51] BiMAC: Bidirectional Multimodal Alignment in Contrastive Learning PDF

Cannot Refute

[52] Vision-language pre-training with triple contrastive learning PDF

Cannot Refute

[53] Using multimodal contrastive knowledge distillation for video-text retrieval PDF

Cannot Refute

[54] Regionclip: Region-based language-image pretraining PDF

Cannot Refute

Benchmarking Open-ended Segmentation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Novel lexical mapping function for open-ended segmentation evaluation

[40] Sentiment Analysis in the Medical Domain PDF

[41] Semantic networks for automatic coding (v2) PDF

[42] An Investigation of the Use of Lexical Cohesive Devices in Academic Writing Essays of Grade 9 Learners at an American School in Sharjah PDF

[43] THE SYNONYMY OF MEDICAL TERMS IN ROMANIAN PDF

[44] Word Associations as a Source of Commonsense Knowledge PDF

Lexical Alignment Curve evaluation protocol

[39] An approach for efficient open vocabulary spoken term detection PDF

OPAL: First MLLM with contrastive learning for open-ended segmentation

[45] CoCa: Contrastive Captioners are Image-Text Foundation Models PDF

[46] Cross-Modal Contrastive Learning for Text-to-Image Generation PDF

[47] MedCLIP: Contrastive Learning from Unpaired Medical Images and Text PDF

[48] Unified Contrastive Learning in Image-Text-Label Space PDF

[49] Cross-modal Contrastive Learning for Multimodal Fake News Detection PDF

[50] Multimodal Federated Learning via Contrastive Representation Ensemble PDF

[51] BiMAC: Bidirectional Multimodal Alignment in Contrastive Learning PDF

[52] Vision-language pre-training with triple contrastive learning PDF

[53] Using multimodal contrastive knowledge distillation for video-text retrieval PDF

[54] Regionclip: Region-based language-image pretraining PDF

Table of Contents