Object-Centric Refinement for Enhanced Zero-Shot Segmentation

ICLR 2026 Conference SubmissionAnonymous Authors
Zero-Shot LearningVision-Language modelsSemantic SegmentationComputer Vision
Abstract:

Zero-shot semantic segmentation aims to recognize, pixel-wise, unseen categories without annotated masks, typically by leveraging vision-language models such as CLIP. However, the patch representations obtained by the CLIP's vision encoder lack object-centric structure, making it difficult to localize coherent semantic regions. This hinders the performance of the segmentation decoder, especially for unseen categories. To mitigate this issue, we propose object-centric zero-shot segmentation (OC-ZSS) that enhances patch representations using object-level information. To extract object features for patch refinement, we introduce self-supervision-guided object prompts into the encoder. These prompts attend to coarse object regions using attention masks derived from unsupervised clustering of features from a pretrained self-supervised~(SSL) model. Although these prompts offer a structured initialization of the object-level context, the extracted features remain coarse due to the unsupervised nature of clustering. To further refine the object features and effectively enrich patch representations, we develop a dual-stage Object Refinement Attention (ORA) module that iteratively updates both object and patch features through cross-attention. Last, to make the refinement more robust and sensitive to objects of varying spatial scales, we incorporate a lightweight granular attention mechanism that operates over multiple receptive fields. OC-ZSS achieves state-of-the-art performance on standard zero-shot segmentation benchmarks across inductive, transductive, and cross-domain settings.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes an object-centric zero-shot segmentation framework (OC-ZSS) that refines CLIP patch representations using object-level information derived from self-supervised clustering and a dual-stage Object Refinement Attention module. Within the taxonomy, it resides in the 'Object-Centric Patch Representation Enhancement' leaf under 'Vision-Language Model Based Zero-Shot Segmentation'. This leaf contains only two papers total, including the original work, indicating a relatively sparse and emerging research direction focused specifically on explicit object-centric refinement mechanisms for patch-level features.

The taxonomy reveals that the broader vision-language segmentation branch includes a sibling category on 'Feature Purification and Outlier Suppression', which addresses semantic misalignment through noise reduction rather than object-centric structuring. Neighboring branches explore self-supervised ViT features without language alignment, instance segmentation in robotic contexts, and generative diffusion-based approaches. The original paper's position suggests it bridges vision-language alignment with self-supervised feature extraction, diverging from purely discriminative or purely generative paradigms by integrating SSL-derived object prompts into a CLIP-based pipeline.

Across three identified contributions, the analysis examined 22 candidate papers total, with 10 candidates per major contribution and 2 for the ORA module. None of the contributions were clearly refuted by prior work within this limited search scope. The object-centric framework and self-supervision-guided prompts each faced 10 candidates without refutation, while the dual-stage ORA module encountered only 2 candidates. This suggests that among the top-K semantic matches examined, no single prior work directly anticipates the combination of SSL-guided object prompts with iterative attention-based refinement for CLIP patch features.

Given the sparse taxonomy leaf and absence of refutations among 22 examined candidates, the work appears to occupy a relatively novel position within object-centric patch refinement for zero-shot segmentation. However, the limited search scope means the analysis captures top semantic neighbors rather than exhaustive prior art. The integration of SSL clustering with vision-language models represents a distinctive methodological choice, though broader field coverage would be needed to assess whether similar hybrid approaches exist outside the examined candidate set.

Taxonomy

Core-task Taxonomy Papers
8
3
Claimed Contributions
22
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: zero-shot semantic segmentation using object-centric patch refinement. The field has coalesced around several distinct methodological branches. Vision-language model based zero-shot segmentation leverages pre-trained models like CLIP to align visual patches with textual class descriptions, enabling segmentation without task-specific training data. Self-supervised vision transformer based unsupervised segmentation exploits learned feature representations to discover object boundaries without explicit labels. Zero-shot instance segmentation in robotic environments addresses the challenge of identifying novel objects in interactive settings, while generative model based zero-shot segmentation uses diffusion or other generative frameworks to produce segmentation masks. These branches differ primarily in their reliance on language supervision, the granularity of their outputs, and their deployment contexts, yet all share the goal of generalizing beyond seen categories. Within vision-language approaches, a particularly active line of work focuses on refining patch-level representations to better capture object-centric information. Object-Centric Refinement[0] exemplifies this direction by enhancing how individual patches are processed and aggregated to improve segmentation quality. Closely related efforts such as Implicit Patch Distillation[6] and Feature Purification[3] similarly emphasize cleaning or distilling patch features to reduce noise and improve alignment with semantic concepts. In contrast, methods like Eagle[1] and ZISVFM[2] may prioritize different aspects of the vision-language pipeline, such as hierarchical feature extraction or multi-scale fusion. The central trade-off in this cluster revolves around balancing computational efficiency with the fidelity of object-centric representations, and open questions remain about how best to integrate spatial context while preserving fine-grained patch details. Object-Centric Refinement[0] sits squarely within this patch representation enhancement subgroup, sharing the emphasis on localized feature improvement seen in neighboring works.

Claimed Contributions

Object-centric zero-shot segmentation framework (OC-ZSS)

The authors introduce a framework that refines CLIP patch features to be more object-centric by incorporating object-level information, improving zero-shot segmentation performance especially on unseen categories without retraining the encoder.

10 retrieved papers
Self-supervision-guided object prompts

The authors propose frozen object prompts injected into the CLIP encoder that use attention masks generated from DINO feature clustering to focus on distinct object regions, providing coarse object features without requiring annotations or encoder fine-tuning.

10 retrieved papers
Dual-stage Object Refinement Attention (ORA) module

The authors design a refinement module that performs iterative cross-attention between object and patch features in two stages, gradually enhancing object tokens and using them to enrich patch semantics for tighter object-level grouping and improved text alignment.

2 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Object-centric zero-shot segmentation framework (OC-ZSS)

The authors introduce a framework that refines CLIP patch features to be more object-centric by incorporating object-level information, improving zero-shot segmentation performance especially on unseen categories without retraining the encoder.

Contribution

Self-supervision-guided object prompts

The authors propose frozen object prompts injected into the CLIP encoder that use attention masks generated from DINO feature clustering to focus on distinct object regions, providing coarse object features without requiring annotations or encoder fine-tuning.

Contribution

Dual-stage Object Refinement Attention (ORA) module

The authors design a refinement module that performs iterative cross-attention between object and patch features in two stages, gradually enhancing object tokens and using them to enrich patch semantics for tighter object-level grouping and improved text alignment.