MaskInversion: Localized Embeddings via Optimization of Explainability Maps

ICLR 2026 Conference SubmissionAnonymous Authors
vision encoderlocalized embeddingCLIP
Abstract:

Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions. To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for a query image region specified by a mask at test time. MaskInversion starts with initializing an embedding token and compares its explainability map, derived from the pretrained model, to the query mask. The embedding token is then subsequently refined to approximate the query region by minimizing the discrepancy between its explainability map and the query mask. During this process, only the embedding vector is updated, while the underlying foundation model is kept frozen allowing to use MaskInversion with any pre-trained model. As deriving the explainability map involves computing its gradient, which can be expensive, we propose a gradient decomposition strategy that simplifies this computation. The learned region representation can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation. We evaluate the proposed method on all those tasks on several datasets such as PascalVOC, MSCOCO, RefCOCO, and OpenImagesV7 and show its capabilities compared to other SOTA approaches.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes MaskInversion, a test-time optimization method that refines embedding tokens to align with query image regions by minimizing discrepancies between explainability maps and input masks. Within the taxonomy, it resides in the 'Embedding Optimization for Localized Representations' leaf under 'Region-Based Inference and Prompting'. This leaf contains only two papers total, indicating a relatively sparse research direction compared to more crowded areas like 'Region-Text Contrastive Learning' (five papers) or 'Visual Prompting and Attention Guidance' (five papers). The work thus occupies a niche focused on iterative embedding refinement rather than fixed prompting or pre-training strategies.

The taxonomy reveals that neighboring leaves emphasize different inference-time strategies: 'Visual Prompting and Attention Guidance' uses markers or bounding boxes to guide attention without embedding optimization, while 'Region-Conditioned Generation and Grounding' focuses on generative outputs conditioned on regions. Broader branches like 'Region-Aware Vision-Language Pre-training' address foundational training with region supervision, which MaskInversion explicitly avoids by keeping models frozen. The scope notes clarify that this leaf excludes fixed visual prompts and pre-training modifications, positioning MaskInversion as a post-hoc optimization approach distinct from both training-time alignment and static prompting methods.

Among thirty candidates examined across three contributions, none were flagged as clearly refuting the proposed methods. The core MaskInversion contribution examined ten candidates with zero refutable overlaps, as did the gradient decomposition strategy and regularization loss components. This suggests that within the limited search scope—primarily top-K semantic matches and citation expansion—no prior work directly anticipates the combination of explainability-map-driven inversion with gradient decomposition for region embedding. However, the small candidate pool and sparse leaf population mean the analysis captures a snapshot rather than exhaustive coverage of potential prior art.

Given the limited search scope and the sparse taxonomy leaf, the work appears to introduce a distinct optimization-centric approach to region embeddings. The absence of refutable candidates among thirty examined papers, combined with only one sibling paper in the taxonomy, suggests the method occupies a relatively unexplored niche. However, the analysis does not cover broader embedding optimization literature outside vision-language models or alternative explainability-based techniques, leaving open questions about connections to related optimization paradigms in other domains.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Generating localized embeddings for specific image regions from vision-language models. The field has organized itself into several major branches that reflect different emphases in how region-level representations are learned and applied. Region-Aware Vision-Language Pre-training and Alignment focuses on foundational training strategies that align visual regions with language at scale, often through contrastive or grounding objectives (e.g., GLIPv2[6], RegionCLIP[20]). Spatial and 3D-Aware Region Representation extends these ideas to capture geometric and depth information, enabling richer spatial reasoning (e.g., SpatialRGPT[1], Spatial 3D LLM[5]). Region-Based Inference and Prompting explores how to optimize or manipulate region embeddings at inference time, while Multi-Modal Task Integration with Region Features and Specialized Region-Based Applications address downstream uses ranging from interactive understanding to domain-specific tasks like medical imaging (e.g., RegionMed CLIP[31]). Finally, Analysis and Understanding of Region Representations investigates what these embeddings capture and how they can be improved. A particularly active line of work centers on embedding optimization for localized representations, where methods seek to refine region features beyond what pre-training alone provides. MaskInversion[0] sits squarely in this space, focusing on inverting or optimizing embeddings to better capture fine-grained region semantics. This contrasts with approaches like LARE[10], which may emphasize different optimization strategies or prompting mechanisms to achieve localization. Meanwhile, works such as Groma[9] and RegionGPT[8] illustrate how region features can be integrated into large language models for grounded reasoning, highlighting a trade-off between optimization-centric methods and those that rely on architectural innovations or richer pre-training. The central question across these directions is how to balance computational efficiency, generalization across diverse region types, and the fidelity of localized embeddings—challenges that MaskInversion[0] addresses through its inversion-based framework.

Claimed Contributions

MaskInversion method for localized embeddings via explainability map optimization

The authors introduce MaskInversion, a test-time optimization method that learns localized embedding tokens for specific image regions by iteratively refining an embedding to match its explainability map to a query mask, while keeping the foundation model frozen.

10 retrieved papers
Gradient decomposition strategy for efficient explainability map computation

The authors propose a gradient decomposition technique that eliminates the need to compute second-order derivatives at each iteration by decomposing the gradient computation, thereby improving computational efficiency especially when processing multiple masks.

10 retrieved papers
Regularization loss for balancing global and local representations

The authors introduce an auxiliary regularization loss that forces the localized embedding token to remain close to the original global image embedding, enabling control over the trade-off between regional specificity and global context.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MaskInversion method for localized embeddings via explainability map optimization

The authors introduce MaskInversion, a test-time optimization method that learns localized embedding tokens for specific image regions by iteratively refining an embedding to match its explainability map to a query mask, while keeping the foundation model frozen.

Contribution

Gradient decomposition strategy for efficient explainability map computation

The authors propose a gradient decomposition technique that eliminates the need to compute second-order derivatives at each iteration by decomposing the gradient computation, thereby improving computational efficiency especially when processing multiple masks.

Contribution

Regularization loss for balancing global and local representations

The authors introduce an auxiliary regularization loss that forces the localized embedding token to remain close to the original global image embedding, enabling control over the trade-off between regional specificity and global context.