MaskInversion: Localized Embeddings via Optimization of Explainability Maps

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

vision encoderlocalized embeddingCLIP

Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions. To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for a query image region specified by a mask at test time. MaskInversion starts with initializing an embedding token and compares its explainability map, derived from the pretrained model, to the query mask. The embedding token is then subsequently refined to approximate the query region by minimizing the discrepancy between its explainability map and the query mask. During this process, only the embedding vector is updated, while the underlying foundation model is kept frozen allowing to use MaskInversion with any pre-trained model. As deriving the explainability map involves computing its gradient, which can be expensive, we propose a gradient decomposition strategy that simplifies this computation. The learned region representation can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation. We evaluate the proposed method on all those tasks on several datasets such as PascalVOC, MSCOCO, RefCOCO, and OpenImagesV7 and show its capabilities compared to other SOTA approaches.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes MaskInversion, a test-time optimization method that refines embedding tokens to align with query image regions by minimizing discrepancies between explainability maps and input masks. Within the taxonomy, it resides in the 'Embedding Optimization for Localized Representations' leaf under 'Region-Based Inference and Prompting'. This leaf contains only two papers total, indicating a relatively sparse research direction compared to more crowded areas like 'Region-Text Contrastive Learning' (five papers) or 'Visual Prompting and Attention Guidance' (five papers). The work thus occupies a niche focused on iterative embedding refinement rather than fixed prompting or pre-training strategies.

The taxonomy reveals that neighboring leaves emphasize different inference-time strategies: 'Visual Prompting and Attention Guidance' uses markers or bounding boxes to guide attention without embedding optimization, while 'Region-Conditioned Generation and Grounding' focuses on generative outputs conditioned on regions. Broader branches like 'Region-Aware Vision-Language Pre-training' address foundational training with region supervision, which MaskInversion explicitly avoids by keeping models frozen. The scope notes clarify that this leaf excludes fixed visual prompts and pre-training modifications, positioning MaskInversion as a post-hoc optimization approach distinct from both training-time alignment and static prompting methods.

Among thirty candidates examined across three contributions, none were flagged as clearly refuting the proposed methods. The core MaskInversion contribution examined ten candidates with zero refutable overlaps, as did the gradient decomposition strategy and regularization loss components. This suggests that within the limited search scope—primarily top-K semantic matches and citation expansion—no prior work directly anticipates the combination of explainability-map-driven inversion with gradient decomposition for region embedding. However, the small candidate pool and sparse leaf population mean the analysis captures a snapshot rather than exhaustive coverage of potential prior art.

Given the limited search scope and the sparse taxonomy leaf, the work appears to introduce a distinct optimization-centric approach to region embeddings. The absence of refutable candidates among thirty examined papers, combined with only one sibling paper in the taxonomy, suggests the method occupies a relatively unexplored niche. However, the analysis does not cover broader embedding optimization literature outside vision-language models or alternative explainability-based techniques, leaving open questions about connections to related optimization paradigms in other domains.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Generating localized embeddings for specific image regions from vision-language models. The field has organized itself into several major branches that reflect different emphases in how region-level representations are learned and applied. Region-Aware Vision-Language Pre-training and Alignment focuses on foundational training strategies that align visual regions with language at scale, often through contrastive or grounding objectives (e.g., GLIPv2[6], RegionCLIP[20]). Spatial and 3D-Aware Region Representation extends these ideas to capture geometric and depth information, enabling richer spatial reasoning (e.g., SpatialRGPT[1], Spatial 3D LLM[5]). Region-Based Inference and Prompting explores how to optimize or manipulate region embeddings at inference time, while Multi-Modal Task Integration with Region Features and Specialized Region-Based Applications address downstream uses ranging from interactive understanding to domain-specific tasks like medical imaging (e.g., RegionMed CLIP[31]). Finally, Analysis and Understanding of Region Representations investigates what these embeddings capture and how they can be improved. A particularly active line of work centers on embedding optimization for localized representations, where methods seek to refine region features beyond what pre-training alone provides. MaskInversion[0] sits squarely in this space, focusing on inverting or optimizing embeddings to better capture fine-grained region semantics. This contrasts with approaches like LARE[10], which may emphasize different optimization strategies or prompting mechanisms to achieve localization. Meanwhile, works such as Groma[9] and RegionGPT[8] illustrate how region features can be integrated into large language models for grounded reasoning, highlighting a trade-off between optimization-centric methods and those that rely on architectural innovations or richer pre-training. The central question across these directions is how to balance computational efficiency, generalization across diverse region types, and the fidelity of localized embeddings—challenges that MaskInversion[0] addresses through its inversion-based framework.

Claimed Contributions

MaskInversion method for localized embeddings via explainability map optimization

10 retrieved papers

The authors introduce MaskInversion, a test-time optimization method that learns localized embedding tokens for specific image regions by iteratively refining an embedding to match its explainability map to a query mask, while keeping the foundation model frozen.

10 retrieved papers

Gradient decomposition strategy for efficient explainability map computation

10 retrieved papers

The authors propose a gradient decomposition technique that eliminates the need to compute second-order derivatives at each iteration by decomposing the gradient computation, thereby improving computational efficiency especially when processing multiple masks.

10 retrieved papers

Regularization loss for balancing global and local representations

10 retrieved papers

The authors introduce an auxiliary regularization loss that forces the localized embedding token to remain close to the original global image embedding, enabling control over the trade-off between regional specificity and global context.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[10] Lare: Latent augmentation using regional embedding with vision-language model PDF

Kosuke Sakurai, Tatsuya Ishii, Ryotaro Shimizu, Linxin Song, Masayuki Goto (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MaskInversion method for localized embeddings via explainability map optimization

[71] Interpreting CLIP's Image Representation via Text-Based Decomposition PDF

Cannot Refute

[72] Interpretable representations in explainable AI: from theory to practice PDF

Cannot Refute

[73] Explainable AI enhanced transformer based UNet for medical images segmentation using gradient weighted class activation map PDF

Cannot Refute

[74] Finding Regions of Counterfactual Explanations via Robust Optimization PDF

Cannot Refute

[75] CEIR: Concept-based Explainable Image Representation Learning PDF

Cannot Refute

[76] ICEv2: Interpretability, Comprehensiveness, and Explainability in Vision Transformer. PDF

Cannot Refute

[77] Artxai: Explainable artificial intelligence curates deep representation learning for artistic images using fuzzy techniques PDF

Cannot Refute

[78] Explainable self-supervised learning for medical image diagnosis based on DINO V2 model and semantic search PDF

Cannot Refute

[79] Explainable artificial intelligence (XAI) for deep learning based medical imaging classification PDF

Cannot Refute

[80] Local Concept Embeddings for Analysis of Concept Distributions in DNN Feature Spaces PDF

Cannot Refute

Contribution

Gradient decomposition strategy for efficient explainability map computation

[61] Interpretable basis decomposition for visual explanation PDF

Cannot Refute

[62] Gradient based feature attribution in explainable ai: A technical review PDF

Cannot Refute

[63] Decomposition and completion network for salient object detection PDF

Cannot Refute

[64] Full-gradient representation for neural network visualization PDF

Cannot Refute

[65] DecomCAM: Advancing Beyond Saliency Maps through Decomposition and Integration PDF

Cannot Refute

[66] Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition PDF

Cannot Refute

[67] A survey of post-hoc xai methods from a visualization perspective: Challenges and opportunities PDF

Cannot Refute

[68] DecompX: Explaining Transformers Decisions by Propagating Token Decomposition PDF

Cannot Refute

[69] Deeply Explain CNN Via Hierarchical Decomposition PDF

Cannot Refute

[70] An Effective Infrared and Visible Image Fusion Approach via Rolling Guidance Filtering and Gradient Saliency Map PDF

Cannot Refute

Contribution

Regularization loss for balancing global and local representations

[51] Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization PDF

Cannot Refute

[52] VICRegL: Self-Supervised Learning of Local Visual Features PDF

Cannot Refute

[53] Learning Robust Global Representations by Penalizing Local Predictive Power PDF

Cannot Refute

[54] LPTMono: monocular depth estimation for underwater images using local perception transformer and globalâlocal context fusion: D. Liu et al. PDF

Cannot Refute

[55] Globalâlocal consistent semi-supervised segmentation of histopathological image with different perturbations PDF

Cannot Refute

[56] GaussEdit: Adaptive 3D Scene Editing With Text and Image Prompts PDF

Cannot Refute

[57] SLN-RED: Regularization by Simultaneous Local and Nonlocal Denoising for Image Restoration PDF

Cannot Refute

[58] Conformer: Local Features Coupling Global Representations for Visual Recognition PDF

Cannot Refute

[59] Global-local spatially aware preserving projection for dimensionality reduction of hyperspectral images PDF

Cannot Refute

[60] Polyp Segmentation Using a Hybrid Vision Transformer and a Hybrid Loss Function PDF

Cannot Refute

MaskInversion: Localized Embeddings via Optimization of Explainability Maps

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[10] Lare: Latent augmentation using regional embedding with vision-language model PDF

Contribution Analysis

MaskInversion method for localized embeddings via explainability map optimization

[71] Interpreting CLIP's Image Representation via Text-Based Decomposition PDF

[72] Interpretable representations in explainable AI: from theory to practice PDF

[73] Explainable AI enhanced transformer based UNet for medical images segmentation using gradient weighted class activation map PDF

[74] Finding Regions of Counterfactual Explanations via Robust Optimization PDF

[75] CEIR: Concept-based Explainable Image Representation Learning PDF

[76] ICEv2: Interpretability, Comprehensiveness, and Explainability in Vision Transformer. PDF

[77] Artxai: Explainable artificial intelligence curates deep representation learning for artistic images using fuzzy techniques PDF

[78] Explainable self-supervised learning for medical image diagnosis based on DINO V2 model and semantic search PDF

[79] Explainable artificial intelligence (XAI) for deep learning based medical imaging classification PDF

[80] Local Concept Embeddings for Analysis of Concept Distributions in DNN Feature Spaces PDF

Gradient decomposition strategy for efficient explainability map computation

[61] Interpretable basis decomposition for visual explanation PDF

[62] Gradient based feature attribution in explainable ai: A technical review PDF

[63] Decomposition and completion network for salient object detection PDF

[64] Full-gradient representation for neural network visualization PDF

[65] DecomCAM: Advancing Beyond Saliency Maps through Decomposition and Integration PDF

[66] Detail preserved fusion of visible and infrared images using regional saliency extraction and multi-scale image decomposition PDF

[67] A survey of post-hoc xai methods from a visualization perspective: Challenges and opportunities PDF

[68] DecompX: Explaining Transformers Decisions by Propagating Token Decomposition PDF

[69] Deeply Explain CNN Via Hierarchical Decomposition PDF

[70] An Effective Infrared and Visible Image Fusion Approach via Rolling Guidance Filtering and Gradient Saliency Map PDF

Regularization loss for balancing global and local representations

[51] Dngaussian: Optimizing sparse-view 3d gaussian radiance fields with global-local depth normalization PDF

[52] VICRegL: Self-Supervised Learning of Local Visual Features PDF

[53] Learning Robust Global Representations by Penalizing Local Predictive Power PDF

[54] LPTMono: monocular depth estimation for underwater images using local perception transformer and globalâlocal context fusion: D. Liu et al. PDF

[55] Globalâlocal consistent semi-supervised segmentation of histopathological image with different perturbations PDF

[56] GaussEdit: Adaptive 3D Scene Editing With Text and Image Prompts PDF

[57] SLN-RED: Regularization by Simultaneous Local and Nonlocal Denoising for Image Restoration PDF

[58] Conformer: Local Features Coupling Global Representations for Visual Recognition PDF

[59] Global-local spatially aware preserving projection for dimensionality reduction of hyperspectral images PDF

[60] Polyp Segmentation Using a Hybrid Vision Transformer and a Hybrid Loss Function PDF

Table of Contents

[54] LPTMono: monocular depth estimation for underwater images using local perception transformer and globalâlocal context fusion: D. Liu et al. PDF

[55] Globalâlocal consistent semi-supervised segmentation of histopathological image with different perturbations PDF