MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

multimodal retrievalinformation retrieval

Universal multimodal embedding models have achieved great success in capturing semantic relevance between queries and candidates. However, current methods either condense queries and candidates into a single vector, potentially limiting the expressiveness for fine-grained information, or produce too many vectors that are prohibitively expensive for multi-vector retrieval. In this work, we introduce MetaEmbed, a new framework for multimodal retrieval that rethinks how multimodal embeddings are constructed and interacted with at scale. During training, a fixed number of learnable Meta Tokens are appended to the input sequence. At test-time, their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings. Through the proposed Matryoshka Multi-Vector Retrieval training, MetaEmbed learns to organize information by granularity across multiple vectors. As a result, we enable test-time scaling in multimodal retrieval where users can balance retrieval quality against efficiency demands by selecting the number of tokens used for indexing and retrieval interactions. Extensive evaluations on the Massive Multimodal Embedding Benchmark (MMEB) and the Visual Document Retrieval Benchmark (ViDoRe) confirm that MetaEmbed achieves state-of-the-art retrieval performance while scaling robustly to models with 32B parameters.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MetaEmbed, a framework that uses learnable Meta Tokens to produce compact multi-vector embeddings for multimodal retrieval, combined with Matryoshka Multi-Vector Retrieval training to organize information by granularity. It resides in the Token-Level Interaction and Matching Mechanisms leaf, which contains five papers total, including the original work. This leaf sits within the broader Late Interaction Architecture Design and Optimization branch, indicating a moderately populated research direction focused on fine-grained matching strategies rather than single-vector compression or domain-specific applications.

The taxonomy reveals three sibling leaves within Late Interaction Architecture Design: Embedding Efficiency and Scalability (two papers on sparsification and compression), and Unified Multimodal Embedding Models (three papers on joint text-image architectures). Neighboring branches address Domain-Specific Applications (e.g., visual question answering, e-commerce) and Multimodal Fusion Strategies (adaptive fusion, cross-modal adapters). MetaEmbed's focus on flexible token-level interaction distinguishes it from efficiency-centric methods like PyLate and from unified embedding models that prioritize single-vector objectives, positioning it as an architectural innovation within the late-interaction paradigm.

Among twenty-seven candidates examined via semantic search, none were flagged as clearly refuting any of the three contributions. The MetaEmbed framework and Meta Tokens concept examined ten candidates with zero refutable overlaps; the Matryoshka Multi-Vector Retrieval training method also examined ten candidates with no refutations; and the test-time scaling mechanism examined seven candidates, again with no refutations. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—no prior work was found to substantially overlap with the proposed techniques, though the analysis does not claim exhaustive coverage of the entire field.

Given the moderate density of the Token-Level Interaction leaf and the absence of refutable prior work among the examined candidates, the contributions appear relatively novel within the scope analyzed. However, the search examined only twenty-seven papers, and the taxonomy contains twenty-eight total papers across ten leaves, so broader or deeper literature searches might uncover additional related work. The assessment reflects what is visible in the current taxonomy structure and candidate set, not a definitive claim about the entire multimodal retrieval landscape.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: multimodal retrieval with flexible late interaction embeddings. This field centers on architectures that defer the fusion of text, image, video, and other modalities until a late stage, enabling fine-grained token-level matching rather than collapsing representations into single vectors. The taxonomy reflects four main branches: Late Interaction Architecture Design and Optimization focuses on the mechanics of token-level interaction and matching mechanisms, exploring how to efficiently compute cross-modal similarities at a granular level (e.g., Fine-grained Late Interaction[1], Preflmr[2], CLaMR[3]). Domain-Specific Multimodal Retrieval Applications adapts these techniques to specialized contexts such as video search (Video-ColBERT[4]) or art retrieval (ArtSeek[9]). Multimodal Fusion Strategies and Cross-Modal Interaction examines broader fusion paradigms, including query-adaptive and bilateral approaches (Query-adaptive Late Fusion[15], Bilateral Adaptive Fusion[25]), while General Multimodal Interface and Interaction Frameworks addresses user-facing systems and retrieval-augmented generation pipelines (Multimodal RAG[11]). Recent work has concentrated on balancing expressiveness with computational efficiency: some studies push toward richer token-level alignments to capture nuanced semantic correspondences, while others seek lightweight architectures suitable for large-scale deployment (PyLate[16], Jina Embeddings v4[21]). MetaEmbed[0] sits squarely within the Token-Level Interaction and Matching Mechanisms cluster, emphasizing flexible embedding strategies that adapt interaction granularity based on query and document characteristics. Compared to neighbors like CLaMR[3], which also targets fine-grained matching, MetaEmbed[0] appears to prioritize meta-learning or parameterized flexibility in how tokens are aligned, rather than a fixed interaction schema. This contrasts with Video-ColBERT[4], which specializes in temporal video retrieval, illustrating how the same late-interaction principle can be tailored to different modalities and application constraints. Open questions remain around optimal token budget allocation, cross-modal attention mechanisms, and generalization across diverse retrieval benchmarks.

Claimed Contributions

MetaEmbed framework with Meta Tokens for multimodal retrieval

10 retrieved papers

The authors propose MetaEmbed, a framework that appends a small number of learnable Meta Tokens to input sequences. Their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings for retrieval, reducing the number of vectors needed while maintaining quality.

10 retrieved papers

Matryoshka Multi-Vector Retrieval (MMR) training method

10 retrieved papers

The authors introduce MMR, a training approach that organizes embeddings into hierarchical nested groups. By performing contrastive learning across parallel nested groups, the model learns coarse-to-fine multi-vector embeddings that enable flexible retrieval at different granularities.

10 retrieved papers

Test-time scaling mechanism for multimodal retrieval

7 retrieved papers

The authors enable a test-time scaling capability where users can dynamically adjust the number of Meta Embeddings used during retrieval. This allows flexible trade-offs between retrieval accuracy, index size, and latency without retraining the model.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering PDF

Lin Weizhe, Chen Jinghong, Weizhe Lin, Jinghong Chen, Jingbiao Mei, Byrne, Bill, Alexandru Coca, Bill Byrne (2023)

[2] Preflmr: Scaling up fine-grained late-interaction multi-modal retrievers PDF

Lin Weizhe, Weizhe Lin, Chen Jinghong, Jingbiao Mei, Byrne, Bill, Jinghong Chen, Bill Byrne (2024)

[3] CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval PDF

Wan, David, Wang Han, David Wan, Stengel-Eskin, Elias, Han Wang, Cho, Jaemin, Elias Stengel-Eskin, Bansal, Mohit, Jaemin Cho, Mohit Bansal (2025)

[4] Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval PDF

Arun Reddy, Alexander Martin, Eugene Yang, Andrew Yates, Kate Sanders, Kenton Murray, Reno Kriz, Celso M. de Melo, Benjamin Van Durme, Rama Chellappa, Ramalingam Chellappa (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MetaEmbed framework with Meta Tokens for multimodal retrieval

[46] HybridToken-VLM: Hybrid Token Compression for Vision-Language Models PDF

Cannot Refute

[47] Visual Semantic Contextualization Network for Multi-Query Image Retrieval PDF

Cannot Refute

[48] Efficient token-guided image-text retrieval with consistent multimodal contrastive training PDF

Cannot Refute

[49] Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning PDF

Cannot Refute

[50] VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs PDF

Cannot Refute

[51] Learning Compact Vision Tokens for Efficient Large Multimodal Models PDF

Cannot Refute

[52] Multi-vector attention models for deep re-ranking PDF

Cannot Refute

[53] ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling PDF

Cannot Refute

[54] ReMatch: Boosting Representation through Matching for Multimodal Retrieval PDF

Cannot Refute

[55] Representation Learning for Visual Tasks: A Study of Attention and Information Selection PDF

Cannot Refute

Contribution

Matryoshka Multi-Vector Retrieval (MMR) training method

[36] HGCL: Hierarchical Graph Contrastive Learning for User-Item Recommendation PDF

Cannot Refute

[37] Use all the labels: A hierarchical multi-label contrastive learning framework PDF

Cannot Refute

[38] Enhanced hierarchical contrastive learning for recommendation PDF

Cannot Refute

[39] HierVision: Standardized and Reproducible Hierarchical Sources for Vision Datasets PDF

Cannot Refute

[40] Hierarchical contrastive learning with multiple augmentations for sequential recommendation PDF

Cannot Refute

[41] Hihpq: Hierarchical hyperbolic product quantization for unsupervised image retrieval PDF

Cannot Refute

[42] Hierarchicalcontrast: A coarse-to-fine contrastive learning framework for cross-domain zero-shot slot filling PDF

Cannot Refute

[43] HiCLIP: Contrastive language-image pretraining with hierarchy-aware attention PDF

Cannot Refute

[44] Self-Supervised Learning of Dense Hierarchical Representations for Medical Image Segmentation PDF

Cannot Refute

[45] Retrieval-style in-context learning for few-shot hierarchical text classification PDF

Cannot Refute

Contribution

Test-time scaling mechanism for multimodal retrieval

[29] Dynamic embedding size search with minimum regret for streaming recommender system PDF

Cannot Refute

[30] Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking PDF

Cannot Refute

[31] Towards Multi-Granularity Memory Association and Selection for Long-Term Conversational Agents PDF

Cannot Refute

[32] Progressively optimized bi-granular document representation for scalable embedding based retrieval PDF

Cannot Refute

[33] Enhancing Retrieval-Augmented Generation via Dual-Granularity Document Indexing PDF

Cannot Refute

[34] Towards a Smaller Student: Capacity Dynamic Distillation for Efficient Image Retrieval PDF

Cannot Refute

[35] Hvlad: Point Cloud Based Large-Scale Place Recognition Network in Dusty Environment PDF

Cannot Refute

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering PDF

[2] Preflmr: Scaling up fine-grained late-interaction multi-modal retrievers PDF

[3] CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval PDF

[4] Video-ColBERT: Contextualized Late Interaction for Text-to-Video Retrieval PDF

Contribution Analysis

MetaEmbed framework with Meta Tokens for multimodal retrieval

[46] HybridToken-VLM: Hybrid Token Compression for Vision-Language Models PDF

[47] Visual Semantic Contextualization Network for Multi-Query Image Retrieval PDF

[48] Efficient token-guided image-text retrieval with consistent multimodal contrastive training PDF

[49] Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning PDF

[50] VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs PDF

[51] Learning Compact Vision Tokens for Efficient Large Multimodal Models PDF

[52] Multi-vector attention models for deep re-ranking PDF

[53] ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling PDF

[54] ReMatch: Boosting Representation through Matching for Multimodal Retrieval PDF

[55] Representation Learning for Visual Tasks: A Study of Attention and Information Selection PDF

Matryoshka Multi-Vector Retrieval (MMR) training method

[36] HGCL: Hierarchical Graph Contrastive Learning for User-Item Recommendation PDF

[37] Use all the labels: A hierarchical multi-label contrastive learning framework PDF

[38] Enhanced hierarchical contrastive learning for recommendation PDF

[39] HierVision: Standardized and Reproducible Hierarchical Sources for Vision Datasets PDF

[40] Hierarchical contrastive learning with multiple augmentations for sequential recommendation PDF

[41] Hihpq: Hierarchical hyperbolic product quantization for unsupervised image retrieval PDF

[42] Hierarchicalcontrast: A coarse-to-fine contrastive learning framework for cross-domain zero-shot slot filling PDF

[43] HiCLIP: Contrastive language-image pretraining with hierarchy-aware attention PDF

[44] Self-Supervised Learning of Dense Hierarchical Representations for Medical Image Segmentation PDF

[45] Retrieval-style in-context learning for few-shot hierarchical text classification PDF

Test-time scaling mechanism for multimodal retrieval

[29] Dynamic embedding size search with minimum regret for streaming recommender system PDF

[30] Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking PDF

[31] Towards Multi-Granularity Memory Association and Selection for Long-Term Conversational Agents PDF

[32] Progressively optimized bi-granular document representation for scalable embedding based retrieval PDF

[33] Enhancing Retrieval-Augmented Generation via Dual-Granularity Document Indexing PDF

[34] Towards a Smaller Student: Capacity Dynamic Distillation for Efficient Image Retrieval PDF

[35] Hvlad: Point Cloud Based Large-Scale Place Recognition Network in Dusty Environment PDF

Table of Contents