MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

ICLR 2026 Conference SubmissionAnonymous Authors
multimodal retrievalinformation retrieval
Abstract:

Universal multimodal embedding models have achieved great success in capturing semantic relevance between queries and candidates. However, current methods either condense queries and candidates into a single vector, potentially limiting the expressiveness for fine-grained information, or produce too many vectors that are prohibitively expensive for multi-vector retrieval. In this work, we introduce MetaEmbed, a new framework for multimodal retrieval that rethinks how multimodal embeddings are constructed and interacted with at scale. During training, a fixed number of learnable Meta Tokens are appended to the input sequence. At test-time, their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings. Through the proposed Matryoshka Multi-Vector Retrieval training, MetaEmbed learns to organize information by granularity across multiple vectors. As a result, we enable test-time scaling in multimodal retrieval where users can balance retrieval quality against efficiency demands by selecting the number of tokens used for indexing and retrieval interactions. Extensive evaluations on the Massive Multimodal Embedding Benchmark (MMEB) and the Visual Document Retrieval Benchmark (ViDoRe) confirm that MetaEmbed achieves state-of-the-art retrieval performance while scaling robustly to models with 32B parameters.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MetaEmbed, a framework that uses learnable Meta Tokens to produce compact multi-vector embeddings for multimodal retrieval, combined with Matryoshka Multi-Vector Retrieval training to organize information by granularity. It resides in the Token-Level Interaction and Matching Mechanisms leaf, which contains five papers total, including the original work. This leaf sits within the broader Late Interaction Architecture Design and Optimization branch, indicating a moderately populated research direction focused on fine-grained matching strategies rather than single-vector compression or domain-specific applications.

The taxonomy reveals three sibling leaves within Late Interaction Architecture Design: Embedding Efficiency and Scalability (two papers on sparsification and compression), and Unified Multimodal Embedding Models (three papers on joint text-image architectures). Neighboring branches address Domain-Specific Applications (e.g., visual question answering, e-commerce) and Multimodal Fusion Strategies (adaptive fusion, cross-modal adapters). MetaEmbed's focus on flexible token-level interaction distinguishes it from efficiency-centric methods like PyLate and from unified embedding models that prioritize single-vector objectives, positioning it as an architectural innovation within the late-interaction paradigm.

Among twenty-seven candidates examined via semantic search, none were flagged as clearly refuting any of the three contributions. The MetaEmbed framework and Meta Tokens concept examined ten candidates with zero refutable overlaps; the Matryoshka Multi-Vector Retrieval training method also examined ten candidates with no refutations; and the test-time scaling mechanism examined seven candidates, again with no refutations. This suggests that within the limited search scope—top-K semantic matches plus citation expansion—no prior work was found to substantially overlap with the proposed techniques, though the analysis does not claim exhaustive coverage of the entire field.

Given the moderate density of the Token-Level Interaction leaf and the absence of refutable prior work among the examined candidates, the contributions appear relatively novel within the scope analyzed. However, the search examined only twenty-seven papers, and the taxonomy contains twenty-eight total papers across ten leaves, so broader or deeper literature searches might uncover additional related work. The assessment reflects what is visible in the current taxonomy structure and candidate set, not a definitive claim about the entire multimodal retrieval landscape.

Taxonomy

Core-task Taxonomy Papers
28
3
Claimed Contributions
27
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: multimodal retrieval with flexible late interaction embeddings. This field centers on architectures that defer the fusion of text, image, video, and other modalities until a late stage, enabling fine-grained token-level matching rather than collapsing representations into single vectors. The taxonomy reflects four main branches: Late Interaction Architecture Design and Optimization focuses on the mechanics of token-level interaction and matching mechanisms, exploring how to efficiently compute cross-modal similarities at a granular level (e.g., Fine-grained Late Interaction[1], Preflmr[2], CLaMR[3]). Domain-Specific Multimodal Retrieval Applications adapts these techniques to specialized contexts such as video search (Video-ColBERT[4]) or art retrieval (ArtSeek[9]). Multimodal Fusion Strategies and Cross-Modal Interaction examines broader fusion paradigms, including query-adaptive and bilateral approaches (Query-adaptive Late Fusion[15], Bilateral Adaptive Fusion[25]), while General Multimodal Interface and Interaction Frameworks addresses user-facing systems and retrieval-augmented generation pipelines (Multimodal RAG[11]). Recent work has concentrated on balancing expressiveness with computational efficiency: some studies push toward richer token-level alignments to capture nuanced semantic correspondences, while others seek lightweight architectures suitable for large-scale deployment (PyLate[16], Jina Embeddings v4[21]). MetaEmbed[0] sits squarely within the Token-Level Interaction and Matching Mechanisms cluster, emphasizing flexible embedding strategies that adapt interaction granularity based on query and document characteristics. Compared to neighbors like CLaMR[3], which also targets fine-grained matching, MetaEmbed[0] appears to prioritize meta-learning or parameterized flexibility in how tokens are aligned, rather than a fixed interaction schema. This contrasts with Video-ColBERT[4], which specializes in temporal video retrieval, illustrating how the same late-interaction principle can be tailored to different modalities and application constraints. Open questions remain around optimal token budget allocation, cross-modal attention mechanisms, and generalization across diverse retrieval benchmarks.

Claimed Contributions

MetaEmbed framework with Meta Tokens for multimodal retrieval

The authors propose MetaEmbed, a framework that appends a small number of learnable Meta Tokens to input sequences. Their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings for retrieval, reducing the number of vectors needed while maintaining quality.

10 retrieved papers
Matryoshka Multi-Vector Retrieval (MMR) training method

The authors introduce MMR, a training approach that organizes embeddings into hierarchical nested groups. By performing contrastive learning across parallel nested groups, the model learns coarse-to-fine multi-vector embeddings that enable flexible retrieval at different granularities.

10 retrieved papers
Test-time scaling mechanism for multimodal retrieval

The authors enable a test-time scaling capability where users can dynamically adjust the number of Meta Embeddings used during retrieval. This allows flexible trade-offs between retrieval accuracy, index size, and latency without retraining the model.

7 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MetaEmbed framework with Meta Tokens for multimodal retrieval

The authors propose MetaEmbed, a framework that appends a small number of learnable Meta Tokens to input sequences. Their last-layer contextualized representations serve as compact yet expressive multi-vector embeddings for retrieval, reducing the number of vectors needed while maintaining quality.

Contribution

Matryoshka Multi-Vector Retrieval (MMR) training method

The authors introduce MMR, a training approach that organizes embeddings into hierarchical nested groups. By performing contrastive learning across parallel nested groups, the model learns coarse-to-fine multi-vector embeddings that enable flexible retrieval at different granularities.

Contribution

Test-time scaling mechanism for multimodal retrieval

The authors enable a test-time scaling capability where users can dynamically adjust the number of Meta Embeddings used during retrieval. This allows flexible trade-offs between retrieval accuracy, index size, and latency without retraining the model.

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction | Novelty Validation