MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal Retrieval BenchmarkReasoningMultimodal LLMs
Abstract:

We introduce MRMR, the first expert-level multidisciplinary multimodal retrieval benchmark requiring intensive reasoning. MRMR contains 1,435 queries spanning 23 domains, with positive documents carefully verified by human experts. Compared to prior benchmarks, MRMR introduces three key advancements. First, it challenges retrieval systems across diverse areas of expertise, enabling fine-grained model comparison across domains. Second, queries are reasoning-intensive, with images requiring deeper interpretation such as diagnosing microscopic slides. We further introduce Contradiction Retrieval, a novel task requiring models to identify conflicting concepts. Finally, queries and documents are constructed as image–text interleaved sequences. Unlike earlier benchmarks restricted to single images or unimodal documents, MRMR offers a realistic setting with multi-image queries and mixed-modality corpus documents. We conduct an extensive evaluation of 4 categories of multimodal retrieval systems and 14 frontier models on MRMR. The text embedding model Qwen3-Embedding with LLM-generated image captions achieves the highest performance, highlighting substantial room for improving multimodal retrieval models. Although latest multimodal models such as Ops-MM-Embedding perform competitively on expert-domain queries, they fall short on reasoning-intensive tasks. We believe that MRMR paves the way for advancing multimodal retrieval in more realistic and challenging scenarios.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: reasoning-intensive multimodal retrieval. This emerging field sits at the intersection of multimodal understanding, complex reasoning, and information retrieval, requiring systems to not only match queries with relevant documents but also perform sophisticated inference over visual and textual content. The taxonomy reveals six major branches that collectively map the landscape: Multimodal Chain-of-Thought Reasoning Frameworks explore how to elicit step-by-step reasoning across modalities (e.g., Multimodal Chain-of-Thought Survey[1], Progressive Multimodal Reasoning[3]); Retrieval-Augmented Generation Integration examines how external knowledge sources enhance multimodal generation (Multimodal RAG Survey[2], HM-RAG[7]); Multimodal Retrieval Systems and Embeddings focus on representation learning and similarity computation (Similarity Computation Reasoning[4], Reasoning Guided Embeddings[42]); Reasoning Mechanisms and Interpretability investigate the internal processes that enable inference (Deep Thinking[17], Visualization-of-Thought[29]); Evaluation Benchmarks and Datasets provide standardized testbeds for measuring progress; and Specialized Applications and Paradigms address domain-specific challenges in areas like medical imaging and visual question answering. Several active research directions reveal key trade-offs and open questions. One line emphasizes tighter integration of retrieval with reasoning processes, where systems proactively decide when and what to retrieve during multi-step inference (Proactive Reasoning-with-Retrieval[20], Reason-before-Retrieve[32]), contrasting with pipeline approaches that separate retrieval and reasoning stages. Another thread explores how to design benchmarks that genuinely test reasoning depth rather than pattern matching (MR-Bench[44], MR2-Bench[47]). MRMR[0] contributes to this evaluation-focused branch by proposing a reasoning-intensive retrieval benchmark, positioning itself alongside MR-Bench[44] and MR2-Bench[47] as part of efforts to establish rigorous testbeds. While MR-Bench[44] and MR2-Bench[47] emphasize multi-hop reasoning and compositional understanding, MRMR[0] appears to stress the retrieval dimension more explicitly, aiming to capture scenarios where finding the right information requires substantial inferential effort rather than surface-level matching.

Claimed Contributions

MRMR benchmark for expert-level multidisciplinary multimodal retrieval

The authors present MRMR, a new benchmark containing 1,435 expert-annotated queries across 23 domains. It evaluates retrieval systems on reasoning-intensive tasks using image-text interleaved sequences, addressing limitations of prior benchmarks that focus on general-domain knowledge and single-image queries.

10 retrieved papers
Contradiction Retrieval task for multimodal setting

The authors originally introduce the Contradiction Retrieval task in the multimodal domain, which requires models to perform logical reasoning to identify documents that conflict with or contradict the user query, going beyond semantic relevance detection.

10 retrieved papers
Image-text interleaved query and document representation

The benchmark represents both queries and documents as interleaved sequences of text and images, enabling more realistic retrieval scenarios with multi-image queries and mixed-modality corpus documents, unlike earlier benchmarks restricted to single images or unimodal documents.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MRMR benchmark for expert-level multidisciplinary multimodal retrieval

The authors present MRMR, a new benchmark containing 1,435 expert-annotated queries across 23 domains. It evaluates retrieval systems on reasoning-intensive tasks using image-text interleaved sequences, addressing limitations of prior benchmarks that focus on general-domain knowledge and single-image queries.

Contribution

Contradiction Retrieval task for multimodal setting

The authors originally introduce the Contradiction Retrieval task in the multimodal domain, which requires models to perform logical reasoning to identify documents that conflict with or contradict the user query, going beyond semantic relevance detection.

Contribution

Image-text interleaved query and document representation

The benchmark represents both queries and documents as interleaved sequences of text and images, enabling more realistic retrieval scenarios with multi-image queries and mixed-modality corpus documents, unlike earlier benchmarks restricted to single images or unimodal documents.

MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval | Novelty Validation