MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Multimodal Retrieval BenchmarkReasoningMultimodal LLMs

We introduce MRMR, the first expert-level multidisciplinary multimodal retrieval benchmark requiring intensive reasoning. MRMR contains 1,435 queries spanning 23 domains, with positive documents carefully verified by human experts. Compared to prior benchmarks, MRMR introduces three key advancements. First, it challenges retrieval systems across diverse areas of expertise, enabling fine-grained model comparison across domains. Second, queries are reasoning-intensive, with images requiring deeper interpretation such as diagnosing microscopic slides. We further introduce Contradiction Retrieval, a novel task requiring models to identify conflicting concepts. Finally, queries and documents are constructed as image–text interleaved sequences. Unlike earlier benchmarks restricted to single images or unimodal documents, MRMR offers a realistic setting with multi-image queries and mixed-modality corpus documents. We conduct an extensive evaluation of 4 categories of multimodal retrieval systems and 14 frontier models on MRMR. The text embedding model Qwen3-Embedding with LLM-generated image captions achieves the highest performance, highlighting substantial room for improving multimodal retrieval models. Although latest multimodal models such as Ops-MM-Embedding perform competitively on expert-domain queries, they fall short on reasoning-intensive tasks. We believe that MRMR paves the way for advancing multimodal retrieval in more realistic and challenging scenarios.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: reasoning-intensive multimodal retrieval. This emerging field sits at the intersection of multimodal understanding, complex reasoning, and information retrieval, requiring systems to not only match queries with relevant documents but also perform sophisticated inference over visual and textual content. The taxonomy reveals six major branches that collectively map the landscape: Multimodal Chain-of-Thought Reasoning Frameworks explore how to elicit step-by-step reasoning across modalities (e.g., Multimodal Chain-of-Thought Survey[1], Progressive Multimodal Reasoning[3]); Retrieval-Augmented Generation Integration examines how external knowledge sources enhance multimodal generation (Multimodal RAG Survey[2], HM-RAG[7]); Multimodal Retrieval Systems and Embeddings focus on representation learning and similarity computation (Similarity Computation Reasoning[4], Reasoning Guided Embeddings[42]); Reasoning Mechanisms and Interpretability investigate the internal processes that enable inference (Deep Thinking[17], Visualization-of-Thought[29]); Evaluation Benchmarks and Datasets provide standardized testbeds for measuring progress; and Specialized Applications and Paradigms address domain-specific challenges in areas like medical imaging and visual question answering. Several active research directions reveal key trade-offs and open questions. One line emphasizes tighter integration of retrieval with reasoning processes, where systems proactively decide when and what to retrieve during multi-step inference (Proactive Reasoning-with-Retrieval[20], Reason-before-Retrieve[32]), contrasting with pipeline approaches that separate retrieval and reasoning stages. Another thread explores how to design benchmarks that genuinely test reasoning depth rather than pattern matching (MR-Bench[44], MR2-Bench[47]). MRMR[0] contributes to this evaluation-focused branch by proposing a reasoning-intensive retrieval benchmark, positioning itself alongside MR-Bench[44] and MR2-Bench[47] as part of efforts to establish rigorous testbeds. While MR-Bench[44] and MR2-Bench[47] emphasize multi-hop reasoning and compositional understanding, MRMR[0] appears to stress the retrieval dimension more explicitly, aiming to capture scenarios where finding the right information requires substantial inferential effort rather than surface-level matching.

Claimed Contributions

MRMR benchmark for expert-level multidisciplinary multimodal retrieval

10 retrieved papers

The authors present MRMR, a new benchmark containing 1,435 expert-annotated queries across 23 domains. It evaluates retrieval systems on reasoning-intensive tasks using image-text interleaved sequences, addressing limitations of prior benchmarks that focus on general-domain knowledge and single-image queries.

10 retrieved papers

Contradiction Retrieval task for multimodal setting

10 retrieved papers

The authors originally introduce the Contradiction Retrieval task in the multimodal domain, which requires models to perform logical reasoning to identify documents that conflict with or contradict the user query, going beyond semantic relevance detection.

10 retrieved papers

Image-text interleaved query and document representation

Can Refute

10 retrieved papers

The benchmark represents both queries and documents as interleaved sequences of text and images, enabling more realistic retrieval scenarios with multi-image queries and mixed-modality corpus documents, unlike earlier benchmarks restricted to single images or unimodal documents.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[44] MR-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval PDF

J Zhou, Z Liu, L Xiong, JG Yao, Y Wang, S Xiao (2025)

[47] MR $^2$ -Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval PDF

Zhou Junjie, Liu Ze, Junjie Zhou, Xiong Lei, Ze Liu, Yao Jin-ge, Lei Xiong, Wang, Yueze, Jin-Ge Yao, Xiao, Shitao, Yueze Wang, Lin Fenfen, Shitao Xiao, Fenfen Lin, Dou, Zhicheng, Miguel Hu Chen, Bao, Siqi, Zhicheng Dou, Lian, Defu, Siqi Bao, Xiong Yongping, Defu Lian, Liu Zheng, Yongping Xiong, Zheng Liu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MRMR benchmark for expert-level multidisciplinary multimodal retrieval

[6] Retrieval-augmented multi-modal chain-of-thoughts reasoning for large language models PDF

Cannot Refute

[71] R-bench: Graduate-level multi-disciplinary benchmarks for llm & mllm complex reasoning evaluation PDF

Cannot Refute

[72] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi PDF

Cannot Refute

[73] MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning PDF

Cannot Refute

[74] Medxpertqa: Benchmarking expert-level medical reasoning and understanding PDF

Cannot Refute

[75] ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning PDF

Cannot Refute

[76] Mac: A live benchmark for multimodal large language models in scientific understanding PDF

Cannot Refute

[77] Mmau: A massive multi-task audio understanding and reasoning benchmark PDF

Cannot Refute

[78] A survey on benchmarks of multimodal large language models PDF

Cannot Refute

[79] A survey on multimodal benchmarks: In the era of large ai models PDF

Cannot Refute

Contribution

Contradiction Retrieval task for multimodal setting

[61] Alleviating the inconsistency of multimodal data in cross-modal retrieval PDF

Cannot Refute

[62] Why foundation models struggle with cross-modal context PDF

Cannot Refute

[63] Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models PDF

Cannot Refute

[64] Verite: a robust benchmark for multimodal misinformation detection accounting for unimodal bias PDF

Cannot Refute

[65] Complementary-contradictory feature regularization against multimodal overfitting PDF

Cannot Refute

[66] Crosscheck-bench: Diagnosing compositional failures in multimodal conflict resolution PDF

Cannot Refute

[67] Exclaim: An explainable cross-modal agentic system for misinformation detection with hierarchical retrieval PDF

Cannot Refute

[68] Detecting fake news by exploring the consistency of multimodal data PDF

Cannot Refute

[69] Cross-modal ambiguity learning for multimodal fake news detection PDF

Cannot Refute

[70] Is cognition consistent with perception? assessing and mitigating multimodal knowledge conflicts in document understanding PDF

Cannot Refute

Contribution

Image-text interleaved query and document representation

[58] Unified Multimodal Interleaved Document Representation for Retrieval PDF

Can Refute

[60] Unified Multi-Modal Interleaved Document Representation for Information Retrieval PDF

Can Refute

[51] Comm: A coherent interleaved image-text dataset for multimodal understanding and generation PDF

Cannot Refute

[52] Chameleon: Mixed-modal early-fusion foundation models PDF

Cannot Refute

[53] A Bounding Box is Worth One Token-Interleaving Layout and Text in a Large Language Model for Document Understanding PDF

Cannot Refute

[54] Vega: Learning interleaved image-text comprehension in vision-language large models PDF

Cannot Refute

[55] Learning interleaved image-text comprehension in vision-language large models PDF

Cannot Refute

[56] Openleaf: Open-domain interleaved image-text generation and evaluation PDF

Cannot Refute

[57] Learning video context as interleaved multimodal sequences PDF

Cannot Refute

[59] Vista: Visualized text embedding for universal multi-modal retrieval PDF

Cannot Refute

MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[44] MR-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval PDF

[47] MR2^22-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval PDF

Contribution Analysis

MRMR benchmark for expert-level multidisciplinary multimodal retrieval

[6] Retrieval-augmented multi-modal chain-of-thoughts reasoning for large language models PDF

[71] R-bench: Graduate-level multi-disciplinary benchmarks for llm & mllm complex reasoning evaluation PDF

[72] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi PDF

[73] MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning PDF

[74] Medxpertqa: Benchmarking expert-level medical reasoning and understanding PDF

[75] ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning PDF

[76] Mac: A live benchmark for multimodal large language models in scientific understanding PDF

[77] Mmau: A massive multi-task audio understanding and reasoning benchmark PDF

[78] A survey on benchmarks of multimodal large language models PDF

[79] A survey on multimodal benchmarks: In the era of large ai models PDF

Contradiction Retrieval task for multimodal setting

[61] Alleviating the inconsistency of multimodal data in cross-modal retrieval PDF

[62] Why foundation models struggle with cross-modal context PDF

[63] Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models PDF

[64] Verite: a robust benchmark for multimodal misinformation detection accounting for unimodal bias PDF

[65] Complementary-contradictory feature regularization against multimodal overfitting PDF

[66] Crosscheck-bench: Diagnosing compositional failures in multimodal conflict resolution PDF

[67] Exclaim: An explainable cross-modal agentic system for misinformation detection with hierarchical retrieval PDF

[68] Detecting fake news by exploring the consistency of multimodal data PDF

[69] Cross-modal ambiguity learning for multimodal fake news detection PDF

[70] Is cognition consistent with perception? assessing and mitigating multimodal knowledge conflicts in document understanding PDF

Image-text interleaved query and document representation

[58] Unified Multimodal Interleaved Document Representation for Retrieval PDF

[60] Unified Multi-Modal Interleaved Document Representation for Information Retrieval PDF

[51] Comm: A coherent interleaved image-text dataset for multimodal understanding and generation PDF

[52] Chameleon: Mixed-modal early-fusion foundation models PDF

[53] A Bounding Box is Worth One Token-Interleaving Layout and Text in a Large Language Model for Document Understanding PDF

[54] Vega: Learning interleaved image-text comprehension in vision-language large models PDF

[55] Learning interleaved image-text comprehension in vision-language large models PDF

[56] Openleaf: Open-domain interleaved image-text generation and evaluation PDF

[57] Learning video context as interleaved multimodal sequences PDF

[59] Vista: Visualized text embedding for universal multi-modal retrieval PDF

Table of Contents

[47] MR $^2$ -Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval PDF