MR $^2$ -Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Multimodal RetrievalReasoning-intensive RetrievalMultimodal EmbeddingBenchmark

Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios. Existing benchmarks primarily probe surface-level semantic correspondence (e.g., object–text matching) while failing to assess the deeper reasoning required to capture complex relationships between visual and textual information. To address this gap, we introduce MR $^{2}$ -Bench, a reasoning-intensive benchmark for multimodal retrieval. MR $^2$ -Bench presents the following critical values: 1) all tasks are reasoning-driven, going beyond shallow matching to effectively assess models’ capacity for logical, spatial, and causal inference; 2) it features diverse multimodal data, such as natural images, diagrams, and visual puzzles, enabling comprehensive evaluation across content types; 3) it supports complex queries and documents containing multiple images and covers diverse retrieval scenarios, more accurately reflecting real-world applications. Our benchmark contains 1,309 curated queries, derived either from manual collection and annotation or from selective consolidation of public datasets. Despite achieving strong results on existing benchmarks, current state-of-the-art models still struggle on MR $^{2}$ -Bench: for example, the leading Seed1.6-Embedding model attains a Recall@1 of 77.78 on MMEB, but only 9.91 on MR $^{2}$ -Bench. This substantial performance gap highlights both the increased challenge posed by our benchmark and the pressing need for further advances in reasoning-intensive multimodal retrieval.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MR²-Bench, a benchmark designed to evaluate reasoning-intensive multimodal retrieval rather than surface-level semantic matching. It resides in the 'Reasoning-Intensive Retrieval Benchmarks' leaf of the taxonomy, which contains only two papers total (including this one). This leaf sits within the broader 'Reasoning Evaluation and Benchmarking' branch, indicating a relatively sparse research direction focused specifically on retrieval tasks that demand complex inference. The sibling paper in this leaf addresses similar evaluation challenges, suggesting that reasoning-driven retrieval benchmarking is an emerging but not yet crowded area.

The taxonomy reveals that MR²-Bench occupies a distinct niche between several related directions. Neighboring leaves include 'General Multimodal Reasoning Evaluation' (six papers assessing diverse reasoning abilities) and 'Training Data Construction' (three papers on instruction-tuning datasets). The 'Multimodal Retrieval Models and Embeddings' branch (six papers) focuses on representation learning rather than evaluation. The scope notes clarify that MR²-Bench excludes general reasoning benchmarks and training datasets, instead targeting retrieval-specific reasoning assessment. This positioning suggests the work bridges evaluation methodology with retrieval system design, addressing a gap between pure reasoning tests and embedding-based retrieval methods.

Among thirty candidates examined, the contribution-level analysis shows varied novelty signals. The core benchmark contribution (Contribution A) examined ten candidates with zero refutations, suggesting limited direct prior work on reasoning-intensive retrieval evaluation at this scale. The diverse data domains contribution (Contribution B) also found no refutations among ten candidates. However, the complex query support contribution (Contribution C) identified one refutable candidate among ten examined, indicating some overlap with existing work on multi-image document retrieval. These statistics reflect a focused search scope rather than exhaustive coverage, with most contributions appearing relatively novel within the examined literature.

Based on the limited search scope of thirty semantically similar papers, the work appears to address an underexplored evaluation gap. The taxonomy structure confirms that reasoning-intensive retrieval benchmarking is a sparse area with only one sibling paper. However, the analysis does not cover the full breadth of multimodal evaluation literature, and the single refutation for complex query support suggests potential overlap with document-centric retrieval systems. The assessment is constrained by the top-K semantic search methodology and may not capture all relevant prior benchmarks.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: reasoning-intensive multimodal retrieval. This field addresses the challenge of retrieving and reasoning over information that spans text, images, video, and other modalities, requiring models to perform complex inference rather than simple matching. The taxonomy reveals five main branches that capture complementary aspects of this landscape. Multimodal Chain-of-Thought Reasoning (e.g., Multimodal Chain-of-Thought[3], MM-REACT[7]) focuses on eliciting step-by-step reasoning traces that integrate visual and textual cues. Retrieval-Augmented Multimodal Generation explores how external knowledge sources can enhance generation quality, as seen in works like MuRAG[15] and Retrieval-Augmented Multimodal[37]. Multimodal Retrieval Models and Embeddings develop representation learning techniques for cross-modal matching, including efforts like MM-Embed[11] and Composed Image Retrieval[20]. Reasoning Evaluation and Benchmarking provides datasets and metrics to measure reasoning capabilities, while Advanced Reasoning Paradigms investigates novel inference strategies such as deep thinking and progressive retrieval. Recent work highlights tensions between end-to-end reasoning and modular retrieval-augmented approaches, with many studies exploring how to balance retrieval precision and reasoning depth. MR2-Bench[0] sits squarely within the Reasoning Evaluation and Benchmarking branch, specifically targeting reasoning-intensive retrieval scenarios. Unlike broader multimodal reasoning surveys (Multimodal Reasoning Survey[5], Multimodal Chain-of-Thought Survey[1]) that catalog diverse reasoning paradigms, MR2-Bench[0] emphasizes rigorous evaluation of retrieval under reasoning demands, sharing thematic overlap with MMIR Benchmark[14] and EMMA Benchmark[16]. Its focus on reasoning-intensive retrieval distinguishes it from purely generation-oriented benchmarks, positioning it as a critical resource for assessing whether models can retrieve relevant multimodal evidence when complex inference is required, rather than relying solely on surface-level similarity.

Claimed Contributions

MR²-Bench: A reasoning-intensive multimodal retrieval benchmark

10 retrieved papers

The authors present the first benchmark specifically designed to evaluate multimodal retrieval systems on reasoning-intensive tasks rather than shallow semantic matching. It contains 1,309 curated queries across 12 sub-tasks organized into three meta-tasks, requiring logical, spatial, and causal inference over diverse multimodal data including natural images, diagrams, and visual puzzles.

10 retrieved papers

Diverse multimodal data domains beyond natural images

10 retrieved papers

The benchmark incorporates a broad range of image types including mathematical visual proofs, visual puzzles, economic charts, and scientific diagrams. These data types have widespread applications and inherently require visual reasoning capabilities that previous multimodal retrieval tasks have largely overlooked.

10 retrieved papers

Support for complex free-form queries and documents with multiple images

Can Refute

10 retrieved papers

Unlike previous multimodal benchmarks where queries or documents typically contain at most a single image, both queries and documents in MR²-Bench may include multiple images in arbitrary interleaved text-image layouts. This design more accurately reflects real-world document structures and retrieval scenarios.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[31] Who is in the Spotlight: The Hidden Bias Undermining Multimodal Retrieval-Augmented Generation PDF

Jiayu Yao, Sheng-Hua Liu, Yiwei Wang, Lingrui Mei, Baolong Bi, Yuyao Ge, Zhecheng Li, Xueqi Cheng (2025) • Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MR²-Bench: A reasoning-intensive multimodal retrieval benchmark

[2] PixelLM: Pixel Reasoning with Large Multimodal Model PDF

Cannot Refute

[3] Multimodal Chain-of-Thought Reasoning in Language Models PDF

Cannot Refute

[7] MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action PDF

Cannot Refute

[8] MMAT-1M: A large reasoning dataset for multimodal agent tuning PDF

Cannot Refute

[14] Multimodal inconsistency reasoning (mmir): A new benchmark for multimodal reasoning models PDF

Cannot Refute

[51] Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark PDF

Cannot Refute

[52] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

Cannot Refute

[53] MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark PDF

Cannot Refute

[54] Medxpertqa: Benchmarking expert-level medical reasoning and understanding PDF

Cannot Refute

[55] Fcmr: robust evaluation of financial cross-modal multi-hop reasoning PDF

Cannot Refute

Contribution

Diverse multimodal data domains beyond natural images

[12] Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval PDF

Cannot Refute

[64] Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? PDF

Cannot Refute

[65] Measuring multimodal mathematical reasoning with math-vision dataset PDF

Cannot Refute

[66] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization PDF

Cannot Refute

[67] Blink: Multimodal large language models can see but not perceive PDF

Cannot Refute

[68] Cross-modal attention guided visual reasoning for referring image segmentation PDF

Cannot Refute

[69] CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation PDF

Cannot Refute

[70] CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation PDF

Cannot Refute

[71] Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities PDF

Cannot Refute

[72] FlowVQA: Mapping multimodal logic in visual question answering with flowcharts PDF

Cannot Refute

Contribution

Support for complex free-form queries and documents with multiple images

[58] M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding PDF

Can Refute

[11] MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs PDF

Cannot Refute

[26] ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents PDF

Cannot Refute

[56] A survey of multimodal retrieval-augmented generation PDF

Cannot Refute

[57] Unifying multimodal retrieval via document screenshot embedding PDF

Cannot Refute

[59] RAG-Check: Evaluating Multimodal Retrieval Augmented Generation Performance PDF

Cannot Refute

[60] MVQA: A dataset for multimodal information retrieval in PDF-based visual question answering PDF

Cannot Refute

[61] M3DocVQA: Multi-modal Multi-page Multi-document Understanding PDF

Cannot Refute

[62] Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems PDF

Cannot Refute

[63] Self-adaptive Multimodal Retrieval-Augmented Generation PDF

Cannot Refute

MR2^22-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[31] Who is in the Spotlight: The Hidden Bias Undermining Multimodal Retrieval-Augmented Generation PDF

Contribution Analysis

MR²-Bench: A reasoning-intensive multimodal retrieval benchmark

[2] PixelLM: Pixel Reasoning with Large Multimodal Model PDF

[3] Multimodal Chain-of-Thought Reasoning in Language Models PDF

[7] MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action PDF

[8] MMAT-1M: A large reasoning dataset for multimodal agent tuning PDF

[14] Multimodal inconsistency reasoning (mmir): A new benchmark for multimodal reasoning models PDF

[51] Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark PDF

[52] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

[53] MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark PDF

[54] Medxpertqa: Benchmarking expert-level medical reasoning and understanding PDF

[55] Fcmr: robust evaluation of financial cross-modal multi-hop reasoning PDF

Diverse multimodal data domains beyond natural images

[12] Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval PDF

[64] Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? PDF

[65] Measuring multimodal mathematical reasoning with math-vision dataset PDF

[66] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization PDF

[67] Blink: Multimodal large language models can see but not perceive PDF

[68] Cross-modal attention guided visual reasoning for referring image segmentation PDF

[69] CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation PDF

[70] CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation PDF

[71] Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities PDF

[72] FlowVQA: Mapping multimodal logic in visual question answering with flowcharts PDF

Support for complex free-form queries and documents with multiple images

[58] M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding PDF

[11] MM-Embed: Universal Multimodal Retrieval with Multimodal LLMs PDF

[26] ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents PDF

[56] A survey of multimodal retrieval-augmented generation PDF

[57] Unifying multimodal retrieval via document screenshot embedding PDF

[59] RAG-Check: Evaluating Multimodal Retrieval Augmented Generation Performance PDF

[60] MVQA: A dataset for multimodal information retrieval in PDF-based visual question answering PDF

[61] M3DocVQA: Multi-modal Multi-page Multi-document Understanding PDF

[62] Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems PDF

[63] Self-adaptive Multimodal Retrieval-Augmented Generation PDF

Table of Contents

MR $^2$ -Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval