MR2^2-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval

ICLR 2026 Conference SubmissionAnonymous Authors
Multimodal RetrievalReasoning-intensive RetrievalMultimodal EmbeddingBenchmark
Abstract:

Multimodal retrieval is becoming a crucial component of modern AI applications, yet its evaluation lags behind the demands of more realistic and challenging scenarios. Existing benchmarks primarily probe surface-level semantic correspondence (e.g., object–text matching) while failing to assess the deeper reasoning required to capture complex relationships between visual and textual information. To address this gap, we introduce MR2^{2}-Bench, a reasoning-intensive benchmark for multimodal retrieval. MR2^2-Bench presents the following critical values: 1) all tasks are reasoning-driven, going beyond shallow matching to effectively assess models’ capacity for logical, spatial, and causal inference; 2) it features diverse multimodal data, such as natural images, diagrams, and visual puzzles, enabling comprehensive evaluation across content types; 3) it supports complex queries and documents containing multiple images and covers diverse retrieval scenarios, more accurately reflecting real-world applications. Our benchmark contains 1,309 curated queries, derived either from manual collection and annotation or from selective consolidation of public datasets. Despite achieving strong results on existing benchmarks, current state-of-the-art models still struggle on MR2^{2}-Bench: for example, the leading Seed1.6-Embedding model attains a Recall@1 of 77.78 on MMEB, but only 9.91 on MR2^{2}-Bench. This substantial performance gap highlights both the increased challenge posed by our benchmark and the pressing need for further advances in reasoning-intensive multimodal retrieval.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MR²-Bench, a benchmark designed to evaluate reasoning-intensive multimodal retrieval rather than surface-level semantic matching. It resides in the 'Reasoning-Intensive Retrieval Benchmarks' leaf of the taxonomy, which contains only two papers total (including this one). This leaf sits within the broader 'Reasoning Evaluation and Benchmarking' branch, indicating a relatively sparse research direction focused specifically on retrieval tasks that demand complex inference. The sibling paper in this leaf addresses similar evaluation challenges, suggesting that reasoning-driven retrieval benchmarking is an emerging but not yet crowded area.

The taxonomy reveals that MR²-Bench occupies a distinct niche between several related directions. Neighboring leaves include 'General Multimodal Reasoning Evaluation' (six papers assessing diverse reasoning abilities) and 'Training Data Construction' (three papers on instruction-tuning datasets). The 'Multimodal Retrieval Models and Embeddings' branch (six papers) focuses on representation learning rather than evaluation. The scope notes clarify that MR²-Bench excludes general reasoning benchmarks and training datasets, instead targeting retrieval-specific reasoning assessment. This positioning suggests the work bridges evaluation methodology with retrieval system design, addressing a gap between pure reasoning tests and embedding-based retrieval methods.

Among thirty candidates examined, the contribution-level analysis shows varied novelty signals. The core benchmark contribution (Contribution A) examined ten candidates with zero refutations, suggesting limited direct prior work on reasoning-intensive retrieval evaluation at this scale. The diverse data domains contribution (Contribution B) also found no refutations among ten candidates. However, the complex query support contribution (Contribution C) identified one refutable candidate among ten examined, indicating some overlap with existing work on multi-image document retrieval. These statistics reflect a focused search scope rather than exhaustive coverage, with most contributions appearing relatively novel within the examined literature.

Based on the limited search scope of thirty semantically similar papers, the work appears to address an underexplored evaluation gap. The taxonomy structure confirms that reasoning-intensive retrieval benchmarking is a sparse area with only one sibling paper. However, the analysis does not cover the full breadth of multimodal evaluation literature, and the single refutation for complex query support suggests potential overlap with document-centric retrieval systems. The assessment is constrained by the top-K semantic search methodology and may not capture all relevant prior benchmarks.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: reasoning-intensive multimodal retrieval. This field addresses the challenge of retrieving and reasoning over information that spans text, images, video, and other modalities, requiring models to perform complex inference rather than simple matching. The taxonomy reveals five main branches that capture complementary aspects of this landscape. Multimodal Chain-of-Thought Reasoning (e.g., Multimodal Chain-of-Thought[3], MM-REACT[7]) focuses on eliciting step-by-step reasoning traces that integrate visual and textual cues. Retrieval-Augmented Multimodal Generation explores how external knowledge sources can enhance generation quality, as seen in works like MuRAG[15] and Retrieval-Augmented Multimodal[37]. Multimodal Retrieval Models and Embeddings develop representation learning techniques for cross-modal matching, including efforts like MM-Embed[11] and Composed Image Retrieval[20]. Reasoning Evaluation and Benchmarking provides datasets and metrics to measure reasoning capabilities, while Advanced Reasoning Paradigms investigates novel inference strategies such as deep thinking and progressive retrieval. Recent work highlights tensions between end-to-end reasoning and modular retrieval-augmented approaches, with many studies exploring how to balance retrieval precision and reasoning depth. MR2-Bench[0] sits squarely within the Reasoning Evaluation and Benchmarking branch, specifically targeting reasoning-intensive retrieval scenarios. Unlike broader multimodal reasoning surveys (Multimodal Reasoning Survey[5], Multimodal Chain-of-Thought Survey[1]) that catalog diverse reasoning paradigms, MR2-Bench[0] emphasizes rigorous evaluation of retrieval under reasoning demands, sharing thematic overlap with MMIR Benchmark[14] and EMMA Benchmark[16]. Its focus on reasoning-intensive retrieval distinguishes it from purely generation-oriented benchmarks, positioning it as a critical resource for assessing whether models can retrieve relevant multimodal evidence when complex inference is required, rather than relying solely on surface-level similarity.

Claimed Contributions

MR²-Bench: A reasoning-intensive multimodal retrieval benchmark

The authors present the first benchmark specifically designed to evaluate multimodal retrieval systems on reasoning-intensive tasks rather than shallow semantic matching. It contains 1,309 curated queries across 12 sub-tasks organized into three meta-tasks, requiring logical, spatial, and causal inference over diverse multimodal data including natural images, diagrams, and visual puzzles.

10 retrieved papers
Diverse multimodal data domains beyond natural images

The benchmark incorporates a broad range of image types including mathematical visual proofs, visual puzzles, economic charts, and scientific diagrams. These data types have widespread applications and inherently require visual reasoning capabilities that previous multimodal retrieval tasks have largely overlooked.

10 retrieved papers
Support for complex free-form queries and documents with multiple images

Unlike previous multimodal benchmarks where queries or documents typically contain at most a single image, both queries and documents in MR²-Bench may include multiple images in arbitrary interleaved text-image layouts. This design more accurately reflects real-world document structures and retrieval scenarios.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MR²-Bench: A reasoning-intensive multimodal retrieval benchmark

The authors present the first benchmark specifically designed to evaluate multimodal retrieval systems on reasoning-intensive tasks rather than shallow semantic matching. It contains 1,309 curated queries across 12 sub-tasks organized into three meta-tasks, requiring logical, spatial, and causal inference over diverse multimodal data including natural images, diagrams, and visual puzzles.

Contribution

Diverse multimodal data domains beyond natural images

The benchmark incorporates a broad range of image types including mathematical visual proofs, visual puzzles, economic charts, and scientific diagrams. These data types have widespread applications and inherently require visual reasoning capabilities that previous multimodal retrieval tasks have largely overlooked.

Contribution

Support for complex free-form queries and documents with multiple images

Unlike previous multimodal benchmarks where queries or documents typically contain at most a single image, both queries and documents in MR²-Bench may include multiple images in arbitrary interleaved text-image layouts. This design more accurately reflects real-world document structures and retrieval scenarios.