MR-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval
Overview
Overall Novelty Assessment
The paper introduces MR²-Bench, a benchmark designed to evaluate reasoning-intensive multimodal retrieval rather than surface-level semantic matching. It resides in the 'Reasoning-Intensive Retrieval Benchmarks' leaf of the taxonomy, which contains only two papers total (including this one). This leaf sits within the broader 'Reasoning Evaluation and Benchmarking' branch, indicating a relatively sparse research direction focused specifically on retrieval tasks that demand complex inference. The sibling paper in this leaf addresses similar evaluation challenges, suggesting that reasoning-driven retrieval benchmarking is an emerging but not yet crowded area.
The taxonomy reveals that MR²-Bench occupies a distinct niche between several related directions. Neighboring leaves include 'General Multimodal Reasoning Evaluation' (six papers assessing diverse reasoning abilities) and 'Training Data Construction' (three papers on instruction-tuning datasets). The 'Multimodal Retrieval Models and Embeddings' branch (six papers) focuses on representation learning rather than evaluation. The scope notes clarify that MR²-Bench excludes general reasoning benchmarks and training datasets, instead targeting retrieval-specific reasoning assessment. This positioning suggests the work bridges evaluation methodology with retrieval system design, addressing a gap between pure reasoning tests and embedding-based retrieval methods.
Among thirty candidates examined, the contribution-level analysis shows varied novelty signals. The core benchmark contribution (Contribution A) examined ten candidates with zero refutations, suggesting limited direct prior work on reasoning-intensive retrieval evaluation at this scale. The diverse data domains contribution (Contribution B) also found no refutations among ten candidates. However, the complex query support contribution (Contribution C) identified one refutable candidate among ten examined, indicating some overlap with existing work on multi-image document retrieval. These statistics reflect a focused search scope rather than exhaustive coverage, with most contributions appearing relatively novel within the examined literature.
Based on the limited search scope of thirty semantically similar papers, the work appears to address an underexplored evaluation gap. The taxonomy structure confirms that reasoning-intensive retrieval benchmarking is a sparse area with only one sibling paper. However, the analysis does not cover the full breadth of multimodal evaluation literature, and the single refutation for complex query support suggests potential overlap with document-centric retrieval systems. The assessment is constrained by the top-K semantic search methodology and may not capture all relevant prior benchmarks.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present the first benchmark specifically designed to evaluate multimodal retrieval systems on reasoning-intensive tasks rather than shallow semantic matching. It contains 1,309 curated queries across 12 sub-tasks organized into three meta-tasks, requiring logical, spatial, and causal inference over diverse multimodal data including natural images, diagrams, and visual puzzles.
The benchmark incorporates a broad range of image types including mathematical visual proofs, visual puzzles, economic charts, and scientific diagrams. These data types have widespread applications and inherently require visual reasoning capabilities that previous multimodal retrieval tasks have largely overlooked.
Unlike previous multimodal benchmarks where queries or documents typically contain at most a single image, both queries and documents in MR²-Bench may include multiple images in arbitrary interleaved text-image layouts. This design more accurately reflects real-world document structures and retrieval scenarios.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[31] Who is in the Spotlight: The Hidden Bias Undermining Multimodal Retrieval-Augmented Generation PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
MR²-Bench: A reasoning-intensive multimodal retrieval benchmark
The authors present the first benchmark specifically designed to evaluate multimodal retrieval systems on reasoning-intensive tasks rather than shallow semantic matching. It contains 1,309 curated queries across 12 sub-tasks organized into three meta-tasks, requiring logical, spatial, and causal inference over diverse multimodal data including natural images, diagrams, and visual puzzles.
[2] PixelLM: Pixel Reasoning with Large Multimodal Model PDF
[3] Multimodal Chain-of-Thought Reasoning in Language Models PDF
[7] MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action PDF
[8] MMAT-1M: A large reasoning dataset for multimodal agent tuning PDF
[14] Multimodal inconsistency reasoning (mmir): A new benchmark for multimodal reasoning models PDF
[51] Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark PDF
[52] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF
[53] MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark PDF
[54] Medxpertqa: Benchmarking expert-level medical reasoning and understanding PDF
[55] Fcmr: robust evaluation of financial cross-modal multi-hop reasoning PDF
Diverse multimodal data domains beyond natural images
The benchmark incorporates a broad range of image types including mathematical visual proofs, visual puzzles, economic charts, and scientific diagrams. These data types have widespread applications and inherently require visual reasoning capabilities that previous multimodal retrieval tasks have largely overlooked.
[12] Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval PDF
[64] Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? PDF
[65] Measuring multimodal mathematical reasoning with math-vision dataset PDF
[66] R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization PDF
[67] Blink: Multimodal large language models can see but not perceive PDF
[68] Cross-modal attention guided visual reasoning for referring image segmentation PDF
[69] CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation PDF
[70] CMIRNet: Cross-Modal Interactive Reasoning Network for Referring Image Segmentation PDF
[71] Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities PDF
[72] FlowVQA: Mapping multimodal logic in visual question answering with flowcharts PDF
Support for complex free-form queries and documents with multiple images
Unlike previous multimodal benchmarks where queries or documents typically contain at most a single image, both queries and documents in MR²-Bench may include multiple images in arbitrary interleaved text-image layouts. This design more accurately reflects real-world document structures and retrieval scenarios.