MRMR: A Realistic and Expert-Level Multidisciplinary Benchmark for Reasoning-Intensive Multimodal Retrieval
Overview
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors present MRMR, a new benchmark containing 1,435 expert-annotated queries across 23 domains. It evaluates retrieval systems on reasoning-intensive tasks using image-text interleaved sequences, addressing limitations of prior benchmarks that focus on general-domain knowledge and single-image queries.
The authors originally introduce the Contradiction Retrieval task in the multimodal domain, which requires models to perform logical reasoning to identify documents that conflict with or contradict the user query, going beyond semantic relevance detection.
The benchmark represents both queries and documents as interleaved sequences of text and images, enabling more realistic retrieval scenarios with multi-image queries and mixed-modality corpus documents, unlike earlier benchmarks restricted to single images or unimodal documents.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[44] MR-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval PDF
[47] MR-Bench: Going Beyond Matching to Reasoning in Multimodal Retrieval PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
MRMR benchmark for expert-level multidisciplinary multimodal retrieval
The authors present MRMR, a new benchmark containing 1,435 expert-annotated queries across 23 domains. It evaluates retrieval systems on reasoning-intensive tasks using image-text interleaved sequences, addressing limitations of prior benchmarks that focus on general-domain knowledge and single-image queries.
[6] Retrieval-augmented multi-modal chain-of-thoughts reasoning for large language models PDF
[71] R-bench: Graduate-level multi-disciplinary benchmarks for llm & mllm complex reasoning evaluation PDF
[72] Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi PDF
[73] MMMG: A Massive, Multidisciplinary, Multi-Tier Generation Benchmark for Text-to-Image Reasoning PDF
[74] Medxpertqa: Benchmarking expert-level medical reasoning and understanding PDF
[75] ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning PDF
[76] Mac: A live benchmark for multimodal large language models in scientific understanding PDF
[77] Mmau: A massive multi-task audio understanding and reasoning benchmark PDF
[78] A survey on benchmarks of multimodal large language models PDF
[79] A survey on multimodal benchmarks: In the era of large ai models PDF
Contradiction Retrieval task for multimodal setting
The authors originally introduce the Contradiction Retrieval task in the multimodal domain, which requires models to perform logical reasoning to identify documents that conflict with or contradict the user query, going beyond semantic relevance detection.
[61] Alleviating the inconsistency of multimodal data in cross-modal retrieval PDF
[62] Why foundation models struggle with cross-modal context PDF
[63] Benchmarking Multimodal Knowledge Conflict for Large Multimodal Models PDF
[64] Verite: a robust benchmark for multimodal misinformation detection accounting for unimodal bias PDF
[65] Complementary-contradictory feature regularization against multimodal overfitting PDF
[66] Crosscheck-bench: Diagnosing compositional failures in multimodal conflict resolution PDF
[67] Exclaim: An explainable cross-modal agentic system for misinformation detection with hierarchical retrieval PDF
[68] Detecting fake news by exploring the consistency of multimodal data PDF
[69] Cross-modal ambiguity learning for multimodal fake news detection PDF
[70] Is cognition consistent with perception? assessing and mitigating multimodal knowledge conflicts in document understanding PDF
Image-text interleaved query and document representation
The benchmark represents both queries and documents as interleaved sequences of text and images, enabling more realistic retrieval scenarios with multi-image queries and mixed-modality corpus documents, unlike earlier benchmarks restricted to single images or unimodal documents.