MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

multimodal reasoningmultimodal benchmarkmulti-image benchmarkthinking models

Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs’ reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,676 multiple-choice questions based on 19,367 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MMR-Life, a benchmark for evaluating multimodal multi-image reasoning across seven reasoning types in real-life scenarios. It resides in the 'Multi-Image and Multi-Turn Reasoning Benchmarks' leaf, which contains six papers total including the original work. This leaf sits within the broader 'Benchmark Development and Evaluation Frameworks' branch, indicating a moderately populated research direction focused on assessing models' ability to integrate information across multiple images and conversational turns. The taxonomy reveals this is an active but not overcrowded area, with sibling papers like MMIU and MMDU exploring related but distinct emphases.

The taxonomy structure shows MMR-Life's leaf is one of four within the benchmark branch, alongside domain-specific evaluations, real-world scenario benchmarks, and specialized task benchmarks. Neighboring leaves contain works emphasizing expert knowledge requirements or high-resolution perceptual challenges, while MMR-Life explicitly excludes domain-specific expertise in favor of diverse reasoning types. The broader taxonomy includes model architecture and application branches, suggesting the field balances benchmark creation with system development. MMR-Life's focus on real-life scenarios without specialized domain knowledge positions it at the intersection of general-purpose evaluation and practical applicability, distinguishing it from both expert-level and synthetic task benchmarks.

Among thirty candidates examined, the benchmark contribution itself shows no clear refutation across ten papers reviewed, suggesting relative novelty in its specific combination of real-life scenarios and seven reasoning types. However, the evaluation contribution examining thirty-seven models encountered one refutable candidate among ten examined, indicating some overlap with prior large-scale model assessments. The analysis of reasoning paradigms found no refutations across ten candidates. These statistics reflect a limited search scope rather than exhaustive coverage, with the benchmark's design appearing more distinctive than its evaluation methodology within the examined literature.

Based on the limited search of thirty semantically similar papers, MMR-Life appears to occupy a recognizable but not heavily saturated position within multi-image reasoning benchmarks. The taxonomy context suggests the work contributes to an evolving conversation about balancing breadth and depth in evaluation design. The analysis does not cover the full landscape of multimodal benchmarking, particularly works outside the top-K semantic matches or recent publications not yet indexed.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: multimodal multi-image reasoning in real-life scenarios. The field organizes around four main branches that together capture the lifecycle of developing and deploying such systems. Benchmark Development and Evaluation Frameworks focuses on creating datasets and metrics to assess multi-image understanding, often emphasizing multi-turn interactions and complex reasoning chains as seen in works like MMIU[6] and MMDU[7]. Model Architectures and Training Methodologies explores the design of vision-language models capable of processing multiple images simultaneously, including innovations in attention mechanisms and training strategies exemplified by efforts such as R1-OneVision[4] and Generative In-Context Learners[3]. Application-Driven Systems and Task-Specific Methods targets concrete use cases—ranging from medical diagnosis to agricultural monitoring—where multi-image reasoning addresses domain-specific challenges. Finally, Foundational Methods and Cross-Domain Techniques provides the underlying algorithmic toolkit, including contrastive learning and cross-modal alignment strategies that generalize across tasks. Within the benchmark branch, a particularly active line of work centers on evaluating models' ability to reason across image sequences and conversational contexts, balancing breadth of coverage with depth of reasoning difficulty. MMR-Life[0] situates itself in this cluster alongside MMIU[6], MMDU[7], and REMI[28], all of which probe multi-image and multi-turn capabilities but differ in their emphasis: while MMIU[6] stresses interleaved understanding and MMDU[7] targets document-level reasoning, MMR-Life[0] focuses on real-life scenario diversity and practical applicability. Nearby works like MMCR[20] and MIHBench[30] further explore compositional reasoning and hallucination detection, highlighting ongoing questions about how to measure robustness and generalization when models must integrate information from varied visual inputs. This landscape reveals a tension between creating comprehensive benchmarks that cover diverse real-world settings and designing targeted evaluations that isolate specific reasoning skills.

Claimed Contributions

MMR-Life benchmark for multimodal multi-image reasoning in real-life scenarios

10 retrieved papers

The authors propose MMR-Life, a new benchmark containing 2,676 multiple-choice questions based on 19,367 images from real-world contexts. It comprehensively covers seven reasoning types (abductive, analogical, causal, deductive, inductive, spatial, and temporal) and does not rely on domain-specific expertise, instead requiring models to integrate information across multiple images.

10 retrieved papers

Extensive evaluation of 37 advanced MLLMs revealing substantial challenges

Can Refute

10 retrieved papers

The authors conduct a comprehensive evaluation of 37 state-of-the-art multimodal large language models on MMR-Life. The results show that even the most advanced models struggle considerably, with GPT-5 reaching only 58% accuracy compared to 72% human performance, and models display considerable variance across different reasoning types.

10 retrieved papers

Can Refute

Analysis of MLLM reasoning paradigms and their effectiveness

10 retrieved papers

The authors provide an in-depth analysis of current MLLM reasoning paradigms, examining how thinking length, reasoning methods (such as reinforcement learning), and reasoning types influence model performance. Key findings include that long thinking benefits only limited reasoning types, RL shows weaker generalization in small models, and reasoning types cluster into patterns.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Mmiu: Multimodal multi-image understanding for evaluating large vision-language models PDF

Meng Fanqing, Fanqing Meng, Wang Jin, Jin Wang, Li Chuan-hao, Chuanhao Li, Quanfeng Lu, Lu Quanfeng, TIAN Hao, Hao Tian, Liao Jia-qi, Jiaqi Liao, Zhu, Xizhou, Xizhou Zhu, Dai, Jifeng, Jifeng Dai, Qiao Yu, Yu Qiao, Luo, Ping, Ping Luo, Zhang Kai-peng, Kaipeng Zhang, Shao, Wenqi, Wenqi Shao (2024)

[7] Mmdu: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for lvlms PDF

Tao Chu, Xiaoyi Dong, Zijian Liang, Lin Dahua, Ziyu Liu, Yu Qiao, Jiaqi Wang, Xilin Wei, Yuanjun Xiong, Yuhang Zang, Pan Zhang (2024)

[20] Mmcr: Advancing visual language model in multimodal multi-turn contextual reasoning PDF

Yan Dawei, Li Yang, Chen Qing-guo, Luo Weihua, Wang Peng, Zhang, Haokui, Shen, Chunhua (2025)

[28] Remi: A dataset for reasoning with multiple images PDF

Ankit Anand, Pranjal Awasthi, Ishita Dasgupta, Petar DeviÄ, Nishanth Dikkala, Bahare Fatemi, Sreenivas Gollapudi, Dee Guo, Mehran Kazemi, Fangyu. Liu, Ahmed Qureshi (2024)

[30] MIHBench: Benchmarking and Mitigating Multi-Image Hallucinations in Multimodal Large Language Models PDF

Li Jiale, Wu MingRui, Chen, Hao, Ji, Jiayi, Sun, Xiaoshuai, Cao, Liujuan, Ji Rongrong (2025) • Proceedings of the 33rd ACM International Conference on Multimedia

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MMR-Life benchmark for multimodal multi-image reasoning in real-life scenarios

[2] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

Cannot Refute

[51] Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts PDF

Cannot Refute

[52] STAR: A Benchmark for Situated Reasoning in Real-World Videos PDF

Cannot Refute

[53] Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling PDF

Cannot Refute

[54] Muirbench: A comprehensive benchmark for robust multi-image understanding PDF

Cannot Refute

[55] Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences PDF

Cannot Refute

[56] MSEarth: A Benchmark for Multimodal Scientific Comprehension of Earth Science PDF

Cannot Refute

[57] MVImgNet: A Large-scale Dataset of Multi-view Images PDF

Cannot Refute

[58] Vision-g1: Towards general vision language reasoning with multi-domain data curation PDF

Cannot Refute

[59] Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark PDF

Cannot Refute

Contribution

Extensive evaluation of 37 advanced MLLMs revealing substantial challenges

[65] Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi PDF

Can Refute

[2] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

Cannot Refute

[4] R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization PDF

Cannot Refute

[55] Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences PDF

Cannot Refute

[60] Mibench: Evaluating multimodal large language models over multiple images PDF

Cannot Refute

[61] BLINK: Multimodal Large Language Models Can See but Not Perceive PDF

Cannot Refute

[62] Palm-e: An embodied multimodal language model PDF

Cannot Refute

[63] Vila: On pre-training for visual language models PDF

Cannot Refute

[64] MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks PDF

Cannot Refute

[66] A survey on multimodal large language models PDF

Cannot Refute

Contribution

Analysis of MLLM reasoning paradigms and their effectiveness

[23] MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action PDF

Cannot Refute

[67] Vision-r1: Incentivizing reasoning capability in multimodal large language models PDF

Cannot Refute

[68] Multimodal Chain-of-Thought Reasoning in Language Models PDF

Cannot Refute

[69] Multimodal chain-of-thought reasoning: A comprehensive survey PDF

Cannot Refute

[70] Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models PDF

Cannot Refute

[71] Perception, reason, think, and plan: A survey on large multimodal reasoning models PDF

Cannot Refute

[72] Recursive Inference Scaling: A Winning Path to Scalable Inference in Language and Multimodal Systems PDF

Cannot Refute

[73] VisualPRM: An Effective Process Reward Model for Multimodal Reasoning PDF

Cannot Refute

[74] A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity PDF

Cannot Refute

[75] Skywork r1v: Pioneering multimodal reasoning with chain-of-thought PDF

Cannot Refute

MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Mmiu: Multimodal multi-image understanding for evaluating large vision-language models PDF

[7] Mmdu: A multi-turn multi-image dialog understanding benchmark and instruction-tuning dataset for lvlms PDF

[20] Mmcr: Advancing visual language model in multimodal multi-turn contextual reasoning PDF

[28] Remi: A dataset for reasoning with multiple images PDF

[30] MIHBench: Benchmarking and Mitigating Multi-Image Hallucinations in Multimodal Large Language Models PDF

Contribution Analysis

MMR-Life benchmark for multimodal multi-image reasoning in real-life scenarios

[2] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

[51] Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts PDF

[52] STAR: A Benchmark for Situated Reasoning in Real-World Videos PDF

[53] Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling PDF

[54] Muirbench: A comprehensive benchmark for robust multi-image understanding PDF

[55] Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences PDF

[56] MSEarth: A Benchmark for Multimodal Scientific Comprehension of Earth Science PDF

[57] MVImgNet: A Large-scale Dataset of Multi-view Images PDF

[58] Vision-g1: Towards general vision language reasoning with multi-domain data curation PDF

[59] Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark PDF

Extensive evaluation of 37 advanced MLLMs revealing substantial challenges

[65] Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi PDF

[2] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

[4] R1-onevision: Advancing generalized multimodal reasoning through cross-modal formalization PDF

[55] Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences PDF

[60] Mibench: Evaluating multimodal large language models over multiple images PDF

[61] BLINK: Multimodal Large Language Models Can See but Not Perceive PDF

[62] Palm-e: An embodied multimodal language model PDF

[63] Vila: On pre-training for visual language models PDF

[64] MM-InstructEval: Zero-Shot Evaluation of (Multimodal) Large Language Models on Multimodal Reasoning Tasks PDF

[66] A survey on multimodal large language models PDF

Analysis of MLLM reasoning paradigms and their effectiveness

[23] MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action PDF

[67] Vision-r1: Incentivizing reasoning capability in multimodal large language models PDF

[68] Multimodal Chain-of-Thought Reasoning in Language Models PDF

[69] Multimodal chain-of-thought reasoning: A comprehensive survey PDF

[70] Understand, Think, and Answer: Advancing Visual Reasoning with Large Multimodal Models PDF

[71] Perception, reason, think, and plan: A survey on large multimodal reasoning models PDF

[72] Recursive Inference Scaling: A Winning Path to Scalable Inference in Language and Multimodal Systems PDF

[73] VisualPRM: An Effective Process Reward Model for Multimodal Reasoning PDF

[74] A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity PDF

[75] Skywork r1v: Pioneering multimodal reasoning with chain-of-thought PDF

Table of Contents