MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning

ICLR 2026 Conference SubmissionAnonymous Authors
multimodal reasoningmultimodal benchmarkmulti-image benchmarkthinking models
Abstract:

Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs’ reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,676 multiple-choice questions based on 19,367 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MMR-Life, a benchmark for evaluating multimodal multi-image reasoning across seven reasoning types in real-life scenarios. It resides in the 'Multi-Image and Multi-Turn Reasoning Benchmarks' leaf, which contains six papers total including the original work. This leaf sits within the broader 'Benchmark Development and Evaluation Frameworks' branch, indicating a moderately populated research direction focused on assessing models' ability to integrate information across multiple images and conversational turns. The taxonomy reveals this is an active but not overcrowded area, with sibling papers like MMIU and MMDU exploring related but distinct emphases.

The taxonomy structure shows MMR-Life's leaf is one of four within the benchmark branch, alongside domain-specific evaluations, real-world scenario benchmarks, and specialized task benchmarks. Neighboring leaves contain works emphasizing expert knowledge requirements or high-resolution perceptual challenges, while MMR-Life explicitly excludes domain-specific expertise in favor of diverse reasoning types. The broader taxonomy includes model architecture and application branches, suggesting the field balances benchmark creation with system development. MMR-Life's focus on real-life scenarios without specialized domain knowledge positions it at the intersection of general-purpose evaluation and practical applicability, distinguishing it from both expert-level and synthetic task benchmarks.

Among thirty candidates examined, the benchmark contribution itself shows no clear refutation across ten papers reviewed, suggesting relative novelty in its specific combination of real-life scenarios and seven reasoning types. However, the evaluation contribution examining thirty-seven models encountered one refutable candidate among ten examined, indicating some overlap with prior large-scale model assessments. The analysis of reasoning paradigms found no refutations across ten candidates. These statistics reflect a limited search scope rather than exhaustive coverage, with the benchmark's design appearing more distinctive than its evaluation methodology within the examined literature.

Based on the limited search of thirty semantically similar papers, MMR-Life appears to occupy a recognizable but not heavily saturated position within multi-image reasoning benchmarks. The taxonomy context suggests the work contributes to an evolving conversation about balancing breadth and depth in evaluation design. The analysis does not cover the full landscape of multimodal benchmarking, particularly works outside the top-K semantic matches or recent publications not yet indexed.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: multimodal multi-image reasoning in real-life scenarios. The field organizes around four main branches that together capture the lifecycle of developing and deploying such systems. Benchmark Development and Evaluation Frameworks focuses on creating datasets and metrics to assess multi-image understanding, often emphasizing multi-turn interactions and complex reasoning chains as seen in works like MMIU[6] and MMDU[7]. Model Architectures and Training Methodologies explores the design of vision-language models capable of processing multiple images simultaneously, including innovations in attention mechanisms and training strategies exemplified by efforts such as R1-OneVision[4] and Generative In-Context Learners[3]. Application-Driven Systems and Task-Specific Methods targets concrete use cases—ranging from medical diagnosis to agricultural monitoring—where multi-image reasoning addresses domain-specific challenges. Finally, Foundational Methods and Cross-Domain Techniques provides the underlying algorithmic toolkit, including contrastive learning and cross-modal alignment strategies that generalize across tasks. Within the benchmark branch, a particularly active line of work centers on evaluating models' ability to reason across image sequences and conversational contexts, balancing breadth of coverage with depth of reasoning difficulty. MMR-Life[0] situates itself in this cluster alongside MMIU[6], MMDU[7], and REMI[28], all of which probe multi-image and multi-turn capabilities but differ in their emphasis: while MMIU[6] stresses interleaved understanding and MMDU[7] targets document-level reasoning, MMR-Life[0] focuses on real-life scenario diversity and practical applicability. Nearby works like MMCR[20] and MIHBench[30] further explore compositional reasoning and hallucination detection, highlighting ongoing questions about how to measure robustness and generalization when models must integrate information from varied visual inputs. This landscape reveals a tension between creating comprehensive benchmarks that cover diverse real-world settings and designing targeted evaluations that isolate specific reasoning skills.

Claimed Contributions

MMR-Life benchmark for multimodal multi-image reasoning in real-life scenarios

The authors propose MMR-Life, a new benchmark containing 2,676 multiple-choice questions based on 19,367 images from real-world contexts. It comprehensively covers seven reasoning types (abductive, analogical, causal, deductive, inductive, spatial, and temporal) and does not rely on domain-specific expertise, instead requiring models to integrate information across multiple images.

10 retrieved papers
Extensive evaluation of 37 advanced MLLMs revealing substantial challenges

The authors conduct a comprehensive evaluation of 37 state-of-the-art multimodal large language models on MMR-Life. The results show that even the most advanced models struggle considerably, with GPT-5 reaching only 58% accuracy compared to 72% human performance, and models display considerable variance across different reasoning types.

10 retrieved papers
Can Refute
Analysis of MLLM reasoning paradigms and their effectiveness

The authors provide an in-depth analysis of current MLLM reasoning paradigms, examining how thinking length, reasoning methods (such as reinforcement learning), and reasoning types influence model performance. Key findings include that long thinking benefits only limited reasoning types, RL shows weaker generalization in small models, and reasoning types cluster into patterns.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MMR-Life benchmark for multimodal multi-image reasoning in real-life scenarios

The authors propose MMR-Life, a new benchmark containing 2,676 multiple-choice questions based on 19,367 images from real-world contexts. It comprehensively covers seven reasoning types (abductive, analogical, causal, deductive, inductive, spatial, and temporal) and does not rely on domain-specific expertise, instead requiring models to integrate information across multiple images.

Contribution

Extensive evaluation of 37 advanced MLLMs revealing substantial challenges

The authors conduct a comprehensive evaluation of 37 state-of-the-art multimodal large language models on MMR-Life. The results show that even the most advanced models struggle considerably, with GPT-5 reaching only 58% accuracy compared to 72% human performance, and models display considerable variance across different reasoning types.

Contribution

Analysis of MLLM reasoning paradigms and their effectiveness

The authors provide an in-depth analysis of current MLLM reasoning paradigms, examining how thinking length, reasoning methods (such as reinforcement learning), and reasoning types influence model performance. Key findings include that long thinking benefits only limited reasoning types, RL shows weaker generalization in small models, and reasoning types cluster into patterns.

MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning | Novelty Validation