MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Spatial IntelligenceMLLMVLMVQABenchmark3D Understanding

Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a stepwise reasoning process. We conduct extensive experiments and evaluate 37 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's GPT-5 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering insights for advancing spatial intelligence.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: multi-image spatial reasoning for multimodal large language models. The field has organized itself around several complementary branches. Spatial Reasoning Benchmarks and Evaluation focuses on datasets and metrics that test models' ability to understand spatial relationships across multiple images, including specialized benchmarks for multi-image spatial intelligence like MMSI-Bench[0] and NuScenes-SpatialQA[8]. Training Methodologies for Spatial Reasoning explores techniques to improve spatial understanding through curriculum learning, data augmentation, and specialized training objectives. Model Architectures for Spatial Understanding investigates architectural innovations such as attention mechanisms and token-efficient designs that better capture spatial information. General Multi-Image Understanding Evaluation examines broader multi-image capabilities beyond pure spatial reasoning, while Application-Driven Spatial Reasoning targets domain-specific scenarios like robotics and autonomous driving. Surveys and Theoretical Foundations provide conceptual frameworks, and Auxiliary Techniques encompass supporting methods like prompting strategies and visual grounding. Several active research directions reveal key trade-offs in the field. One line emphasizes comprehensive benchmark construction to expose model weaknesses, with works like Space-10[9] and Ego-Centric Spatial[35] probing different spatial perspectives and reference frames. Another direction pursues architectural and training innovations, exemplified by Thinking in Space[3] and Mind the Gap[4], which tackle the challenge of integrating spatial reasoning into existing multimodal architectures. MMSI-Bench[0] sits squarely within the benchmark-focused cluster, providing a systematic evaluation framework for multi-image spatial intelligence. Compared to neighbors like NuScenes-SpatialQA[8], which targets autonomous driving scenarios, MMSI-Bench[0] appears to take a more general-purpose approach to spatial reasoning evaluation. Its emphasis contrasts with Omnidirectional Spatial[49], which explores specific viewpoint challenges, suggesting MMSI-Bench[0] aims for broader coverage of spatial reasoning phenomena across diverse multi-image contexts.

Claimed Contributions

MMSI-Bench benchmark for multi-image spatial intelligence

7 retrieved papers

The authors present MMSI-Bench, a comprehensive benchmark containing 1,000 human-curated multiple-choice questions designed to evaluate multimodal large language models on spatial reasoning tasks that require integrating information across multiple images. The benchmark covers ten fundamental spatial reasoning task types plus multi-step reasoning, spanning diverse real-world scenarios.

7 retrieved papers

Human-centric benchmark construction methodology

9 retrieved papers

The authors develop a fully human-centric design approach where expert researchers manually select image sets, create novel and challenging questions that cannot be answered from single images, and provide detailed step-by-step reasoning annotations. This methodology ensures high quality, diversity, and difficulty compared to template-based approaches.

9 retrieved papers

Automated error analysis pipeline using annotated reasoning

9 retrieved papers

The authors introduce an automated analysis method that uses the human-annotated reasoning processes to systematically categorize model failures into four error types: grounding errors, overlap-matching and scene-reconstruction errors, situation-transformation reasoning errors, and spatial-logic errors. This enables scalable diagnosis of spatial reasoning capabilities.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[8] Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving PDF

Tian Kexin, Kexin Tian, Zhang Yun-long, Jingrui Mao, Jiang Jiwan, Yunlong Zhang, Zhou Yang, Jiwan Jiang, Tu, Zhengzhong, Yang Zhou, Zhengzhong Tu (2025)

[9] Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence PDF

Gong Ziyang, Li Wenhao, Ziyang Gong, MA Oliver, Wenhao Li, Li Songyuan, Oliver Ma, Wang, Zhaokai, Songyuan Li, Ji, Jiayi, Jiayi Ji, Yang Xue, Xue Yang, Luo, Gen, Gen Luo, Yan, Junchi, Junchi Yan, Ji Rongrong, Rongrong Ji (2025)

[35] Spatial reasoning with vision-language models in ego-centric multi-view scenes PDF

Gholami, Mohsen, Rezaei Ahmad, Mohsen Gholami, Wei-Min Zhou, Ahmad Rezaei, Mao, Sitong, Weimin Zhou, Zhou, Shunbo, Sitong Mao, Zhang Yong, Shunbo Zhou, Akbari Mohammad, Yong Zhang, Mohammad Akbari (2025)

[49] Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning? PDF

Zheng Xu, Zihao Dongfang, Weng, Ziqiao, Xu Zheng, Ziqiao Weng, Paudel, Danda Pani, Yuanhuiyi Lyu, Van Gool Luc, D. Paudel, Yang, Kailun, L. V. Gool, Hu, Xuming, Kailun Yang, Xuming Hu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MMSI-Bench benchmark for multi-image spatial intelligence

[8] Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving PDF

Cannot Refute

[10] Mm-spatial: Exploring 3d spatial understanding in multimodal llms PDF

Cannot Refute

[23] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models PDF

Cannot Refute

[51] Muirbench: A comprehensive benchmark for robust multi-image understanding PDF

Cannot Refute

[52] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models PDF

Cannot Refute

[56] MultiChartQA: Benchmarking vision-language models on multi-chart problems PDF

Cannot Refute

[57] MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning PDF

Cannot Refute

Contribution

Human-centric benchmark construction methodology

[58] MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark PDF

Cannot Refute

[59] MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning PDF

Cannot Refute

[60] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

Cannot Refute

[61] ExpVid: A Benchmark for Experiment Video Understanding & Reasoning PDF

Cannot Refute

[63] MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs PDF

Cannot Refute

[64] MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark PDF

Cannot Refute

[65] EthicsMH: A Pilot Benchmark for Ethical Reasoning in Mental Health AI PDF

Cannot Refute

[66] Hallucination-aware multimodal benchmark for gastrointestinal image analysis with large vision-language models PDF

Cannot Refute

[67] MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence PDF

Cannot Refute

Contribution

Automated error analysis pipeline using annotated reasoning

[68] TGEA: An Error-Annotated Dataset and Benchmark Tasks for TextGeneration from Pretrained Language Models PDF

Cannot Refute

[69] GENERATIVE AI FOR PIPELINE REPAIR: AUTOMATING CORRECTIVE ACTIONS IN SELF-HEALING DATA SYSTEMS PDF

Cannot Refute

[70] Outlier Detection in Plantar Pressure: Human-Centered Comparison of Statistical Parametric Mapping and Explainable Machine Learning PDF

Cannot Refute

[71] Multimodal Fact-Checking: An Agent-based Approach PDF

Cannot Refute

[72] ChainReaction: Causal Chain-Guided Reasoning for Modular and Explainable Causal-Why Video Question Answering PDF

Cannot Refute

[73] â¦ DISEASE DETECTION FROM ELECTRONIC HEALTH RECORD DATA: AN END-TO-END AI-AUGMENTED PIPELINE FOR COMPUTABLE PHENOTYPING PDF

Cannot Refute

[74] Failure diagnosis of complex systems PDF

Cannot Refute

[75] Towards a theory of mind for ethical software agents PDF

Cannot Refute

[76] A general introspective reasoning approach to web search for case adaptation PDF

Cannot Refute

MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

Overview

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[8] Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving PDF

[9] Space-10: A comprehensive benchmark for multimodal large language models in compositional spatial intelligence PDF

[35] Spatial reasoning with vision-language models in ego-centric multi-view scenes PDF

[49] Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning? PDF

Contribution Analysis

MMSI-Bench benchmark for multi-image spatial intelligence

[8] Nuscenes-spatialqa: A spatial understanding and reasoning benchmark for vision-language models in autonomous driving PDF

[10] Mm-spatial: Exploring 3d spatial understanding in multimodal llms PDF

[23] MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models PDF

[51] Muirbench: A comprehensive benchmark for robust multi-image understanding PDF

[52] Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Mutimodal Models PDF

[56] MultiChartQA: Benchmarking vision-language models on multi-chart problems PDF

[57] MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning PDF

Human-centric benchmark construction methodology

[58] MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark PDF

[59] MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning PDF

[60] MMMU: A Massive Multi-Discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI PDF

[61] ExpVid: A Benchmark for Experiment Video Understanding & Reasoning PDF

[63] MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs PDF

[64] MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark PDF

[65] EthicsMH: A Pilot Benchmark for Ethical Reasoning in Mental Health AI PDF

[66] Hallucination-aware multimodal benchmark for gastrointestinal image analysis with large vision-language models PDF

[67] MMAU-Pro: A Challenging and Comprehensive Benchmark for Holistic Evaluation of Audio General Intelligence PDF

Automated error analysis pipeline using annotated reasoning

[68] TGEA: An Error-Annotated Dataset and Benchmark Tasks for TextGeneration from Pretrained Language Models PDF

[69] GENERATIVE AI FOR PIPELINE REPAIR: AUTOMATING CORRECTIVE ACTIONS IN SELF-HEALING DATA SYSTEMS PDF

[70] Outlier Detection in Plantar Pressure: Human-Centered Comparison of Statistical Parametric Mapping and Explainable Machine Learning PDF

[71] Multimodal Fact-Checking: An Agent-based Approach PDF

[72] ChainReaction: Causal Chain-Guided Reasoning for Modular and Explainable Causal-Why Video Question Answering PDF

[73] â¦ DISEASE DETECTION FROM ELECTRONIC HEALTH RECORD DATA: AN END-TO-END AI-AUGMENTED PIPELINE FOR COMPUTABLE PHENOTYPING PDF

[74] Failure diagnosis of complex systems PDF

[75] Towards a theory of mind for ethical software agents PDF

[76] A general introspective reasoning approach to web search for case adaptation PDF

Table of Contents

[73] â¦ DISEASE DETECTION FROM ELECTRONIC HEALTH RECORD DATA: AN END-TO-END AI-AUGMENTED PIPELINE FOR COMPUTABLE PHENOTYPING PDF