MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

ICLR 2026 Conference SubmissionAnonymous Authors
Spatial IntelligenceMLLMVLMVQABenchmark3D Understanding
Abstract:

Spatial intelligence is essential for multimodal large language models (MLLMs) operating in the complex physical world. Existing benchmarks, however, probe only single-image relations and thus fail to assess the multi-image spatial reasoning that real-world deployments demand. We introduce MMSI-Bench, a VQA benchmark dedicated to multi-image spatial intelligence. Six 3D-vision researchers spent more than 300 hours meticulously crafting 1,000 challenging, unambiguous multiple-choice questions from over 120,000 images, each paired with carefully designed distractors and a stepwise reasoning process. We conduct extensive experiments and evaluate 37 open-source and proprietary MLLMs, observing a wide gap: the strongest open-source model attains roughly 30% accuracy and OpenAI's GPT-5 reasoning model reaches 40%, while humans score 97%. These results underscore the challenging nature of MMSI-Bench and the substantial headroom for future research. Leveraging the annotated reasoning processes, we also provide an automated error analysis pipeline that diagnoses four dominant failure modes, including (1) grounding errors, (2) overlap-matching and scene-reconstruction errors, (3) situation-transformation reasoning errors, and (4) spatial-logic errors, offering insights for advancing spatial intelligence.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: multi-image spatial reasoning for multimodal large language models. The field has organized itself around several complementary branches. Spatial Reasoning Benchmarks and Evaluation focuses on datasets and metrics that test models' ability to understand spatial relationships across multiple images, including specialized benchmarks for multi-image spatial intelligence like MMSI-Bench[0] and NuScenes-SpatialQA[8]. Training Methodologies for Spatial Reasoning explores techniques to improve spatial understanding through curriculum learning, data augmentation, and specialized training objectives. Model Architectures for Spatial Understanding investigates architectural innovations such as attention mechanisms and token-efficient designs that better capture spatial information. General Multi-Image Understanding Evaluation examines broader multi-image capabilities beyond pure spatial reasoning, while Application-Driven Spatial Reasoning targets domain-specific scenarios like robotics and autonomous driving. Surveys and Theoretical Foundations provide conceptual frameworks, and Auxiliary Techniques encompass supporting methods like prompting strategies and visual grounding. Several active research directions reveal key trade-offs in the field. One line emphasizes comprehensive benchmark construction to expose model weaknesses, with works like Space-10[9] and Ego-Centric Spatial[35] probing different spatial perspectives and reference frames. Another direction pursues architectural and training innovations, exemplified by Thinking in Space[3] and Mind the Gap[4], which tackle the challenge of integrating spatial reasoning into existing multimodal architectures. MMSI-Bench[0] sits squarely within the benchmark-focused cluster, providing a systematic evaluation framework for multi-image spatial intelligence. Compared to neighbors like NuScenes-SpatialQA[8], which targets autonomous driving scenarios, MMSI-Bench[0] appears to take a more general-purpose approach to spatial reasoning evaluation. Its emphasis contrasts with Omnidirectional Spatial[49], which explores specific viewpoint challenges, suggesting MMSI-Bench[0] aims for broader coverage of spatial reasoning phenomena across diverse multi-image contexts.

Claimed Contributions

MMSI-Bench benchmark for multi-image spatial intelligence

The authors present MMSI-Bench, a comprehensive benchmark containing 1,000 human-curated multiple-choice questions designed to evaluate multimodal large language models on spatial reasoning tasks that require integrating information across multiple images. The benchmark covers ten fundamental spatial reasoning task types plus multi-step reasoning, spanning diverse real-world scenarios.

7 retrieved papers
Human-centric benchmark construction methodology

The authors develop a fully human-centric design approach where expert researchers manually select image sets, create novel and challenging questions that cannot be answered from single images, and provide detailed step-by-step reasoning annotations. This methodology ensures high quality, diversity, and difficulty compared to template-based approaches.

9 retrieved papers
Automated error analysis pipeline using annotated reasoning

The authors introduce an automated analysis method that uses the human-annotated reasoning processes to systematically categorize model failures into four error types: grounding errors, overlap-matching and scene-reconstruction errors, situation-transformation reasoning errors, and spatial-logic errors. This enables scalable diagnosis of spatial reasoning capabilities.

9 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MMSI-Bench benchmark for multi-image spatial intelligence

The authors present MMSI-Bench, a comprehensive benchmark containing 1,000 human-curated multiple-choice questions designed to evaluate multimodal large language models on spatial reasoning tasks that require integrating information across multiple images. The benchmark covers ten fundamental spatial reasoning task types plus multi-step reasoning, spanning diverse real-world scenarios.

Contribution

Human-centric benchmark construction methodology

The authors develop a fully human-centric design approach where expert researchers manually select image sets, create novel and challenging questions that cannot be answered from single images, and provide detailed step-by-step reasoning annotations. This methodology ensures high quality, diversity, and difficulty compared to template-based approaches.

Contribution

Automated error analysis pipeline using annotated reasoning

The authors introduce an automated analysis method that uses the human-annotated reasoning processes to systematically categorize model failures into four error types: grounding errors, overlap-matching and scene-reconstruction errors, situation-transformation reasoning errors, and spatial-logic errors. This enables scalable diagnosis of spatial reasoning capabilities.