Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

spatial reasoning; visual reasoning

Spatial cognition is essential for human intelligence, enabling problem-solving through visual simulations rather than solely relying on verbal reasoning. However, existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation. We introduce \textbf{STARE (Spatial Transformations and Reasoning Evaluation)}, a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through multi-step visual simulation. STARE features 3K tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning), reflecting practical cognitive challenges like object assembly, mechanical diagram interpretation, and everyday spatial navigation. Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks like 3D cube net folding and tangram puzzles that require multi-step visual simulations. Humans achieve near-perfect accuracy but take considerable time (up to 28.9s) on complex tasks, significantly speeding up (down by 7.5 seconds on average) with intermediate visual simulations. In contrast, models exhibit inconsistent performance gains from visual simulations, improving on most tasks but declining in specific cases like tangram puzzles (GPT-4o, o1) and cube net folding (Claude-3.5, Gemini-2.0 Flash), indicating that models may not know how to effectively leverage intermediate visual information.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces STARE, a benchmark evaluating spatial cognition through multi-step visual simulation across 2D/3D transformations, cube net folding, tangram puzzles, and real-world scenarios. It resides in the 'Multi-Step Visual Transformation and Simulation Tasks' leaf, which contains six papers including the original work. This leaf sits within the broader 'Spatial Reasoning Benchmarks and Evaluation Frameworks' branch, indicating a moderately populated research direction focused specifically on sequential geometric manipulations requiring mental imagery rather than single-step or navigation-based tasks.

The taxonomy reveals neighboring evaluation approaches: 'Perspective-Taking and Viewpoint Transformation Evaluation' (three papers on mental rotation and viewpoint shifts), 'Multi-Image and Cross-View Spatial Reasoning Assessment' (two papers on 3D inference from multiple views), and 'Real-World Simulation and Qualitative Spatial Reasoning Benchmarks' (two papers on realistic 3D scenarios). STARE's emphasis on integrated tasks like tangram puzzles and cube net folding bridges abstract geometric transformations with practical assembly challenges, positioning it between purely synthetic benchmarks and domain-specific evaluations like cartography or construction safety assessments found in adjacent leaves.

Among thirty candidates examined, none clearly refute the three core contributions: the STARE benchmark itself (ten candidates, zero refutable), the evaluation framework comparing reasoning with and without intermediate visual simulations (ten candidates, zero refutable), and the comprehensive analysis of model limitations (ten candidates, zero refutable). The sibling papers in the same leaf—StepGame Benchmark, VisionCube, and others—address related sequential reasoning but differ in domain focus (board games, 3D rotations) or task granularity, suggesting STARE's combination of foundational transformations with integrated puzzles occupies a distinct niche within this limited search scope.

Based on the top-thirty semantic matches and taxonomy structure, STARE appears to contribute a novel task suite blending geometric primitives with complex assembly challenges. The absence of refutable prior work in this limited search does not guarantee exhaustive novelty but indicates that among closely related benchmarks examined, none directly anticipate STARE's specific combination of 2D/3D transformations, tangram puzzles, and cube net folding with explicit visual simulation evaluation.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: evaluating spatial reasoning through multi-step visual simulation. The field has organized itself around several complementary branches. The largest branch, Spatial Reasoning Benchmarks and Evaluation Frameworks, encompasses diverse diagnostic tasks—ranging from multi-step visual transformations like those in StepGame Benchmark[5] and LEGO Puzzles[12], to static spatial relation tests and dynamic simulation challenges such as VisionCube[14]. A second major branch, Spatial Reasoning Enhancement Methods for Vision-Language Models, explores techniques to improve model performance, including chain-of-thought prompting, mental imagery simulation approaches like Mental Imagery Simulation[3], and specialized architectural modifications. Application Domains for Spatial Reasoning demonstrates how these capabilities transfer to real-world settings—navigation, robotics, autonomous driving, and GUI interaction—while Cognitive and Theoretical Foundations of Spatial Reasoning draws on psychology and neuroscience to inform computational design. A smaller Auxiliary Studies branch provides methodological tools and cross-cutting analyses. Within the benchmarking landscape, a particularly active line of work focuses on multi-step visual transformation and simulation tasks that require models to predict the outcome of sequential physical or geometric operations. Unfolding Spatial Cognition[0] sits squarely in this cluster, emphasizing iterative visual state changes that test whether models can mentally simulate unfolding processes. Nearby efforts like StepGame Benchmark[5] and VisionCube[14] similarly probe step-by-step reasoning but differ in their choice of domain—StepGame uses board-game-like scenarios while VisionCube targets 3D cube rotations. In contrast, Gen ViRe[34] and ORIGAMISPACE[35] explore generative or origami-specific transformations, highlighting trade-offs between procedural fidelity and task complexity. Across these works, open questions persist about the granularity of intermediate supervision, the role of explicit mental models versus end-to-end learning, and how well performance on synthetic benchmarks transfers to embodied or real-world spatial tasks.

Claimed Contributions

STARE benchmark for evaluating spatial cognition through visual simulations

10 retrieved papers

The authors introduce STARE, a comprehensive benchmark containing approximately 4,000 tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning). The benchmark is specifically designed to evaluate whether multimodal models can perform complex visual reasoning through multi-step visual simulations, similar to how humans solve spatial problems.

10 retrieved papers

Evaluation framework with and without intermediate visual simulations

10 retrieved papers

The authors develop a systematic evaluation framework that tests models under different conditions: with only questions, with textual step descriptions, and with explicit intermediate visual simulations. This framework enables fine-grained analysis of whether models can effectively leverage visual guidance versus relying solely on internal mental simulation capabilities.

10 retrieved papers

Comprehensive analysis revealing model limitations in visual simulation

10 retrieved papers

The authors provide extensive experimental analysis demonstrating that current multimodal models struggle with complex spatial reasoning tasks requiring multi-step visual simulations, performing near random chance on tasks like cube net folding and tangram puzzles. They reveal that models exhibit inconsistent performance gains from visual simulations and identify specific failure modes including perception errors and inability to integrate visual context effectively.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Advancing spatial reasoning in large language models: An in-depth evaluation and enhancement using the stepgame benchmark PDF

Li, Fangjun, Hogg, David C., Fangjun Li, Cohn Anthony G., David C. Hogg, Anthony G. Cohn (2024)

[12] LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? PDF

Gao Junyao, Kexian Tang, Zeng Yan-hong, Junyao Gao, Duan, Haodong, Yanhong Zeng, Sun Ya-nan, Haodong Duan, Yanan Sun, Zhening Xing, Lyu Kaifeng, Wenran Liu, Chen Kai, Kaifeng Lyu, Kai Chen (2025) • arXiv.org

[14] VisionCube: 3D-Aware Vision-Language Model for Multi-Step Spatial Reasoning PDF

Feiyang Wang, Nan Luo, Wangyu Wu (2025)

[34] Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark PDF

Xinxin Liu, Zhaopan Xu, Ming Li, Kai Wang, Yong Jae Lee, Yuzhang Shang (2025)

[35] ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints PDF

Rui Xu, Dakuan Lu, Zicheng Zhao, Xiaoyu Tan, Xintao Wang, Siyu Yuan, Jiangjie Chen, Yinghui Xu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

STARE benchmark for evaluating spatial cognition through visual simulations

[18] Govig: Goal-conditioned visual navigation instruction generation PDF

Cannot Refute

[51] Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models PDF

Cannot Refute

[52] Mind the gap: Benchmarking spatial reasoning in vision-language models PDF

Cannot Refute

[53] Thinking in space: How multimodal large language models see, remember, and recall spaces PDF

Cannot Refute

[54] Visfactor: Benchmarking fundamental visual cognition in multimodal large language models PDF

Cannot Refute

[55] What is the visual cognition gap between humans and multimodal llms? PDF

Cannot Refute

[56] Multi-modal learning for geospatial vegetation forecasting PDF

Cannot Refute

[57] 11plus-bench: Demystifying multimodal llm spatial reasoning with cognitive-inspired analysis PDF

Cannot Refute

[58] Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning PDF

Cannot Refute

[59] SPATIA: Multimodal Model for Prediction and Generation of Spatial Cell Phenotypes PDF

Cannot Refute

Contribution

Evaluation framework with and without intermediate visual simulations

[3] Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation PDF

Cannot Refute

[8] Spatial Mental Modeling from Limited Views PDF

Cannot Refute

[31] Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning. PDF

Cannot Refute

[47] Clinical trainee performance on taskâbased AR/VRâguided surgical simulation is correlated with their 3D image spatial reasoning scores PDF

Cannot Refute

[60] Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning PDF

Cannot Refute

[61] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

Cannot Refute

[62] SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning PDF

Cannot Refute

[63] VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning PDF

Cannot Refute

[64] Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs PDF

Cannot Refute

[65] EscapeCraft: A 3D Room Escape Environment for Benchmarking Complex Multimodal Reasoning Ability PDF

Cannot Refute

Contribution

Comprehensive analysis revealing model limitations in visual simulation

[12] LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? PDF

Cannot Refute

[25] High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning PDF

Cannot Refute

[35] ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints PDF

Cannot Refute

[53] Thinking in space: How multimodal large language models see, remember, and recall spaces PDF

Cannot Refute

[66] Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark PDF

Cannot Refute

[67] Vocot: Unleashing visually grounded multi-step reasoning in large multi-modal models PDF

Cannot Refute

[68] Multi-modal situated reasoning in 3d scenes PDF

Cannot Refute

[69] Open3dvqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space PDF

Cannot Refute

[70] Visual-o1: Understanding ambiguous instructions via multi-modal multi-turn chain-of-thoughts reasoning PDF

Cannot Refute

[71] SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning PDF

Cannot Refute

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Advancing spatial reasoning in large language models: An in-depth evaluation and enhancement using the stepgame benchmark PDF

[12] LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? PDF

[14] VisionCube: 3D-Aware Vision-Language Model for Multi-Step Spatial Reasoning PDF

[34] Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark PDF

[35] ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints PDF

Contribution Analysis

STARE benchmark for evaluating spatial cognition through visual simulations

[18] Govig: Goal-conditioned visual navigation instruction generation PDF

[51] Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models PDF

[52] Mind the gap: Benchmarking spatial reasoning in vision-language models PDF

[53] Thinking in space: How multimodal large language models see, remember, and recall spaces PDF

[54] Visfactor: Benchmarking fundamental visual cognition in multimodal large language models PDF

[55] What is the visual cognition gap between humans and multimodal llms? PDF

[56] Multi-modal learning for geospatial vegetation forecasting PDF

[57] 11plus-bench: Demystifying multimodal llm spatial reasoning with cognitive-inspired analysis PDF

[58] Rank2Tell: A Multimodal Driving Dataset for Joint Importance Ranking and Reasoning PDF

[59] SPATIA: Multimodal Model for Prediction and Generation of Spatial Cell Phenotypes PDF

Evaluation framework with and without intermediate visual simulations

[3] Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation PDF

[8] Spatial Mental Modeling from Limited Views PDF

[31] Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning. PDF

[47] Clinical trainee performance on taskâbased AR/VRâguided surgical simulation is correlated with their 3D image spatial reasoning scores PDF

[60] Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning PDF

[61] SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model PDF

[62] SSR: Enhancing Depth Perception in Vision-Language Models via Rationale-Guided Spatial Reasoning PDF

[63] VisualTrans: A Benchmark for Real-World Visual Transformation Reasoning PDF

[64] Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs PDF

[65] EscapeCraft: A 3D Room Escape Environment for Benchmarking Complex Multimodal Reasoning Ability PDF

Comprehensive analysis revealing model limitations in visual simulation

[12] LEGO-Puzzles: How Good Are MLLMs at Multi-Step Spatial Reasoning? PDF

[25] High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning PDF

[35] ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints PDF

[53] Thinking in space: How multimodal large language models see, remember, and recall spaces PDF

[66] Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark PDF

[67] Vocot: Unleashing visually grounded multi-step reasoning in large multi-modal models PDF

[68] Multi-modal situated reasoning in 3d scenes PDF

[69] Open3dvqa: A benchmark for comprehensive spatial reasoning with multimodal large language model in open space PDF

[70] Visual-o1: Understanding ambiguous instructions via multi-modal multi-turn chain-of-thoughts reasoning PDF

[71] SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning PDF

Table of Contents

[47] Clinical trainee performance on taskâbased AR/VRâguided surgical simulation is correlated with their 3D image spatial reasoning scores PDF