Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations

ICLR 2026 Conference SubmissionAnonymous Authors
spatial reasoning; visual reasoning
Abstract:

Spatial cognition is essential for human intelligence, enabling problem-solving through visual simulations rather than solely relying on verbal reasoning. However, existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation. We introduce \textbf{STARE (Spatial Transformations and Reasoning Evaluation)}, a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through multi-step visual simulation. STARE features 3K tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning), reflecting practical cognitive challenges like object assembly, mechanical diagram interpretation, and everyday spatial navigation. Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks like 3D cube net folding and tangram puzzles that require multi-step visual simulations. Humans achieve near-perfect accuracy but take considerable time (up to 28.9s) on complex tasks, significantly speeding up (down by 7.5 seconds on average) with intermediate visual simulations. In contrast, models exhibit inconsistent performance gains from visual simulations, improving on most tasks but declining in specific cases like tangram puzzles (GPT-4o, o1) and cube net folding (Claude-3.5, Gemini-2.0 Flash), indicating that models may not know how to effectively leverage intermediate visual information.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces STARE, a benchmark evaluating spatial cognition through multi-step visual simulation across 2D/3D transformations, cube net folding, tangram puzzles, and real-world scenarios. It resides in the 'Multi-Step Visual Transformation and Simulation Tasks' leaf, which contains six papers including the original work. This leaf sits within the broader 'Spatial Reasoning Benchmarks and Evaluation Frameworks' branch, indicating a moderately populated research direction focused specifically on sequential geometric manipulations requiring mental imagery rather than single-step or navigation-based tasks.

The taxonomy reveals neighboring evaluation approaches: 'Perspective-Taking and Viewpoint Transformation Evaluation' (three papers on mental rotation and viewpoint shifts), 'Multi-Image and Cross-View Spatial Reasoning Assessment' (two papers on 3D inference from multiple views), and 'Real-World Simulation and Qualitative Spatial Reasoning Benchmarks' (two papers on realistic 3D scenarios). STARE's emphasis on integrated tasks like tangram puzzles and cube net folding bridges abstract geometric transformations with practical assembly challenges, positioning it between purely synthetic benchmarks and domain-specific evaluations like cartography or construction safety assessments found in adjacent leaves.

Among thirty candidates examined, none clearly refute the three core contributions: the STARE benchmark itself (ten candidates, zero refutable), the evaluation framework comparing reasoning with and without intermediate visual simulations (ten candidates, zero refutable), and the comprehensive analysis of model limitations (ten candidates, zero refutable). The sibling papers in the same leaf—StepGame Benchmark, VisionCube, and others—address related sequential reasoning but differ in domain focus (board games, 3D rotations) or task granularity, suggesting STARE's combination of foundational transformations with integrated puzzles occupies a distinct niche within this limited search scope.

Based on the top-thirty semantic matches and taxonomy structure, STARE appears to contribute a novel task suite blending geometric primitives with complex assembly challenges. The absence of refutable prior work in this limited search does not guarantee exhaustive novelty but indicates that among closely related benchmarks examined, none directly anticipate STARE's specific combination of 2D/3D transformations, tangram puzzles, and cube net folding with explicit visual simulation evaluation.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: evaluating spatial reasoning through multi-step visual simulation. The field has organized itself around several complementary branches. The largest branch, Spatial Reasoning Benchmarks and Evaluation Frameworks, encompasses diverse diagnostic tasks—ranging from multi-step visual transformations like those in StepGame Benchmark[5] and LEGO Puzzles[12], to static spatial relation tests and dynamic simulation challenges such as VisionCube[14]. A second major branch, Spatial Reasoning Enhancement Methods for Vision-Language Models, explores techniques to improve model performance, including chain-of-thought prompting, mental imagery simulation approaches like Mental Imagery Simulation[3], and specialized architectural modifications. Application Domains for Spatial Reasoning demonstrates how these capabilities transfer to real-world settings—navigation, robotics, autonomous driving, and GUI interaction—while Cognitive and Theoretical Foundations of Spatial Reasoning draws on psychology and neuroscience to inform computational design. A smaller Auxiliary Studies branch provides methodological tools and cross-cutting analyses. Within the benchmarking landscape, a particularly active line of work focuses on multi-step visual transformation and simulation tasks that require models to predict the outcome of sequential physical or geometric operations. Unfolding Spatial Cognition[0] sits squarely in this cluster, emphasizing iterative visual state changes that test whether models can mentally simulate unfolding processes. Nearby efforts like StepGame Benchmark[5] and VisionCube[14] similarly probe step-by-step reasoning but differ in their choice of domain—StepGame uses board-game-like scenarios while VisionCube targets 3D cube rotations. In contrast, Gen ViRe[34] and ORIGAMISPACE[35] explore generative or origami-specific transformations, highlighting trade-offs between procedural fidelity and task complexity. Across these works, open questions persist about the granularity of intermediate supervision, the role of explicit mental models versus end-to-end learning, and how well performance on synthetic benchmarks transfers to embodied or real-world spatial tasks.

Claimed Contributions

STARE benchmark for evaluating spatial cognition through visual simulations

The authors introduce STARE, a comprehensive benchmark containing approximately 4,000 tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning). The benchmark is specifically designed to evaluate whether multimodal models can perform complex visual reasoning through multi-step visual simulations, similar to how humans solve spatial problems.

10 retrieved papers
Evaluation framework with and without intermediate visual simulations

The authors develop a systematic evaluation framework that tests models under different conditions: with only questions, with textual step descriptions, and with explicit intermediate visual simulations. This framework enables fine-grained analysis of whether models can effectively leverage visual guidance versus relying solely on internal mental simulation capabilities.

10 retrieved papers
Comprehensive analysis revealing model limitations in visual simulation

The authors provide extensive experimental analysis demonstrating that current multimodal models struggle with complex spatial reasoning tasks requiring multi-step visual simulations, performing near random chance on tasks like cube net folding and tangram puzzles. They reveal that models exhibit inconsistent performance gains from visual simulations and identify specific failure modes including perception errors and inability to integrate visual context effectively.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

STARE benchmark for evaluating spatial cognition through visual simulations

The authors introduce STARE, a comprehensive benchmark containing approximately 4,000 tasks spanning foundational geometric transformations (2D and 3D), integrated spatial reasoning (cube net folding and tangram puzzles), and real-world spatial reasoning (perspective and temporal reasoning). The benchmark is specifically designed to evaluate whether multimodal models can perform complex visual reasoning through multi-step visual simulations, similar to how humans solve spatial problems.

Contribution

Evaluation framework with and without intermediate visual simulations

The authors develop a systematic evaluation framework that tests models under different conditions: with only questions, with textual step descriptions, and with explicit intermediate visual simulations. This framework enables fine-grained analysis of whether models can effectively leverage visual guidance versus relying solely on internal mental simulation capabilities.

Contribution

Comprehensive analysis revealing model limitations in visual simulation

The authors provide extensive experimental analysis demonstrating that current multimodal models struggle with complex spatial reasoning tasks requiring multi-step visual simulations, performing near random chance on tasks like cube net folding and tangram puzzles. They reveal that models exhibit inconsistent performance gains from visual simulations and identify specific failure modes including perception errors and inability to integrate visual context effectively.

Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations | Novelty Validation