What Lies Beyond the View? Actively Constructing Spatial Beliefs in Foundation Models

ICLR 2026 Conference SubmissionAnonymous Authors
Large Language ModeVision-Language ModelSpatial ReasoningSpatial AgentActive Exploration
Abstract:

Current foundation models can answer spatial reasoning questions about a given image or text, yet they lack the fundamental ability to build a genuine spatial understanding of an environment through active exploration. This reflects a critical blind spot in prevailing evaluation protocols, which predominantly test passive reasoning on curated data rather than the active construction of knowledge under uncertainty. To address this, we introduce Theory of Space (ToS), a new framework analogous to the Theory of Mind. While Theory of Mind concerns an agent's ability to model the hidden mental states of others, ToS concerns its ability to construct, update, and utilize an internal belief about the unobserved structure of its spatial environment from local, incomplete observations. We implement ToS with a comprehensive benchmark featuring both text-based and visual environments. Instead of performing specific tasks in such environments, the primary objective is to build a complete and accurate spatial belief through curiosity-driven exploration. A core innovation of our framework is the direct probing of this internal belief: we prompt models to explicitly present their cognitive map at each step, allowing us to measure not only task performance but also the quality, consistency, and evolution of the underlying spatial model itself. By evaluating state-of-the-art models as both active explorers and passive reasoners (using logs from scripted proxy agents), we disentangle exploration strategy from reasoning ability. Our analysis reveals common failure modes in spatial belief management, such as egomotion update errors and the inability to maintain a globally consistent map. The ToS framework provides the concepts and tools necessary to evaluate and build agents with more robust, human-like spatial intelligence.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces Theory of Space (ToS), a framework for evaluating how foundation models actively construct spatial beliefs through exploration, analogous to Theory of Mind for mental state modeling. It resides in the 'Theory-Driven Spatial Belief Frameworks' leaf, which contains only three papers total, including this work and two siblings. This represents a notably sparse research direction within the broader taxonomy of 48 papers across the field, suggesting the paper targets a relatively underexplored conceptual niche focused on principled frameworks for spatial belief construction rather than task-specific navigation or perception methods.

The taxonomy reveals that neighboring research directions are substantially more populated: 'Memory-Augmented Spatial Reasoning' (3 papers), 'Foundation Model-Guided Exploration' (4 papers), and 'Zero-Shot Object Navigation' (4 papers) all address related but distinct aspects of spatial intelligence. The sibling papers in the same leaf—Spatial Schema Intuitions and Adaptive World Models—examine cognitive primitives and dynamic model adaptation respectively, whereas ToS emphasizes the active exploration process itself. The taxonomy's scope and exclude notes clarify that ToS belongs here because it proposes a theoretical evaluation framework rather than applying existing methods to specific tasks, distinguishing it from application-oriented categories.

Among 30 candidates examined through semantic search, the contribution-level analysis shows varied novelty profiles. The ToS framework itself (10 candidates examined, 0 refutable) and the comprehensive benchmark (10 candidates examined, 0 refutable) appear to have limited direct prior work within the search scope. However, the direct probing mechanism for internal spatial beliefs (10 candidates examined, 1 refutable) shows at least one candidate providing overlapping prior work. This suggests that while the overarching framework may be relatively novel, specific technical components like belief probing have some precedent in the examined literature, though the limited search scale means substantial related work could exist beyond these 30 candidates.

Given the sparse taxonomy leaf and limited search scope, the work appears to occupy a conceptual space with relatively few direct competitors among the examined papers. The framework's emphasis on curiosity-driven exploration and explicit cognitive map probing distinguishes it from task-oriented navigation benchmarks, though the analysis acknowledges it covers only top-30 semantic matches and does not claim exhaustive coverage of all potentially relevant spatial reasoning literature.

Taxonomy

Core-task Taxonomy Papers
48
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: Actively constructing spatial beliefs through exploration in foundation models. The field encompasses a diverse set of approaches that address how agents build, maintain, and reason about spatial knowledge in embodied settings. At the highest level, the taxonomy distinguishes between works focused on spatial belief construction and cognitive mapping (which develop explicit or implicit representations of environments), exploration strategies and active perception (which determine how agents gather spatial information), embodied navigation and object search (which apply spatial reasoning to goal-directed tasks), reinforcement learning and interactive decision-making (which learn policies through environmental interaction), benchmarks and evaluation frameworks (which standardize assessment), foundation models for robotics and embodied AI (which leverage large pretrained models for spatial tasks), domain-specific and applied spatial systems (which target particular application areas), and cognitive and learning sciences perspectives (which draw on human spatial cognition research). Representative works such as SpatialVLA[1] and Embodied-r[3] illustrate how foundation models are being adapted to spatial reasoning, while Voronav[2] and SSR-ZSON[8] exemplify navigation-centric approaches. A particularly active line of work centers on how agents should balance exploration with exploitation when spatial knowledge is incomplete or uncertain, as seen in Explore Until Confident[6] and Adaptive World Models[13]. Another contrasting theme involves whether to rely on end-to-end learned representations versus structured symbolic or schema-based spatial models, a tension visible across Foundation Models Hypothesis Testing[7] and Spatial Schema Intuitions[4]. Beyond the View[0] sits within the theory-driven spatial belief frameworks cluster, emphasizing principled mechanisms for belief updating during exploration. Compared to Spatial Schema Intuitions[4], which examines cognitive primitives for spatial understanding, and Adaptive World Models[13], which focuses on dynamic model adaptation, Beyond the View[0] appears to prioritize the active construction process itself—how agents iteratively refine spatial hypotheses by strategically choosing where to look next. This positioning highlights ongoing questions about the interplay between model architecture, exploration policy, and the granularity of spatial representations needed for robust embodied intelligence.

Claimed Contributions

Theory of Space (ToS) framework

The authors propose ToS as a conceptual framework for evaluating an agent's ability to actively construct, update, and utilize an internal spatial belief from partial observations. Unlike Theory of Mind, which models hidden mental states of others, ToS models the uncertain, unobserved structure of physical space through curiosity-driven exploration.

10 retrieved papers
Comprehensive benchmark for active spatial belief construction

The authors develop a benchmark that evaluates agents through active exploration in procedurally generated multi-room environments. The benchmark includes both text-based and vision-based modalities, scripted proxy agents for disentangling exploration from reasoning, and a suite of spatial cognition tasks covering route and survey knowledge.

10 retrieved papers
Direct probing mechanism for internal spatial beliefs

The authors introduce a method to explicitly probe the agent's internal spatial representation by requiring it to externalize its cognitive map at any exploration step. This allows measurement of not only task performance but also the quality, consistency, and evolution of the underlying spatial model itself, moving beyond black-box evaluation.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Theory of Space (ToS) framework

The authors propose ToS as a conceptual framework for evaluating an agent's ability to actively construct, update, and utilize an internal spatial belief from partial observations. Unlike Theory of Mind, which models hidden mental states of others, ToS models the uncertain, unobserved structure of physical space through curiosity-driven exploration.

Contribution

Comprehensive benchmark for active spatial belief construction

The authors develop a benchmark that evaluates agents through active exploration in procedurally generated multi-room environments. The benchmark includes both text-based and vision-based modalities, scripted proxy agents for disentangling exploration from reasoning, and a suite of spatial cognition tasks covering route and survey knowledge.

Contribution

Direct probing mechanism for internal spatial beliefs

The authors introduce a method to explicitly probe the agent's internal spatial representation by requiring it to externalize its cognitive map at any exploration step. This allows measurement of not only task performance but also the quality, consistency, and evolution of the underlying spatial model itself, moving beyond black-box evaluation.