From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Robot LearningRobotics

Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height. Code will be released publicly.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes FALCON, a paradigm that injects 3D spatial tokens from foundation models into the action head of vision-language-action models, aiming to bridge the spatial reasoning gap in existing 2D-encoder-based VLAs. Within the taxonomy, it resides in the 'Spatial Foundation Model Integration' leaf under 'Spatial Representation and Encoding Methods', alongside two sibling papers. This leaf represents a focused research direction within a broader taxonomy of 45 papers across multiple branches, suggesting a moderately active but not overcrowded subfield dedicated to leveraging pretrained spatial models for VLA enhancement.

The taxonomy reveals neighboring leaves addressing related spatial challenges: 'Explicit 3D Input Integration' (3 papers) handles depth sensors and point clouds, 'Implicit Spatial Understanding from 2D' (3 papers) learns geometry without explicit sensors, and 'Ego-Centric and Position Encoding' (2 papers) focuses on position-based representations. FALCON's approach diverges by emphasizing foundation model priors over sensor-specific architectures or learned-from-scratch encoders. The taxonomy's scope notes clarify that methods training spatial encoders from scratch or using only vision-language models belong elsewhere, positioning FALCON's foundation-model-centric design as a distinct strategy within the spatial encoding landscape.

Among 23 candidates examined, the core FALCON paradigm (Contribution 1) shows substantial prior work: 10 candidates examined, 5 potentially refutable. The Embodied Spatial Model (Contribution 2) appears more novel, with 6 candidates examined and none clearly refutable. The Spatial-Enhanced Action Head (Contribution 3) examined 7 candidates with 1 refutable. These statistics reflect a limited semantic search scope, not exhaustive coverage. The paradigm's core idea of injecting spatial tokens into action heads has recognizable precedents among the examined candidates, while the flexible modality integration mechanism appears less explored within this search window.

Given the limited search scope of 23 candidates, the analysis suggests FALCON operates in a moderately explored area where spatial foundation model integration is an active concern, but specific architectural choices around action-head injection and modality flexibility may offer incremental distinctions. The taxonomy context indicates this is one approach among several competing strategies for spatial enhancement, with the field still exploring optimal integration points and architectural patterns for combining geometric priors with vision-language reasoning.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Integrating spatial foundation priors into vision-language-action models for robotic manipulation. The field has organized itself around several complementary branches that address different facets of this integration challenge. Spatial Representation and Encoding Methods explore how to capture and embed geometric information—ranging from depth maps and point clouds to scene graphs and affordance fields—into model architectures. Spatial Reasoning and Grounding Mechanisms focus on connecting language instructions to physical locations and object relationships, enabling robots to understand where and how to act. Action Generation and Prediction branches develop techniques for translating multimodal inputs into executable motor commands, while Multimodal Integration and Perception addresses the fusion of vision, language, and sometimes tactile or proprioceptive signals. Training Paradigms and Data Utilization examines strategies for leveraging large-scale datasets and foundation models, Model Efficiency and Optimization tackles computational constraints, Benchmarking and Evaluation Frameworks provide standardized testbeds, and Specialized Applications and Task Domains target specific manipulation scenarios such as grasping or assembly. Within this landscape, a particularly active line of work centers on directly incorporating spatial foundation models—such as depth estimators, segmentation networks, or geometric reasoning modules—into vision-language-action architectures. Papers like SpatialVLA[1], 3DS-VLA[2], and DepthVLA[28] exemplify efforts to enrich visual encoders with explicit 3D or depth cues, while GeoVLA[17] and GeoAware-VLA[34] emphasize geometric awareness for more precise spatial reasoning. Spatial to Actions[0] situates itself within this cluster by proposing a framework that systematically integrates spatial foundation priors into the action-generation pipeline, aiming to bridge the gap between high-level semantic understanding and low-level geometric control. Compared to works like Evo-0[3] or InternVLA[4], which prioritize scaling and generalization across diverse tasks, Spatial to Actions[0] places stronger emphasis on leveraging pre-trained spatial representations to improve manipulation accuracy and robustness in geometrically demanding scenarios.

Claimed Contributions

FALCON paradigm for injecting 3D spatial tokens into VLA action head

Can Refute

10 retrieved papers

The authors propose a new architecture that integrates spatial tokens from foundation models directly into the action prediction component rather than the vision-language backbone. This design preserves language reasoning while providing robust geometric priors from RGB inputs alone.

10 retrieved papers

Can Refute

Embodied Spatial Model for flexible 3D modality integration

6 retrieved papers

The authors develop a spatial encoding module that can flexibly incorporate additional 3D inputs such as depth maps or camera poses when available, while maintaining strong performance with RGB-only input. This enables modality transferability without requiring model retraining.

6 retrieved papers

Spatial-Enhanced Action Head for multimodal fusion

Can Refute

7 retrieved papers

The authors introduce a dedicated fusion mechanism that combines spatial tokens with semantic features at the action prediction stage. This approach avoids disrupting the pre-trained vision-language alignment while enabling precise spatial reasoning for robot control.

7 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[37] VGGT-DP: Generalizable Robot Control via Vision Foundation Models PDF

Ge Shijia, Zhang Yin-xin, Shijia Ge, Xie, Shuzhao, Yinxin Zhang, Zhang Wei-xiang, Shuzhao Xie, Zhou, Mingcai, Weixiang Zhang, Wang Zhi, Mingcai Zhou, Zhi Wang (2025)

[43] Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos PDF

Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Sipeng Zheng, Zongqing Lu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FALCON paradigm for injecting 3D spatial tokens into VLA action head

[1] Spatialvla: Exploring spatial representations for visual-language-action model PDF

Can Refute

[2] 3ds-vla: A 3d spatial-aware vision language action model for robust multi-task manipulation PDF

Cannot Refute

Contribution

Spatial-Enhanced Action Head for multimodal fusion

[56] Spatial integration of multimodal brain images in cerebral infarction. PDF

Cannot Refute

From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[37] VGGT-DP: Generalizable Robot Control via Vision Foundation Models PDF

[43] Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos PDF

Contribution Analysis

FALCON paradigm for injecting 3D spatial tokens into VLA action head

[1] Spatialvla: Exploring spatial representations for visual-language-action model PDF

[2] 3ds-vla: A 3d spatial-aware vision language action model for robust multi-task manipulation PDF

[17] Geovla: Empowering 3d representations in vision-language-action models PDF

[46] PointVLA: Injecting the 3D World into Vision-Language-Action Models PDF

[47] RetoVLA: Reusing Register Tokens for Spatial Reasoning in Vision-Language-Action Models PDF

[11] Improving vision-language-action models via chain-of-affordance PDF

[40] Spatial Forcing: Implicit Spatial Representation Alignment for Vision-language-action Model PDF

[48] SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Models PDF

[49] Toward Embodiment Equivariant Vision-Language-Action Policy PDF

[50] mindmap: Spatial Memory in Deep Feature Maps for 3D Action Policies PDF

Embodied Spatial Model for flexible 3D modality integration

[57] Cross-Spatial Fusion and Dynamic-Range Particle Filter-Based FPGA-GPU Architecture for 1-ms RGB-Based Object Pose Tracking PDF

[58] Morefusion: Multi-object reasoning for 6d pose estimation from volumetric fusion PDF

[59] Tracking and Planning with Spatial World Models PDF

[60] Joint estimation of depth and motion from a monocular endoscopy image sequence using a multi-loss rebalancing network. PDF

[61] Embodied VideoAgent: Persistent Memory from Egocentric Videos and Embodied Sensors Enables Dynamic Scene Understanding PDF

[62] A Spatial Pose Detection Method for Scrapers Based on Planar Vision and Laser Range Fusion PDF

Spatial-Enhanced Action Head for multimodal fusion

[47] RetoVLA: Reusing Register Tokens for Spatial Reasoning in Vision-Language-Action Models PDF

[51] Transformer RGBT tracking with spatio-temporal multimodal tokens PDF

[52] Mutually beneficial transformer for multimodal data fusion PDF

[53] Brain harmony: A multimodal foundation model unifying morphology and function into 1D tokens PDF

[54] Tmformer: Token merging transformer for brain tumor segmentation with missing modalities PDF

[55] Explainable Action Prediction through Self-Supervision on Scene Graphs PDF

[56] Spatial integration of multimodal brain images in cerebral infarction. PDF

Table of Contents