Abstract:

Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height. Code will be released publicly.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes FALCON, a paradigm that injects 3D spatial tokens from foundation models into the action head of vision-language-action models, aiming to bridge the spatial reasoning gap in existing 2D-encoder-based VLAs. Within the taxonomy, it resides in the 'Spatial Foundation Model Integration' leaf under 'Spatial Representation and Encoding Methods', alongside two sibling papers. This leaf represents a focused research direction within a broader taxonomy of 45 papers across multiple branches, suggesting a moderately active but not overcrowded subfield dedicated to leveraging pretrained spatial models for VLA enhancement.

The taxonomy reveals neighboring leaves addressing related spatial challenges: 'Explicit 3D Input Integration' (3 papers) handles depth sensors and point clouds, 'Implicit Spatial Understanding from 2D' (3 papers) learns geometry without explicit sensors, and 'Ego-Centric and Position Encoding' (2 papers) focuses on position-based representations. FALCON's approach diverges by emphasizing foundation model priors over sensor-specific architectures or learned-from-scratch encoders. The taxonomy's scope notes clarify that methods training spatial encoders from scratch or using only vision-language models belong elsewhere, positioning FALCON's foundation-model-centric design as a distinct strategy within the spatial encoding landscape.

Among 23 candidates examined, the core FALCON paradigm (Contribution 1) shows substantial prior work: 10 candidates examined, 5 potentially refutable. The Embodied Spatial Model (Contribution 2) appears more novel, with 6 candidates examined and none clearly refutable. The Spatial-Enhanced Action Head (Contribution 3) examined 7 candidates with 1 refutable. These statistics reflect a limited semantic search scope, not exhaustive coverage. The paradigm's core idea of injecting spatial tokens into action heads has recognizable precedents among the examined candidates, while the flexible modality integration mechanism appears less explored within this search window.

Given the limited search scope of 23 candidates, the analysis suggests FALCON operates in a moderately explored area where spatial foundation model integration is an active concern, but specific architectural choices around action-head injection and modality flexibility may offer incremental distinctions. The taxonomy context indicates this is one approach among several competing strategies for spatial enhancement, with the field still exploring optimal integration points and architectural patterns for combining geometric priors with vision-language reasoning.

Taxonomy

Core-task Taxonomy Papers
45
3
Claimed Contributions
23
Contribution Candidate Papers Compared
6
Refutable Paper

Research Landscape Overview

Core task: Integrating spatial foundation priors into vision-language-action models for robotic manipulation. The field has organized itself around several complementary branches that address different facets of this integration challenge. Spatial Representation and Encoding Methods explore how to capture and embed geometric information—ranging from depth maps and point clouds to scene graphs and affordance fields—into model architectures. Spatial Reasoning and Grounding Mechanisms focus on connecting language instructions to physical locations and object relationships, enabling robots to understand where and how to act. Action Generation and Prediction branches develop techniques for translating multimodal inputs into executable motor commands, while Multimodal Integration and Perception addresses the fusion of vision, language, and sometimes tactile or proprioceptive signals. Training Paradigms and Data Utilization examines strategies for leveraging large-scale datasets and foundation models, Model Efficiency and Optimization tackles computational constraints, Benchmarking and Evaluation Frameworks provide standardized testbeds, and Specialized Applications and Task Domains target specific manipulation scenarios such as grasping or assembly. Within this landscape, a particularly active line of work centers on directly incorporating spatial foundation models—such as depth estimators, segmentation networks, or geometric reasoning modules—into vision-language-action architectures. Papers like SpatialVLA[1], 3DS-VLA[2], and DepthVLA[28] exemplify efforts to enrich visual encoders with explicit 3D or depth cues, while GeoVLA[17] and GeoAware-VLA[34] emphasize geometric awareness for more precise spatial reasoning. Spatial to Actions[0] situates itself within this cluster by proposing a framework that systematically integrates spatial foundation priors into the action-generation pipeline, aiming to bridge the gap between high-level semantic understanding and low-level geometric control. Compared to works like Evo-0[3] or InternVLA[4], which prioritize scaling and generalization across diverse tasks, Spatial to Actions[0] places stronger emphasis on leveraging pre-trained spatial representations to improve manipulation accuracy and robustness in geometrically demanding scenarios.

Claimed Contributions

FALCON paradigm for injecting 3D spatial tokens into VLA action head

The authors propose a new architecture that integrates spatial tokens from foundation models directly into the action prediction component rather than the vision-language backbone. This design preserves language reasoning while providing robust geometric priors from RGB inputs alone.

10 retrieved papers
Can Refute
Embodied Spatial Model for flexible 3D modality integration

The authors develop a spatial encoding module that can flexibly incorporate additional 3D inputs such as depth maps or camera poses when available, while maintaining strong performance with RGB-only input. This enables modality transferability without requiring model retraining.

6 retrieved papers
Spatial-Enhanced Action Head for multimodal fusion

The authors introduce a dedicated fusion mechanism that combines spatial tokens with semantic features at the action prediction stage. This approach avoids disrupting the pre-trained vision-language alignment while enabling precise spatial reasoning for robot control.

7 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

FALCON paradigm for injecting 3D spatial tokens into VLA action head

The authors propose a new architecture that integrates spatial tokens from foundation models directly into the action prediction component rather than the vision-language backbone. This design preserves language reasoning while providing robust geometric priors from RGB inputs alone.

Contribution

Embodied Spatial Model for flexible 3D modality integration

The authors develop a spatial encoding module that can flexibly incorporate additional 3D inputs such as depth maps or camera poses when available, while maintaining strong performance with RGB-only input. This enables modality transferability without requiring model retraining.

Contribution

Spatial-Enhanced Action Head for multimodal fusion

The authors introduce a dedicated fusion mechanism that combines spatial tokens with semantic features at the action prediction stage. This approach avoids disrupting the pre-trained vision-language alignment while enabling precise spatial reasoning for robot control.

From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors | Novelty Validation