Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-Language Navigation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

navigation foundation modelsVision-and-Language Navigation

While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes DualVLN, a dual-system VLN foundation model that decouples VLM-based global planning (System 2) from a diffusion-based local policy (System 1). It resides in the 'Vision-Language Navigation Dual-System Foundations' leaf, which contains only three papers total, including this work and two siblings. This indicates a relatively sparse research direction within the broader dual-system architecture space, suggesting the specific combination of VLN tasks with decoupled reasoning and trajectory execution remains underexplored compared to more crowded areas like autonomous driving VLA models or general VLA foundations.

The taxonomy reveals that dual-system architectures branch into unified adaptive models, decoupled reasoning-action frameworks, and domain-specific applications. DualVLN's leaf sits under 'Decoupled Dual-System Models with Separate Reasoning and Action Modules,' adjacent to robotic manipulation dual-systems and distinct from unified single-model approaches. Neighboring branches include general VLA architectures, spatial reasoning methods, and learning paradigms for VLN. The scope notes clarify that explicit dual-system separation for VLN distinguishes this work from end-to-end models and from manipulation-focused dual-systems, positioning it at the intersection of architectural modularity and navigation-specific challenges.

Among 21 candidates examined across three contributions, only one refutable pair emerged. The core dual-system VLN contribution examined 10 candidates with zero refutations, while the multi-modal diffusion transformer examined 6 with none. The Social-VLN benchmark contribution examined 5 candidates and found 1 potential overlap. This limited search scope—top-K semantic matches plus citation expansion—suggests that within the examined literature, the dual-system VLN architecture and diffusion-based local policy appear relatively novel, though the dynamic obstacle benchmark may have closer prior work. The small candidate pool and sparse taxonomy leaf indicate the analysis covers a focused but not exhaustive slice of the field.

Given the limited 21-candidate search and the sparse three-paper taxonomy leaf, the dual-system VLN framework appears to occupy a less crowded niche. The absence of refutations for the core architectural contributions among examined candidates suggests potential novelty, though the analysis does not cover the full breadth of VLN or diffusion policy literature. The benchmark contribution's single overlap warrants closer scrutiny of prior dynamic navigation evaluation methods.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: vision-language navigation with dual-system reasoning and control. The field has evolved around several complementary branches that address how agents integrate visual perception, linguistic instructions, and action policies. At the highest level, Dual-System Architectures for Vision-Language-Action Models explore decoupled reasoning and action modules—often separating slow deliberative planning from fast reactive control—while Vision-Language-Action Model Architectures and Techniques focus on end-to-end or tightly integrated neural designs that unify perception and policy learning. Vision-Language Navigation Reasoning and Representation Methods emphasize spatial memory, graph-based scene understanding, and semantic grounding mechanisms, whereas Learning Paradigms and Robustness for Vision-Language Navigation investigate continual learning, domain generalization, and adversarial robustness. Finally, Control and Decision-Making Frameworks address hierarchical planning, attention mechanisms, and decision-theoretic formulations that guide navigation behavior. Representative works such as VLA Models Foundations[2] and VLM VLA Manipulation Survey[4] illustrate the breadth of architectural choices, while OneTwoVLA[5] and Fast in Slow[6] exemplify dual-system designs that balance deliberation with reactive execution. A particularly active line of research contrasts tightly coupled end-to-end models with explicitly decoupled dual-system approaches. The former often achieve strong performance through large-scale pretraining and unified vision-language encoders, as seen in VLA Autonomous Driving[3] and RationalVLA[9], but may struggle with interpretability and fine-grained spatial reasoning. In contrast, decoupled architectures—such as Ground Slow Move Fast[0] and its closely related variant Ground Slow Move Fast Dual[10]—separate high-level grounding and planning from low-level motor control, enabling modular reasoning and potentially greater robustness in novel environments. Ground Slow Move Fast[0] sits squarely within this dual-system paradigm, emphasizing the interplay between a slow grounding module that interprets instructions and a fast action module that executes navigation. Compared to FSR VLN[21], which also explores dual-stream reasoning, Ground Slow Move Fast[0] places stronger emphasis on explicit temporal decoupling and control handoff. This design choice reflects ongoing debates about modularity versus end-to-end optimization, and whether interpretable reasoning stages can improve generalization without sacrificing real-time responsiveness.

Claimed Contributions

DualVLN: A dual-system VLN foundation model

10 retrieved papers

The authors introduce DualVLN, a novel architecture that decouples vision-language navigation into two complementary systems: System 2 (a VLM-based global planner for pixel-goal grounding) and System 1 (a lightweight diffusion transformer policy for trajectory generation). This dual-system design enables asynchronous inference, robust real-time control, and adaptive local decision-making in dynamic environments.

10 retrieved papers

Multi-modal conditioning diffusion transformer with latent goal representation

6 retrieved papers

The authors develop a diffusion-based policy (System 1) that conditions on both explicit pixel goals and implicit latent goal embeddings extracted from the VLM through learnable queries. This multi-modal conditioning approach enables the policy to generate continuous, smooth trajectories at high frequency while maintaining strong coordination with the reasoning system.

6 retrieved papers

Social-VLN benchmark for dynamic obstacle avoidance evaluation

Can Refute

5 retrieved papers

The authors create Social-VLN, a new benchmark built upon R2R-CE that incorporates dynamic humanoid agents strategically placed along navigation trajectories. This benchmark evaluates navigation systems on social awareness, dynamic obstacle avoidance, and trajectory recovery capabilities, introducing a Human Collision Rate metric to quantify safety in human-centric environments.

5 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[10] Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation PDF

Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, Xihui Liu (2025)

[21] FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph PDF

Zhou Xiaolin, Xiaolin Zhou, Liu Liu, Tingyang Xiao, Wang Yu-cheng, Yucheng Wang, Meng Xinrui, Maiyue Chen, Wang Xinjie, Xinrui Meng, Feng Wei, Xinjie Wang, Sui Wei, Wei Feng, Su, Zhizhong, Wei Sui, Zhizhong Su (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DualVLN: A dual-system VLN foundation model

[5] OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning PDF

Cannot Refute

[10] Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation PDF

Cannot Refute

[12] Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation PDF

Cannot Refute

[13] Omninav: A unified framework for prospective exploration and visual-language navigation PDF

Cannot Refute

[17] A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation PDF

Cannot Refute

[30] JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation PDF

Cannot Refute

[31] DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation PDF

Cannot Refute

[32] A VLM-Drone System for Indoor Navigation Assistance with Semantic Reasoning for the Visually Impaired PDF

Cannot Refute

[33] Robix: A unified model for robot interaction, reasoning and planning PDF

Cannot Refute

[34] Lovon: Legged open-vocabulary object navigator PDF

Cannot Refute

Contribution

Multi-modal conditioning diffusion transformer with latent goal representation

[10] Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation PDF

Cannot Refute

[35] Multimodal diffusion transformer: Learning versatile behavior from multimodal goals PDF

Cannot Refute

[36] A survey on diffusion policy for robotic manipulation: Taxonomy, analysis, and future directions PDF

Cannot Refute

[37] VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers PDF

Cannot Refute

[38] Efficient Virtuoso: A Latent Diffusion Transformer Model for Goal-Conditioned Trajectory Planning PDF

Cannot Refute

[39] Diffusion Policy PDF

Cannot Refute

Contribution

Social-VLN benchmark for dynamic obstacle avoidance evaluation

[25] HA-VLN: A Benchmark for Human-Aware Navigation in Discrete-Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an â¦ PDF

Can Refute

[26] Conav: A benchmark for human-centered collaborative navigation PDF

Cannot Refute

[27] Static and dynamic approaches for Embodied Social Navigation from the perspective of an autonomous agent PDF

Cannot Refute

[28] SANGO: Socially Aware Navigation through Grouped Obstacles PDF

Cannot Refute

[29] Emotional awareness based adaptive social navigation for humanoid robots PDF

Cannot Refute

Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-Language Navigation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[10] Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation PDF

[21] FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph PDF

Contribution Analysis

DualVLN: A dual-system VLN foundation model

[5] OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning PDF

[10] Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation PDF

[12] Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation PDF

[13] Omninav: A unified framework for prospective exploration and visual-language navigation PDF

[17] A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation PDF

[30] JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation PDF

[31] DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation PDF

[32] A VLM-Drone System for Indoor Navigation Assistance with Semantic Reasoning for the Visually Impaired PDF

[33] Robix: A unified model for robot interaction, reasoning and planning PDF

[34] Lovon: Legged open-vocabulary object navigator PDF

Multi-modal conditioning diffusion transformer with latent goal representation

[10] Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation PDF

[35] Multimodal diffusion transformer: Learning versatile behavior from multimodal goals PDF

[36] A survey on diffusion policy for robotic manipulation: Taxonomy, analysis, and future directions PDF

[37] VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers PDF

[38] Efficient Virtuoso: A Latent Diffusion Transformer Model for Goal-Conditioned Trajectory Planning PDF

[39] Diffusion Policy PDF

Social-VLN benchmark for dynamic obstacle avoidance evaluation

[25] HA-VLN: A Benchmark for Human-Aware Navigation in Discrete-Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an â¦ PDF

[26] Conav: A benchmark for human-centered collaborative navigation PDF

[27] Static and dynamic approaches for Embodied Social Navigation from the perspective of an autonomous agent PDF

[28] SANGO: Socially Aware Navigation through Grouped Obstacles PDF

[29] Emotional awareness based adaptive social navigation for humanoid robots PDF

Table of Contents

[25] HA-VLN: A Benchmark for Human-Aware Navigation in Discrete-Continuous Environments with Dynamic Multi-Human Interactions, Real-World Validation, and an â¦ PDF