Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-Language Navigation

ICLR 2026 Conference SubmissionAnonymous Authors
navigation foundation modelsVision-and-Language Navigation
Abstract:

While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes DualVLN, a dual-system VLN foundation model that decouples VLM-based global planning (System 2) from a diffusion-based local policy (System 1). It resides in the 'Vision-Language Navigation Dual-System Foundations' leaf, which contains only three papers total, including this work and two siblings. This indicates a relatively sparse research direction within the broader dual-system architecture space, suggesting the specific combination of VLN tasks with decoupled reasoning and trajectory execution remains underexplored compared to more crowded areas like autonomous driving VLA models or general VLA foundations.

The taxonomy reveals that dual-system architectures branch into unified adaptive models, decoupled reasoning-action frameworks, and domain-specific applications. DualVLN's leaf sits under 'Decoupled Dual-System Models with Separate Reasoning and Action Modules,' adjacent to robotic manipulation dual-systems and distinct from unified single-model approaches. Neighboring branches include general VLA architectures, spatial reasoning methods, and learning paradigms for VLN. The scope notes clarify that explicit dual-system separation for VLN distinguishes this work from end-to-end models and from manipulation-focused dual-systems, positioning it at the intersection of architectural modularity and navigation-specific challenges.

Among 21 candidates examined across three contributions, only one refutable pair emerged. The core dual-system VLN contribution examined 10 candidates with zero refutations, while the multi-modal diffusion transformer examined 6 with none. The Social-VLN benchmark contribution examined 5 candidates and found 1 potential overlap. This limited search scope—top-K semantic matches plus citation expansion—suggests that within the examined literature, the dual-system VLN architecture and diffusion-based local policy appear relatively novel, though the dynamic obstacle benchmark may have closer prior work. The small candidate pool and sparse taxonomy leaf indicate the analysis covers a focused but not exhaustive slice of the field.

Given the limited 21-candidate search and the sparse three-paper taxonomy leaf, the dual-system VLN framework appears to occupy a less crowded niche. The absence of refutations for the core architectural contributions among examined candidates suggests potential novelty, though the analysis does not cover the full breadth of VLN or diffusion policy literature. The benchmark contribution's single overlap warrants closer scrutiny of prior dynamic navigation evaluation methods.

Taxonomy

Core-task Taxonomy Papers
24
3
Claimed Contributions
21
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: vision-language navigation with dual-system reasoning and control. The field has evolved around several complementary branches that address how agents integrate visual perception, linguistic instructions, and action policies. At the highest level, Dual-System Architectures for Vision-Language-Action Models explore decoupled reasoning and action modules—often separating slow deliberative planning from fast reactive control—while Vision-Language-Action Model Architectures and Techniques focus on end-to-end or tightly integrated neural designs that unify perception and policy learning. Vision-Language Navigation Reasoning and Representation Methods emphasize spatial memory, graph-based scene understanding, and semantic grounding mechanisms, whereas Learning Paradigms and Robustness for Vision-Language Navigation investigate continual learning, domain generalization, and adversarial robustness. Finally, Control and Decision-Making Frameworks address hierarchical planning, attention mechanisms, and decision-theoretic formulations that guide navigation behavior. Representative works such as VLA Models Foundations[2] and VLM VLA Manipulation Survey[4] illustrate the breadth of architectural choices, while OneTwoVLA[5] and Fast in Slow[6] exemplify dual-system designs that balance deliberation with reactive execution. A particularly active line of research contrasts tightly coupled end-to-end models with explicitly decoupled dual-system approaches. The former often achieve strong performance through large-scale pretraining and unified vision-language encoders, as seen in VLA Autonomous Driving[3] and RationalVLA[9], but may struggle with interpretability and fine-grained spatial reasoning. In contrast, decoupled architectures—such as Ground Slow Move Fast[0] and its closely related variant Ground Slow Move Fast Dual[10]—separate high-level grounding and planning from low-level motor control, enabling modular reasoning and potentially greater robustness in novel environments. Ground Slow Move Fast[0] sits squarely within this dual-system paradigm, emphasizing the interplay between a slow grounding module that interprets instructions and a fast action module that executes navigation. Compared to FSR VLN[21], which also explores dual-stream reasoning, Ground Slow Move Fast[0] places stronger emphasis on explicit temporal decoupling and control handoff. This design choice reflects ongoing debates about modularity versus end-to-end optimization, and whether interpretable reasoning stages can improve generalization without sacrificing real-time responsiveness.

Claimed Contributions

DualVLN: A dual-system VLN foundation model

The authors introduce DualVLN, a novel architecture that decouples vision-language navigation into two complementary systems: System 2 (a VLM-based global planner for pixel-goal grounding) and System 1 (a lightweight diffusion transformer policy for trajectory generation). This dual-system design enables asynchronous inference, robust real-time control, and adaptive local decision-making in dynamic environments.

10 retrieved papers
Multi-modal conditioning diffusion transformer with latent goal representation

The authors develop a diffusion-based policy (System 1) that conditions on both explicit pixel goals and implicit latent goal embeddings extracted from the VLM through learnable queries. This multi-modal conditioning approach enables the policy to generate continuous, smooth trajectories at high frequency while maintaining strong coordination with the reasoning system.

6 retrieved papers
Social-VLN benchmark for dynamic obstacle avoidance evaluation

The authors create Social-VLN, a new benchmark built upon R2R-CE that incorporates dynamic humanoid agents strategically placed along navigation trajectories. This benchmark evaluates navigation systems on social awareness, dynamic obstacle avoidance, and trajectory recovery capabilities, introducing a Human Collision Rate metric to quantify safety in human-centric environments.

5 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

DualVLN: A dual-system VLN foundation model

The authors introduce DualVLN, a novel architecture that decouples vision-language navigation into two complementary systems: System 2 (a VLM-based global planner for pixel-goal grounding) and System 1 (a lightweight diffusion transformer policy for trajectory generation). This dual-system design enables asynchronous inference, robust real-time control, and adaptive local decision-making in dynamic environments.

Contribution

Multi-modal conditioning diffusion transformer with latent goal representation

The authors develop a diffusion-based policy (System 1) that conditions on both explicit pixel goals and implicit latent goal embeddings extracted from the VLM through learnable queries. This multi-modal conditioning approach enables the policy to generate continuous, smooth trajectories at high frequency while maintaining strong coordination with the reasoning system.

Contribution

Social-VLN benchmark for dynamic obstacle avoidance evaluation

The authors create Social-VLN, a new benchmark built upon R2R-CE that incorporates dynamic humanoid agents strategically placed along navigation trajectories. This benchmark evaluates navigation systems on social awareness, dynamic obstacle avoidance, and trajectory recovery capabilities, introducing a Human Collision Rate metric to quantify safety in human-centric environments.

Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-Language Navigation | Novelty Validation