Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-Language Navigation
Overview
Overall Novelty Assessment
The paper proposes DualVLN, a dual-system VLN foundation model that decouples VLM-based global planning (System 2) from a diffusion-based local policy (System 1). It resides in the 'Vision-Language Navigation Dual-System Foundations' leaf, which contains only three papers total, including this work and two siblings. This indicates a relatively sparse research direction within the broader dual-system architecture space, suggesting the specific combination of VLN tasks with decoupled reasoning and trajectory execution remains underexplored compared to more crowded areas like autonomous driving VLA models or general VLA foundations.
The taxonomy reveals that dual-system architectures branch into unified adaptive models, decoupled reasoning-action frameworks, and domain-specific applications. DualVLN's leaf sits under 'Decoupled Dual-System Models with Separate Reasoning and Action Modules,' adjacent to robotic manipulation dual-systems and distinct from unified single-model approaches. Neighboring branches include general VLA architectures, spatial reasoning methods, and learning paradigms for VLN. The scope notes clarify that explicit dual-system separation for VLN distinguishes this work from end-to-end models and from manipulation-focused dual-systems, positioning it at the intersection of architectural modularity and navigation-specific challenges.
Among 21 candidates examined across three contributions, only one refutable pair emerged. The core dual-system VLN contribution examined 10 candidates with zero refutations, while the multi-modal diffusion transformer examined 6 with none. The Social-VLN benchmark contribution examined 5 candidates and found 1 potential overlap. This limited search scope—top-K semantic matches plus citation expansion—suggests that within the examined literature, the dual-system VLN architecture and diffusion-based local policy appear relatively novel, though the dynamic obstacle benchmark may have closer prior work. The small candidate pool and sparse taxonomy leaf indicate the analysis covers a focused but not exhaustive slice of the field.
Given the limited 21-candidate search and the sparse three-paper taxonomy leaf, the dual-system VLN framework appears to occupy a less crowded niche. The absence of refutations for the core architectural contributions among examined candidates suggests potential novelty, though the analysis does not cover the full breadth of VLN or diffusion policy literature. The benchmark contribution's single overlap warrants closer scrutiny of prior dynamic navigation evaluation methods.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce DualVLN, a novel architecture that decouples vision-language navigation into two complementary systems: System 2 (a VLM-based global planner for pixel-goal grounding) and System 1 (a lightweight diffusion transformer policy for trajectory generation). This dual-system design enables asynchronous inference, robust real-time control, and adaptive local decision-making in dynamic environments.
The authors develop a diffusion-based policy (System 1) that conditions on both explicit pixel goals and implicit latent goal embeddings extracted from the VLM through learnable queries. This multi-modal conditioning approach enables the policy to generate continuous, smooth trajectories at high frequency while maintaining strong coordination with the reasoning system.
The authors create Social-VLN, a new benchmark built upon R2R-CE that incorporates dynamic humanoid agents strategically placed along navigation trajectories. This benchmark evaluates navigation systems on social awareness, dynamic obstacle avoidance, and trajectory recovery capabilities, introducing a Human Collision Rate metric to quantify safety in human-centric environments.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[10] Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation PDF
[21] FSR-VLN: Fast and Slow Reasoning for Vision-Language Navigation with Hierarchical Multi-modal Scene Graph PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
DualVLN: A dual-system VLN foundation model
The authors introduce DualVLN, a novel architecture that decouples vision-language navigation into two complementary systems: System 2 (a VLM-based global planner for pixel-goal grounding) and System 1 (a lightweight diffusion transformer policy for trajectory generation). This dual-system design enables asynchronous inference, robust real-time control, and adaptive local decision-making in dynamic environments.
[5] OneTwoVLA: A Unified Vision-Language-Action Model with Adaptive Reasoning PDF
[10] Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation PDF
[12] Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation PDF
[13] Omninav: A unified framework for prospective exploration and visual-language navigation PDF
[17] A Dual Semantic-Aware Recurrent Global-Adaptive Network For Vision-and-Language Navigation PDF
[30] JanusVLN: Decoupling Semantics and Spatiality with Dual Implicit Memory for Vision-Language Navigation PDF
[31] DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation PDF
[32] A VLM-Drone System for Indoor Navigation Assistance with Semantic Reasoning for the Visually Impaired PDF
[33] Robix: A unified model for robot interaction, reasoning and planning PDF
[34] Lovon: Legged open-vocabulary object navigator PDF
Multi-modal conditioning diffusion transformer with latent goal representation
The authors develop a diffusion-based policy (System 1) that conditions on both explicit pixel goals and implicit latent goal embeddings extracted from the VLM through learnable queries. This multi-modal conditioning approach enables the policy to generate continuous, smooth trajectories at high frequency while maintaining strong coordination with the reasoning system.
[10] Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation PDF
[35] Multimodal diffusion transformer: Learning versatile behavior from multimodal goals PDF
[36] A survey on diffusion policy for robotic manipulation: Taxonomy, analysis, and future directions PDF
[37] VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers PDF
[38] Efficient Virtuoso: A Latent Diffusion Transformer Model for Goal-Conditioned Trajectory Planning PDF
[39] Diffusion Policy PDF
Social-VLN benchmark for dynamic obstacle avoidance evaluation
The authors create Social-VLN, a new benchmark built upon R2R-CE that incorporates dynamic humanoid agents strategically placed along navigation trajectories. This benchmark evaluates navigation systems on social awareness, dynamic obstacle avoidance, and trajectory recovery capabilities, introducing a Human Collision Rate metric to quantify safety in human-centric environments.