From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 5.5 Download Report PDF

Urban NavigationFoundation ModelsReinforcement Learning

Navigation foundation models trained on massive web-scale data enable agents to generalize across diverse environments and embodiments. However, these models, which are trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in the real-world urban navigation where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing (S2E) learning framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pre-training on offline videos and post-training through reinforcement learning. It maintains the model's generalizability acquired from large-scale real-world videos while enhancing its interactivity through reinforcement learning in simulation environments. Specifically, we introduce two innovations:

an Anchor-Guided Distribution Matching strategy for offline pretraining, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and
a Residual-Attention Module for reinforcement learning, which obtains reactive behaviors from simulation environments without erasing the model’s pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3D Gaussian Splatting reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models.

Abstract:

an Anchor-Guided Distribution Matching strategy for offline pretraining, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and
a Residual-Attention Module for reinforcement learning, which obtains reactive behaviors from simulation environments without erasing the model’s pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3D Gaussian Splatting reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models.

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes the Seeing-to-Experiencing (S2E) framework to scale navigation foundation models by combining offline video pretraining with reinforcement learning post-training. It resides in the 'Scaling and Lifelong Learning' leaf under 'Zero-Shot and Generalization', which contains only two papers including this one. This sparse positioning suggests the specific combination of large-scale visual pretraining and RL-driven scaling for navigation foundation models remains relatively underexplored. The sibling paper focuses on lifelong skill accumulation in persistent environments, whereas S2E emphasizes bridging passive observation and active interaction through RL.

The taxonomy reveals neighboring research directions that contextualize S2E's approach. The 'Foundation Model Post-Training' leaf (three papers) addresses RL or supervised fine-tuning of foundation models but does not explicitly focus on scaling through experiential learning. The 'Vision-Based Navigation' and 'Offline and Batch RL' leaves explore RL from visual inputs and pre-collected datasets respectively, yet lack the foundation model integration central to S2E. The 'VFM Distillation and Transfer' branch examines knowledge transfer from visual foundation models but without the RL-driven post-training component. S2E appears to occupy a niche bridging offline foundation model pretraining and online RL adaptation.

Among thirty candidates examined, the S2E framework shows partial overlap with prior work: three candidates can refute aspects of the core contribution. The Anchor-Guided Distribution Matching strategy and NavBench-GS benchmark appear more novel, with zero refutable candidates among ten examined for each. However, this analysis reflects a limited semantic search scope, not an exhaustive literature review. The statistics suggest that while the overarching S2E framework has some precedent in combining pretraining and RL, the specific technical mechanisms (anchor-based supervision, residual components) and evaluation infrastructure may offer incremental novelty within the examined candidate set.

Given the sparse taxonomy leaf and limited search scope, S2E appears to address a relatively underexplored intersection of foundation model scaling and RL-driven navigation. The framework's novelty likely resides in its specific technical choices rather than the high-level concept of combining offline and online learning. A more exhaustive search across robotics, computer vision, and RL venues would be necessary to fully assess whether the anchor-based and residual innovations constitute significant departures from existing methods or represent incremental refinements of established techniques.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: scaling navigation foundation models with reinforcement learning. The field structure reflects a multifaceted effort to integrate large-scale pretrained models with RL-driven navigation capabilities. Foundation Model Integration and Adaptation explores how vision-language models and behavioral foundation models can be adapted for navigation tasks, often leveraging offline data or multimodal inputs (e.g., Multimodal Web Navigation[12], Behavioral Foundation Models[13]). RL-Centric Navigation Approaches focuses on classical and modern RL techniques tailored to navigation, including offline RL methods (Offline RL Visual Navigation[10]) and curriculum-based training (Curriculum Swarm Navigation[8]). Domain-Specific Navigation Applications addresses specialized settings such as autonomous driving (HighwayLLM[28]), medical guidewire navigation (Zero-shot Guidewire Navigation[44]), and spacecraft trajectory planning (Spacecraft Trajectory Transformer[38]). Methodological Foundations and Analysis provides theoretical grounding and benchmarking (Benchmarking RL Navigation[47]), while Zero-Shot and Generalization examines how models transfer across environments and scale over time, including lifelong learning paradigms (Lifelong Navigation Improvement[6]). A central tension across these branches concerns the trade-off between supervised pretraining and online RL fine-tuning, with works like SFT Memorizes RL Generalizes[2] highlighting that RL often yields better generalization despite supervised methods' data efficiency. Within the Zero-Shot and Generalization branch, Seeing to Experiencing[0] sits alongside Lifelong Navigation Improvement[6], both emphasizing continual adaptation and scaling over extended interaction horizons. While Lifelong Navigation Improvement[6] focuses on incremental skill accumulation in persistent environments, Seeing to Experiencing[0] appears to bridge passive visual pretraining with active RL-driven exploration, aiming to scale foundation models through experiential feedback. This contrasts with purely zero-shot approaches (Action-Aware Zero-Shot Navigation[37]) that rely on pretrained representations without further online learning, underscoring an ongoing debate about when and how to inject RL into foundation model pipelines for robust, generalizable navigation.

Claimed Contributions

Seeing-to-Experiencing (S2E) learning framework

Can Refute

10 retrieved papers

A hybrid learning framework that combines pretraining on large-scale offline videos with post-training through reinforcement learning in simulation environments. This approach maintains the model's generalizability from real-world videos while enhancing its interactivity and reactive behaviors through RL.

10 retrieved papers

Can Refute

Anchor-Guided Distribution Matching strategy

10 retrieved papers

A pretraining strategy that uses anchor-based supervision to model multimodal distributions in normalized motion trajectory space. It employs a Gaussian Mixture Model with uniformly sampled anchors to represent diverse navigation behaviors, providing stable learning and supporting cross-embodiment deployment.

10 retrieved papers

NavBench-GS evaluation benchmark

10 retrieved papers

A comprehensive end-to-end evaluation benchmark built on photorealistic 3D Gaussian Splatting reconstructions of real-world scenes with accurate physics and interactive dynamics. It enables closed-loop policy assessment in photo-realistic 3D scenes and systematically evaluates the generalizability and safety of navigation models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[6] Lifelong autonomous improvement of navigation foundation models in the wild PDF

K Stachowicz, L Ignatova, S Levine (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Seeing-to-Experiencing (S2E) learning framework

[71] Offline visual representation learning for embodied navigation PDF

Can Refute

[75] Reinforcement learning with action-free pre-training from videos PDF

Can Refute

[79] PIRLNav: Pretraining with Imitation and RL Finetuning for OBJECTNAV PDF

Can Refute

[15] Empowering embodied visual tracking with visual foundation models and offline rl PDF

Cannot Refute

[72] Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action PDF

Cannot Refute

[73] Video-Enhanced Offline Reinforcement Learning: A Model-Based Approach PDF

Cannot Refute

[74] Vision-language models provide promptable representations for reinforcement learning PDF

Cannot Refute

[76] Goal-guided transformer-enabled reinforcement learning for efficient autonomous navigation PDF

Cannot Refute

[77] Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos PDF

Cannot Refute

[78] Ving: Learning open-world navigation with visual goals PDF

Cannot Refute

Contribution

Anchor-Guided Distribution Matching strategy

[61] EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction PDF

Cannot Refute

[62] Anchor-based Multi-modal Transformer Network for Pedestrian Trajectory and Intention Prediction PDF

Cannot Refute

[63] Prophnet: Efficient agent-centric motion forecasting with anchor-informed proposals PDF

Cannot Refute

[64] YOPOv2-Tracker: An End-to-End Agile Tracking and Navigation Framework from Perception to Action PDF

Cannot Refute

[65] Interpretable Social Anchors for Human Trajectory Forecasting in Crowds PDF

Cannot Refute

[66] Self-supervised multi-future occupancy forecasting for autonomous driving PDF

Cannot Refute

[67] Scene Informer: Anchor-based Occlusion Inference and Trajectory Prediction in Partially Observable Environments PDF

Cannot Refute

[68] Multi-modal trajectory forecasting with Multi-scale Interactions and Multi-pseudo-target Supervision PDF

Cannot Refute

[69] RetailOpt: Opt-In, Easy-to-Deploy Trajectory Estimation from Smartphone Motion Data and Retail Facility Information PDF

Cannot Refute

[70] Maneuver-based anchor trajectory hypotheses at roundabouts PDF

Cannot Refute

Contribution

NavBench-GS evaluation benchmark

[51] Building Generalizable Agents with a Realistic and Rich 3D Environment PDF

Cannot Refute

[52] Vlfm: Vision-language frontier maps for zero-shot semantic navigation PDF

Cannot Refute

[53] Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai PDF

Cannot Refute

[54] From cognition to precognition: A future-aware framework for social navigation PDF

Cannot Refute

[55] Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation PDF

Cannot Refute

[56] Habitat 2.0: Training home assistants to rearrange their habitat PDF

Cannot Refute

[57] Vid2sim: Realistic and interactive simulation from video for urban navigation PDF

Cannot Refute

[58] Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments PDF

Cannot Refute

[59] ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments PDF

Cannot Refute

[60] Sim2real predictivity: Does evaluation in simulation predict real-world performance? PDF

Cannot Refute

From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[6] Lifelong autonomous improvement of navigation foundation models in the wild PDF

Contribution Analysis

Seeing-to-Experiencing (S2E) learning framework

[71] Offline visual representation learning for embodied navigation PDF

[75] Reinforcement learning with action-free pre-training from videos PDF

[79] PIRLNav: Pretraining with Imitation and RL Finetuning for OBJECTNAV PDF

[15] Empowering embodied visual tracking with visual foundation models and offline rl PDF

[72] Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action PDF

[73] Video-Enhanced Offline Reinforcement Learning: A Model-Based Approach PDF

[74] Vision-language models provide promptable representations for reinforcement learning PDF

[76] Goal-guided transformer-enabled reinforcement learning for efficient autonomous navigation PDF

[77] Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos PDF

[78] Ving: Learning open-world navigation with visual goals PDF

Anchor-Guided Distribution Matching strategy

[61] EDA: Evolving and Distinct Anchors for Multimodal Motion Prediction PDF

[62] Anchor-based Multi-modal Transformer Network for Pedestrian Trajectory and Intention Prediction PDF

[63] Prophnet: Efficient agent-centric motion forecasting with anchor-informed proposals PDF

[64] YOPOv2-Tracker: An End-to-End Agile Tracking and Navigation Framework from Perception to Action PDF

[65] Interpretable Social Anchors for Human Trajectory Forecasting in Crowds PDF

[66] Self-supervised multi-future occupancy forecasting for autonomous driving PDF

[67] Scene Informer: Anchor-based Occlusion Inference and Trajectory Prediction in Partially Observable Environments PDF

[68] Multi-modal trajectory forecasting with Multi-scale Interactions and Multi-pseudo-target Supervision PDF

[69] RetailOpt: Opt-In, Easy-to-Deploy Trajectory Estimation from Smartphone Motion Data and Retail Facility Information PDF

[70] Maneuver-based anchor trajectory hypotheses at roundabouts PDF

NavBench-GS evaluation benchmark

[51] Building Generalizable Agents with a Realistic and Rich 3D Environment PDF

[52] Vlfm: Vision-language frontier maps for zero-shot semantic navigation PDF

[53] Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai PDF

[54] From cognition to precognition: A future-aware framework for social navigation PDF

[55] Hm3d-ovon: A dataset and benchmark for open-vocabulary object goal navigation PDF

[56] Habitat 2.0: Training home assistants to rearrange their habitat PDF

[57] Vid2sim: Realistic and interactive simulation from video for urban navigation PDF

[58] Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments PDF

[59] ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments PDF

[60] Sim2real predictivity: Does evaluation in simulation predict real-world performance? PDF

Table of Contents