From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

ICLR 2026 Conference SubmissionAnonymous Authors
Urban NavigationFoundation ModelsReinforcement Learning
Abstract:

Navigation foundation models trained on massive web-scale data enable agents to generalize across diverse environments and embodiments. However, these models, which are trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in the real-world urban navigation where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing (S2E) learning framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pre-training on offline videos and post-training through reinforcement learning. It maintains the model's generalizability acquired from large-scale real-world videos while enhancing its interactivity through reinforcement learning in simulation environments. Specifically, we introduce two innovations:

  1. an Anchor-Guided Distribution Matching strategy for offline pretraining, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and
  2. a Residual-Attention Module for reinforcement learning, which obtains reactive behaviors from simulation environments without erasing the model’s pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3D Gaussian Splatting reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models.
Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes the Seeing-to-Experiencing (S2E) framework to scale navigation foundation models by combining offline video pretraining with reinforcement learning post-training. It resides in the 'Scaling and Lifelong Learning' leaf under 'Zero-Shot and Generalization', which contains only two papers including this one. This sparse positioning suggests the specific combination of large-scale visual pretraining and RL-driven scaling for navigation foundation models remains relatively underexplored. The sibling paper focuses on lifelong skill accumulation in persistent environments, whereas S2E emphasizes bridging passive observation and active interaction through RL.

The taxonomy reveals neighboring research directions that contextualize S2E's approach. The 'Foundation Model Post-Training' leaf (three papers) addresses RL or supervised fine-tuning of foundation models but does not explicitly focus on scaling through experiential learning. The 'Vision-Based Navigation' and 'Offline and Batch RL' leaves explore RL from visual inputs and pre-collected datasets respectively, yet lack the foundation model integration central to S2E. The 'VFM Distillation and Transfer' branch examines knowledge transfer from visual foundation models but without the RL-driven post-training component. S2E appears to occupy a niche bridging offline foundation model pretraining and online RL adaptation.

Among thirty candidates examined, the S2E framework shows partial overlap with prior work: three candidates can refute aspects of the core contribution. The Anchor-Guided Distribution Matching strategy and NavBench-GS benchmark appear more novel, with zero refutable candidates among ten examined for each. However, this analysis reflects a limited semantic search scope, not an exhaustive literature review. The statistics suggest that while the overarching S2E framework has some precedent in combining pretraining and RL, the specific technical mechanisms (anchor-based supervision, residual components) and evaluation infrastructure may offer incremental novelty within the examined candidate set.

Given the sparse taxonomy leaf and limited search scope, S2E appears to address a relatively underexplored intersection of foundation model scaling and RL-driven navigation. The framework's novelty likely resides in its specific technical choices rather than the high-level concept of combining offline and online learning. A more exhaustive search across robotics, computer vision, and RL venues would be necessary to fully assess whether the anchor-based and residual innovations constitute significant departures from existing methods or represent incremental refinements of established techniques.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
3
Refutable Paper

Research Landscape Overview

Core task: scaling navigation foundation models with reinforcement learning. The field structure reflects a multifaceted effort to integrate large-scale pretrained models with RL-driven navigation capabilities. Foundation Model Integration and Adaptation explores how vision-language models and behavioral foundation models can be adapted for navigation tasks, often leveraging offline data or multimodal inputs (e.g., Multimodal Web Navigation[12], Behavioral Foundation Models[13]). RL-Centric Navigation Approaches focuses on classical and modern RL techniques tailored to navigation, including offline RL methods (Offline RL Visual Navigation[10]) and curriculum-based training (Curriculum Swarm Navigation[8]). Domain-Specific Navigation Applications addresses specialized settings such as autonomous driving (HighwayLLM[28]), medical guidewire navigation (Zero-shot Guidewire Navigation[44]), and spacecraft trajectory planning (Spacecraft Trajectory Transformer[38]). Methodological Foundations and Analysis provides theoretical grounding and benchmarking (Benchmarking RL Navigation[47]), while Zero-Shot and Generalization examines how models transfer across environments and scale over time, including lifelong learning paradigms (Lifelong Navigation Improvement[6]). A central tension across these branches concerns the trade-off between supervised pretraining and online RL fine-tuning, with works like SFT Memorizes RL Generalizes[2] highlighting that RL often yields better generalization despite supervised methods' data efficiency. Within the Zero-Shot and Generalization branch, Seeing to Experiencing[0] sits alongside Lifelong Navigation Improvement[6], both emphasizing continual adaptation and scaling over extended interaction horizons. While Lifelong Navigation Improvement[6] focuses on incremental skill accumulation in persistent environments, Seeing to Experiencing[0] appears to bridge passive visual pretraining with active RL-driven exploration, aiming to scale foundation models through experiential feedback. This contrasts with purely zero-shot approaches (Action-Aware Zero-Shot Navigation[37]) that rely on pretrained representations without further online learning, underscoring an ongoing debate about when and how to inject RL into foundation model pipelines for robust, generalizable navigation.

Claimed Contributions

Seeing-to-Experiencing (S2E) learning framework

A hybrid learning framework that combines pretraining on large-scale offline videos with post-training through reinforcement learning in simulation environments. This approach maintains the model's generalizability from real-world videos while enhancing its interactivity and reactive behaviors through RL.

10 retrieved papers
Can Refute
Anchor-Guided Distribution Matching strategy

A pretraining strategy that uses anchor-based supervision to model multimodal distributions in normalized motion trajectory space. It employs a Gaussian Mixture Model with uniformly sampled anchors to represent diverse navigation behaviors, providing stable learning and supporting cross-embodiment deployment.

10 retrieved papers
NavBench-GS evaluation benchmark

A comprehensive end-to-end evaluation benchmark built on photorealistic 3D Gaussian Splatting reconstructions of real-world scenes with accurate physics and interactive dynamics. It enables closed-loop policy assessment in photo-realistic 3D scenes and systematically evaluates the generalizability and safety of navigation models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Seeing-to-Experiencing (S2E) learning framework

A hybrid learning framework that combines pretraining on large-scale offline videos with post-training through reinforcement learning in simulation environments. This approach maintains the model's generalizability from real-world videos while enhancing its interactivity and reactive behaviors through RL.

Contribution

Anchor-Guided Distribution Matching strategy

A pretraining strategy that uses anchor-based supervision to model multimodal distributions in normalized motion trajectory space. It employs a Gaussian Mixture Model with uniformly sampled anchors to represent diverse navigation behaviors, providing stable learning and supporting cross-embodiment deployment.

Contribution

NavBench-GS evaluation benchmark

A comprehensive end-to-end evaluation benchmark built on photorealistic 3D Gaussian Splatting reconstructions of real-world scenes with accurate physics and interactive dynamics. It enables closed-loop policy assessment in photo-realistic 3D scenes and systematically evaluates the generalizability and safety of navigation models.