Towards Physically Executable 3D Gaussian for Embodied Navigation

ICLR 2026 Conference SubmissionAnonymous Authors
3D Gaussian Splatting; Vision-and-Language Navigation
Abstract:

3D Gaussian Splatting (3DGS), a 3D representation method with photorealistic real-time rendering capabilities, is regarded as an effective tool for narrowing the sim-to-real gap. However, it lacks fine-grained semantics and physical executability for Visual-Language Navigation (VLN). To address this, we propose SAGE-3D (Semantically and Physically Aligned Gaussian Environments for 3D Navigation), a new paradigm that upgrades 3DGS into an executable, semantically and physically aligned environment. It comprises two components: (1) Object-Centric Semantic Grounding, which adds object-level fine-grained annotations to 3DGS; and (2) Physics-Aware Execution Jointing, which embeds collision objects into 3DGS and constructs rich physical interfaces. We release InteriorGS, containing 1K object-annotated 3DGS indoor scene data, and introduce SAGE-Bench, the first 3DGS-based VLN benchmark with 2M VLN data. Experiments show that 3DGS scene data is more difficult to converge, while exhibiting strong generalizability, improving baseline performance by 31% on the VLN-CE Unseen task.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SAGE-3D, a paradigm that enhances 3D Gaussian Splatting with object-level semantic annotations and physics-aware collision interfaces for Visual-Language Navigation. It resides in the 'Language-Guided Task Execution' leaf alongside three sibling papers (LagMemo, ATLAS Navigator, and one other), forming a small cluster within the broader 'Vision-Language Navigation in Continuous Environments' branch. This leaf represents a focused research direction within a taxonomy of 32 papers across 12 leaf nodes, suggesting moderate but not overwhelming prior work in this specific intersection of semantic grounding and task execution.

The taxonomy tree reveals that SAGE-3D sits adjacent to 'Trajectory Planning and Viewpoint Synthesis' (2 papers) and 'Image-Goal and Instance-Level Navigation' (5 papers), both under the same parent branch. Neighboring branches include 'Semantic 3D Gaussian Splatting Representations' (10 papers across three leaves) and 'Sim-to-Real Transfer and Embodied AI Platforms' (4 papers). The scope notes clarify that SAGE-3D's emphasis on physical executability and object-centric grounding distinguishes it from purely semantic representation methods (excluded from this leaf) and from trajectory synthesis approaches that lack explicit task-level reasoning.

Among 24 candidates examined across three contributions, no clearly refutable prior work was identified. The SAGE-3D paradigm examined 4 candidates with 0 refutations; InteriorGS dataset examined 10 candidates with 0 refutations; SAGE-Bench benchmark examined 10 candidates with 0 refutations. This limited search scope—top-K semantic matches plus citation expansion—suggests that within the examined literature, the combination of object-level semantic grounding, physics-aware execution interfaces, and a dedicated VLN benchmark appears relatively unexplored. However, the analysis does not claim exhaustive coverage of all possible prior work.

Given the constrained search scope (24 candidates, not hundreds), the contributions appear to occupy a niche where semantic 3DGS, physical executability, and VLN benchmarking converge. The taxonomy structure indicates this is a moderately populated research area with clear boundaries separating representation methods from navigation policies. The absence of refutable candidates among examined papers suggests potential novelty, though a broader literature review would be needed to confirm whether similar integrations exist outside the top-K semantic neighborhood.

Taxonomy

Core-task Taxonomy Papers
32
3
Claimed Contributions
24
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Visual-Language Navigation in 3D Gaussian Splatting environments. This emerging field sits at the intersection of neural scene representations and embodied AI, where agents must interpret natural language instructions to navigate photorealistic 3D spaces reconstructed via Gaussian splatting. The taxonomy reveals four main branches: semantic 3D Gaussian splatting representations that enrich scene geometry with language-grounded features (e.g., Vision-Language Gaussian Splatting[21], FMGS[4]); navigation frameworks and policies that design control strategies for continuous or discrete movement in these environments (e.g., Splat-nav[1], GaussNav[6]); sim-to-real transfer and embodied AI platforms addressing the gap between synthetic training and physical deployment (e.g., BEINGS[25], RealMirror[13]); and related embodied AI applications extending beyond pure navigation to manipulation and multi-task scenarios (e.g., RoboTidy[20], VR-Robo[16]). These branches collectively capture how Gaussian splatting's rendering efficiency and geometric fidelity enable richer vision-language grounding compared to traditional mesh or voxel representations. Recent work has concentrated on two contrasting themes: memory-augmented architectures that maintain spatial-semantic histories for long-horizon tasks (e.g., LagMemo[18], ATLAS Navigator[5]) versus end-to-end policies that directly map observations to actions without explicit memory modules. Physically Executable Gaussian[0] falls within the language-guided task execution cluster, emphasizing the generation of physically plausible action sequences grounded in Gaussian-based scene understanding. Compared to LagMemo[18], which prioritizes episodic memory for multi-step reasoning, and ATLAS Navigator[5], which focuses on hierarchical planning with topological maps, Physically Executable Gaussian[0] appears to stress the executability constraint—ensuring that predicted trajectories respect physical dynamics and scene affordances. This positions it as a bridge between high-level language grounding and low-level motion feasibility, a trade-off that remains an open question as the field scales to more complex real-world environments.

Claimed Contributions

SAGE-3D paradigm for semantically and physically aligned 3D Gaussian environments

The authors introduce SAGE-3D, a paradigm that upgrades 3D Gaussian Splatting from a rendering-only representation into an executable environment foundation by adding object-level semantics and physics-aware execution capabilities for embodied navigation tasks.

4 retrieved papers
InteriorGS dataset with object-level annotated 3DGS scenes

The authors release InteriorGS, a dataset containing 1,000 manually annotated 3D Gaussian Splatting indoor scenes with over 554,000 object instances across 755 categories, providing fine-grained object-level semantics including instance IDs, categories, and bounding boxes.

10 retrieved papers
SAGE-Bench VLN benchmark with hierarchical instructions and continuity metrics

The authors introduce SAGE-Bench, the first 3DGS-based Vision-Language Navigation benchmark featuring 2 million trajectory-instruction pairs, hierarchical instruction generation combining high-level semantic goals with low-level actions, and three novel navigation natural continuity metrics (Continuous Success Ratio, Integrated Collision Penalty, and Path Smoothness).

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SAGE-3D paradigm for semantically and physically aligned 3D Gaussian environments

The authors introduce SAGE-3D, a paradigm that upgrades 3D Gaussian Splatting from a rendering-only representation into an executable environment foundation by adding object-level semantics and physics-aware execution capabilities for embodied navigation tasks.

Contribution

InteriorGS dataset with object-level annotated 3DGS scenes

The authors release InteriorGS, a dataset containing 1,000 manually annotated 3D Gaussian Splatting indoor scenes with over 554,000 object instances across 755 categories, providing fine-grained object-level semantics including instance IDs, categories, and bounding boxes.

Contribution

SAGE-Bench VLN benchmark with hierarchical instructions and continuity metrics

The authors introduce SAGE-Bench, the first 3DGS-based Vision-Language Navigation benchmark featuring 2 million trajectory-instruction pairs, hierarchical instruction generation combining high-level semantic goals with low-level actions, and three novel navigation natural continuity metrics (Continuous Success Ratio, Integrated Collision Penalty, and Path Smoothness).