MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Scene GraphTask PlanningSpatial UnderstandingMobile Manipulation

Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To overcome these shortcomings, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. To address this, we construct MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, and design MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision–language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments show that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments. More visualizations and robot demonstrations are available at https://momagraph.github.io/.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MomaGraph, a unified scene representation integrating spatial-functional relationships and part-level interactive elements for mobile manipulators. It resides in the 'LLM-Based Task Planning with Scene Graphs' leaf, which contains five papers including the original work. This leaf sits within the broader 'Task Planning and Reasoning with Scene Graphs' branch, indicating a moderately populated research direction. The taxonomy shows that while scene graph construction and navigation have multiple specialized subcategories, task planning with LLMs represents a focused but active area where structured spatial knowledge meets language-driven reasoning.

The taxonomy reveals neighboring work in 'Schema-Guided and Multi-Agent Reasoning' (two papers) and 'Classical and Symbolic Planning with Scene Graphs' (two papers), suggesting that task planning approaches vary in their reliance on foundation models versus symbolic methods. The 'Functional and Interactive Scene Graphs' leaf (four papers) in the construction branch addresses similar concerns about affordances and part-level modeling, though it focuses on representation rather than planning. The 'Dynamic and Temporal Scene Graph Updating' leaf (three papers) tackles temporal consistency, a challenge MomaGraph addresses through its emphasis on object states and updates, bridging construction and planning concerns.

Across three contributions, the analysis examined thirty candidates total, with zero refutable pairs identified. The unified representation contribution examined ten candidates with none providing clear overlap; the vision-language model contribution similarly found no refutations among ten candidates; and the dataset-benchmark contribution encountered no prior work among its ten examined papers. This limited search scope—thirty semantically similar papers from a fifty-paper taxonomy—suggests the analysis captures closely related work but may not reflect the full breadth of scene graph research. The absence of refutations indicates that among these top matches, no single prior work directly anticipates MomaGraph's specific combination of spatial-functional integration, part-level modeling, and task-driven evaluation.

Given the search examined roughly sixty percent of the taxonomy's papers through semantic similarity, the findings suggest MomaGraph occupies a relatively distinct position within its immediate neighborhood. The lack of refutations across all contributions, combined with the paper's placement in a moderately populated leaf, implies the work synthesizes ideas from multiple branches—construction, planning, and benchmarking—in a novel configuration. However, the limited scope means potential overlaps in the broader literature, particularly in functional scene graphs or memory-augmented planning, remain unexplored by this analysis.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: task-oriented scene graph generation for embodied agents. The field centers on building structured, graph-based representations of environments that enable robots and virtual agents to reason about objects, spatial relationships, and affordances in service of concrete tasks. The taxonomy reveals several major branches: Scene Graph Construction and Representation focuses on how to extract and maintain these graphs from sensor data, often leveraging vision-language models or open-vocabulary methods like Open Vocabulary Functional Graphs[1]. Task Planning and Reasoning with Scene Graphs explores how agents use these structures to decompose high-level goals into executable steps, with many studies integrating large language models to ground symbolic reasoning in spatial context. Navigation with Scene Graphs addresses how agents exploit relational cues for efficient exploration and goal-directed movement, while Manipulation and Interaction branches examine grasping, tool use, and dynamic updates to the graph as the agent acts. Additional branches cover Scene Generation and Simulation for synthetic training data, Domain-Specific Applications such as surgical or household robotics, Representation Learning for continuous embeddings of graph elements, and Benchmarking and Datasets that provide standardized evaluation. A particularly active line of work lies at the intersection of LLM-based planning and scene graph reasoning, where systems like SayPlan[9] and EmbodiedRAG[27] demonstrate how pretrained language models can be grounded in structured spatial knowledge to produce more interpretable and adaptable plans. MomaGraph[0] sits squarely within this LLM-based task planning cluster, emphasizing how multi-modal scene graphs can inform language-driven decision-making for embodied agents. Compared to Hierarchical Action Generation[28], which focuses on decomposing actions into sub-goals, MomaGraph[0] places greater emphasis on the interplay between visual scene understanding and linguistic task specifications. Meanwhile, Context Matters[30] highlights the importance of situational context in grounding, a theme that complements MomaGraph[0]'s approach to integrating rich relational information. Open questions remain around scalability to large, dynamic environments, the trade-off between symbolic and continuous representations, and how to maintain graph consistency under partial observability and real-time constraints.

Claimed Contributions

MomaGraph: Unified Scene Graph Representation

10 retrieved papers

MomaGraph is a novel scene representation that unifies spatial and functional relationships while introducing part-level interactive nodes. It provides a compact, adaptive, and task-relevant structured representation for embodied agents, addressing limitations of prior work that separated spatial and functional relations or treated scenes as static snapshots.

10 retrieved papers

MomaGraph-R1: Vision-Language Model with Reinforcement Learning

9 retrieved papers

MomaGraph-R1 is a 7B vision-language model trained using the DAPO reinforcement learning algorithm with a graph-alignment reward function. It generates task-oriented scene graphs and serves as a zero-shot task planner within a Graph-then-Plan framework, improving reasoning effectiveness and interpretability.

9 retrieved papers

MomaGraph-Scenes Dataset and MomaGraph-Bench Evaluation Suite

10 retrieved papers

MomaGraph-Scenes is the first dataset jointly modeling spatial and functional relationships with part-level annotations, encompassing multi-view observations and task-aligned scene graphs. MomaGraph-Bench is a comprehensive benchmark evaluating six reasoning capabilities from high-level planning to fine-grained scene understanding.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[9] SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning PDF

Rana, Krishan, Krishan Rana, Haviland, Jesse, Jesse Haviland, Garg, Sourav, Sourav Garg, Abou-Chakra, Jad, Jad Abou-Chakra, Reid, Ian, Ian Reid, Suenderhauf, Niko, Niko Suenderhauf, I. Reid, Niko SÃ¼nderhauf (2023) • Conference on Robot Learning

[27] EmbodiedRAG: Dynamic 3D Scene Graph Retrieval for Efficient and Scalable Robot Task Planning PDF

Meghan Booker, Grayson Byrd, Bethany Kemp, Aurora Schmidt, Booker, Meghan, Corban Rivera, Rivera, Corban (2024) • arXiv.org

[28] Hierarchical Generation of Action Sequence for Service Rots Based on Scene Graph via Large Language Models PDF

Yu Gu, Guohui Tian, Zhengsong Jiang (2024)

[30] Context Matters! Relaxing Goals with LLMs for Feasible 3D Scene Planning PDF

Brienza, Michele, Emanuele Musumeci, Michele Brienza, F. Argenziano, Suriani, Vincenzo, Vincenzo Suriani, Nardi Daniele, Daniele Nardi, Bloisi, Domenico D., D. Bloisi (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MomaGraph: Unified Scene Graph Representation

[5] Commonsense scene graph-based target localization for object search PDF

Cannot Refute

[15] 3D scene graphs in robotics: A unified representation bridging geometry, semantics, and action PDF

Cannot Refute

[51] Dynamic open-vocabulary 3d scene graphs for long-term language-guided mobile manipulation PDF

Cannot Refute

[52] Scene Graph Generation with Role-Playing Large Language Models PDF

Cannot Refute

[53] Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimization PDF

Cannot Refute

[54] Fast Contextual Scene Graph Generation with Unbiased Context Augmentation PDF

Cannot Refute

[55] Visual knowledge graph for human action reasoning in videos PDF

Cannot Refute

[56] FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction PDF

Cannot Refute

[57] SceneHGN: Hierarchical Graph Networks for 3D Indoor Scene Generation With Fine-Grained Geometry PDF

Cannot Refute

[58] Part-level scene reconstruction affords robot interaction PDF

Cannot Refute

Contribution

MomaGraph-R1: Vision-Language Model with Reinforcement Learning

[65] UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning PDF

Cannot Refute

[66] Sense, Imagine, Act: Multimodal Perception Improves Model-Based Reinforcement Learning for Head-to-Head Autonomous Racing PDF

Cannot Refute

[67] Prompt Informed Reinforcement Learning for Visual Coverage Path Planning PDF

Cannot Refute

[68] Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games PDF

Cannot Refute

[69] Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries PDF

Cannot Refute

[70] Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning PDF

Cannot Refute

[71] Multimodal Visual Transformer for Sim2real Transfer in Visual Reinforcement Learning PDF

Cannot Refute

[72] Brain-Inspired Planning for Better Generalization in Reinforcement Learning PDF

Cannot Refute

[73] SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization PDF

Cannot Refute

Contribution

MomaGraph-Scenes Dataset and MomaGraph-Bench Evaluation Suite

[3] Grapheqa: Using 3d semantic scene graphs for real-time embodied question answering PDF

Cannot Refute

[16] Schema-Guided Scene-Graph Reasoning based on Multi-Agent Large Language Model System PDF

Cannot Refute

[18] EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks PDF

Cannot Refute

[20] Reasoning with scene graphs for robot planning under partial observability PDF

Cannot Refute

[59] Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning PDF

Cannot Refute

[60] Embodied agent interface: Benchmarking llms for embodied decision making PDF

Cannot Refute

[61] Optimal scene graph planning with large language model guidance PDF

Cannot Refute

[62] Verigraph: Scene graphs for execution verifiable robot planning PDF

Cannot Refute

[63] Exploring 3D Reasoning-Driven Planning: From Implicit Human Intentions to Route-Aware Activity Planning PDF

Cannot Refute

[64] Taskography: Evaluating robot task planning over large 3d scene graphs PDF

Cannot Refute

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[9] SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning PDF

[27] EmbodiedRAG: Dynamic 3D Scene Graph Retrieval for Efficient and Scalable Robot Task Planning PDF

[28] Hierarchical Generation of Action Sequence for Service Rots Based on Scene Graph via Large Language Models PDF

[30] Context Matters! Relaxing Goals with LLMs for Feasible 3D Scene Planning PDF

Contribution Analysis

MomaGraph: Unified Scene Graph Representation

[5] Commonsense scene graph-based target localization for object search PDF

[15] 3D scene graphs in robotics: A unified representation bridging geometry, semantics, and action PDF

[51] Dynamic open-vocabulary 3d scene graphs for long-term language-guided mobile manipulation PDF

[52] Scene Graph Generation with Role-Playing Large Language Models PDF

[53] Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimization PDF

[54] Fast Contextual Scene Graph Generation with Unbiased Context Augmentation PDF

[55] Visual knowledge graph for human action reasoning in videos PDF

[56] FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction PDF

[57] SceneHGN: Hierarchical Graph Networks for 3D Indoor Scene Generation With Fine-Grained Geometry PDF

[58] Part-level scene reconstruction affords robot interaction PDF

MomaGraph-R1: Vision-Language Model with Reinforcement Learning

[65] UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning PDF

[66] Sense, Imagine, Act: Multimodal Perception Improves Model-Based Reinforcement Learning for Head-to-Head Autonomous Racing PDF

[67] Prompt Informed Reinforcement Learning for Visual Coverage Path Planning PDF

[68] Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games PDF

[69] Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries PDF

[70] Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning PDF

[71] Multimodal Visual Transformer for Sim2real Transfer in Visual Reinforcement Learning PDF

[72] Brain-Inspired Planning for Better Generalization in Reinforcement Learning PDF

[73] SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization PDF

MomaGraph-Scenes Dataset and MomaGraph-Bench Evaluation Suite

[3] Grapheqa: Using 3d semantic scene graphs for real-time embodied question answering PDF

[16] Schema-Guided Scene-Graph Reasoning based on Multi-Agent Large Language Model System PDF

[18] EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks PDF

[20] Reasoning with scene graphs for robot planning under partial observability PDF

[59] Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning PDF

[60] Embodied agent interface: Benchmarking llms for embodied decision making PDF

[61] Optimal scene graph planning with large language model guidance PDF

[62] Verigraph: Scene graphs for execution verifiable robot planning PDF

[63] Exploring 3D Reasoning-Driven Planning: From Implicit Human Intentions to Route-Aware Activity Planning PDF

[64] Taskography: Evaluating robot task planning over large 3d scene graphs PDF

Table of Contents