MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning

ICLR 2026 Conference SubmissionAnonymous Authors
Scene GraphTask PlanningSpatial UnderstandingMobile Manipulation
Abstract:

Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To overcome these shortcomings, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. To address this, we construct MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, and design MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision–language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments show that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments. More visualizations and robot demonstrations are available at https://momagraph.github.io/.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MomaGraph, a unified scene representation integrating spatial-functional relationships and part-level interactive elements for mobile manipulators. It resides in the 'LLM-Based Task Planning with Scene Graphs' leaf, which contains five papers including the original work. This leaf sits within the broader 'Task Planning and Reasoning with Scene Graphs' branch, indicating a moderately populated research direction. The taxonomy shows that while scene graph construction and navigation have multiple specialized subcategories, task planning with LLMs represents a focused but active area where structured spatial knowledge meets language-driven reasoning.

The taxonomy reveals neighboring work in 'Schema-Guided and Multi-Agent Reasoning' (two papers) and 'Classical and Symbolic Planning with Scene Graphs' (two papers), suggesting that task planning approaches vary in their reliance on foundation models versus symbolic methods. The 'Functional and Interactive Scene Graphs' leaf (four papers) in the construction branch addresses similar concerns about affordances and part-level modeling, though it focuses on representation rather than planning. The 'Dynamic and Temporal Scene Graph Updating' leaf (three papers) tackles temporal consistency, a challenge MomaGraph addresses through its emphasis on object states and updates, bridging construction and planning concerns.

Across three contributions, the analysis examined thirty candidates total, with zero refutable pairs identified. The unified representation contribution examined ten candidates with none providing clear overlap; the vision-language model contribution similarly found no refutations among ten candidates; and the dataset-benchmark contribution encountered no prior work among its ten examined papers. This limited search scope—thirty semantically similar papers from a fifty-paper taxonomy—suggests the analysis captures closely related work but may not reflect the full breadth of scene graph research. The absence of refutations indicates that among these top matches, no single prior work directly anticipates MomaGraph's specific combination of spatial-functional integration, part-level modeling, and task-driven evaluation.

Given the search examined roughly sixty percent of the taxonomy's papers through semantic similarity, the findings suggest MomaGraph occupies a relatively distinct position within its immediate neighborhood. The lack of refutations across all contributions, combined with the paper's placement in a moderately populated leaf, implies the work synthesizes ideas from multiple branches—construction, planning, and benchmarking—in a novel configuration. However, the limited scope means potential overlaps in the broader literature, particularly in functional scene graphs or memory-augmented planning, remain unexplored by this analysis.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: task-oriented scene graph generation for embodied agents. The field centers on building structured, graph-based representations of environments that enable robots and virtual agents to reason about objects, spatial relationships, and affordances in service of concrete tasks. The taxonomy reveals several major branches: Scene Graph Construction and Representation focuses on how to extract and maintain these graphs from sensor data, often leveraging vision-language models or open-vocabulary methods like Open Vocabulary Functional Graphs[1]. Task Planning and Reasoning with Scene Graphs explores how agents use these structures to decompose high-level goals into executable steps, with many studies integrating large language models to ground symbolic reasoning in spatial context. Navigation with Scene Graphs addresses how agents exploit relational cues for efficient exploration and goal-directed movement, while Manipulation and Interaction branches examine grasping, tool use, and dynamic updates to the graph as the agent acts. Additional branches cover Scene Generation and Simulation for synthetic training data, Domain-Specific Applications such as surgical or household robotics, Representation Learning for continuous embeddings of graph elements, and Benchmarking and Datasets that provide standardized evaluation. A particularly active line of work lies at the intersection of LLM-based planning and scene graph reasoning, where systems like SayPlan[9] and EmbodiedRAG[27] demonstrate how pretrained language models can be grounded in structured spatial knowledge to produce more interpretable and adaptable plans. MomaGraph[0] sits squarely within this LLM-based task planning cluster, emphasizing how multi-modal scene graphs can inform language-driven decision-making for embodied agents. Compared to Hierarchical Action Generation[28], which focuses on decomposing actions into sub-goals, MomaGraph[0] places greater emphasis on the interplay between visual scene understanding and linguistic task specifications. Meanwhile, Context Matters[30] highlights the importance of situational context in grounding, a theme that complements MomaGraph[0]'s approach to integrating rich relational information. Open questions remain around scalability to large, dynamic environments, the trade-off between symbolic and continuous representations, and how to maintain graph consistency under partial observability and real-time constraints.

Claimed Contributions

MomaGraph: Unified Scene Graph Representation

MomaGraph is a novel scene representation that unifies spatial and functional relationships while introducing part-level interactive nodes. It provides a compact, adaptive, and task-relevant structured representation for embodied agents, addressing limitations of prior work that separated spatial and functional relations or treated scenes as static snapshots.

10 retrieved papers
MomaGraph-R1: Vision-Language Model with Reinforcement Learning

MomaGraph-R1 is a 7B vision-language model trained using the DAPO reinforcement learning algorithm with a graph-alignment reward function. It generates task-oriented scene graphs and serves as a zero-shot task planner within a Graph-then-Plan framework, improving reasoning effectiveness and interpretability.

9 retrieved papers
MomaGraph-Scenes Dataset and MomaGraph-Bench Evaluation Suite

MomaGraph-Scenes is the first dataset jointly modeling spatial and functional relationships with part-level annotations, encompassing multi-view observations and task-aligned scene graphs. MomaGraph-Bench is a comprehensive benchmark evaluating six reasoning capabilities from high-level planning to fine-grained scene understanding.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MomaGraph: Unified Scene Graph Representation

MomaGraph is a novel scene representation that unifies spatial and functional relationships while introducing part-level interactive nodes. It provides a compact, adaptive, and task-relevant structured representation for embodied agents, addressing limitations of prior work that separated spatial and functional relations or treated scenes as static snapshots.

Contribution

MomaGraph-R1: Vision-Language Model with Reinforcement Learning

MomaGraph-R1 is a 7B vision-language model trained using the DAPO reinforcement learning algorithm with a graph-alignment reward function. It generates task-oriented scene graphs and serves as a zero-shot task planner within a Graph-then-Plan framework, improving reasoning effectiveness and interpretability.

Contribution

MomaGraph-Scenes Dataset and MomaGraph-Bench Evaluation Suite

MomaGraph-Scenes is the first dataset jointly modeling spatial and functional relationships with part-level annotations, encompassing multi-view observations and task-aligned scene graphs. MomaGraph-Bench is a comprehensive benchmark evaluating six reasoning capabilities from high-level planning to fine-grained scene understanding.