MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning
Overview
Overall Novelty Assessment
The paper introduces MomaGraph, a unified scene representation integrating spatial-functional relationships and part-level interactive elements for mobile manipulators. It resides in the 'LLM-Based Task Planning with Scene Graphs' leaf, which contains five papers including the original work. This leaf sits within the broader 'Task Planning and Reasoning with Scene Graphs' branch, indicating a moderately populated research direction. The taxonomy shows that while scene graph construction and navigation have multiple specialized subcategories, task planning with LLMs represents a focused but active area where structured spatial knowledge meets language-driven reasoning.
The taxonomy reveals neighboring work in 'Schema-Guided and Multi-Agent Reasoning' (two papers) and 'Classical and Symbolic Planning with Scene Graphs' (two papers), suggesting that task planning approaches vary in their reliance on foundation models versus symbolic methods. The 'Functional and Interactive Scene Graphs' leaf (four papers) in the construction branch addresses similar concerns about affordances and part-level modeling, though it focuses on representation rather than planning. The 'Dynamic and Temporal Scene Graph Updating' leaf (three papers) tackles temporal consistency, a challenge MomaGraph addresses through its emphasis on object states and updates, bridging construction and planning concerns.
Across three contributions, the analysis examined thirty candidates total, with zero refutable pairs identified. The unified representation contribution examined ten candidates with none providing clear overlap; the vision-language model contribution similarly found no refutations among ten candidates; and the dataset-benchmark contribution encountered no prior work among its ten examined papers. This limited search scope—thirty semantically similar papers from a fifty-paper taxonomy—suggests the analysis captures closely related work but may not reflect the full breadth of scene graph research. The absence of refutations indicates that among these top matches, no single prior work directly anticipates MomaGraph's specific combination of spatial-functional integration, part-level modeling, and task-driven evaluation.
Given the search examined roughly sixty percent of the taxonomy's papers through semantic similarity, the findings suggest MomaGraph occupies a relatively distinct position within its immediate neighborhood. The lack of refutations across all contributions, combined with the paper's placement in a moderately populated leaf, implies the work synthesizes ideas from multiple branches—construction, planning, and benchmarking—in a novel configuration. However, the limited scope means potential overlaps in the broader literature, particularly in functional scene graphs or memory-augmented planning, remain unexplored by this analysis.
Taxonomy
Research Landscape Overview
Claimed Contributions
MomaGraph is a novel scene representation that unifies spatial and functional relationships while introducing part-level interactive nodes. It provides a compact, adaptive, and task-relevant structured representation for embodied agents, addressing limitations of prior work that separated spatial and functional relations or treated scenes as static snapshots.
MomaGraph-R1 is a 7B vision-language model trained using the DAPO reinforcement learning algorithm with a graph-alignment reward function. It generates task-oriented scene graphs and serves as a zero-shot task planner within a Graph-then-Plan framework, improving reasoning effectiveness and interpretability.
MomaGraph-Scenes is the first dataset jointly modeling spatial and functional relationships with part-level annotations, encompassing multi-view observations and task-aligned scene graphs. MomaGraph-Bench is a comprehensive benchmark evaluating six reasoning capabilities from high-level planning to fine-grained scene understanding.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
[9] SayPlan: Grounding Large Language Models using 3D Scene Graphs for Scalable Task Planning PDF
[27] EmbodiedRAG: Dynamic 3D Scene Graph Retrieval for Efficient and Scalable Robot Task Planning PDF
[28] Hierarchical Generation of Action Sequence for Service Rots Based on Scene Graph via Large Language Models PDF
[30] Context Matters! Relaxing Goals with LLMs for Feasible 3D Scene Planning PDF
Contribution Analysis
Detailed comparisons for each claimed contribution
MomaGraph: Unified Scene Graph Representation
MomaGraph is a novel scene representation that unifies spatial and functional relationships while introducing part-level interactive nodes. It provides a compact, adaptive, and task-relevant structured representation for embodied agents, addressing limitations of prior work that separated spatial and functional relations or treated scenes as static snapshots.
[5] Commonsense scene graph-based target localization for object search PDF
[15] 3D scene graphs in robotics: A unified representation bridging geometry, semantics, and action PDF
[51] Dynamic open-vocabulary 3d scene graphs for long-term language-guided mobile manipulation PDF
[52] Scene Graph Generation with Role-Playing Large Language Models PDF
[53] Hydra: A Real-time Spatial Perception System for 3D Scene Graph Construction and Optimization PDF
[54] Fast Contextual Scene Graph Generation with Unbiased Context Augmentation PDF
[55] Visual knowledge graph for human action reasoning in videos PDF
[56] FunGraph: Functionality Aware 3D Scene Graphs for Language-Prompted Scene Interaction PDF
[57] SceneHGN: Hierarchical Graph Networks for 3D Indoor Scene Generation With Fine-Grained Geometry PDF
[58] Part-level scene reconstruction affords robot interaction PDF
MomaGraph-R1: Vision-Language Model with Reinforcement Learning
MomaGraph-R1 is a 7B vision-language model trained using the DAPO reinforcement learning algorithm with a graph-alignment reward function. It generates task-oriented scene graphs and serves as a zero-shot task planner within a Graph-then-Plan framework, improving reasoning effectiveness and interpretability.
[65] UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning PDF
[66] Sense, Imagine, Act: Multimodal Perception Improves Model-Based Reinforcement Learning for Head-to-Head Autonomous Racing PDF
[67] Prompt Informed Reinforcement Learning for Visual Coverage Path Planning PDF
[68] Atari-GPT: Benchmarking Multimodal Large Language Models as Low-Level Policies in Atari Games PDF
[69] Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries PDF
[70] Spatial-SSRL: Enhancing Spatial Understanding via Self-Supervised Reinforcement Learning PDF
[71] Multimodal Visual Transformer for Sim2real Transfer in Visual Reinforcement Learning PDF
[72] Brain-Inspired Planning for Better Generalization in Reinforcement Learning PDF
[73] SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization PDF
MomaGraph-Scenes Dataset and MomaGraph-Bench Evaluation Suite
MomaGraph-Scenes is the first dataset jointly modeling spatial and functional relationships with part-level annotations, encompassing multi-view observations and task-aligned scene graphs. MomaGraph-Bench is a comprehensive benchmark evaluating six reasoning capabilities from high-level planning to fine-grained scene understanding.