MoMaGen: Generating Demonstrations under Soft and Hard Constraints for Multi-Step Bimanual Mobile Manipulation

ICLR 2026 Conference SubmissionAnonymous Authors
Data Generation for Robot LearningBimanual Mobile ManipulationImitation Learning for Robotics
Abstract:

Imitation learning from large-scale, diverse human demonstrations has proven effective for training robots, but collecting such data is costly and time-consuming. This challenge is amplified for multi-step bimanual mobile manipulation, where humans must teleoperate both a mobile base and two high-degree-of-freedom arms. Prior automated data generation frameworks have addressed static bimanual manipulation by augmenting a few human demonstrations in simulation, but they fall short for mobile settings due to two key challenges: (1) determining base placement to ensure reachability, and (2) positioning the camera to provide sufficient visibility for visuomotor policies. To address these issues, we introduce MoMaGen, which formulates data generation as a constrained optimization problem that enforces hard constraints (e.g., reachability) while balancing soft constraints (e.g., visibility during navigation). This formulation generalizes prior approaches and provides a principled foundation for future methods. We evaluate MoMaGen on four multi-step bimanual mobile manipulation tasks and show that it generates significantly more diverse datasets than existing methods. Leveraging this diversity, MoMaGen can train successful imitation learning policies from a single source demonstration, and these policies can be fine-tuned with as few as 40 real-world demonstrations to achieve deployment on physical robotic hardware. More details are available at our project page: momagen-iclr2026.github.io.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MoMaGen, a framework for automatically generating demonstration data for multi-step bimanual mobile manipulation tasks through constrained optimization. It resides in the Simulation-Based Demonstration Synthesis leaf, which contains four papers total, indicating a moderately populated research direction. The sibling papers include DexMimicGen, RoboTwin, and one other work, all addressing automated demonstration synthesis in simulation. This leaf sits within the broader Demonstration Generation and Data Synthesis branch, suggesting the paper targets a recognized but not overcrowded problem space where scalable data generation remains an active challenge.

The taxonomy reveals neighboring research directions that contextualize MoMaGen's positioning. The adjacent Real-World Demonstration Augmentation leaf contains methods that proliferate from limited human examples rather than pure simulation synthesis. Downstream, the Imitation Learning for Bimanual Manipulation branch shows how generated demonstrations feed into policy learning architectures. The Mobile Manipulation Integration branch addresses navigation-manipulation coordination, which MoMaGen explicitly targets through its base placement and camera visibility constraints. The paper appears to bridge simulation-based synthesis with mobile manipulation challenges, connecting two previously separate research threads in the taxonomy structure.

Among thirty candidates examined across three contributions, none were identified as clearly refuting the proposed work. The MoMaGen framework contribution examined ten candidates with zero refutable matches, as did the unified constrained optimization formulation and the reachability-visibility constraints. This suggests that within the limited search scope, the specific combination of constrained optimization for mobile bimanual demonstration generation appears relatively unexplored. However, the analysis explicitly notes this reflects top-K semantic search results rather than exhaustive coverage, meaning potentially relevant prior work in adjacent optimization-based planning domains may exist outside the examined candidate set.

Based on the limited literature search, the work appears to occupy a distinct position combining simulation-based synthesis with mobile manipulation constraints. The taxonomy structure shows related work in static bimanual synthesis and separate efforts in mobile manipulation coordination, but the examined candidates did not reveal direct overlap with MoMaGen's integrated approach. The analysis acknowledges its scope limitations, covering thirty semantically similar papers rather than comprehensive field coverage, particularly in adjacent constraint-based planning and trajectory optimization domains that may contain relevant methodological precedents.

Taxonomy

Core-task Taxonomy Papers
42
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Generating demonstrations for multi-step bimanual mobile manipulation. The field addresses the challenge of creating training data for robots that must coordinate two arms while navigating through environments to complete complex, sequential tasks. The taxonomy reveals several complementary research directions: Demonstration Generation and Data Synthesis focuses on creating scalable training data through simulation and automated methods, exemplified by works like DexMimicGen[5] and RoboTwin[14]; Data Collection Systems and Benchmarks establishes standardized platforms and datasets such as Bigym[3] and Mobile ALOHA[2]; Imitation Learning for Bimanual Manipulation develops policies that learn coordinated dual-arm behaviors from demonstrations; Task and Motion Planning tackles the symbolic and geometric reasoning required for long-horizon tasks; Reinforcement Learning and Optimization-Based Control explores learning-based and model-based approaches to skill acquisition; and Mobile Manipulation Integration addresses the unique challenges of combining locomotion with manipulation. A central tension emerges between simulation-based synthesis approaches that promise scalability and real-world data collection that captures physical nuances. Within Demonstration Generation, some works pursue fully automated synthesis pipelines while others like Mobile ALOHA[12] emphasize teleoperation systems for high-quality human demonstrations. MoMaGen[0] sits within the Simulation-Based Demonstration Synthesis cluster alongside DexMimicGen[5] and RoboTwin[14], focusing on automated generation of diverse bimanual mobile manipulation trajectories in simulation. Compared to DexMimicGen[5], which emphasizes dexterous in-hand manipulation, MoMaGen[0] appears to tackle the broader integration of mobility with dual-arm coordination. The interplay between these synthesis methods and imitation learning architectures like RDT-1B[1] highlights ongoing questions about how to bridge the sim-to-real gap while maintaining demonstration diversity and task coverage.

Claimed Contributions

MoMaGen framework for bimanual mobile manipulation data generation

The authors introduce MoMaGen, a framework that formulates automated demonstration generation for bimanual mobile manipulation as a constrained optimization problem. This formulation addresses reachability and visibility challenges unique to mobile manipulators by incorporating both hard constraints that must be satisfied and soft constraints that are optimized.

10 retrieved papers
Unified constrained optimization formulation for X-Gen methods

The authors provide a unified framework that interprets existing X-Gen family methods (MimicGen, SkillMimicGen, DexMimicGen) as instances of constrained optimization with different constraint sets. This generalization offers a principled foundation for understanding and developing automated data generation approaches.

10 retrieved papers
Novel reachability and visibility constraints for mobile manipulation

The authors introduce technical innovations including reachability as a hard constraint to ensure manipulability, object visibility during manipulation as a hard constraint for visuomotor policy training, object visibility during navigation as a soft constraint, and retraction as a soft constraint to promote safe navigation after manipulation.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MoMaGen framework for bimanual mobile manipulation data generation

The authors introduce MoMaGen, a framework that formulates automated demonstration generation for bimanual mobile manipulation as a constrained optimization problem. This formulation addresses reachability and visibility challenges unique to mobile manipulators by incorporating both hard constraints that must be satisfied and soft constraints that are optimized.

Contribution

Unified constrained optimization formulation for X-Gen methods

The authors provide a unified framework that interprets existing X-Gen family methods (MimicGen, SkillMimicGen, DexMimicGen) as instances of constrained optimization with different constraint sets. This generalization offers a principled foundation for understanding and developing automated data generation approaches.

Contribution

Novel reachability and visibility constraints for mobile manipulation

The authors introduce technical innovations including reachability as a hard constraint to ensure manipulability, object visibility during manipulation as a hard constraint for visuomotor policy training, object visibility during navigation as a soft constraint, and retraction as a soft constraint to promote safe navigation after manipulation.