RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

Robot Datasets and BenchmarkingVision-Language-Action ModelsRobot Simulation

Recent advances in robot learning have accelerated progress toward generalist robots that can operate across diverse tasks and environments. Yet despite this momentum, it remains difficult to gauge how close we are to this goal, as the field lacks a reproducible, large-scale benchmark for systematic evaluation. To address this gap, we present RoboCasa365, a comprehensive robot simulation benchmark for everyday tasks. Built on the RoboCasa platform, RoboCasa365 introduces 365 everyday tasks across 2,500 diverse kitchen environments, over 600 hours of human demonstration data and over 1600 hours of synthetically generated demonstration data, making it one of the most diverse and large-scale resources for studying generalist policies. We design the benchmark to support evaluation across key settings, including multi-task learning, robot foundation model training, and lifelong learning. We present extensive experiments with state-of-the-art methods and analyze how task diversity, dataset scale, and environment variation shape generalization. Our results provide new insights into what factors most strongly affect the performance of generalist robots and help inform strategies for future progress in the field.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces RoboCasa365, a large-scale simulation benchmark comprising 365 everyday kitchen tasks across 2,500 environments with over 600 hours of human demonstrations and 1,600 hours of synthetic data. It resides in the Multi-Task Manipulation Benchmarks leaf, which contains four papers including RLBench, Calvin, and Meta World. This leaf sits within the moderately populated Benchmark Datasets and Task Suites branch, indicating an active but not overcrowded research direction focused on standardized evaluation protocols for generalist manipulation policies.

The taxonomy reveals that Multi-Task Manipulation Benchmarks neighbor Meta-Learning and Generalization Benchmarks (two papers) and Unified Data Standards (one paper), forming a cohesive cluster under Benchmark Datasets. Adjacent branches include Simulation Frameworks (twelve papers across three leaves) and Policy Learning Architectures (fifteen papers across five leaves). RoboCasa365 bridges these areas by providing both a task suite and infrastructure for training foundation models, connecting to the Generalist and Foundation Models leaf where works like VIMA and Gr00t explore multimodal instruction following and embodied control.

Among thirty candidates examined, the core benchmark contribution shows one refutable candidate from ten examined, suggesting some overlap with prior kitchen-focused benchmarks like the original RoboCasa platform. The systematic benchmarking framework contribution examined ten candidates with zero refutations, indicating relative novelty in its structured evaluation across multi-task learning, foundation model training, and lifelong learning settings. The experimental analysis contribution similarly found no refutations among ten candidates, suggesting that the specific factors studied—task diversity, dataset scale, and environment variation—have not been systematically analyzed at this scale in prior work.

Given the limited search scope of thirty semantically similar papers, the analysis captures immediate neighbors but cannot claim exhaustive coverage of the broader manipulation benchmarking literature. The work appears to advance an established research direction by scaling up task diversity and environment complexity, while its structured evaluation framework and factor analysis offer incremental but substantive contributions to understanding generalist policy performance. The single refutation for the benchmark contribution reflects expected overlap with closely related kitchen simulation platforms rather than fundamental lack of novelty.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: training and benchmarking generalist robots in simulation. The field organizes around several complementary branches that together support the development of versatile robotic systems. Simulation Frameworks and Environments provide the foundational platforms (e.g., Orbit[6], ManiSkill3[16]) where robots can be trained at scale, while Benchmark Datasets and Task Suites define standardized evaluation protocols spanning manipulation, navigation, and household activities. Policy Learning Architectures and Methods explore neural network designs and training algorithms—from imitation learning to reinforcement learning and transformer-based approaches—that enable robots to generalize across diverse tasks. Task Planning and Execution addresses higher-level reasoning, and Sim-to-Real Transfer tackles the challenge of deploying policies trained in simulation onto physical hardware. Specialized Applications target domains such as space robotics or construction, and Perspectives papers articulate long-term visions for generalist agents. Within the Benchmark Datasets and Task Suites branch, a particularly active line focuses on Multi-Task Manipulation Benchmarks that stress-test policy generalization. Works like RLBench[10], Calvin[14], and Meta World[4] established early testbeds with tens to hundreds of tasks, while more recent efforts such as RoboCasa[12] and RoboCasa365[0] scale up both task diversity and scene complexity by incorporating realistic kitchen environments and procedurally generated layouts. RoboCasa365[0] extends this trajectory by offering an even larger repertoire of household manipulation scenarios, positioning itself alongside other high-capacity benchmarks like Massively Parallel Benchmarking[21] that leverage parallelization for rapid evaluation. Meanwhile, neighboring works such as VIMA[8] emphasize multimodal instruction following, and Gr00t[3] explores foundation models for embodied control, illustrating the interplay between rich task suites and advanced policy architectures. These benchmarks collectively drive progress by revealing where current methods succeed and where open challenges—such as long-horizon reasoning and robust sim-to-real transfer—remain.

Claimed Contributions

RoboCasa365 simulation benchmark for generalist robots

Can Refute

10 retrieved papers

The authors introduce RoboCasa365, a large-scale simulation framework built on RoboCasa that includes 365 everyday tasks across 2,500 diverse kitchen environments, over 600 hours of human demonstration data and over 1600 hours of synthetically generated demonstration data. The benchmark is designed to support evaluation across multi-task learning, robot foundation model training, and lifelong learning settings.

10 retrieved papers

Can Refute

Systematic benchmarking framework for three learning settings

10 retrieved papers

The authors design a comprehensive suite of benchmarks to systematically study and compare state-of-the-art approaches across three distinct learning paradigms: multi-task training at scale, pretraining followed by post-training (foundation model training), and lifelong learning with sequential task acquisition. This enables reproducible large-scale experiments and analysis of factors influencing generalization.

10 retrieved papers

Extensive experimental analysis of factors affecting generalist robot performance

10 retrieved papers

The authors perform systematic experiments using RoboCasa365 to investigate how different factors such as task diversity, dataset scale, environment variation, and pretraining data composition impact the performance and generalization capabilities of generalist robot policies. Their results provide insights into what factors most strongly affect generalist robot performance.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[10] Rlbench: The robot learning benchmark & learning environment PDF

Stephen James, Zicong Ma, David Rovick Arrojo, A. Davison, Andrew J. Davison (2020)

[14] Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks PDF

Oier Mees, Lukas Hermann, Erick Rosete-Beas, LukÃ¡s Hermann, Wolfram Burgard (2022)

[21] Benchmarking massively parallelized multi-task reinforcement learning for robotics tasks PDF

Xu, Zifan, Vira Joshi, Liu Bo, Zifan Xu, Stone, Peter, Bo Liu, Zhang, Amy, Peter Stone, Amy Zhang (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RoboCasa365 simulation benchmark for generalist robots

[12] Robocasa: Large-scale simulation of everyday tasks for generalist robots PDF

Can Refute

[51] Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation PDF

Cannot Refute

[52] igibson 2.0: Object-centric simulation for robot learning of everyday household tasks PDF

Cannot Refute

[53] Virtualhome: Simulating household activities via programs PDF

Cannot Refute

[54] Agentworld: An interactive simulation platform for scene construction and mobile robotic manipulation PDF

Cannot Refute

[55] Habitat 2.0: Training home assistants to rearrange their habitat PDF

Cannot Refute

[56] Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments PDF

Cannot Refute

[57] Hazards in daily life? enabling robots to proactively detect and resolve anomalies PDF

Cannot Refute

[58] Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation PDF

Cannot Refute

[59] Behavior-1k: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation PDF

Cannot Refute

Contribution

Systematic benchmarking framework for three learning settings

[21] Benchmarking massively parallelized multi-task reinforcement learning for robotics tasks PDF

Cannot Refute

[60] Flexible Multitask Learning with Factorized Diffusion Policy PDF

Cannot Refute

[61] Delving into multi-modal multi-task foundation models for road scene understanding: From learning paradigm perspectives PDF

Cannot Refute

[62] WorldAgen: Unified State-Action Prediction with Test-Time World Model Training PDF

Cannot Refute

[63] AI Robotics Open Source R&D Survey: Foundation Models, Datasets, Simulation, and Benchmarks Platforms (2023-2025) PDF

Cannot Refute

[64] Photonic neuromorphic architecture for tens-of-task lifelong learning PDF

Cannot Refute

[65] Continual robot learning using self-supervised task inference PDF

Cannot Refute

[66] Multi-Task Continual Learning in Robotics for Cooking PDF

Cannot Refute

[67] Multi-task actor-critic with knowledge transfer via a shared critic PDF

Cannot Refute

[68] Three-Stage Training Framework for Multi-Task Sequential Human-Robot Collaboration PDF

Cannot Refute

Contribution

Extensive experimental analysis of factors affecting generalist robot performance

[69] DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset PDF

Cannot Refute

[70] BridgeData V2: A Dataset for Robot Learning at Scale PDF

Cannot Refute

[71] MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations PDF

Cannot Refute

[72] Rt-1: Robotics transformer for real-world control at scale PDF

Cannot Refute

[73] RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation PDF

Cannot Refute

[74] Is Diversity All You Need for Scalable Robotic Manipulation? PDF

Cannot Refute

[75] Efficient data collection for robotic manipulation via compositional generalization PDF

Cannot Refute

[76] IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data PDF

Cannot Refute

[77] OXE-AugE: A Large-Scale Robot Augmentation of OXE for Scaling Cross-Embodiment Policy Learning PDF

Cannot Refute

[78] RoboNet: Large-Scale Multi-Robot Learning PDF

Cannot Refute

RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[10] Rlbench: The robot learning benchmark & learning environment PDF

[14] Calvin: A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks PDF

[21] Benchmarking massively parallelized multi-task reinforcement learning for robotics tasks PDF

Contribution Analysis

RoboCasa365 simulation benchmark for generalist robots

[12] Robocasa: Large-scale simulation of everyday tasks for generalist robots PDF

[51] Behavior-1k: A benchmark for embodied ai with 1,000 everyday activities and realistic simulation PDF

[52] igibson 2.0: Object-centric simulation for robot learning of everyday household tasks PDF

[53] Virtualhome: Simulating household activities via programs PDF

[54] Agentworld: An interactive simulation platform for scene construction and mobile robotic manipulation PDF

[55] Habitat 2.0: Training home assistants to rearrange their habitat PDF

[56] Behavior: Benchmark for everyday household activities in virtual, interactive, and ecological environments PDF

[57] Hazards in daily life? enabling robots to proactively detect and resolve anomalies PDF

[58] Robomind: Benchmark on multi-embodiment intelligence normative data for robot manipulation PDF

[59] Behavior-1k: A human-centered, embodied ai benchmark with 1,000 everyday activities and realistic simulation PDF

Systematic benchmarking framework for three learning settings

[21] Benchmarking massively parallelized multi-task reinforcement learning for robotics tasks PDF

[60] Flexible Multitask Learning with Factorized Diffusion Policy PDF

[61] Delving into multi-modal multi-task foundation models for road scene understanding: From learning paradigm perspectives PDF

[62] WorldAgen: Unified State-Action Prediction with Test-Time World Model Training PDF

[63] AI Robotics Open Source R&D Survey: Foundation Models, Datasets, Simulation, and Benchmarks Platforms (2023-2025) PDF

[64] Photonic neuromorphic architecture for tens-of-task lifelong learning PDF

[65] Continual robot learning using self-supervised task inference PDF

[66] Multi-Task Continual Learning in Robotics for Cooking PDF

[67] Multi-task actor-critic with knowledge transfer via a shared critic PDF

[68] Three-Stage Training Framework for Multi-Task Sequential Human-Robot Collaboration PDF

Extensive experimental analysis of factors affecting generalist robot performance

[69] DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset PDF

[70] BridgeData V2: A Dataset for Robot Learning at Scale PDF

[71] MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations PDF

[72] Rt-1: Robotics transformer for real-world control at scale PDF

[73] RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation PDF

[74] Is Diversity All You Need for Scalable Robotic Manipulation? PDF

[75] Efficient data collection for robotic manipulation via compositional generalization PDF

[76] IRIS: Implicit Reinforcement without Interaction at Scale for Learning Control from Offline Robot Manipulation Data PDF

[77] OXE-AugE: A Large-Scale Robot Augmentation of OXE for Scaling Cross-Embodiment Policy Learning PDF

[78] RoboNet: Large-Scale Multi-Robot Learning PDF

Table of Contents