RoboCasa365: A Large-Scale Simulation Framework for Training and Benchmarking Generalist Robots

ICLR 2026 Conference SubmissionAnonymous Authors
Robot Datasets and BenchmarkingVision-Language-Action ModelsRobot Simulation
Abstract:

Recent advances in robot learning have accelerated progress toward generalist robots that can operate across diverse tasks and environments. Yet despite this momentum, it remains difficult to gauge how close we are to this goal, as the field lacks a reproducible, large-scale benchmark for systematic evaluation. To address this gap, we present RoboCasa365, a comprehensive robot simulation benchmark for everyday tasks. Built on the RoboCasa platform, RoboCasa365 introduces 365 everyday tasks across 2,500 diverse kitchen environments, over 600 hours of human demonstration data and over 1600 hours of synthetically generated demonstration data, making it one of the most diverse and large-scale resources for studying generalist policies. We design the benchmark to support evaluation across key settings, including multi-task learning, robot foundation model training, and lifelong learning. We present extensive experiments with state-of-the-art methods and analyze how task diversity, dataset scale, and environment variation shape generalization. Our results provide new insights into what factors most strongly affect the performance of generalist robots and help inform strategies for future progress in the field.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces RoboCasa365, a large-scale simulation benchmark comprising 365 everyday kitchen tasks across 2,500 environments with over 600 hours of human demonstrations and 1,600 hours of synthetic data. It resides in the Multi-Task Manipulation Benchmarks leaf, which contains four papers including RLBench, Calvin, and Meta World. This leaf sits within the moderately populated Benchmark Datasets and Task Suites branch, indicating an active but not overcrowded research direction focused on standardized evaluation protocols for generalist manipulation policies.

The taxonomy reveals that Multi-Task Manipulation Benchmarks neighbor Meta-Learning and Generalization Benchmarks (two papers) and Unified Data Standards (one paper), forming a cohesive cluster under Benchmark Datasets. Adjacent branches include Simulation Frameworks (twelve papers across three leaves) and Policy Learning Architectures (fifteen papers across five leaves). RoboCasa365 bridges these areas by providing both a task suite and infrastructure for training foundation models, connecting to the Generalist and Foundation Models leaf where works like VIMA and Gr00t explore multimodal instruction following and embodied control.

Among thirty candidates examined, the core benchmark contribution shows one refutable candidate from ten examined, suggesting some overlap with prior kitchen-focused benchmarks like the original RoboCasa platform. The systematic benchmarking framework contribution examined ten candidates with zero refutations, indicating relative novelty in its structured evaluation across multi-task learning, foundation model training, and lifelong learning settings. The experimental analysis contribution similarly found no refutations among ten candidates, suggesting that the specific factors studied—task diversity, dataset scale, and environment variation—have not been systematically analyzed at this scale in prior work.

Given the limited search scope of thirty semantically similar papers, the analysis captures immediate neighbors but cannot claim exhaustive coverage of the broader manipulation benchmarking literature. The work appears to advance an established research direction by scaling up task diversity and environment complexity, while its structured evaluation framework and factor analysis offer incremental but substantive contributions to understanding generalist policy performance. The single refutation for the benchmark contribution reflects expected overlap with closely related kitchen simulation platforms rather than fundamental lack of novelty.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: training and benchmarking generalist robots in simulation. The field organizes around several complementary branches that together support the development of versatile robotic systems. Simulation Frameworks and Environments provide the foundational platforms (e.g., Orbit[6], ManiSkill3[16]) where robots can be trained at scale, while Benchmark Datasets and Task Suites define standardized evaluation protocols spanning manipulation, navigation, and household activities. Policy Learning Architectures and Methods explore neural network designs and training algorithms—from imitation learning to reinforcement learning and transformer-based approaches—that enable robots to generalize across diverse tasks. Task Planning and Execution addresses higher-level reasoning, and Sim-to-Real Transfer tackles the challenge of deploying policies trained in simulation onto physical hardware. Specialized Applications target domains such as space robotics or construction, and Perspectives papers articulate long-term visions for generalist agents. Within the Benchmark Datasets and Task Suites branch, a particularly active line focuses on Multi-Task Manipulation Benchmarks that stress-test policy generalization. Works like RLBench[10], Calvin[14], and Meta World[4] established early testbeds with tens to hundreds of tasks, while more recent efforts such as RoboCasa[12] and RoboCasa365[0] scale up both task diversity and scene complexity by incorporating realistic kitchen environments and procedurally generated layouts. RoboCasa365[0] extends this trajectory by offering an even larger repertoire of household manipulation scenarios, positioning itself alongside other high-capacity benchmarks like Massively Parallel Benchmarking[21] that leverage parallelization for rapid evaluation. Meanwhile, neighboring works such as VIMA[8] emphasize multimodal instruction following, and Gr00t[3] explores foundation models for embodied control, illustrating the interplay between rich task suites and advanced policy architectures. These benchmarks collectively drive progress by revealing where current methods succeed and where open challenges—such as long-horizon reasoning and robust sim-to-real transfer—remain.

Claimed Contributions

RoboCasa365 simulation benchmark for generalist robots

The authors introduce RoboCasa365, a large-scale simulation framework built on RoboCasa that includes 365 everyday tasks across 2,500 diverse kitchen environments, over 600 hours of human demonstration data and over 1600 hours of synthetically generated demonstration data. The benchmark is designed to support evaluation across multi-task learning, robot foundation model training, and lifelong learning settings.

10 retrieved papers
Can Refute
Systematic benchmarking framework for three learning settings

The authors design a comprehensive suite of benchmarks to systematically study and compare state-of-the-art approaches across three distinct learning paradigms: multi-task training at scale, pretraining followed by post-training (foundation model training), and lifelong learning with sequential task acquisition. This enables reproducible large-scale experiments and analysis of factors influencing generalization.

10 retrieved papers
Extensive experimental analysis of factors affecting generalist robot performance

The authors perform systematic experiments using RoboCasa365 to investigate how different factors such as task diversity, dataset scale, environment variation, and pretraining data composition impact the performance and generalization capabilities of generalist robot policies. Their results provide insights into what factors most strongly affect generalist robot performance.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

RoboCasa365 simulation benchmark for generalist robots

The authors introduce RoboCasa365, a large-scale simulation framework built on RoboCasa that includes 365 everyday tasks across 2,500 diverse kitchen environments, over 600 hours of human demonstration data and over 1600 hours of synthetically generated demonstration data. The benchmark is designed to support evaluation across multi-task learning, robot foundation model training, and lifelong learning settings.

Contribution

Systematic benchmarking framework for three learning settings

The authors design a comprehensive suite of benchmarks to systematically study and compare state-of-the-art approaches across three distinct learning paradigms: multi-task training at scale, pretraining followed by post-training (foundation model training), and lifelong learning with sequential task acquisition. This enables reproducible large-scale experiments and analysis of factors influencing generalization.

Contribution

Extensive experimental analysis of factors affecting generalist robot performance

The authors perform systematic experiments using RoboCasa365 to investigate how different factors such as task diversity, dataset scale, environment variation, and pretraining data composition impact the performance and generalization capabilities of generalist robot policies. Their results provide insights into what factors most strongly affect generalist robot performance.