Learning Massively Multitask World Models for Continuous Control

ICLR 2026 Conference SubmissionAnonymous Authors
reinforcement learningworld modelscontinuous control
Abstract:

General-purpose control demands agents that act across many tasks and embodiments, yet research on reinforcement learning (RL) for continuous control remains dominated by single-task or offline regimens, reinforcing a view that online RL does not scale. Inspired by the foundation model recipe (large-scale pretraining followed by light RL) we ask whether a single agent can be trained on hundreds of tasks with online interaction. To accelerate research in this direction, we introduce a new benchmark with 200 diverse tasks spanning many domains and embodiments, each with language instructions, demonstrations, and optionally image observations. We then present Newt, a language-conditioned multitask world model that is first pretrained on demonstrations to acquire task-aware representations and action priors, and then jointly optimized with online interaction across all tasks. Experiments show that Newt yields better multitask performance and data-efficiency than a set of strong baselines, exhibits strong open-loop control, and enables rapid adaptation to unseen tasks. We release our environments, demonstrations, code for training and evaluation, as well as 200+ checkpoints. Website: https://newt-world-models.github.io

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces MMBench, a 200-task benchmark spanning diverse domains and embodiments, and Newt, a language-conditioned world model pretrained on demonstrations then fine-tuned with online RL. It resides in the 'Generalist World Model Pretraining' leaf, which contains only three papers total. This is a relatively sparse research direction within the broader taxonomy of 50 papers across 36 topics, suggesting the work targets an emerging area where large-scale multitask world model pretraining remains underexplored compared to more established branches like cross-embodiment transfer or policy architectures.

The taxonomy reveals neighboring directions that share conceptual ground but differ in scope. 'Latent World Models for Continuous Control' focuses on trajectory optimization without large-scale pretraining, while sibling categories like 'Cross-Embodiment Transfer' and 'Language and Multimodal Grounding' address complementary challenges of morphology generalization and linguistic task specification. The paper's emphasis on massively multitask online RL distinguishes it from purely offline or single-task model-based methods, positioning it at the intersection of world model learning, language conditioning, and scalable online interaction across hundreds of tasks.

Among 25 candidates examined, the contribution-level analysis shows mixed novelty signals. The benchmark contribution (MMBench) examined 10 candidates with zero refutations, suggesting limited prior work on 200-task continuous control benchmarks at this scale. The Newt architecture examined 5 candidates with no refutations, indicating the specific combination of language conditioning and world model pretraining may be relatively novel. However, the demonstration of massively multitask online RL examined 10 candidates and found 1 refutable match, suggesting some prior exploration of large-scale multitask online learning, though the search scope was limited.

Based on the top-25 semantic matches examined, the work appears to occupy a sparsely populated research direction, particularly in combining world model pretraining with hundreds of tasks and online interaction. The limited search scope means the analysis cannot rule out relevant prior work outside the candidate set, especially in adjacent areas like offline multitask RL or smaller-scale world model benchmarks. The taxonomy structure suggests the field is still consolidating around how to scale model-based multitask learning effectively.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
25
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: multitask reinforcement learning for continuous control across diverse tasks and embodiments. The field is organized around several complementary directions. Multitask Policy Architectures and Parameter Sharing explores how to design networks that efficiently handle multiple tasks, often through attention mechanisms or mixture-of-experts structures like Attention Mixture Experts[3]. Cross-Task Knowledge Transfer and Guidance investigates how learned policies can inform or accelerate training on new tasks, as seen in Cross-Task Policy Guidance[4]. Cross-Embodiment Transfer and Generalization addresses the challenge of applying policies across different robot morphologies, with works like Cross-Embodied Learning[8] and Universal Morphology Control[34] tackling morphology-agnostic control. World Models and Model-Based Multitask Learning focuses on learning predictive models that generalize across tasks, exemplified by TD-MPC2[5] and GenRL[16]. Language and Multimodal Grounding for Embodied Control leverages linguistic or multimodal signals to guide policies, as in Code as Policies[6] and Multimodal LLMs Embodied[20]. Imitation Learning and Demonstration-Based Multitask Methods use expert data to bootstrap multitask policies, while Temporal Abstraction and Hierarchical Multitask Learning decomposes complex behaviors into reusable skills. Specialized Multitask Applications target domain-specific challenges, and Benchmarks, Frameworks, and Evaluation Infrastructure provide standardized testbeds like Benchmark Multitask Continuous[41]. A particularly active line of work centers on generalist world model pretraining, where the goal is to learn a single predictive model that can support planning or policy learning across a wide range of tasks and embodiments. Massively Multitask World Models[0] sits squarely in this branch, emphasizing large-scale pretraining of world models to enable zero-shot or few-shot transfer. This approach contrasts with more task-specific model-based methods and shares conceptual ground with GenRL[16] and Generalist World Model[19], which similarly pursue broad generalization through learned dynamics. Compared to TD-MPC2[5], which focuses on efficient online planning with a compact model, Massively Multitask World Models[0] scales up the diversity of training data and tasks to achieve broader coverage. The central trade-off in this area involves balancing model capacity and training scale against sample efficiency and computational cost, with open questions around how much task diversity is needed and whether a single world model can truly capture the full spectrum of embodied control challenges.

Claimed Contributions

MMBench: A benchmark for massively multitask reinforcement learning

The authors introduce MMBench, the first benchmark designed for massively multitask RL, comprising 200 continuous control tasks across 10 domains with language instructions, demonstrations, and support for both state and RGB observations. This includes 41 new tasks and a new task suite called MiniArcade with 19 arcade-style environments.

10 retrieved papers
Newt: A language-conditioned multitask world model

The authors present Newt, a model-based RL agent that extends TD-MPC2 to the massively multitask online setting. It uses a self-predictive world model conditioned on language instructions and optionally images, with algorithmic improvements including model-based pretraining on demonstrations, additional action supervision via behavior cloning loss, and constrained planning.

5 retrieved papers
Demonstration of feasibility and effectiveness of massively multitask online RL

The authors demonstrate that training a single agent via online RL on hundreds of tasks simultaneously is feasible and effective. Their experiments show Newt outperforms strong baselines in multitask performance and data efficiency, can perform open-loop control over long horizons, and transfers well to unseen tasks through few-shot finetuning.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

MMBench: A benchmark for massively multitask reinforcement learning

The authors introduce MMBench, the first benchmark designed for massively multitask RL, comprising 200 continuous control tasks across 10 domains with language instructions, demonstrations, and support for both state and RGB observations. This includes 41 new tasks and a new task suite called MiniArcade with 19 arcade-style environments.

Contribution

Newt: A language-conditioned multitask world model

The authors present Newt, a model-based RL agent that extends TD-MPC2 to the massively multitask online setting. It uses a self-predictive world model conditioned on language instructions and optionally images, with algorithmic improvements including model-based pretraining on demonstrations, additional action supervision via behavior cloning loss, and constrained planning.

Contribution

Demonstration of feasibility and effectiveness of massively multitask online RL

The authors demonstrate that training a single agent via online RL on hundreds of tasks simultaneously is feasible and effective. Their experiments show Newt outperforms strong baselines in multitask performance and data efficiency, can perform open-loop control over long horizons, and transfers well to unseen tasks through few-shot finetuning.