Composition of Memory Experts for Diffusion World Models

ICLR 2026 Conference SubmissionAnonymous Authors
World ModelDiffusion ModelMemoryGenerative ModelsVideo Generation
Abstract:

World models aim to predict plausible futures consistent with past observations, a capability central to planning and decision-making in reinforcement learning. Yet, existing architectures face a fundamental memory trade-off: transformers preserve local detail but are bottlenecked by quadratic attention, while recurrent and state- space models scale more efficiently but compress history at the cost of fidelity. To overcome this trade-off, we suggest decoupling future–past consistency from any single architecture and instead leveraging a set of specialized experts. We introduce a diffusion-based framework that integrates heterogeneous memory models through a contrastive product-of-experts formulation. Our approach instantiates three complementary roles: a short-term memory expert that captures fine local dynamics, a long-term memory expert that stores episodic history in external diffusion weights via lightweight test-time finetuning, and a spatial long-term memory expert that enforces geometric and spatial coherence. This compositional design avoids mode collapse and scales to long contexts without incurring a quadratic cost. Across simulated and real-world benchmarks, our method improves temporal consistency, recall of past observations, and navigation performance, establishing a novel paradigm for building and operating memory-augmented diffusion world models.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a compositional memory framework for diffusion-based world models, integrating multiple specialized experts through a product-of-experts formulation. According to the taxonomy, this work is the sole member of the 'Compositional and Multi-Expert Memory' leaf under 'Memory Architecture and Integration Mechanisms'. This leaf is distinct from sibling approaches like 'State-Space Model Integration' (4 papers), 'External Memory Systems' (3 papers), and 'Recurrent and Autoregressive Memory' (2 papers), indicating that compositional multi-expert memory is a relatively sparse research direction within the broader memory architecture landscape.

The taxonomy reveals that neighboring leaves focus on single-architecture memory solutions: state-space models compress history through structured recurrence, external memory banks maintain explicit episodic storage, and recurrent methods propagate hidden states sequentially. The paper's compositional design diverges by decoupling memory roles across heterogeneous experts rather than relying on a unified architecture. This positions the work at the intersection of memory integration mechanisms and temporal consistency enhancement, bridging architectural innovation with the goal of long-horizon coherence addressed in the 'Temporal Consistency and Long-Horizon Generation' branch.

Among 29 candidates examined across three contributions, no refutable prior work was identified. The 'Product of Contrastive Experts' mechanism examined 10 candidates with 0 refutations, the 'Compositional memory framework' examined 9 candidates with 0 refutations, and the 'External diffusion model as long-term memory' examined 10 candidates with 0 refutations. This suggests that within the limited search scope, the specific combination of contrastive product-of-experts formulation, test-time finetuning for episodic memory, and multi-scale expert decomposition appears novel relative to the examined literature.

The analysis is constrained by the top-K semantic search scope and does not constitute an exhaustive survey of all related work. The absence of sibling papers in the same taxonomy leaf and the zero refutations across contributions indicate that this compositional multi-expert approach occupies a distinct niche, though the limited candidate pool means potentially relevant work outside the search radius may exist. The novelty assessment reflects what was examined, not a definitive claim about the entire field.

Taxonomy

Core-task Taxonomy Papers
32
3
Claimed Contributions
29
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: memory-augmented diffusion world models for long-term temporal consistency. The field centers on enabling diffusion-based generative models to maintain coherent predictions over extended horizons by incorporating memory mechanisms. The taxonomy reveals several main branches: Memory Architecture and Integration Mechanisms explores how memory modules are designed and coupled with diffusion processes, ranging from compositional multi-expert systems like Memory Experts[0] to plug-and-play approaches such as Plug and Play Memory[26] and recurrent structures like Recurrent Autoregressive[16]. Temporal Consistency and Long-Horizon Generation addresses methods for preserving coherence across time, including works like Infinimotion[11] and Long Context Video[12]. Application Domains span robotics (TrackVLA[13]), autonomous driving (Autonomous Driving Survey[2]), and video synthesis (Talkingface Generation[9], Coherent Story[10]). Training and Optimization Strategies examine learning paradigms such as reinforcement-based diffusion (Reinforced Diffusions[22]) and state-space formulations (StateSpaceDiffuser[1]), while Theoretical Foundations provide broader context. A particularly active line of work investigates how different memory architectures trade off between flexibility and computational efficiency. Some approaches like WORLDMEM[3] and Spatiotemporal Memory[29] emphasize explicit memory banks that store and retrieve past states, while others such as MALT Diffusion[5] and Memory Imagination Consistency[4] integrate memory more implicitly within the diffusion process itself. Memory Experts[0] sits within the compositional and multi-expert memory cluster, distinguishing itself by decomposing memory into specialized components that can be selectively activated. This contrasts with unified memory strategies like WORLDMEM[3], which maintains a single global memory representation, and with recurrent methods like Recurrent Autoregressive[16], which propagate hidden states sequentially. The central tension across these branches involves balancing long-term consistency against the risk of error accumulation and the computational cost of maintaining extensive memory over many timesteps.

Claimed Contributions

Product of Contrastive Experts (PoCE) for memory integration

The authors propose a contrastive product-of-experts formulation that factors out spurious distribution modes when composing heterogeneous memory experts in diffusion models. This approach prevents mode collapse and over-confidence that occur with naive product-of-experts, enabling principled integration of multiple memory models without retraining.

10 retrieved papers
Compositional memory framework with specialized experts

The authors introduce a diffusion-based framework that decouples memory from any single architecture by composing specialized experts: a short-term memory expert for local dynamics, a long-term memory expert that stores episodic history via test-time finetuning, and a spatial long-term memory expert for geometric coherence. This compositional design avoids the memory-fidelity trade-off of existing architectures.

9 retrieved papers
External diffusion model as long-term memory with finetuning strategy

The authors propose using an external diffusion model as long-term memory that stores episodic knowledge directly in its weights through lightweight test-time finetuning with LoRA adapters. This enables constant-time reuse of past experience across hundreds of frames without quadratic scaling costs, while preserving the generalization capacity of pretrained models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Product of Contrastive Experts (PoCE) for memory integration

The authors propose a contrastive product-of-experts formulation that factors out spurious distribution modes when composing heterogeneous memory experts in diffusion models. This approach prevents mode collapse and over-confidence that occur with naive product-of-experts, enabling principled integration of multiple memory models without retraining.

Contribution

Compositional memory framework with specialized experts

The authors introduce a diffusion-based framework that decouples memory from any single architecture by composing specialized experts: a short-term memory expert for local dynamics, a long-term memory expert that stores episodic history via test-time finetuning, and a spatial long-term memory expert for geometric coherence. This compositional design avoids the memory-fidelity trade-off of existing architectures.

Contribution

External diffusion model as long-term memory with finetuning strategy

The authors propose using an external diffusion model as long-term memory that stores episodic knowledge directly in its weights through lightweight test-time finetuning with LoRA adapters. This enables constant-time reuse of past experience across hundreds of frames without quadratic scaling costs, while preserving the generalization capacity of pretrained models.