Composition of Memory Experts for Diffusion World Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

World ModelDiffusion ModelMemoryGenerative ModelsVideo Generation

World models aim to predict plausible futures consistent with past observations, a capability central to planning and decision-making in reinforcement learning. Yet, existing architectures face a fundamental memory trade-off: transformers preserve local detail but are bottlenecked by quadratic attention, while recurrent and state- space models scale more efficiently but compress history at the cost of fidelity. To overcome this trade-off, we suggest decoupling future–past consistency from any single architecture and instead leveraging a set of specialized experts. We introduce a diffusion-based framework that integrates heterogeneous memory models through a contrastive product-of-experts formulation. Our approach instantiates three complementary roles: a short-term memory expert that captures fine local dynamics, a long-term memory expert that stores episodic history in external diffusion weights via lightweight test-time finetuning, and a spatial long-term memory expert that enforces geometric and spatial coherence. This compositional design avoids mode collapse and scales to long contexts without incurring a quadratic cost. Across simulated and real-world benchmarks, our method improves temporal consistency, recall of past observations, and navigation performance, establishing a novel paradigm for building and operating memory-augmented diffusion world models.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a compositional memory framework for diffusion-based world models, integrating multiple specialized experts through a product-of-experts formulation. According to the taxonomy, this work is the sole member of the 'Compositional and Multi-Expert Memory' leaf under 'Memory Architecture and Integration Mechanisms'. This leaf is distinct from sibling approaches like 'State-Space Model Integration' (4 papers), 'External Memory Systems' (3 papers), and 'Recurrent and Autoregressive Memory' (2 papers), indicating that compositional multi-expert memory is a relatively sparse research direction within the broader memory architecture landscape.

The taxonomy reveals that neighboring leaves focus on single-architecture memory solutions: state-space models compress history through structured recurrence, external memory banks maintain explicit episodic storage, and recurrent methods propagate hidden states sequentially. The paper's compositional design diverges by decoupling memory roles across heterogeneous experts rather than relying on a unified architecture. This positions the work at the intersection of memory integration mechanisms and temporal consistency enhancement, bridging architectural innovation with the goal of long-horizon coherence addressed in the 'Temporal Consistency and Long-Horizon Generation' branch.

Among 29 candidates examined across three contributions, no refutable prior work was identified. The 'Product of Contrastive Experts' mechanism examined 10 candidates with 0 refutations, the 'Compositional memory framework' examined 9 candidates with 0 refutations, and the 'External diffusion model as long-term memory' examined 10 candidates with 0 refutations. This suggests that within the limited search scope, the specific combination of contrastive product-of-experts formulation, test-time finetuning for episodic memory, and multi-scale expert decomposition appears novel relative to the examined literature.

The analysis is constrained by the top-K semantic search scope and does not constitute an exhaustive survey of all related work. The absence of sibling papers in the same taxonomy leaf and the zero refutations across contributions indicate that this compositional multi-expert approach occupies a distinct niche, though the limited candidate pool means potentially relevant work outside the search radius may exist. The novelty assessment reflects what was examined, not a definitive claim about the entire field.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: memory-augmented diffusion world models for long-term temporal consistency. The field centers on enabling diffusion-based generative models to maintain coherent predictions over extended horizons by incorporating memory mechanisms. The taxonomy reveals several main branches: Memory Architecture and Integration Mechanisms explores how memory modules are designed and coupled with diffusion processes, ranging from compositional multi-expert systems like Memory Experts[0] to plug-and-play approaches such as Plug and Play Memory[26] and recurrent structures like Recurrent Autoregressive[16]. Temporal Consistency and Long-Horizon Generation addresses methods for preserving coherence across time, including works like Infinimotion[11] and Long Context Video[12]. Application Domains span robotics (TrackVLA[13]), autonomous driving (Autonomous Driving Survey[2]), and video synthesis (Talkingface Generation[9], Coherent Story[10]). Training and Optimization Strategies examine learning paradigms such as reinforcement-based diffusion (Reinforced Diffusions[22]) and state-space formulations (StateSpaceDiffuser[1]), while Theoretical Foundations provide broader context. A particularly active line of work investigates how different memory architectures trade off between flexibility and computational efficiency. Some approaches like WORLDMEM[3] and Spatiotemporal Memory[29] emphasize explicit memory banks that store and retrieve past states, while others such as MALT Diffusion[5] and Memory Imagination Consistency[4] integrate memory more implicitly within the diffusion process itself. Memory Experts[0] sits within the compositional and multi-expert memory cluster, distinguishing itself by decomposing memory into specialized components that can be selectively activated. This contrasts with unified memory strategies like WORLDMEM[3], which maintains a single global memory representation, and with recurrent methods like Recurrent Autoregressive[16], which propagate hidden states sequentially. The central tension across these branches involves balancing long-term consistency against the risk of error accumulation and the computational cost of maintaining extensive memory over many timesteps.

Claimed Contributions

Product of Contrastive Experts (PoCE) for memory integration

10 retrieved papers

The authors propose a contrastive product-of-experts formulation that factors out spurious distribution modes when composing heterogeneous memory experts in diffusion models. This approach prevents mode collapse and over-confidence that occur with naive product-of-experts, enabling principled integration of multiple memory models without retraining.

10 retrieved papers

Compositional memory framework with specialized experts

9 retrieved papers

The authors introduce a diffusion-based framework that decouples memory from any single architecture by composing specialized experts: a short-term memory expert for local dynamics, a long-term memory expert that stores episodic history via test-time finetuning, and a spatial long-term memory expert for geometric coherence. This compositional design avoids the memory-fidelity trade-off of existing architectures.

9 retrieved papers

External diffusion model as long-term memory with finetuning strategy

10 retrieved papers

The authors propose using an external diffusion model as long-term memory that stores episodic knowledge directly in its weights through lightweight test-time finetuning with LoRA adapters. This enables constant-time reuse of past experience across hundreds of frames without quadratic scaling costs, while preserving the generalization capacity of pretrained models.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Product of Contrastive Experts (PoCE) for memory integration

[40] Controllable Group Choreography Using Contrastive Diffusion PDF

Cannot Refute

[41] Non-confusing generation of customized concepts in diffusion models PDF

Cannot Refute

[42] Medusa: A Multi-Scale High-order Contrastive Dual-Diffusion Approach for Multi-View Clustering PDF

Cannot Refute

[43] Continual learning for unknown domain fault diagnosis in rotating machinery via Diffusion-Integrated Dynamic Mixture Experts PDF

Cannot Refute

[44] SwiMDiff: Scene-wide matching contrastive learning with diffusion constraint for remote sensing image PDF

Cannot Refute

[45] Fusion of diffusion weighted MRI and clinical data for predicting functional outcome after acute ischemic stroke with deep contrastive learning PDF

Cannot Refute

[46] Enhancing underwater images: a dual-constraint latent diffusion approach with multi-view contrastive learning PDF

Cannot Refute

[47] GCN-diffusion and multi-view contrastive learning for enhanced knowledge recommendation PDF

Cannot Refute

[48] Fusion of diffusion models and intent learning in sequential recommendation PDF

Cannot Refute

[49] Towards Good Generalizations for Diffusion Generated Image Detection Using Multiple Reconstruction Contrastive Learning PDF

Cannot Refute

Contribution

Compositional memory framework with specialized experts

[26] Learning Plug-and-play Memory for Guiding Video Diffusion Models PDF

Cannot Refute

[28] EgoLCD: Egocentric Video Generation with Long Context Diffusion PDF

Cannot Refute

[33] Streamingt2v: Consistent, dynamic, and extendable long video generation from text PDF

Cannot Refute

[34] A Category-Theoretic Framework for Wake-Sleep Consolidation in Dual-Transformer Architectures PDF

Cannot Refute

[35] Accelerated Inorganic Materials Design with Generative AI Agents PDF

Cannot Refute

[36] VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory PDF

Cannot Refute

[37] Sr-cis: Self-reflective incremental system with decoupled memory and reasoning PDF

Cannot Refute

[38] D-Cubed: Latent Diffusion Trajectory Optimisation for Dexterous Deformable Manipulation PDF

Cannot Refute

[39] Towards Continuous Intelligence Growth: Self-Training, Continual Learning, and Dual-Scale Memory in SuperIntelliAgent PDF

Cannot Refute

Contribution

External diffusion model as long-term memory with finetuning strategy

[26] Learning Plug-and-play Memory for Guiding Video Diffusion Models PDF

Cannot Refute

[27] RELIC: Interactive Video World Model with Long-Horizon Memory PDF

Cannot Refute

[50] Flexible Diffusion Modeling of Long Videos PDF

Cannot Refute

[51] Learning Long-Context Diffusion Policies via Past-Token Prediction PDF

Cannot Refute

[52] How to continually adapt text-to-image diffusion models for flexible customization? PDF

Cannot Refute

[53] Closed-loop visuomotor control with generative expectation for robotic manipulation PDF

Cannot Refute

[54] Generative Prompting with Diffusion for Lifelong Continual Adaptation PDF

Cannot Refute

[55] Hierarchical Multiscale Diffuser for Extendable Long-Horizon Planning PDF

Cannot Refute

[56] Reconstruct, Inpaint, Test-Time Finetune: Dynamic Novel-view Synthesis from Monocular Videos PDF

Cannot Refute

[57] POMDIFFUSER: LONG-MEMORY MEETS LONG-PLANNING FOR POMDPS PDF

Cannot Refute

Composition of Memory Experts for Diffusion World Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Product of Contrastive Experts (PoCE) for memory integration

[40] Controllable Group Choreography Using Contrastive Diffusion PDF

[41] Non-confusing generation of customized concepts in diffusion models PDF

[42] Medusa: A Multi-Scale High-order Contrastive Dual-Diffusion Approach for Multi-View Clustering PDF

[43] Continual learning for unknown domain fault diagnosis in rotating machinery via Diffusion-Integrated Dynamic Mixture Experts PDF

[44] SwiMDiff: Scene-wide matching contrastive learning with diffusion constraint for remote sensing image PDF

[45] Fusion of diffusion weighted MRI and clinical data for predicting functional outcome after acute ischemic stroke with deep contrastive learning PDF

[46] Enhancing underwater images: a dual-constraint latent diffusion approach with multi-view contrastive learning PDF

[47] GCN-diffusion and multi-view contrastive learning for enhanced knowledge recommendation PDF

[48] Fusion of diffusion models and intent learning in sequential recommendation PDF

[49] Towards Good Generalizations for Diffusion Generated Image Detection Using Multiple Reconstruction Contrastive Learning PDF

Compositional memory framework with specialized experts

[26] Learning Plug-and-play Memory for Guiding Video Diffusion Models PDF

[28] EgoLCD: Egocentric Video Generation with Long Context Diffusion PDF

[33] Streamingt2v: Consistent, dynamic, and extendable long video generation from text PDF

[34] A Category-Theoretic Framework for Wake-Sleep Consolidation in Dual-Transformer Architectures PDF

[35] Accelerated Inorganic Materials Design with Generative AI Agents PDF

[36] VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory PDF

[37] Sr-cis: Self-reflective incremental system with decoupled memory and reasoning PDF

[38] D-Cubed: Latent Diffusion Trajectory Optimisation for Dexterous Deformable Manipulation PDF

[39] Towards Continuous Intelligence Growth: Self-Training, Continual Learning, and Dual-Scale Memory in SuperIntelliAgent PDF

External diffusion model as long-term memory with finetuning strategy

[26] Learning Plug-and-play Memory for Guiding Video Diffusion Models PDF

[27] RELIC: Interactive Video World Model with Long-Horizon Memory PDF

[50] Flexible Diffusion Modeling of Long Videos PDF

[51] Learning Long-Context Diffusion Policies via Past-Token Prediction PDF

[52] How to continually adapt text-to-image diffusion models for flexible customization? PDF

[53] Closed-loop visuomotor control with generative expectation for robotic manipulation PDF

[54] Generative Prompting with Diffusion for Lifelong Continual Adaptation PDF

[55] Hierarchical Multiscale Diffuser for Extendable Long-Horizon Planning PDF

[56] Reconstruct, Inpaint, Test-Time Finetune: Dynamic Novel-view Synthesis from Monocular Videos PDF

[57] POMDIFFUSER: LONG-MEMORY MEETS LONG-PLANNING FOR POMDPS PDF

Table of Contents