Mixture of Contexts for Long Video Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Video Generation

Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context videos is fundamentally limited by the quadratic cost of self-attention, which makes memory and computation intractable and difficult to optimize for long sequences. We recast long-context video generation as an internal information retrieval task and propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine. In MoC, each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures. As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content. Efficiency follows as a byproduct of retrieval (near-linear scaling), which enables practical training and synthesis, and the emergence of memory and consistency at the scale of minutes.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Mixture of Contexts (MoC), a learnable sparse attention routing module that dynamically selects informative chunks and mandatory anchors for long-context video generation. It resides in the 'Dynamic and Adaptive Sparse Attention' leaf, which contains four papers total, including the original work. This leaf sits within the broader 'Sparse Attention Mechanisms for Video Diffusion Transformers' branch, indicating a moderately populated research direction focused on reducing quadratic attention costs through content-aware sparsity rather than fixed patterns.

The taxonomy reveals neighboring leaves exploring static sparse patterns (five papers on radial, diagonal, or block structures), training-free acceleration (four papers on heuristic pruning), and hybrid multi-scale designs (three papers on hierarchical attention). The original paper diverges from these by emphasizing learned routing over predefined geometry or post-hoc pruning. Adjacent branches address attention optimization via step distillation and long-term memory management, suggesting the field balances efficiency gains with temporal coherence challenges. The scope notes clarify that dynamic methods like MoC differ from static patterns by adapting attention based on input content during inference.

Among eighteen candidates examined across three contributions, the core MoC framework (Contribution A) shows one refutable candidate out of ten examined, while the content-aligned chunking strategy (Contribution B) found no refutations among eight candidates. Contribution C (causal routing) was not evaluated against prior work. The limited search scope—eighteen papers rather than an exhaustive survey—means these statistics reflect top-K semantic matches and citation expansion, not comprehensive coverage. The single refutation for the main contribution suggests some overlap with existing adaptive routing ideas, though most examined candidates appear non-overlapping or unclear.

Given the moderate density of the dynamic sparse attention leaf and the limited literature search, the work appears to occupy a recognizable niche within an active but not overcrowded subfield. The analysis covers top semantic neighbors and immediate taxonomy siblings but does not claim exhaustive prior art review. The chunking and causal routing components show less direct overlap in the examined set, though the core routing framework encounters at least one closely related prior approach among the candidates surveyed.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: long-context video generation with sparse attention routing. The field addresses the computational challenge of generating extended video sequences by reducing the quadratic cost of full attention in diffusion transformers. The taxonomy reveals several major branches: some focus on designing specialized sparse attention patterns for video diffusion transformers (including dynamic and adaptive routing strategies), while others emphasize attention optimization through distillation or architectural variants. Additional branches tackle long-context temporal coherence, controllable generation conditioned on various inputs, multi-shot narrative synthesis, and training-free or few-shot methods. Theoretical foundations and general sparse attention mechanisms form another strand, alongside specialized topics such as human-object interaction and egocentric video modeling. Representative works like Hunyuanvideo[8] and InfiniteVL[5] illustrate how large-scale systems integrate these sparse routing ideas, whereas methods such as PAROAttention[1] and Radial Attention[2] propose specific geometric or hierarchical sparsity patterns. Within the dynamic and adaptive sparse attention cluster, several lines of work explore how to route attention based on content or learned policies rather than fixed patterns. Mixture of Contexts[0] exemplifies this adaptive approach by dynamically selecting relevant context subsets during generation, contrasting with methods like Bidirectional Sparse Attention[3] that impose structured bidirectional connectivity or VORTA[22] that leverages token-level routing. A key trade-off across these branches is between the flexibility of learned routing (which can better capture complex temporal dependencies) and the simplicity of predefined sparse masks (which offer predictable memory savings). Open questions include how to balance sparsity with quality for very long sequences and whether adaptive routing can generalize across diverse video domains. Mixture of Contexts[0] sits naturally among these adaptive strategies, sharing the goal of content-aware sparsity with neighbors like MoGA[32], yet differing in how context selection is orchestrated across layers and frames.

Claimed Contributions

Mixture of Contexts (MoC) framework with learnable sparse attention routing

Can Refute

10 retrieved papers

The authors introduce MoC, a learnable sparse attention routing mechanism that reformulates long-context video generation as an internal information retrieval process. Each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures, enabling minute-scale video generation at near short-video computational cost.

10 retrieved papers

Can Refute

Content-aligned chunking strategy for video sequences

8 retrieved papers

The authors propose a content-aligned chunking approach that partitions heterogeneous multi-modal video token streams along natural boundaries (frames, shots, text segments) rather than using uniform windows. This design preserves semantic coherence and enables more discriminative top-k retrieval while maintaining compatibility with existing video generation architectures.

8 retrieved papers

Causal routing mechanism to prevent pathological loop closures

0 retrieved papers

The authors introduce a causal masking constraint at the routing stage that restricts each chunk to attend only to earlier positions in the sequence, transforming the routing graph into a directed acyclic graph. This design eliminates isolated feedback loops and ensures information flows strictly forward in time, resulting in smoother temporal dynamics and more stable training.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[11] Training-free and adaptive sparse attention for efficient long video generation PDF

Xia Yifei, Yifei Xia, Fu, Fangcheng, Suhan Ling, Wang Yujie, Fangcheng Fu, Li Huixia, Yujie Wang, Xiao Xuefeng, Huixia Li, Cui Bin, Xuefeng Xiao, Bin Cui (2025)

[22] VORTA: Efficient Video Diffusion via Routing Sparse Attention PDF

Sun Wen-hao, Tu, Rong-Cheng, Wenhao Sun, Ding Yifu, Rong-Cheng Tu, Jin Zhao, Yifu Ding, Liao, Jingyi, Zhao Jin, Liu Shunyu, Jingyi Liao, Tao, Dacheng, Shunyu Liu, Dacheng Tao (2025)

[32] MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation PDF

Jia Weinan, Lu, Yuning, Weinan Jia, Huang, Mengqi, Yuning Lu, Wang, Hualiang, Mengqi Huang, Huang Binyuan, Hualiang Wang, Chen, Nan, Binyuan Huang, Liu Mu, Nan Chen, Jiang Ji-dong, Mu Liu, Mao, Zhendong, Jidong Jiang, Zhendong Mao (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Mixture of Contexts (MoC) framework with learnable sparse attention routing

[32] MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation PDF

Can Refute

[10] Fractal reservoir structuring for large language model generative pathways: An empirical investigation with large language model PDF

Cannot Refute

[16] EgoLCD: Egocentric Video Generation with Long Context Diffusion PDF

Cannot Refute

[56] MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation PDF

Cannot Refute

[57] Salova: Segment-augmented long video assistant for targeted retrieval and routing in long-form video analysis PDF

Cannot Refute

[58] Generative World-Model Planning for Long-Horizon User Preference Evolution and Responsible Personalization PDF

Cannot Refute

[59] Animate-a-story: Storytelling with retrieval-augmented video generation PDF

Cannot Refute

[60] Learning World Models for Interactive Video Generation PDF

Cannot Refute

[61] Pack and Force Your Memory: Long-form and Consistent Video Generation PDF

Cannot Refute

[62] InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing PDF

Cannot Refute

Contribution

Content-aligned chunking strategy for video sequences

[46] Mirasol3b: A multimodal autoregressive model for time-aligned and contextual modalities PDF

Cannot Refute

[47] Text-video retrieval via multi-modal hypergraph networks PDF

Cannot Refute

[49] Hippomm: Hippocampal-inspired multimodal memory for long audiovisual event understanding PDF

Cannot Refute

[50] Split Federated Learning for Real-Time Aerial Video Event Recognition in UAV-Based Geospatial Monitoring PDF

Cannot Refute

[51] Adaptive Chunking for VideoRAG Pipelines with a Newly Gathered Bilingual Educational Dataset PDF

Cannot Refute

[52] Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs PDF

Cannot Refute

[53] Semantic-Assisted Object Clustering for Multi-Modal Referring Video Segmentation. PDF

Cannot Refute

[55] Towards training-free long video understanding: methods, benchmarks, and open challenges PDF

Cannot Refute

Contribution

Mixture of Contexts for Long Video Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[11] Training-free and adaptive sparse attention for efficient long video generation PDF

[22] VORTA: Efficient Video Diffusion via Routing Sparse Attention PDF

[32] MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation PDF

Contribution Analysis

Mixture of Contexts (MoC) framework with learnable sparse attention routing

[32] MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation PDF

[10] Fractal reservoir structuring for large language model generative pathways: An empirical investigation with large language model PDF

[16] EgoLCD: Egocentric Video Generation with Long Context Diffusion PDF

[56] MotionRAG: Motion Retrieval-Augmented Image-to-Video Generation PDF

[57] Salova: Segment-augmented long video assistant for targeted retrieval and routing in long-form video analysis PDF

[58] Generative World-Model Planning for Long-Horizon User Preference Evolution and Responsible Personalization PDF

[59] Animate-a-story: Storytelling with retrieval-augmented video generation PDF

[60] Learning World Models for Interactive Video Generation PDF

[61] Pack and Force Your Memory: Long-form and Consistent Video Generation PDF

[62] InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing PDF

Content-aligned chunking strategy for video sequences

[46] Mirasol3b: A multimodal autoregressive model for time-aligned and contextual modalities PDF

[47] Text-video retrieval via multi-modal hypergraph networks PDF

[49] Hippomm: Hippocampal-inspired multimodal memory for long audiovisual event understanding PDF

[50] Split Federated Learning for Real-Time Aerial Video Event Recognition in UAV-Based Geospatial Monitoring PDF

[51] Adaptive Chunking for VideoRAG Pipelines with a Newly Gathered Bilingual Educational Dataset PDF

[52] Adaptive Token Boundaries: Integrating Human Chunking Mechanisms into Multimodal LLMs PDF

[53] Semantic-Assisted Object Clustering for Multi-Modal Referring Video Segmentation. PDF

[55] Towards training-free long video understanding: methods, benchmarks, and open challenges PDF

Causal routing mechanism to prevent pathological loop closures

Table of Contents