Mixture of Contexts for Long Video Generation

ICLR 2026 Conference SubmissionAnonymous Authors
Video Generation
Abstract:

Long video generation is fundamentally a long context memory problem: models must retain and retrieve salient events across a long range without collapsing or drifting. However, scaling diffusion transformers to generate long-context videos is fundamentally limited by the quadratic cost of self-attention, which makes memory and computation intractable and difficult to optimize for long sequences. We recast long-context video generation as an internal information retrieval task and propose a simple, learnable sparse attention routing module, Mixture of Contexts (MoC), as an effective long-term memory retrieval engine. In MoC, each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures. As we scale the data and gradually sparsify the routing, the model allocates compute to salient history, preserving identities, actions, and scenes over minutes of content. Efficiency follows as a byproduct of retrieval (near-linear scaling), which enables practical training and synthesis, and the emergence of memory and consistency at the scale of minutes.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Mixture of Contexts (MoC), a learnable sparse attention routing module that dynamically selects informative chunks and mandatory anchors for long-context video generation. It resides in the 'Dynamic and Adaptive Sparse Attention' leaf, which contains four papers total, including the original work. This leaf sits within the broader 'Sparse Attention Mechanisms for Video Diffusion Transformers' branch, indicating a moderately populated research direction focused on reducing quadratic attention costs through content-aware sparsity rather than fixed patterns.

The taxonomy reveals neighboring leaves exploring static sparse patterns (five papers on radial, diagonal, or block structures), training-free acceleration (four papers on heuristic pruning), and hybrid multi-scale designs (three papers on hierarchical attention). The original paper diverges from these by emphasizing learned routing over predefined geometry or post-hoc pruning. Adjacent branches address attention optimization via step distillation and long-term memory management, suggesting the field balances efficiency gains with temporal coherence challenges. The scope notes clarify that dynamic methods like MoC differ from static patterns by adapting attention based on input content during inference.

Among eighteen candidates examined across three contributions, the core MoC framework (Contribution A) shows one refutable candidate out of ten examined, while the content-aligned chunking strategy (Contribution B) found no refutations among eight candidates. Contribution C (causal routing) was not evaluated against prior work. The limited search scope—eighteen papers rather than an exhaustive survey—means these statistics reflect top-K semantic matches and citation expansion, not comprehensive coverage. The single refutation for the main contribution suggests some overlap with existing adaptive routing ideas, though most examined candidates appear non-overlapping or unclear.

Given the moderate density of the dynamic sparse attention leaf and the limited literature search, the work appears to occupy a recognizable niche within an active but not overcrowded subfield. The analysis covers top semantic neighbors and immediate taxonomy siblings but does not claim exhaustive prior art review. The chunking and causal routing components show less direct overlap in the examined set, though the core routing framework encounters at least one closely related prior approach among the candidates surveyed.

Taxonomy

Core-task Taxonomy Papers
45
3
Claimed Contributions
18
Contribution Candidate Papers Compared
1
Refutable Paper

Research Landscape Overview

Core task: long-context video generation with sparse attention routing. The field addresses the computational challenge of generating extended video sequences by reducing the quadratic cost of full attention in diffusion transformers. The taxonomy reveals several major branches: some focus on designing specialized sparse attention patterns for video diffusion transformers (including dynamic and adaptive routing strategies), while others emphasize attention optimization through distillation or architectural variants. Additional branches tackle long-context temporal coherence, controllable generation conditioned on various inputs, multi-shot narrative synthesis, and training-free or few-shot methods. Theoretical foundations and general sparse attention mechanisms form another strand, alongside specialized topics such as human-object interaction and egocentric video modeling. Representative works like Hunyuanvideo[8] and InfiniteVL[5] illustrate how large-scale systems integrate these sparse routing ideas, whereas methods such as PAROAttention[1] and Radial Attention[2] propose specific geometric or hierarchical sparsity patterns. Within the dynamic and adaptive sparse attention cluster, several lines of work explore how to route attention based on content or learned policies rather than fixed patterns. Mixture of Contexts[0] exemplifies this adaptive approach by dynamically selecting relevant context subsets during generation, contrasting with methods like Bidirectional Sparse Attention[3] that impose structured bidirectional connectivity or VORTA[22] that leverages token-level routing. A key trade-off across these branches is between the flexibility of learned routing (which can better capture complex temporal dependencies) and the simplicity of predefined sparse masks (which offer predictable memory savings). Open questions include how to balance sparsity with quality for very long sequences and whether adaptive routing can generalize across diverse video domains. Mixture of Contexts[0] sits naturally among these adaptive strategies, sharing the goal of content-aware sparsity with neighbors like MoGA[32], yet differing in how context selection is orchestrated across layers and frames.

Claimed Contributions

Mixture of Contexts (MoC) framework with learnable sparse attention routing

The authors introduce MoC, a learnable sparse attention routing mechanism that reformulates long-context video generation as an internal information retrieval process. Each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures, enabling minute-scale video generation at near short-video computational cost.

10 retrieved papers
Can Refute
Content-aligned chunking strategy for video sequences

The authors propose a content-aligned chunking approach that partitions heterogeneous multi-modal video token streams along natural boundaries (frames, shots, text segments) rather than using uniform windows. This design preserves semantic coherence and enables more discriminative top-k retrieval while maintaining compatibility with existing video generation architectures.

8 retrieved papers
Causal routing mechanism to prevent pathological loop closures

The authors introduce a causal masking constraint at the routing stage that restricts each chunk to attend only to earlier positions in the sequence, transforming the routing graph into a directed acyclic graph. This design eliminates isolated feedback loops and ensures information flows strictly forward in time, resulting in smoother temporal dynamics and more stable training.

0 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Mixture of Contexts (MoC) framework with learnable sparse attention routing

The authors introduce MoC, a learnable sparse attention routing mechanism that reformulates long-context video generation as an internal information retrieval process. Each query dynamically selects a few informative chunks plus mandatory anchors (caption, local windows) to attend to, with causal routing that prevents loop closures, enabling minute-scale video generation at near short-video computational cost.

Contribution

Content-aligned chunking strategy for video sequences

The authors propose a content-aligned chunking approach that partitions heterogeneous multi-modal video token streams along natural boundaries (frames, shots, text segments) rather than using uniform windows. This design preserves semantic coherence and enables more discriminative top-k retrieval while maintaining compatibility with existing video generation architectures.

Contribution

Causal routing mechanism to prevent pathological loop closures

The authors introduce a causal masking constraint at the routing stage that restricts each chunk to attend only to earlier positions in the sequence, transforming the routing graph into a directed acyclic graph. This design eliminates isolated feedback loops and ensures information flows strictly forward in time, resulting in smoother temporal dynamics and more stable training.