MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 7.0 Download Report PDF

video generation

Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query–key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy–efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach. We provide an anonymous link \url{https://anonymous.4open.science/r/MoGA} to showcase the generated videos.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Mixture-of-Groups Attention (MoGA), a learnable sparse attention mechanism using token routing for long video generation, and demonstrates end-to-end minute-level video synthesis at 480p/24fps with ~580k context length. It resides in the 'Learnable Routing and Adaptive Selection' leaf, which contains only four papers total, indicating a relatively sparse research direction within the broader taxonomy of 50 papers. This leaf focuses specifically on trainable modules for dynamic token selection, distinguishing it from the more populated 'Fixed Pattern Sparse Attention' branch (seven papers) that employs predefined structures.

The taxonomy reveals that MoGA's leaf sits within 'Sparse Attention Mechanism Design', adjacent to branches exploring fixed patterns, training-free methods, and hierarchical structures. Neighboring leaves like 'Fixed Pattern Sparse Attention' address similar efficiency goals but through static geometric patterns (radial, block-sparse), while 'Training-Free Sparse Attention' (seven papers) applies heuristics without retraining. The 'Task-Specific Applications' branch, particularly 'Long-Form and Multi-Shot Video Generation' (three papers), represents the application domain where MoGA's contributions materialize, showing how mechanism design connects to downstream synthesis tasks.

Among 28 candidates examined, the contribution-level analysis shows mixed novelty signals. The MoGA mechanism examined 10 candidates with 1 refutable match, suggesting moderate prior work overlap in learnable routing. The end-to-end video generation model examined 10 candidates with 2 refutable matches, indicating more substantial precedent in long-form synthesis systems. The data pipeline contribution examined 8 candidates with 1 refutable match. These statistics reflect a limited semantic search scope rather than exhaustive coverage, meaning unexamined work may exist beyond the top-K matches analyzed.

Based on the 28-candidate search scope, the work appears to incrementally advance learnable routing within a sparsely populated taxonomy leaf, though the end-to-end system shows more overlap with existing long video generation approaches. The analysis captures semantic neighbors but cannot confirm absence of related work in adjacent research communities or recent preprints outside the search window. The taxonomy structure suggests the mechanism design is less crowded than application-focused directions, but definitive novelty assessment requires broader literature coverage.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient sparse attention for long video generation. The field addresses the computational bottleneck of full attention in video diffusion models by developing mechanisms that selectively attend to relevant tokens across spatial and temporal dimensions. The taxonomy reveals five major branches: Sparse Attention Mechanism Design explores architectural patterns for reducing quadratic complexity, including fixed patterns like sliding windows and strided attention, as well as learnable routing strategies; Training and Optimization Strategies focuses on how to effectively learn sparse patterns during model training or adapt pretrained models; Inference Acceleration and Deployment targets runtime efficiency through caching, pruning, and quantization; Task-Specific Applications examines domain-tailored solutions for personalized generation, long-form video synthesis, and multimodal understanding; and Analysis and Theoretical Foundations provides empirical studies and mathematical grounding for sparsity choices. Representative works span from early architectural innovations like Nuwa[18] to recent systems like Hunyuanvideo[28] and Sana Video[10] that integrate multiple efficiency techniques. Particularly active lines of work contrast fixed geometric patterns against adaptive, input-dependent selection. Fixed approaches such as Radial Attention[2] and Bidirectional Sparse Attention[3] offer predictable memory footprints but may miss task-relevant dependencies, while learnable methods like Trainable Sparse Attention[29] and VORTA[32] dynamically route attention based on content but introduce training overhead. MoGA[0] sits within the learnable routing cluster, emphasizing mixture-of-experts-style gating to adaptively select attention contexts, closely aligning with Mixture of Contexts[1] in its use of dynamic selection mechanisms. Compared to VORTA[32], which focuses on token-level routing, MoGA[0] appears to leverage coarser-grained expert assignment, trading fine-grained control for reduced routing complexity. This positioning reflects broader tensions in the field between maximizing adaptivity and maintaining training stability, with ongoing questions about how to balance sparsity ratios, routing granularity, and generalization across diverse video generation tasks.

Claimed Contributions

Mixture-of-Groups Attention (MoGA)

Can Refute

10 retrieved papers

MoGA is a novel sparse attention mechanism that employs a single linear layer as a token router to assign tokens directly to semantic groups, avoiding the coarse block-level estimation used in prior methods. This enables efficient long-range interactions while remaining compatible with modern attention kernels like FlashAttention and sequence parallelism.

10 retrieved papers

Can Refute

End-to-end long video generation model

Can Refute

10 retrieved papers

The authors develop a video generation model built on MoGA that can generate minute-long, multi-shot videos at 480p resolution and 24 fps in an end-to-end manner, handling context lengths of around 580k tokens without requiring multi-stage pipelines.

10 retrieved papers

Can Refute

Multi-shot long video data pipeline

Can Refute

8 retrieved papers

The authors construct a two-stage data pipeline that converts raw long videos into one-minute, multi-shot clips with dense annotations, including video-level filtering and shot-level processing with VQA, OCR-based cropping, and multimodal captioning to enable shot-level text conditioning.

8 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[1] Mixture of contexts for long video generation PDF

Cai, Shengqu, Yang, Ceyuan, Shengqu Cai, Zhang Lvmin, Ceyuan Yang, Guo, Yuwei, Lvmin Zhang, Xiao Jun-fei, Yuwei Guo, Yang Ziyan, Junfei Xiao, Xu, Yinghao, Ziyan Yang, Yang Zhenheng, Yinghao Xu, Yuille, Alan, Zhenheng Yang, Guibas, Leonidas, Alan Yuille, Agrawala, Maneesh, Leonidas J. Guibas, Jiang Lu, Maneesh Agrawala, Wetzstein, Gordon, Lu Jiang, Gordon Wetzstein (2025)

[29] Faster video diffusion with trainable sparse attention PDF

P Zhang, Y Chen, H Huang, W Lin, Z Liu (2025)

[32] VORTA: Efficient Video Diffusion via Routing Sparse Attention PDF

Sun Wen-hao, Tu, Rong-Cheng, Wenhao Sun, Ding Yifu, Rong-Cheng Tu, Jin Zhao, Yifu Ding, Liao, Jingyi, Zhao Jin, Liu Shunyu, Jingyi Liao, Tao, Dacheng, Shunyu Liu, Dacheng Tao (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Mixture-of-Groups Attention (MoGA)

[63] Optimizing Mixture of Block Attention PDF

Can Refute

[35] OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs PDF

Cannot Refute

[58] Biformer: Vision transformer with bi-level routing attention PDF

Cannot Refute

[59] Transformer learns optimal variable selection in group-sparse classification PDF

Cannot Refute

[60] DiT: Efficient vision transformers with dynamic token routing PDF

Cannot Refute

[61] GSAformer: Group sparse attention transformer for functional brain network analysis. PDF

Cannot Refute

[62] SpecAttn: Speculating Sparse Attention PDF

Cannot Refute

[64] Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization PDF

Cannot Refute

[65] Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers PDF

Cannot Refute

[66] Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs PDF

Cannot Refute

Contribution

End-to-end long video generation model

[1] Mixture of contexts for long video generation PDF

Can Refute

[69] MAGI-1: Autoregressive Video Generation at Scale PDF

Can Refute

[67] Phenaki: Variable length video generation from open domain textual description PDF

Cannot Refute

[68] Streamingt2v: Consistent, dynamic, and extendable long video generation from text PDF

Cannot Refute

[70] Stable video infinity: Infinite-length video generation with error recycling PDF

Cannot Refute

[71] Video-infinity: Distributed long video generation PDF

Cannot Refute

[72] Generating long videos of dynamic scenes PDF

Cannot Refute

[73] Video World Models with Long-term Spatial Memory PDF

Cannot Refute

[74] Skyreels-v2: Infinite-length film generative model PDF

Cannot Refute

[75] Worldweaver: Generating long-horizon video worlds via rich perception PDF

Cannot Refute

Contribution

Multi-shot long video data pipeline

[54] EchoShot: Multi-Shot Portrait Video Generation PDF

Can Refute

[19] HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives PDF

Cannot Refute

[51] Cut2next: Generating next shot via in-context tuning PDF

Cannot Refute

[52] ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models PDF

Cannot Refute

[53] TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation PDF

Cannot Refute

[55] AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation PDF

Cannot Refute

[56] Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding PDF

Cannot Refute

[57] Intelligent Video Analysis and Creation PDF

Cannot Refute

MoGA: Mixture-of-Groups Attention for End-to-End Long Video Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[1] Mixture of contexts for long video generation PDF

[29] Faster video diffusion with trainable sparse attention PDF

[32] VORTA: Efficient Video Diffusion via Routing Sparse Attention PDF

Contribution Analysis

Mixture-of-Groups Attention (MoGA)

[63] Optimizing Mixture of Block Attention PDF

[35] OmniSparse: Training-Aware Fine-Grained Sparse Attention for Long-Video MLLMs PDF

[58] Biformer: Vision transformer with bi-level routing attention PDF

[59] Transformer learns optimal variable selection in group-sparse classification PDF

[60] DiT: Efficient vision transformers with dynamic token routing PDF

[61] GSAformer: Group sparse attention transformer for functional brain network analysis. PDF

[62] SpecAttn: Speculating Sparse Attention PDF

[64] Mixture of Weight-shared Heterogeneous Group Attention Experts for Dynamic Token-wise KV Optimization PDF

[65] Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers PDF

[66] Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs PDF

End-to-end long video generation model

[1] Mixture of contexts for long video generation PDF

[69] MAGI-1: Autoregressive Video Generation at Scale PDF

[67] Phenaki: Variable length video generation from open domain textual description PDF

[68] Streamingt2v: Consistent, dynamic, and extendable long video generation from text PDF

[70] Stable video infinity: Infinite-length video generation with error recycling PDF

[71] Video-infinity: Distributed long video generation PDF

[72] Generating long videos of dynamic scenes PDF

[73] Video World Models with Long-term Spatial Memory PDF

[74] Skyreels-v2: Infinite-length film generative model PDF

[75] Worldweaver: Generating long-horizon video worlds via rich perception PDF

Multi-shot long video data pipeline

[54] EchoShot: Multi-Shot Portrait Video Generation PDF

[19] HoloCine: Holistic Generation of Cinematic Multi-Shot Long Video Narratives PDF

[51] Cut2next: Generating next shot via in-context tuning PDF

[52] ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models PDF

[53] TALC: Time-Aligned Captions for Multi-Scene Text-to-Video Generation PDF

[55] AnimeShooter: A Multi-Shot Animation Dataset for Reference-Guided Video Generation PDF

[56] Shot2Tactic-Caption: Multi-Scale Captioning of Badminton Videos for Tactical Understanding PDF

[57] Intelligent Video Analysis and Creation PDF

Table of Contents