Abstract:

Long video generation with Diffusion Transformers (DiTs) is bottlenecked by the quadratic scaling of full attention with sequence length. Since attention is highly redundant, outputs are dominated by a small subset of query–key pairs. Existing sparse methods rely on blockwise coarse estimation, whose accuracy–efficiency trade-offs are constrained by block size. This paper introduces Mixture-of-Groups Attention (MoGA), an efficient sparse attention that uses a lightweight, learnable token router to precisely match tokens without blockwise estimation. Through semantic-aware routing, MoGA enables effective long-range interactions. As a kernel-free method, MoGA integrates seamlessly with modern attention stacks, including FlashAttention and sequence parallelism. Building on MoGA, we develop an efficient long video generation model that end-to-end produces minute-level, multi-shot, 480p videos at 24 fps, with a context length of approximately 580k. Comprehensive experiments on various video generation tasks validate the effectiveness of our approach. We provide an anonymous link \url{https://anonymous.4open.science/r/MoGA} to showcase the generated videos.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Mixture-of-Groups Attention (MoGA), a learnable sparse attention mechanism using token routing for long video generation, and demonstrates end-to-end minute-level video synthesis at 480p/24fps with ~580k context length. It resides in the 'Learnable Routing and Adaptive Selection' leaf, which contains only four papers total, indicating a relatively sparse research direction within the broader taxonomy of 50 papers. This leaf focuses specifically on trainable modules for dynamic token selection, distinguishing it from the more populated 'Fixed Pattern Sparse Attention' branch (seven papers) that employs predefined structures.

The taxonomy reveals that MoGA's leaf sits within 'Sparse Attention Mechanism Design', adjacent to branches exploring fixed patterns, training-free methods, and hierarchical structures. Neighboring leaves like 'Fixed Pattern Sparse Attention' address similar efficiency goals but through static geometric patterns (radial, block-sparse), while 'Training-Free Sparse Attention' (seven papers) applies heuristics without retraining. The 'Task-Specific Applications' branch, particularly 'Long-Form and Multi-Shot Video Generation' (three papers), represents the application domain where MoGA's contributions materialize, showing how mechanism design connects to downstream synthesis tasks.

Among 28 candidates examined, the contribution-level analysis shows mixed novelty signals. The MoGA mechanism examined 10 candidates with 1 refutable match, suggesting moderate prior work overlap in learnable routing. The end-to-end video generation model examined 10 candidates with 2 refutable matches, indicating more substantial precedent in long-form synthesis systems. The data pipeline contribution examined 8 candidates with 1 refutable match. These statistics reflect a limited semantic search scope rather than exhaustive coverage, meaning unexamined work may exist beyond the top-K matches analyzed.

Based on the 28-candidate search scope, the work appears to incrementally advance learnable routing within a sparsely populated taxonomy leaf, though the end-to-end system shows more overlap with existing long video generation approaches. The analysis captures semantic neighbors but cannot confirm absence of related work in adjacent research communities or recent preprints outside the search window. The taxonomy structure suggests the mechanism design is less crowded than application-focused directions, but definitive novelty assessment requires broader literature coverage.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
28
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: efficient sparse attention for long video generation. The field addresses the computational bottleneck of full attention in video diffusion models by developing mechanisms that selectively attend to relevant tokens across spatial and temporal dimensions. The taxonomy reveals five major branches: Sparse Attention Mechanism Design explores architectural patterns for reducing quadratic complexity, including fixed patterns like sliding windows and strided attention, as well as learnable routing strategies; Training and Optimization Strategies focuses on how to effectively learn sparse patterns during model training or adapt pretrained models; Inference Acceleration and Deployment targets runtime efficiency through caching, pruning, and quantization; Task-Specific Applications examines domain-tailored solutions for personalized generation, long-form video synthesis, and multimodal understanding; and Analysis and Theoretical Foundations provides empirical studies and mathematical grounding for sparsity choices. Representative works span from early architectural innovations like Nuwa[18] to recent systems like Hunyuanvideo[28] and Sana Video[10] that integrate multiple efficiency techniques. Particularly active lines of work contrast fixed geometric patterns against adaptive, input-dependent selection. Fixed approaches such as Radial Attention[2] and Bidirectional Sparse Attention[3] offer predictable memory footprints but may miss task-relevant dependencies, while learnable methods like Trainable Sparse Attention[29] and VORTA[32] dynamically route attention based on content but introduce training overhead. MoGA[0] sits within the learnable routing cluster, emphasizing mixture-of-experts-style gating to adaptively select attention contexts, closely aligning with Mixture of Contexts[1] in its use of dynamic selection mechanisms. Compared to VORTA[32], which focuses on token-level routing, MoGA[0] appears to leverage coarser-grained expert assignment, trading fine-grained control for reduced routing complexity. This positioning reflects broader tensions in the field between maximizing adaptivity and maintaining training stability, with ongoing questions about how to balance sparsity ratios, routing granularity, and generalization across diverse video generation tasks.

Claimed Contributions

Mixture-of-Groups Attention (MoGA)

MoGA is a novel sparse attention mechanism that employs a single linear layer as a token router to assign tokens directly to semantic groups, avoiding the coarse block-level estimation used in prior methods. This enables efficient long-range interactions while remaining compatible with modern attention kernels like FlashAttention and sequence parallelism.

10 retrieved papers
Can Refute
End-to-end long video generation model

The authors develop a video generation model built on MoGA that can generate minute-long, multi-shot videos at 480p resolution and 24 fps in an end-to-end manner, handling context lengths of around 580k tokens without requiring multi-stage pipelines.

10 retrieved papers
Can Refute
Multi-shot long video data pipeline

The authors construct a two-stage data pipeline that converts raw long videos into one-minute, multi-shot clips with dense annotations, including video-level filtering and shot-level processing with VQA, OCR-based cropping, and multimodal captioning to enable shot-level text conditioning.

8 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Mixture-of-Groups Attention (MoGA)

MoGA is a novel sparse attention mechanism that employs a single linear layer as a token router to assign tokens directly to semantic groups, avoiding the coarse block-level estimation used in prior methods. This enables efficient long-range interactions while remaining compatible with modern attention kernels like FlashAttention and sequence parallelism.

Contribution

End-to-end long video generation model

The authors develop a video generation model built on MoGA that can generate minute-long, multi-shot videos at 480p resolution and 24 fps in an end-to-end manner, handling context lengths of around 580k tokens without requiring multi-stage pipelines.

Contribution

Multi-shot long video data pipeline

The authors construct a two-stage data pipeline that converts raw long videos into one-minute, multi-shot clips with dense annotations, including video-level filtering and shot-level processing with VQA, OCR-based cropping, and multimodal captioning to enable shot-level text conditioning.