Abstract:

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720×1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x} speedup). In summary, SANA-Video enables low-cost, high-quality video generation. Code and model will be publicly released.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

SANA-Video introduces a linear diffusion transformer for efficient high-resolution, minute-length video generation, combining linear attention with a constant-memory KV cache for block-wise autoregressive synthesis. The paper resides in the 'Block Linear Attention for Long Video Synthesis' leaf, which contains only three papers including SANA-Video itself. This is a relatively sparse research direction within the broader taxonomy of 36 papers across the field, suggesting the specific combination of block linear attention and constant-memory caching for extended video generation remains an emerging area with limited prior exploration.

The taxonomy reveals that SANA-Video's parent branch, 'Linear Attention Mechanisms for Diffusion Transformers', encompasses neighboring approaches like gated linear attention and post-training adaptation methods. Adjacent branches explore sparse-linear fusion strategies and state space models, which offer alternative pathways to linear complexity. The taxonomy's scope notes clarify that block linear attention specifically targets minute-length synthesis through structured decomposition, distinguishing it from full-sequence linear methods and hybrid sparse approaches. SANA-Video's positioning suggests it bridges architectural efficiency (linear attention) with practical deployment constraints (constant-memory caching), connecting to but diverging from pure architectural innovations in neighboring leaves.

Among the three contributions analyzed, the linear DiT architecture examined 10 candidates with 2 appearing to provide overlapping prior work, while the constant-memory KV cache examined 5 candidates with none clearly refuting novelty. The training strategy contribution also examined 10 candidates with 2 potential overlaps. Given the limited search scope of 25 total candidates from semantic search, these statistics suggest the core architectural innovation (linear DiT) operates in a more crowded space, whereas the constant-memory caching mechanism for block attention appears less directly addressed in the examined literature. The analysis does not claim exhaustive coverage but indicates differential novelty across contributions within the sampled candidate set.

Based on the limited literature search, SANA-Video appears to occupy a sparsely populated niche combining block linear attention with constant-memory mechanisms for long video synthesis. The taxonomy structure and sibling paper count suggest this specific integration is relatively underexplored, though individual components (linear attention, block decomposition) have precedents in the examined candidates. The analysis reflects top-25 semantic matches and does not capture the full landscape of video generation research, particularly work outside the linear attention paradigm or published after the search cutoff.

Taxonomy

Core-task Taxonomy Papers
36
3
Claimed Contributions
25
Contribution Candidate Papers Compared
4
Refutable Paper

Research Landscape Overview

Core task: efficient video generation with linear attention. The field addresses the computational bottleneck of quadratic attention in video diffusion transformers by exploring diverse architectural strategies. The taxonomy reveals several major branches: Linear Attention Mechanisms for Diffusion Transformers develop direct replacements for standard attention that scale linearly with sequence length, often through kernel approximations or block-wise decompositions (e.g., LinGen[4], LinVideo[6]). Sparse and Hybrid Attention Strategies selectively compute attention over subsets of tokens or combine local and global patterns to reduce complexity while preserving quality. State Space Models for Video Generation leverage recurrent formulations like Mamba to achieve linear scaling, trading the flexibility of attention for efficiency (e.g., Diffusion Mamba[29]). Additional branches tackle Consistency and Temporal Coherence Methods to maintain frame-to-frame stability, Efficient Inference and Deployment Optimization for real-time or on-device scenarios (e.g., On-device Sora[10]), and domain-specific applications ranging from video editing to multimodal synthesis. Within the Linear Attention Mechanisms branch, a particularly active line of work focuses on block linear attention for long video synthesis, where models partition sequences into manageable chunks to balance memory and quality. SANA-Video[0] exemplifies this approach by employing block-structured linear attention to generate extended sequences efficiently, positioning itself alongside LinGen[4] and LinGen-Uni[20], which similarly decompose attention across temporal blocks. These methods contrast with global linear attention schemes (e.g., Global Linear Attention[15]) that apply a single linear operator across all frames, trading fine-grained temporal modeling for simplicity. Meanwhile, hybrid strategies like Radial Attention[3] and plug-and-play modules (Plug-and-Play Linear[16]) explore middle grounds, injecting linear components into existing architectures without full redesigns. The central tension across these directions lies in balancing computational savings, temporal coherence, and the ability to capture long-range dependencies—challenges that SANA-Video[0] addresses through its block-wise design, which shares conceptual ground with LinGen[4] but may differ in block size, attention kernel, or training objectives.

Claimed Contributions

Linear DiT for efficient video generation

The authors extend SANA's linear DiT design to video by replacing all attention modules with efficient linear attention, reducing complexity from O(N²) to O(N). They integrate Rotary Position Embeddings (RoPE) and introduce a 1D temporal convolution to the Mix-FFN for improved spatio-temporal modeling.

10 retrieved papers
Can Refute
Constant-memory KV cache for block linear attention

The authors reformulate causal linear attention to maintain a fixed-memory KV cache that provides global context at constant memory cost. This enables efficient minute-long video generation without the memory overhead of traditional KV caches used in full attention models.

5 retrieved papers
Efficient training strategy and data filtering

The authors develop a multi-stage training approach that leverages pre-trained text-to-image models, applies resolution-specific data filtering criteria, and uses a coarse-to-fine training paradigm. This reduces training costs to approximately 1% of MovieGen while achieving competitive performance.

10 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Linear DiT for efficient video generation

The authors extend SANA's linear DiT design to video by replacing all attention modules with efficient linear attention, reducing complexity from O(N²) to O(N). They integrate Rotary Position Embeddings (RoPE) and introduce a 1D temporal convolution to the Mix-FFN for improved spatio-temporal modeling.

Contribution

Constant-memory KV cache for block linear attention

The authors reformulate causal linear attention to maintain a fixed-memory KV cache that provides global context at constant memory cost. This enables efficient minute-long video generation without the memory overhead of traditional KV caches used in full attention models.

Contribution

Efficient training strategy and data filtering

The authors develop a multi-stage training approach that leverages pre-trained text-to-image models, applies resolution-specific data filtering criteria, and uses a coarse-to-fine training paradigm. This reduces training costs to approximately 1% of MovieGen while achieving competitive performance.