SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Video Diffusion Model

We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720×1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x} speedup). In summary, SANA-Video enables low-cost, high-quality video generation. Code and model will be publicly released.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

SANA-Video introduces a linear diffusion transformer for efficient high-resolution, minute-length video generation, combining linear attention with a constant-memory KV cache for block-wise autoregressive synthesis. The paper resides in the 'Block Linear Attention for Long Video Synthesis' leaf, which contains only three papers including SANA-Video itself. This is a relatively sparse research direction within the broader taxonomy of 36 papers across the field, suggesting the specific combination of block linear attention and constant-memory caching for extended video generation remains an emerging area with limited prior exploration.

The taxonomy reveals that SANA-Video's parent branch, 'Linear Attention Mechanisms for Diffusion Transformers', encompasses neighboring approaches like gated linear attention and post-training adaptation methods. Adjacent branches explore sparse-linear fusion strategies and state space models, which offer alternative pathways to linear complexity. The taxonomy's scope notes clarify that block linear attention specifically targets minute-length synthesis through structured decomposition, distinguishing it from full-sequence linear methods and hybrid sparse approaches. SANA-Video's positioning suggests it bridges architectural efficiency (linear attention) with practical deployment constraints (constant-memory caching), connecting to but diverging from pure architectural innovations in neighboring leaves.

Among the three contributions analyzed, the linear DiT architecture examined 10 candidates with 2 appearing to provide overlapping prior work, while the constant-memory KV cache examined 5 candidates with none clearly refuting novelty. The training strategy contribution also examined 10 candidates with 2 potential overlaps. Given the limited search scope of 25 total candidates from semantic search, these statistics suggest the core architectural innovation (linear DiT) operates in a more crowded space, whereas the constant-memory caching mechanism for block attention appears less directly addressed in the examined literature. The analysis does not claim exhaustive coverage but indicates differential novelty across contributions within the sampled candidate set.

Based on the limited literature search, SANA-Video appears to occupy a sparsely populated niche combining block linear attention with constant-memory mechanisms for long video synthesis. The taxonomy structure and sibling paper count suggest this specific integration is relatively underexplored, though individual components (linear attention, block decomposition) have precedents in the examined candidates. The analysis reflects top-25 semantic matches and does not capture the full landscape of video generation research, particularly work outside the linear attention paradigm or published after the search cutoff.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient video generation with linear attention. The field addresses the computational bottleneck of quadratic attention in video diffusion transformers by exploring diverse architectural strategies. The taxonomy reveals several major branches: Linear Attention Mechanisms for Diffusion Transformers develop direct replacements for standard attention that scale linearly with sequence length, often through kernel approximations or block-wise decompositions (e.g., LinGen[4], LinVideo[6]). Sparse and Hybrid Attention Strategies selectively compute attention over subsets of tokens or combine local and global patterns to reduce complexity while preserving quality. State Space Models for Video Generation leverage recurrent formulations like Mamba to achieve linear scaling, trading the flexibility of attention for efficiency (e.g., Diffusion Mamba[29]). Additional branches tackle Consistency and Temporal Coherence Methods to maintain frame-to-frame stability, Efficient Inference and Deployment Optimization for real-time or on-device scenarios (e.g., On-device Sora[10]), and domain-specific applications ranging from video editing to multimodal synthesis. Within the Linear Attention Mechanisms branch, a particularly active line of work focuses on block linear attention for long video synthesis, where models partition sequences into manageable chunks to balance memory and quality. SANA-Video[0] exemplifies this approach by employing block-structured linear attention to generate extended sequences efficiently, positioning itself alongside LinGen[4] and LinGen-Uni[20], which similarly decompose attention across temporal blocks. These methods contrast with global linear attention schemes (e.g., Global Linear Attention[15]) that apply a single linear operator across all frames, trading fine-grained temporal modeling for simplicity. Meanwhile, hybrid strategies like Radial Attention[3] and plug-and-play modules (Plug-and-Play Linear[16]) explore middle grounds, injecting linear components into existing architectures without full redesigns. The central tension across these directions lies in balancing computational savings, temporal coherence, and the ability to capture long-range dependencies—challenges that SANA-Video[0] addresses through its block-wise design, which shares conceptual ground with LinGen[4] but may differ in block size, attention kernel, or training objectives.

Claimed Contributions

Linear DiT for efficient video generation

Can Refute

10 retrieved papers

The authors extend SANA's linear DiT design to video by replacing all attention modules with efficient linear attention, reducing complexity from O(N²) to O(N). They integrate Rotary Position Embeddings (RoPE) and introduce a 1D temporal convolution to the Mix-FFN for improved spatio-temporal modeling.

10 retrieved papers

Can Refute

Constant-memory KV cache for block linear attention

5 retrieved papers

The authors reformulate causal linear attention to maintain a fixed-memory KV cache that provides global context at constant memory cost. This enables efficient minute-long video generation without the memory overhead of traditional KV caches used in full attention models.

5 retrieved papers

Efficient training strategy and data filtering

Can Refute

10 retrieved papers

The authors develop a multi-stage training approach that leverages pre-trained text-to-image models, applies resolution-specific data filtering criteria, and uses a coarse-to-fine training paradigm. This reduces training costs to approximately 1% of MovieGen while achieving competitive performance.

10 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[4] LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity PDF

Hongjie Wang, Chih-Yao Ma, Yen-Cheng Liu, Ji Hou, Tao Xu, Jialiang Wang, Felix Juefei Xu, Yaqiao Luo, Felix Juefei-Xu, Peizhao Zhang, Tingbo Hou, PÃ©ter. Vajda, Niraj K. Jha, Peter Vajda, Xiaoliang Dai, N. Jha (2024) • Computer Vision and Pattern Recognition

[20] LinGen-Uni: A Universal Linear-Complexity Framework for High-Resolution Minute-Length Text-to-Video Generation PDF

Hongjie Wang, Chih-Yao Ma, Yen-Cheng Liu, Ji Hou, Tao Xu, Jialiang Wang, Felix Juefei Xu, Yaqiao Luo, Peizhao Zhang, Tingbo Hou, PÃ©ter. Vajda, Xiaoliang Dai, Niraj K. Jha (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Linear DiT for efficient video generation

[4] LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity PDF

Can Refute

[53] SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers PDF

Can Refute

[3] Radial Attention: Sparse Attention with Energy Decay for Long Video Generation PDF

Cannot Refute

[10] On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices PDF

Cannot Refute

[11] Memo: Memory-guided diffusion for expressive talking video generation PDF

Cannot Refute

[12] Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance PDF

Cannot Refute

[21] Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile PDF

Cannot Refute

[52] Photorealistic video generation with diffusion models PDF

Cannot Refute

[54] ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features PDF

Cannot Refute

[55] Efficient diffusion transformer with step-wise dynamic attention mediators PDF

Cannot Refute

Contribution

Constant-memory KV cache for block linear attention

[37] LongLive: Real-time Interactive Long Video Generation PDF

Cannot Refute

[38] GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression PDF

Cannot Refute

[39] SneakPeek: Future-Guided Instructional Streaming Video Generation PDF

Cannot Refute

[40] LoRATv2: Enabling Low-Cost Temporal Modeling in One-Stream Trackers PDF

Cannot Refute

[41] InfVSR: Breaking Length Limits of Generic Video Super-Resolution PDF

Cannot Refute

Contribution

Efficient training strategy and data filtering

[43] Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets PDF

Can Refute

[49] Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation PDF

Can Refute

[42] VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models PDF

Cannot Refute

[44] Waver: Wave your way to lifelike video generation PDF

Cannot Refute

[45] Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration PDF

Cannot Refute

[46] Mora: Enabling Generalist Video Generation via A Multi-Agent Framework PDF

Cannot Refute

[47] EffiVED: Efficient Video Editing via Text-instruction Diffusion Models PDF

Cannot Refute

[48] AMD-Hummingbird: Towards an Efficient Text-to-Video Model PDF

Cannot Refute

[50] DomainDiff: Unified Two-Stage Optimization for Text-Video Retrieval PDF

Cannot Refute

[51] ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation PDF

Cannot Refute

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[4] LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity PDF

[20] LinGen-Uni: A Universal Linear-Complexity Framework for High-Resolution Minute-Length Text-to-Video Generation PDF

Contribution Analysis

Linear DiT for efficient video generation

[4] LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity PDF

[53] SANA: Efficient high-resolution text-to-image synthesis with linear diffusion transformers PDF

[3] Radial Attention: Sparse Attention with Energy Decay for Long Video Generation PDF

[10] On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices PDF

[11] Memo: Memory-guided diffusion for expressive talking video generation PDF

[12] Efficient Long-duration Talking Video Synthesis with Linear Diffusion Transformer under Multimodal Guidance PDF

[21] Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile PDF

[52] Photorealistic video generation with diffusion models PDF

[54] ConceptAttention: Diffusion Transformers Learn Highly Interpretable Features PDF

[55] Efficient diffusion transformer with step-wise dynamic attention mediators PDF

Constant-memory KV cache for block linear attention

[37] LongLive: Real-time Interactive Long Video Generation PDF

[38] GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression PDF

[39] SneakPeek: Future-Guided Instructional Streaming Video Generation PDF

[40] LoRATv2: Enabling Low-Cost Temporal Modeling in One-Stream Trackers PDF

[41] InfVSR: Breaking Length Limits of Generic Video Super-Resolution PDF

Efficient training strategy and data filtering

[43] Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets PDF

[49] Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation PDF

[42] VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models PDF

[44] Waver: Wave your way to lifelike video generation PDF

[45] Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration PDF

[46] Mora: Enabling Generalist Video Generation via A Multi-Agent Framework PDF

[47] EffiVED: Efficient Video Editing via Text-instruction Diffusion Models PDF

[48] AMD-Hummingbird: Towards an Efficient Text-to-Video Model PDF

[50] DomainDiff: Unified Two-Stage Optimization for Text-Video Retrieval PDF

[51] ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation PDF

Table of Contents