Neodragon: Mobile Video Generation Using Diffusion Transformer

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

Text to Video GenerationFlow MatchingDiffusion TransformerDiffusion ModelsMobile Video GenerationStep DistilllationBlock PruningText-Encoder DistillationAsymmetric Decoder Distillation

We propose Neogradon, a video DiT (Diffusion Transformer) designed to run on a low-power NPU present in devices such as phones and laptop computers. We demonstrate that, despite video transformers' huge memory and compute cost, mobile devices can run these models when carefully optimised for efficiency. To achieve this level of efficiency, i) we replace the original large Text-Encoder with a much smaller one with minimal quality loss through our novel distillation framework which doesn’t require any image or video data. ii) We propose an Asymmetric Decoder distillation approach which allows us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the video generation pipeline. iii) With our Block Pruning strategy, we remove entire blocks from the MMDiT denoiser based on their relative importance and recover original performance through a two-stage distillation process. iv) We reduce the diffusion sampling cost using our novel extended version of DMD (Distribution Matching Distillation) for the Pyramidal Flow-Matching objective. Neodragon generates 49 frames of [640 $\times$ 1024] resolution within 7.6 seconds on the Qualcomm Hexagon NPU with a VBench total score of 81.61, setting a new state-of-the-art for mobile video generation.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Neogradon, a mobile-optimized video diffusion transformer combining four distinct efficiency techniques: text-encoder distillation, asymmetric decoder distillation, block pruning, and extended distribution matching distillation for pyramidal flow-matching. Within the taxonomy, it occupies the 'Distribution Matching Distillation' leaf under 'Denoising Process Acceleration', which currently contains no sibling papers. This isolation suggests the specific combination of pyramidal flow-matching with distribution matching distillation for mobile video generation represents a relatively unexplored niche, though the broader denoising acceleration direction includes related adversarial methods.

The taxonomy reveals that efficient mobile video generation research clusters around three main strategies: attention optimization (linear/hybrid attention, token merging), model compression (pruning, compact architectures), and denoising acceleration (distillation, adversarial methods). Neogradon bridges multiple branches by combining denoising acceleration with block pruning (from 'Model Compression') and decoder optimization (from 'Decoder and Autoencoder Optimization'). Neighboring leaves like 'Channel and Temporal Block Pruning' and 'Compact Diffusion Transformer Design' address complementary efficiency dimensions, while 'Adversarial Denoising Reduction' offers an alternative acceleration paradigm. The taxonomy's scope notes clarify that denoising methods exclude attention mechanism changes, positioning Neogradon's multi-pronged approach as integrative rather than purely specialized.

Among 21 candidates examined, the MMDiT Block Pruning contribution shows overlap with 2 prior works from 8 candidates reviewed, suggesting established precedent for transformer block removal strategies. The Text-Encoder Distillation (10 candidates, 0 refutations) and Asymmetric Decoder Distillation (3 candidates, 0 refutations) appear more distinctive within this limited search scope. The extended distribution matching distillation for pyramidal flow-matching occupies an unpopulated taxonomy leaf, though the small candidate pool (21 total) means potentially relevant work in broader diffusion distillation or flow-matching literature may exist beyond the top-K semantic matches examined. The analysis captures immediate neighbors but cannot claim exhaustive coverage of the rapidly evolving mobile diffusion landscape.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: efficient video generation on mobile devices using diffusion transformers. The field addresses the challenge of deploying computationally expensive diffusion-based video generation models on resource-constrained mobile hardware. The taxonomy reveals several complementary optimization strategies: Attention Mechanism Optimization focuses on reducing the quadratic complexity of self-attention operations, while Denoising Process Acceleration targets the iterative sampling bottleneck through techniques like step reduction and distillation. Model Compression and Architecture Optimization encompasses pruning, quantization, and lightweight architectural designs, whereas Decoder and Autoencoder Optimization improves the efficiency of latent-space encoding. Application-Specific Mobile Video Generation tailors solutions to particular use cases, and Related Video Processing Tasks covers adjacent problems that inform mobile deployment strategies. Representative works like Mobile Video Diffusion[5] and MobileVidFactory[8] demonstrate early efforts to adapt video diffusion models for on-device inference, while more recent approaches such as On-device Sora[2] and Taming Diffusion Transformer[3] push the boundaries of what is achievable on mobile platforms. A particularly active research direction involves distribution matching distillation methods that compress multi-step diffusion processes into fewer iterations without sacrificing quality. Neodragon[0] exemplifies this approach by employing distillation techniques to accelerate the denoising process specifically for mobile deployment. This positions it alongside works like Taming Diffusion Efficient[4] and Lightning Video[12], which similarly prioritize inference speed through step reduction. In contrast, approaches such as Attention Surgery[1] and Wavelet Dynamic Transformer[6] emphasize architectural modifications to attention mechanisms and frequency-domain representations. The trade-off between distillation-based acceleration and architectural redesign remains an open question: distillation methods often achieve dramatic speedups but require teacher models and careful training, while architectural optimizations offer more direct efficiency gains but may involve greater design complexity. Neodragon[0] sits firmly within the distillation-focused cluster, sharing methodological kinship with step-reduction strategies while differing from attention-centric or application-specific branches.

Claimed Contributions

Text-Encoder Distillation Framework

10 retrieved papers

A prompt-only distillation framework that compresses the 4.762B-parameter T5XXL text encoder by 35× into a 0.130B-parameter DistilT5 model using a trainable ContextAdapter module, achieving minimal quality degradation without requiring image or video data for training.

10 retrieved papers

Asymmetric Decoder Distillation Approach

3 retrieved papers

A distillation method that replaces the native VAE decoder with a device-friendly architecture achieving over 20× parameter reduction, while preserving the frozen encoder and generative latent space through end-to-end fine-tuning with reconstruction objectives.

3 retrieved papers

MMDiT Block Pruning Strategy

Can Refute

8 retrieved papers

A novel block-pruning strategy for MMDiT architecture that removes entire blocks based on importance scores computed via cosine distance between input and output tokens, followed by a two-stage fine-tuning process that recovers performance while achieving over 25% parameter reduction.

8 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Text-Encoder Distillation Framework

[23] Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression PDF

Cannot Refute

[24] Prompt Distillation for Efficient LLM-based Recommendation PDF

Cannot Refute

[25] Bad actor, good advisor: Exploring the role of large language models in fake news detection PDF

Cannot Refute

[26] Edgesam: Prompt-in-the-loop distillation for on-device deployment of sam PDF

Cannot Refute

[27] Automatic Prompt Optimization with Prompt Distillation PDF

Cannot Refute

[28] Large Language Models for Creation, Enrichment and Evaluation of Taxonomic Graphs PDF

Cannot Refute

[29] Conditional prototype rectification prompt learning PDF

Cannot Refute

[30] PanDa: Prompt Transfer Meets Knowledge Distillation for Efficient Model Adaptation PDF

Cannot Refute

[31] Improving training dataset balance with ChatGPT prompt engineering PDF

Cannot Refute

[32] Multiple local prompts distillation for domain generalization PDF

Cannot Refute

Contribution

Asymmetric Decoder Distillation Approach

[13] Complexity matters: Rethinking the latent space for generative modeling PDF

Cannot Refute

[14] Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers PDF

Cannot Refute

[15] The Guru in the Latent Space: A Pedagogical Frameworkfor Representation Distillation PDF

Cannot Refute

Contribution

MMDiT Block Pruning Strategy

[4] Taming Diffusion Transformer for Efficient Mobile Video Generation in Seconds PDF

Can Refute

[19] Dit-pruner: Pruning diffusion transformer models for text-to-image synthesis using human preference scores PDF

Can Refute

[16] Sd-dit: Unleashing the power of self-supervised discrimination in diffusion transformer PDF

Cannot Refute

[17] Comp-diff: A unified pruning and distillation framework for compressing diffusion models PDF

Cannot Refute

[18] Dip-go: A diffusion pruner via few-step gradient optimization PDF

Cannot Refute

[20] DiffER: categorical diffusion ensembles for single-step chemical retrosynthesis PDF

Cannot Refute

[21] Compressed Diffusion: Pruning with Knowledge Distillation for Efficient Text-to-Image Generation PDF

Cannot Refute

[22] Mixing Pruning and Distillation for Lighter Diffusion Models PDF

Cannot Refute

Neodragon: Mobile Video Generation Using Diffusion Transformer

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Text-Encoder Distillation Framework

[23] Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression PDF

[24] Prompt Distillation for Efficient LLM-based Recommendation PDF

[25] Bad actor, good advisor: Exploring the role of large language models in fake news detection PDF

[26] Edgesam: Prompt-in-the-loop distillation for on-device deployment of sam PDF

[27] Automatic Prompt Optimization with Prompt Distillation PDF

[28] Large Language Models for Creation, Enrichment and Evaluation of Taxonomic Graphs PDF

[29] Conditional prototype rectification prompt learning PDF

[30] PanDa: Prompt Transfer Meets Knowledge Distillation for Efficient Model Adaptation PDF

[31] Improving training dataset balance with ChatGPT prompt engineering PDF

[32] Multiple local prompts distillation for domain generalization PDF

Asymmetric Decoder Distillation Approach

[13] Complexity matters: Rethinking the latent space for generative modeling PDF

[14] Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers PDF

[15] The Guru in the Latent Space: A Pedagogical Frameworkfor Representation Distillation PDF

MMDiT Block Pruning Strategy

[4] Taming Diffusion Transformer for Efficient Mobile Video Generation in Seconds PDF

[19] Dit-pruner: Pruning diffusion transformer models for text-to-image synthesis using human preference scores PDF

[16] Sd-dit: Unleashing the power of self-supervised discrimination in diffusion transformer PDF

[17] Comp-diff: A unified pruning and distillation framework for compressing diffusion models PDF

[18] Dip-go: A diffusion pruner via few-step gradient optimization PDF

[20] DiffER: categorical diffusion ensembles for single-step chemical retrosynthesis PDF

[21] Compressed Diffusion: Pruning with Knowledge Distillation for Efficient Text-to-Image Generation PDF

[22] Mixing Pruning and Distillation for Lighter Diffusion Models PDF

Table of Contents