Neodragon: Mobile Video Generation Using Diffusion Transformer

ICLR 2026 Conference SubmissionAnonymous Authors
Text to Video GenerationFlow MatchingDiffusion TransformerDiffusion ModelsMobile Video GenerationStep DistilllationBlock PruningText-Encoder DistillationAsymmetric Decoder Distillation
Abstract:

We propose Neogradon, a video DiT (Diffusion Transformer) designed to run on a low-power NPU present in devices such as phones and laptop computers. We demonstrate that, despite video transformers' huge memory and compute cost, mobile devices can run these models when carefully optimised for efficiency. To achieve this level of efficiency, i) we replace the original large Text-Encoder with a much smaller one with minimal quality loss through our novel distillation framework which doesn’t require any image or video data. ii) We propose an Asymmetric Decoder distillation approach which allows us to replace the native codec-latent-VAE decoder with a more efficient one, without disturbing the generative latent-space of the video generation pipeline. iii) With our Block Pruning strategy, we remove entire blocks from the MMDiT denoiser based on their relative importance and recover original performance through a two-stage distillation process. iv) We reduce the diffusion sampling cost using our novel extended version of DMD (Distribution Matching Distillation) for the Pyramidal Flow-Matching objective. Neodragon generates 49 frames of [640×\times1024] resolution within 7.6 seconds on the Qualcomm Hexagon NPU with a VBench total score of 81.61, setting a new state-of-the-art for mobile video generation.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes Neogradon, a mobile-optimized video diffusion transformer combining four distinct efficiency techniques: text-encoder distillation, asymmetric decoder distillation, block pruning, and extended distribution matching distillation for pyramidal flow-matching. Within the taxonomy, it occupies the 'Distribution Matching Distillation' leaf under 'Denoising Process Acceleration', which currently contains no sibling papers. This isolation suggests the specific combination of pyramidal flow-matching with distribution matching distillation for mobile video generation represents a relatively unexplored niche, though the broader denoising acceleration direction includes related adversarial methods.

The taxonomy reveals that efficient mobile video generation research clusters around three main strategies: attention optimization (linear/hybrid attention, token merging), model compression (pruning, compact architectures), and denoising acceleration (distillation, adversarial methods). Neogradon bridges multiple branches by combining denoising acceleration with block pruning (from 'Model Compression') and decoder optimization (from 'Decoder and Autoencoder Optimization'). Neighboring leaves like 'Channel and Temporal Block Pruning' and 'Compact Diffusion Transformer Design' address complementary efficiency dimensions, while 'Adversarial Denoising Reduction' offers an alternative acceleration paradigm. The taxonomy's scope notes clarify that denoising methods exclude attention mechanism changes, positioning Neogradon's multi-pronged approach as integrative rather than purely specialized.

Among 21 candidates examined, the MMDiT Block Pruning contribution shows overlap with 2 prior works from 8 candidates reviewed, suggesting established precedent for transformer block removal strategies. The Text-Encoder Distillation (10 candidates, 0 refutations) and Asymmetric Decoder Distillation (3 candidates, 0 refutations) appear more distinctive within this limited search scope. The extended distribution matching distillation for pyramidal flow-matching occupies an unpopulated taxonomy leaf, though the small candidate pool (21 total) means potentially relevant work in broader diffusion distillation or flow-matching literature may exist beyond the top-K semantic matches examined. The analysis captures immediate neighbors but cannot claim exhaustive coverage of the rapidly evolving mobile diffusion landscape.

Taxonomy

Core-task Taxonomy Papers
12
3
Claimed Contributions
21
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: efficient video generation on mobile devices using diffusion transformers. The field addresses the challenge of deploying computationally expensive diffusion-based video generation models on resource-constrained mobile hardware. The taxonomy reveals several complementary optimization strategies: Attention Mechanism Optimization focuses on reducing the quadratic complexity of self-attention operations, while Denoising Process Acceleration targets the iterative sampling bottleneck through techniques like step reduction and distillation. Model Compression and Architecture Optimization encompasses pruning, quantization, and lightweight architectural designs, whereas Decoder and Autoencoder Optimization improves the efficiency of latent-space encoding. Application-Specific Mobile Video Generation tailors solutions to particular use cases, and Related Video Processing Tasks covers adjacent problems that inform mobile deployment strategies. Representative works like Mobile Video Diffusion[5] and MobileVidFactory[8] demonstrate early efforts to adapt video diffusion models for on-device inference, while more recent approaches such as On-device Sora[2] and Taming Diffusion Transformer[3] push the boundaries of what is achievable on mobile platforms. A particularly active research direction involves distribution matching distillation methods that compress multi-step diffusion processes into fewer iterations without sacrificing quality. Neodragon[0] exemplifies this approach by employing distillation techniques to accelerate the denoising process specifically for mobile deployment. This positions it alongside works like Taming Diffusion Efficient[4] and Lightning Video[12], which similarly prioritize inference speed through step reduction. In contrast, approaches such as Attention Surgery[1] and Wavelet Dynamic Transformer[6] emphasize architectural modifications to attention mechanisms and frequency-domain representations. The trade-off between distillation-based acceleration and architectural redesign remains an open question: distillation methods often achieve dramatic speedups but require teacher models and careful training, while architectural optimizations offer more direct efficiency gains but may involve greater design complexity. Neodragon[0] sits firmly within the distillation-focused cluster, sharing methodological kinship with step-reduction strategies while differing from attention-centric or application-specific branches.

Claimed Contributions

Text-Encoder Distillation Framework

A prompt-only distillation framework that compresses the 4.762B-parameter T5XXL text encoder by 35× into a 0.130B-parameter DistilT5 model using a trainable ContextAdapter module, achieving minimal quality degradation without requiring image or video data for training.

10 retrieved papers
Asymmetric Decoder Distillation Approach

A distillation method that replaces the native VAE decoder with a device-friendly architecture achieving over 20× parameter reduction, while preserving the frozen encoder and generative latent space through end-to-end fine-tuning with reconstruction objectives.

3 retrieved papers
MMDiT Block Pruning Strategy

A novel block-pruning strategy for MMDiT architecture that removes entire blocks based on importance scores computed via cosine distance between input and output tokens, followed by a two-stage fine-tuning process that recovers performance while achieving over 25% parameter reduction.

8 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Text-Encoder Distillation Framework

A prompt-only distillation framework that compresses the 4.762B-parameter T5XXL text encoder by 35× into a 0.130B-parameter DistilT5 model using a trainable ContextAdapter module, achieving minimal quality degradation without requiring image or video data for training.

Contribution

Asymmetric Decoder Distillation Approach

A distillation method that replaces the native VAE decoder with a device-friendly architecture achieving over 20× parameter reduction, while preserving the frozen encoder and generative latent space through end-to-end fine-tuning with reconstruction objectives.

Contribution

MMDiT Block Pruning Strategy

A novel block-pruning strategy for MMDiT architecture that removes entire blocks based on importance scores computed via cosine distance between input and output tokens, followed by a two-stage fine-tuning process that recovers performance while achieving over 25% parameter reduction.