Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

Diffusion TransformerJoint Audio-Video GenerationSynchronization

This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Based on the powerful Diffusion Transformer (DiT) architecture, JavisDiT simultaneously generates high-quality audio and video content from open-ended user prompts in a unified framework. To ensure audio-video synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, which consists of 10,140 high-quality text-captioned sounding videos and focuses on synchronization evaluation in diverse and complex real-world scenarios. Further, we specifically devise a robust metric for measuring the synchrony between generated audio-video pairs in real-world content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

JavisDiT introduces a joint audio-video diffusion transformer featuring a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator that extracts multi-level spatio-temporal priors to guide synchronization. The paper resides in the Dual-Branch Diffusion Transformer Architectures leaf, which contains eight papers—a moderately populated research direction within the broader unified joint generation category. This leaf focuses on parallel diffusion pathways with cross-modal interaction, distinguishing itself from single-backbone shared models and multi-stage cascaded pipelines that handle audio and video sequentially.

The taxonomy tree reveals that Dual-Branch Diffusion Transformer Architectures sits alongside Shared-Backbone Unified Models (four papers) and Expert Model Fusion (three papers) within the parent category of Unified Joint Audio-Video Generation Architectures. Neighboring branches include Multi-Stage Cascaded Pipelines and Speech-Driven Talking Head Synthesis, which address related but distinct problems. The scope note clarifies that dual-branch methods employ separate but interacting branches, while shared-backbone approaches use a single architecture with modality-specific adapters. JavisDiT's hierarchical prior estimator differentiates it from sibling works that rely primarily on cross-attention or flow-based formulations for alignment.

Among 30 candidates examined across three contributions, no clearly refutable prior work was identified. The JavisDiT architecture contribution examined 10 candidates with none refuting the hierarchical spatio-temporal prior mechanism. The JavisBench dataset contribution also examined 10 candidates, finding no existing benchmark specifically targeting synchronization evaluation in diverse real-world scenarios at comparable scale. The JavisScore metric contribution similarly examined 10 candidates without encountering a prior metric explicitly designed for measuring real-world audio-video synchrony. These statistics suggest that within the limited search scope, the paper's specific combination of hierarchical prior extraction, benchmark design, and evaluation metric appears distinct from examined prior work.

Given the moderately populated leaf and the limited 30-candidate search, the work appears to introduce novel components—particularly the HiST-Sypo mechanism and synchronized evaluation infrastructure—within an active research direction. However, the analysis does not cover the full corpus of dual-branch diffusion architectures or exhaustively compare against all synchronization mechanisms in neighboring leaves. The absence of refutable candidates reflects the search scope rather than definitive field-wide novelty, and a broader literature review might reveal closer antecedents or parallel developments in related branches.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: synchronized audio-video generation from text prompts. The field has organized itself around several complementary strategies for producing coherent audiovisual content. Unified joint architectures aim to generate both modalities simultaneously within a single model, often leveraging dual-branch diffusion transformers or shared latent representations to maintain tight synchronization. Multi-stage and cascaded pipelines decompose the problem into sequential steps—first generating one modality, then conditioning the other—while speech-driven talking head synthesis focuses on the specialized case of animating human faces from audio. Video-conditioned audio generation reverses the dependency, producing soundtracks that match visual events, and dedicated synchronization mechanisms ensure temporal alignment across modalities. Benchmarking frameworks and application-specific systems address evaluation and real-world deployment, while retrieval and cross-modal understanding methods explore how to leverage existing audiovisual data. Long-form generation and theoretical surveys round out the taxonomy by tackling scalability and providing conceptual overviews. Within the unified joint architectures, a particularly active line of work centers on dual-branch diffusion transformers, where separate but interacting branches handle audio and video streams. Joint Audio Video Diffusion[0] exemplifies this approach by employing parallel diffusion pathways with cross-modal attention to ensure frame-level synchronization. Nearby efforts such as AV DiT[14] and AV DiT Taming[27] explore similar dual-branch designs, experimenting with different attention mechanisms and training strategies to balance generation quality against computational cost. In contrast, works like Uniavgen[3] and SyncFlow[5] emphasize tighter integration or flow-based formulations, trading architectural simplicity for potentially stronger alignment guarantees. The main open questions revolve around how much cross-modal interaction is necessary during generation versus post-hoc alignment, and whether fully unified models can match the flexibility of cascaded pipelines without sacrificing synchronization fidelity.

Claimed Contributions

JavisDiT: Joint Audio-Video Diffusion Transformer with HiST-Sypo Estimator

10 retrieved papers

The authors propose JavisDiT, a diffusion transformer architecture for joint audio-video generation that incorporates a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts global coarse-grained and fine-grained spatio-temporal priors from text prompts to guide precise synchronization between generated audio and video content.

10 retrieved papers

JavisBench: A challenging benchmark dataset for joint audio-video generation

10 retrieved papers

The authors introduce JavisBench, a benchmark consisting of 10,140 high-quality text-captioned sounding videos spanning 5 dimensions and 19 scene categories. The dataset emphasizes complex multi-event scenarios with diverse spatial and temporal compositions to enable comprehensive evaluation of joint audio-video generation systems in real-world contexts.

10 retrieved papers

JavisScore: A robust metric for audio-video synchronization evaluation

10 retrieved papers

The authors develop JavisScore, a new evaluation metric based on temporal-aware semantic alignment that measures spatio-temporal synchronization in diverse real-world scenarios. This metric addresses limitations of existing metrics like A V-Align by using a windowed approach with ImageBind encoders to assess audio-visual alignment across video segments.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Uniavgen: Unified audio and video generation with asymmetric cross-modal interactions PDF

Zhang Guozhen, Zhou Zixiang, Guozhen Zhang, Hu Teng, Zixiang Zhou, Peng, Ziqiao, Teng Hu, Zhang You-liang, Ziqiao Peng, Chen Yi, Youliang Zhang, Zhou Yuan, Yi Chen, Lu Qinglin, Yuan Zhou, Wang Limin, Qinglin Lu, Limin Wang (2025)

[5] SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text PDF

Liu, Haohe, Lan, Gael Le, Haohe Liu, Mei, Xinhao, GaÃ«l Le Lan, Ni, Zhaoheng, Xinhao Mei, Kumar, Anurag, Zhaoheng Ni, Nagaraja, Varun, Anurag Kumar, Wang Wen-wu, Varun Nagaraja, Plumbley, Mark D., Wenwu Wang, Shi, Yangyang, Mark D. Plumbley, Chandra, Vikas, Yangyang Shi, Vikas Chandra (2024)

[14] Av-dit: Efficient audio-visual diffusion transformer for joint audio and video generation PDF

Wang Kai, Deng Shi-jian, Kai Wang, Shi Jing, Shijian Deng, Hatzinakos Dimitrios, Jing Shi, Tian, Yapeng, Dimitrios Hatzinakos, Yapeng Tian (2024)

[27] Av-dit: Taming image diffusion transformers for efficient joint audio and video generation PDF

Kai Wang, Shi-Jian Deng, Jing Shi, Shijian Deng, Dimitrios Hatzinakos, Yapeng Tian (2025)

[41] JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization PDF

Liu Kai, Li Wei, Chen Lai, Wu, Shengqiong, Ji, Jiayi, Zhou Fan, Jiang Rongxin, Luo, Jiebo, Fei Hao, Chua, Tat-Seng (2025)

[43] 3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation PDF

Yaoru Li, Heyu Si, Federico Landi, Pilar Oplustil Gallegos, Ioannis Koutsoumpas, O. Ricardo Cortez Vazquez, Ruiju Fu, Qi Guo, Xin Jin, Shunyu Liu, Mingli Song (2025)

[45] ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation PDF

Jiahui Sun, Weining Wang, Mingzhen Sun, Yirong Yang, Xinxin Zhu, Jing Liu (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

JavisDiT: Joint Audio-Video Diffusion Transformer with HiST-Sypo Estimator

[13] Audiogen-omni: A unified multimodal diffusion transformer for video-synchronized audio, speech, and song generation PDF

Cannot Refute

[14] Av-dit: Efficient audio-visual diffusion transformer for joint audio and video generation PDF

Cannot Refute

[27] Av-dit: Taming image diffusion transformers for efficient joint audio and video generation PDF

Cannot Refute

[41] JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization PDF

Cannot Refute

[68] Tora: Trajectory-oriented Diffusion Transformer for Video Generation PDF

Cannot Refute

[69] Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis PDF

Cannot Refute

[70] MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on PDF

Cannot Refute

[71] 360-degree Human Video Generation with 4D Diffusion Transformer PDF

Cannot Refute

[72] DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer PDF

Cannot Refute

[73] HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation PDF

Cannot Refute

Contribution

JavisBench: A challenging benchmark dataset for joint audio-video generation

[18] Audio-Sync Video Generation with Multi-Stream Temporal Control PDF

Cannot Refute

[52] Temporally Aligned Audio for Video with Autoregression PDF

Cannot Refute

[60] SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation PDF

Cannot Refute

[61] DAVE: Diagnostic benchmark for Audio Visual Evaluation PDF

Cannot Refute

[62] Perception Test: A Diagnostic Benchmark for Multimodal Video Models PDF

Cannot Refute

[63] FoleyBench: A Benchmark For Video-to-Audio Models PDF

Cannot Refute

[64] Fine-grained audioâvisual event localization PDF

Cannot Refute

[65] video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models PDF

Cannot Refute

[66] Toward long form audio-visual video understanding PDF

Cannot Refute

[67] Mecd: Unlocking multi-event causal discovery in video reasoning PDF

Cannot Refute

Contribution

JavisScore: A robust metric for audio-video synchronization evaluation

[10] Text-to-Audio Generation Synchronized with Videos PDF

Cannot Refute

[51] STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment PDF

Cannot Refute

[52] Temporally Aligned Audio for Video with Autoregression PDF

Cannot Refute

[53] Video-to-Audio Generation with Hidden Alignment PDF

Cannot Refute

[54] Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds PDF

Cannot Refute

[55] SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet PDF

Cannot Refute

[56] TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining PDF

Cannot Refute

[57] Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance PDF

Cannot Refute

[58] Audioscenic: Audio-driven video scene editing PDF

Cannot Refute

[59] Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks PDF

Cannot Refute

Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Uniavgen: Unified audio and video generation with asymmetric cross-modal interactions PDF

[5] SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text PDF

[14] Av-dit: Efficient audio-visual diffusion transformer for joint audio and video generation PDF

[27] Av-dit: Taming image diffusion transformers for efficient joint audio and video generation PDF

[41] JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization PDF

[43] 3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation PDF

[45] ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation PDF

Contribution Analysis

JavisDiT: Joint Audio-Video Diffusion Transformer with HiST-Sypo Estimator

[13] Audiogen-omni: A unified multimodal diffusion transformer for video-synchronized audio, speech, and song generation PDF

[14] Av-dit: Efficient audio-visual diffusion transformer for joint audio and video generation PDF

[27] Av-dit: Taming image diffusion transformers for efficient joint audio and video generation PDF

[41] JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization PDF

[68] Tora: Trajectory-oriented Diffusion Transformer for Video Generation PDF

[69] Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis PDF

[70] MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on PDF

[71] 360-degree Human Video Generation with 4D Diffusion Transformer PDF

[72] DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer PDF

[73] HunyuanVideo-Foley: Multimodal Diffusion with Representation Alignment for High-Fidelity Foley Audio Generation PDF

JavisBench: A challenging benchmark dataset for joint audio-video generation

[18] Audio-Sync Video Generation with Multi-Stream Temporal Control PDF

[52] Temporally Aligned Audio for Video with Autoregression PDF

[60] SAVGBench: Benchmarking Spatially Aligned Audio-Video Generation PDF

[61] DAVE: Diagnostic benchmark for Audio Visual Evaluation PDF

[62] Perception Test: A Diagnostic Benchmark for Multimodal Video Models PDF

[63] FoleyBench: A Benchmark For Video-to-Audio Models PDF

[64] Fine-grained audioâvisual event localization PDF

[65] video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models PDF

[66] Toward long form audio-visual video understanding PDF

[67] Mecd: Unlocking multi-event causal discovery in video reasoning PDF

JavisScore: A robust metric for audio-video synchronization evaluation

[10] Text-to-Audio Generation Synchronized with Videos PDF

[51] STA-V2A: Video-to-Audio Generation with Semantic and Temporal Alignment PDF

[52] Temporally Aligned Audio for Video with Autoregression PDF

[53] Video-to-Audio Generation with Hidden Alignment PDF

[54] Foleycrafter: Bring silent videos to life with lifelike and synchronized sounds PDF

[55] SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet PDF

[56] TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining PDF

[57] Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance PDF

[58] Audioscenic: Audio-driven video scene editing PDF

[59] Towards Video to Piano Music Generation with Chain-of-Perform Support Benchmarks PDF

Table of Contents

[64] Fine-grained audioâvisual event localization PDF