Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization

ICLR 2026 Conference SubmissionAnonymous Authors
Diffusion TransformerJoint Audio-Video GenerationSynchronization
Abstract:

This paper introduces JavisDiT, a novel Joint Audio-Video Diffusion Transformer designed for synchronized audio-video generation (JAVG). Based on the powerful Diffusion Transformer (DiT) architecture, JavisDiT simultaneously generates high-quality audio and video content from open-ended user prompts in a unified framework. To ensure audio-video synchronization, we introduce a fine-grained spatio-temporal alignment mechanism through a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts both global and fine-grained spatio-temporal priors, guiding the synchronization between the visual and auditory components. Furthermore, we propose a new benchmark, JavisBench, which consists of 10,140 high-quality text-captioned sounding videos and focuses on synchronization evaluation in diverse and complex real-world scenarios. Further, we specifically devise a robust metric for measuring the synchrony between generated audio-video pairs in real-world content. Experimental results demonstrate that JavisDiT significantly outperforms existing methods by ensuring both high-quality generation and precise synchronization, setting a new standard for JAVG tasks.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

JavisDiT introduces a joint audio-video diffusion transformer featuring a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator that extracts multi-level spatio-temporal priors to guide synchronization. The paper resides in the Dual-Branch Diffusion Transformer Architectures leaf, which contains eight papers—a moderately populated research direction within the broader unified joint generation category. This leaf focuses on parallel diffusion pathways with cross-modal interaction, distinguishing itself from single-backbone shared models and multi-stage cascaded pipelines that handle audio and video sequentially.

The taxonomy tree reveals that Dual-Branch Diffusion Transformer Architectures sits alongside Shared-Backbone Unified Models (four papers) and Expert Model Fusion (three papers) within the parent category of Unified Joint Audio-Video Generation Architectures. Neighboring branches include Multi-Stage Cascaded Pipelines and Speech-Driven Talking Head Synthesis, which address related but distinct problems. The scope note clarifies that dual-branch methods employ separate but interacting branches, while shared-backbone approaches use a single architecture with modality-specific adapters. JavisDiT's hierarchical prior estimator differentiates it from sibling works that rely primarily on cross-attention or flow-based formulations for alignment.

Among 30 candidates examined across three contributions, no clearly refutable prior work was identified. The JavisDiT architecture contribution examined 10 candidates with none refuting the hierarchical spatio-temporal prior mechanism. The JavisBench dataset contribution also examined 10 candidates, finding no existing benchmark specifically targeting synchronization evaluation in diverse real-world scenarios at comparable scale. The JavisScore metric contribution similarly examined 10 candidates without encountering a prior metric explicitly designed for measuring real-world audio-video synchrony. These statistics suggest that within the limited search scope, the paper's specific combination of hierarchical prior extraction, benchmark design, and evaluation metric appears distinct from examined prior work.

Given the moderately populated leaf and the limited 30-candidate search, the work appears to introduce novel components—particularly the HiST-Sypo mechanism and synchronized evaluation infrastructure—within an active research direction. However, the analysis does not cover the full corpus of dual-branch diffusion architectures or exhaustively compare against all synchronization mechanisms in neighboring leaves. The absence of refutable candidates reflects the search scope rather than definitive field-wide novelty, and a broader literature review might reveal closer antecedents or parallel developments in related branches.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
30
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: synchronized audio-video generation from text prompts. The field has organized itself around several complementary strategies for producing coherent audiovisual content. Unified joint architectures aim to generate both modalities simultaneously within a single model, often leveraging dual-branch diffusion transformers or shared latent representations to maintain tight synchronization. Multi-stage and cascaded pipelines decompose the problem into sequential steps—first generating one modality, then conditioning the other—while speech-driven talking head synthesis focuses on the specialized case of animating human faces from audio. Video-conditioned audio generation reverses the dependency, producing soundtracks that match visual events, and dedicated synchronization mechanisms ensure temporal alignment across modalities. Benchmarking frameworks and application-specific systems address evaluation and real-world deployment, while retrieval and cross-modal understanding methods explore how to leverage existing audiovisual data. Long-form generation and theoretical surveys round out the taxonomy by tackling scalability and providing conceptual overviews. Within the unified joint architectures, a particularly active line of work centers on dual-branch diffusion transformers, where separate but interacting branches handle audio and video streams. Joint Audio Video Diffusion[0] exemplifies this approach by employing parallel diffusion pathways with cross-modal attention to ensure frame-level synchronization. Nearby efforts such as AV DiT[14] and AV DiT Taming[27] explore similar dual-branch designs, experimenting with different attention mechanisms and training strategies to balance generation quality against computational cost. In contrast, works like Uniavgen[3] and SyncFlow[5] emphasize tighter integration or flow-based formulations, trading architectural simplicity for potentially stronger alignment guarantees. The main open questions revolve around how much cross-modal interaction is necessary during generation versus post-hoc alignment, and whether fully unified models can match the flexibility of cascaded pipelines without sacrificing synchronization fidelity.

Claimed Contributions

JavisDiT: Joint Audio-Video Diffusion Transformer with HiST-Sypo Estimator

The authors propose JavisDiT, a diffusion transformer architecture for joint audio-video generation that incorporates a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts global coarse-grained and fine-grained spatio-temporal priors from text prompts to guide precise synchronization between generated audio and video content.

10 retrieved papers
JavisBench: A challenging benchmark dataset for joint audio-video generation

The authors introduce JavisBench, a benchmark consisting of 10,140 high-quality text-captioned sounding videos spanning 5 dimensions and 19 scene categories. The dataset emphasizes complex multi-event scenarios with diverse spatial and temporal compositions to enable comprehensive evaluation of joint audio-video generation systems in real-world contexts.

10 retrieved papers
JavisScore: A robust metric for audio-video synchronization evaluation

The authors develop JavisScore, a new evaluation metric based on temporal-aware semantic alignment that measures spatio-temporal synchronization in diverse real-world scenarios. This metric addresses limitations of existing metrics like A V-Align by using a windowed approach with ImageBind encoders to assess audio-visual alignment across video segments.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

JavisDiT: Joint Audio-Video Diffusion Transformer with HiST-Sypo Estimator

The authors propose JavisDiT, a diffusion transformer architecture for joint audio-video generation that incorporates a Hierarchical Spatial-Temporal Synchronized Prior (HiST-Sypo) Estimator. This module extracts global coarse-grained and fine-grained spatio-temporal priors from text prompts to guide precise synchronization between generated audio and video content.

Contribution

JavisBench: A challenging benchmark dataset for joint audio-video generation

The authors introduce JavisBench, a benchmark consisting of 10,140 high-quality text-captioned sounding videos spanning 5 dimensions and 19 scene categories. The dataset emphasizes complex multi-event scenarios with diverse spatial and temporal compositions to enable comprehensive evaluation of joint audio-video generation systems in real-world contexts.

Contribution

JavisScore: A robust metric for audio-video synchronization evaluation

The authors develop JavisScore, a new evaluation metric based on temporal-aware semantic alignment that measures spatio-temporal synchronization in diverse real-world scenarios. This metric addresses limitations of existing metrics like A V-Align by using a windowed approach with ImageBind encoders to assess audio-visual alignment across video segments.