SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation
Overview
Overall Novelty Assessment
The paper introduces SpeakerVid-5M, a large-scale dataset comprising over 8,743 hours and 5.2 million video clips for audio-visual dyadic interactive virtual human generation. Within the taxonomy, it occupies the 'Datasets and Benchmarks' leaf, which currently contains no other sibling papers. This positions the work in a notably sparse research direction: while the broader field includes 23 papers across 19 leaf nodes, the dataset-focused category stands alone, indicating limited prior infrastructure for large-scale training and evaluation in this domain.
The taxonomy reveals that neighboring branches—Multimodal Behavior Generation and Synthesis, Interaction Analysis, and Application Domains—collectively address generation methods, design frameworks, and deployment contexts. SpeakerVid-5M connects to these areas by providing the empirical foundation they require: generation methods like AV-Flow and MAViD Framework depend on annotated corpora, while application studies in social VR and telepresence need benchmarks to validate user experience. The dataset's dual structure (pre-training and SFT subsets) bridges the gap between data-driven realism and controllable model tuning, aligning with the field's tension between scale and interpretability.
Among 28 candidates examined, none clearly refute the three core contributions: the SpeakerVid-5M dataset (10 candidates, 0 refutable), the VidChatBench benchmark (8 candidates, 0 refutable), and the autoregressive baseline method (10 candidates, 0 refutable). The GeSTICS Corpus, the only other dataset-focused work in the taxonomy, targets gesture analysis rather than large-scale generation training. This limited overlap suggests that within the examined scope—top-K semantic matches and citation expansion—the paper addresses a distinct gap in dataset availability for dyadic interactive virtual human tasks.
Based on the 28-candidate search, the work appears to occupy a relatively underserved niche. The absence of sibling papers in the Datasets and Benchmarks leaf and the lack of refutable prior work among examined candidates indicate that large-scale, structured datasets for this specific task are scarce. However, this assessment is constrained by the search scope and does not preclude the existence of related resources in adjacent domains or unpublished efforts.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors introduce SpeakerVid-5M, a large-scale dataset containing over 8,743 hours and 5.2 million video clips specifically designed for audio-visual dyadic interactive virtual human generation. The dataset includes rich multi-modal annotations and is structured into different interaction types and quality tiers to support various 2D virtual human tasks.
The authors develop VidChatBench, a benchmark consisting of 500 test samples with unseen speaker IDs and a comprehensive set of evaluation metrics. The benchmark assesses model performance across five dimensions: video quality, identity preservation, dialogue coherence, audio-visual consistency, and emotional alignment.
The authors propose an autoregressive framework that jointly generates audio and video responses based on multimodal input. The method incorporates Qwen2.5-Omni for multimodal understanding, next-chunk prediction for token generation, and a diffusion MLP for enhanced visual realism, trained progressively across three stages.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
SpeakerVid-5M dataset for audio-visual dyadic interactive virtual human generation
The authors introduce SpeakerVid-5M, a large-scale dataset containing over 8,743 hours and 5.2 million video clips specifically designed for audio-visual dyadic interactive virtual human generation. The dataset includes rich multi-modal annotations and is structured into different interaction types and quality tiers to support various 2D virtual human tasks.
[6] GeSTICS: A Multimodal Corpus for Studying Gesture Synthesis in Two-party Interactions with Contextualized Speech PDF
[7] INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations PDF
[24] Beyond talkingâgenerating holistic 3d human dyadic motion for communication PDF
[39] Interactive conversational head generation PDF
[40] Creating Multimodal Interactive Digital Twin Characters from Videos: A Dataset and Baseline PDF
[41] Slovak dialogue corpus with backchannel annotation PDF
[42] The USC CreativeIT database of multimodal dyadic interactions: From speech and full body motion capture to continuous emotional annotations PDF
[43] The ALICO corpus: Analysing the active listener PDF
[44] CNAMD Corpus: A Chinese Natural Audiovisual Multimodal Database of Conversations for Social Interactive Agents PDF
[45] MSP-AVATAR corpus: Motion capture recordings to study the role of discourse functions in the design of intelligent virtual agents PDF
VidChatBench benchmark for evaluating dyadic interactive virtual human models
The authors develop VidChatBench, a benchmark consisting of 500 test samples with unseen speaker IDs and a comprehensive set of evaluation metrics. The benchmark assesses model performance across five dimensions: video quality, identity preservation, dialogue coherence, audio-visual consistency, and emotional alignment.
[31] Dyadformer: A multi-modal transformer for long-range modeling of dyadic interactions PDF
[32] The Importance of Multimodal Emotion Conditioning and Affect Consistency for Embodied Conversational Agents PDF
[33] The Interaction Behavior Dataset: A Dataset of Smiles and Laughs in Dyadic Interaction PDF
[34] Embodied Conversational Agents: Deep Learning Based Multimodal Sentiment Analysis PDF
[35] Socially-aware virtual agents: Automatically assessing dyadic rapport from temporal patterns of behavior PDF
[36] Evaluating Multimodal Interactive Agents PDF
[37] Virtual social environments as a tool for psychological assessment: Dynamics of interaction with a virtual spouse. PDF
[38] A Multimodal Approach to Improve Performance Evaluation of Call Center Agent. PDF
Autoregressive baseline method for audio-visual dyadic human generation
The authors propose an autoregressive framework that jointly generates audio and video responses based on multimodal input. The method incorporates Qwen2.5-Omni for multimodal understanding, next-chunk prediction for token generation, and a diffusion MLP for enhanced visual realism, trained progressively across three stages.