SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

ICLR 2026 Conference SubmissionAnonymous Authors
Video generationDigital humanHuman-Centric dataset
Abstract:

The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over 8,7438,743 hours, SpeakerVid-5M contains more than 5.25.2 million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark (VidChatBench) for future work. Both the dataset and the corresponding data processing code will be publicly released.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SpeakerVid-5M, a large-scale dataset comprising over 8,743 hours and 5.2 million video clips for audio-visual dyadic interactive virtual human generation. Within the taxonomy, it occupies the 'Datasets and Benchmarks' leaf, which currently contains no other sibling papers. This positions the work in a notably sparse research direction: while the broader field includes 23 papers across 19 leaf nodes, the dataset-focused category stands alone, indicating limited prior infrastructure for large-scale training and evaluation in this domain.

The taxonomy reveals that neighboring branches—Multimodal Behavior Generation and Synthesis, Interaction Analysis, and Application Domains—collectively address generation methods, design frameworks, and deployment contexts. SpeakerVid-5M connects to these areas by providing the empirical foundation they require: generation methods like AV-Flow and MAViD Framework depend on annotated corpora, while application studies in social VR and telepresence need benchmarks to validate user experience. The dataset's dual structure (pre-training and SFT subsets) bridges the gap between data-driven realism and controllable model tuning, aligning with the field's tension between scale and interpretability.

Among 28 candidates examined, none clearly refute the three core contributions: the SpeakerVid-5M dataset (10 candidates, 0 refutable), the VidChatBench benchmark (8 candidates, 0 refutable), and the autoregressive baseline method (10 candidates, 0 refutable). The GeSTICS Corpus, the only other dataset-focused work in the taxonomy, targets gesture analysis rather than large-scale generation training. This limited overlap suggests that within the examined scope—top-K semantic matches and citation expansion—the paper addresses a distinct gap in dataset availability for dyadic interactive virtual human tasks.

Based on the 28-candidate search, the work appears to occupy a relatively underserved niche. The absence of sibling papers in the Datasets and Benchmarks leaf and the lack of refutable prior work among examined candidates indicate that large-scale, structured datasets for this specific task are scarce. However, this assessment is constrained by the search scope and does not preclude the existence of related resources in adjacent domains or unpublished efforts.

Taxonomy

Core-task Taxonomy Papers
23
3
Claimed Contributions
28
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: audio-visual dyadic interactive virtual human generation. This field aims to create believable virtual agents that can engage in face-to-face conversations by synthesizing coordinated speech, facial expressions, gestures, and responsive behaviors. The taxonomy reveals four main branches that collectively address the technical, empirical, and applied dimensions of this challenge. Multimodal Behavior Generation and Synthesis focuses on computational methods for producing realistic audio-visual outputs—ranging from probabilistic models of facial gestures (Probabilistic Facial Gestures[2]) to recent flow-based and neural approaches (AV-Flow[10], MAViD Framework[11])—that capture the temporal dynamics of human interaction. Datasets and Benchmarks provide the empirical foundation, offering annotated corpora of dyadic exchanges (GeSTICS Corpus[6]) and evaluation protocols that ground model development. Interaction Analysis and Design Frameworks examine the structure of turn-taking, backchanneling (Predicting Listener Backchannels[13]), and conversational timing (Avatar Dialog Turn-Yielding[20]), while Application Domains and User Studies explore deployment contexts such as social VR (Social VR Design[4]), telepresence, and user-aware systems (User-Aware Virtual Agents[1]). Recent work has intensified around end-to-end generation pipelines that unify audio and visual modalities, with some studies emphasizing real-time responsiveness (Real-Time Motion Augmentation[16], X-streamer[3]) and others prioritizing expressive diversity or style adaptation (Style Matching Agents[19]). A key tension lies between data-driven realism—requiring large, high-quality corpora—and the need for controllable, interpretable models that designers can tune for specific interaction scenarios. SpeakerVid[0] sits squarely within the Datasets and Benchmarks branch, contributing a resource that supports training and evaluation of dyadic generation systems. Its emphasis on structured data collection complements generation-focused works like AV-Flow[10] and MAViD Framework[11], which rely on such benchmarks to validate their synthesis quality. By providing a curated dataset, SpeakerVid[0] addresses a foundational bottleneck and enables more rigorous comparisons across the diverse generation methods emerging in this rapidly evolving landscape.

Claimed Contributions

SpeakerVid-5M dataset for audio-visual dyadic interactive virtual human generation

The authors introduce SpeakerVid-5M, a large-scale dataset containing over 8,743 hours and 5.2 million video clips specifically designed for audio-visual dyadic interactive virtual human generation. The dataset includes rich multi-modal annotations and is structured into different interaction types and quality tiers to support various 2D virtual human tasks.

10 retrieved papers
VidChatBench benchmark for evaluating dyadic interactive virtual human models

The authors develop VidChatBench, a benchmark consisting of 500 test samples with unseen speaker IDs and a comprehensive set of evaluation metrics. The benchmark assesses model performance across five dimensions: video quality, identity preservation, dialogue coherence, audio-visual consistency, and emotional alignment.

8 retrieved papers
Autoregressive baseline method for audio-visual dyadic human generation

The authors propose an autoregressive framework that jointly generates audio and video responses based on multimodal input. The method incorporates Qwen2.5-Omni for multimodal understanding, next-chunk prediction for token generation, and a diffusion MLP for enhanced visual realism, trained progressively across three stages.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SpeakerVid-5M dataset for audio-visual dyadic interactive virtual human generation

The authors introduce SpeakerVid-5M, a large-scale dataset containing over 8,743 hours and 5.2 million video clips specifically designed for audio-visual dyadic interactive virtual human generation. The dataset includes rich multi-modal annotations and is structured into different interaction types and quality tiers to support various 2D virtual human tasks.

Contribution

VidChatBench benchmark for evaluating dyadic interactive virtual human models

The authors develop VidChatBench, a benchmark consisting of 500 test samples with unseen speaker IDs and a comprehensive set of evaluation metrics. The benchmark assesses model performance across five dimensions: video quality, identity preservation, dialogue coherence, audio-visual consistency, and emotional alignment.

Contribution

Autoregressive baseline method for audio-visual dyadic human generation

The authors propose an autoregressive framework that jointly generates audio and video responses based on multimodal input. The method incorporates Qwen2.5-Omni for multimodal understanding, next-chunk prediction for token generation, and a diffusion MLP for enhanced visual realism, trained progressively across three stages.