SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Video generationDigital humanHuman-Centric dataset

The rapid development of large-scale models has catalyzed significant breakthroughs in the digital human domain. These advanced methodologies offer high-fidelity solutions for avatar driving and rendering, leading academia to focus on the next major challenge: audio-visual dyadic interactive virtual human. To facilitate research in this emerging area, we present SpeakerVid-5M dataset, the first large-scale, high-quality dataset designed for audio-visual dyadic interactive virtual human generation. Totaling over $8,743$ hours, SpeakerVid-5M contains more than $5.2$ million video clips of human portraits. It covers diverse scales and interaction types, including monadic talking, listening, and dyadic conversations. Crucially, the dataset is structured along two key dimensions: interaction type and data quality. First, it is categorized into four types (dialogue branch, single branch, listening branch and multi-turn branch) based on the interaction scenario. Second, it is stratified into a large-scale pre-training subset and a curated, high-quality subset for Supervised Fine-Tuning (SFT). This dual structure accommodates a wide array of 2D virtual human tasks. In addition, we provide an autoregressive (AR)-based video chat baseline trained on this data, accompanied by a dedicated set of metrics and test data to serve as a benchmark (VidChatBench) for future work. Both the dataset and the corresponding data processing code will be publicly released.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces SpeakerVid-5M, a large-scale dataset comprising over 8,743 hours and 5.2 million video clips for audio-visual dyadic interactive virtual human generation. Within the taxonomy, it occupies the 'Datasets and Benchmarks' leaf, which currently contains no other sibling papers. This positions the work in a notably sparse research direction: while the broader field includes 23 papers across 19 leaf nodes, the dataset-focused category stands alone, indicating limited prior infrastructure for large-scale training and evaluation in this domain.

The taxonomy reveals that neighboring branches—Multimodal Behavior Generation and Synthesis, Interaction Analysis, and Application Domains—collectively address generation methods, design frameworks, and deployment contexts. SpeakerVid-5M connects to these areas by providing the empirical foundation they require: generation methods like AV-Flow and MAViD Framework depend on annotated corpora, while application studies in social VR and telepresence need benchmarks to validate user experience. The dataset's dual structure (pre-training and SFT subsets) bridges the gap between data-driven realism and controllable model tuning, aligning with the field's tension between scale and interpretability.

Among 28 candidates examined, none clearly refute the three core contributions: the SpeakerVid-5M dataset (10 candidates, 0 refutable), the VidChatBench benchmark (8 candidates, 0 refutable), and the autoregressive baseline method (10 candidates, 0 refutable). The GeSTICS Corpus, the only other dataset-focused work in the taxonomy, targets gesture analysis rather than large-scale generation training. This limited overlap suggests that within the examined scope—top-K semantic matches and citation expansion—the paper addresses a distinct gap in dataset availability for dyadic interactive virtual human tasks.

Based on the 28-candidate search, the work appears to occupy a relatively underserved niche. The absence of sibling papers in the Datasets and Benchmarks leaf and the lack of refutable prior work among examined candidates indicate that large-scale, structured datasets for this specific task are scarce. However, this assessment is constrained by the search scope and does not preclude the existence of related resources in adjacent domains or unpublished efforts.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: audio-visual dyadic interactive virtual human generation. This field aims to create believable virtual agents that can engage in face-to-face conversations by synthesizing coordinated speech, facial expressions, gestures, and responsive behaviors. The taxonomy reveals four main branches that collectively address the technical, empirical, and applied dimensions of this challenge. Multimodal Behavior Generation and Synthesis focuses on computational methods for producing realistic audio-visual outputs—ranging from probabilistic models of facial gestures (Probabilistic Facial Gestures[2]) to recent flow-based and neural approaches (AV-Flow[10], MAViD Framework[11])—that capture the temporal dynamics of human interaction. Datasets and Benchmarks provide the empirical foundation, offering annotated corpora of dyadic exchanges (GeSTICS Corpus[6]) and evaluation protocols that ground model development. Interaction Analysis and Design Frameworks examine the structure of turn-taking, backchanneling (Predicting Listener Backchannels[13]), and conversational timing (Avatar Dialog Turn-Yielding[20]), while Application Domains and User Studies explore deployment contexts such as social VR (Social VR Design[4]), telepresence, and user-aware systems (User-Aware Virtual Agents[1]). Recent work has intensified around end-to-end generation pipelines that unify audio and visual modalities, with some studies emphasizing real-time responsiveness (Real-Time Motion Augmentation[16], X-streamer[3]) and others prioritizing expressive diversity or style adaptation (Style Matching Agents[19]). A key tension lies between data-driven realism—requiring large, high-quality corpora—and the need for controllable, interpretable models that designers can tune for specific interaction scenarios. SpeakerVid[0] sits squarely within the Datasets and Benchmarks branch, contributing a resource that supports training and evaluation of dyadic generation systems. Its emphasis on structured data collection complements generation-focused works like AV-Flow[10] and MAViD Framework[11], which rely on such benchmarks to validate their synthesis quality. By providing a curated dataset, SpeakerVid[0] addresses a foundational bottleneck and enables more rigorous comparisons across the diverse generation methods emerging in this rapidly evolving landscape.

Claimed Contributions

SpeakerVid-5M dataset for audio-visual dyadic interactive virtual human generation

10 retrieved papers

The authors introduce SpeakerVid-5M, a large-scale dataset containing over 8,743 hours and 5.2 million video clips specifically designed for audio-visual dyadic interactive virtual human generation. The dataset includes rich multi-modal annotations and is structured into different interaction types and quality tiers to support various 2D virtual human tasks.

10 retrieved papers

VidChatBench benchmark for evaluating dyadic interactive virtual human models

8 retrieved papers

The authors develop VidChatBench, a benchmark consisting of 500 test samples with unseen speaker IDs and a comprehensive set of evaluation metrics. The benchmark assesses model performance across five dimensions: video quality, identity preservation, dialogue coherence, audio-visual consistency, and emotional alignment.

8 retrieved papers

Autoregressive baseline method for audio-visual dyadic human generation

10 retrieved papers

The authors propose an autoregressive framework that jointly generates audio and video responses based on multimodal input. The method incorporates Qwen2.5-Omni for multimodal understanding, next-chunk prediction for token generation, and a diffusion MLP for enhanced visual realism, trained progressively across three stages.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

SpeakerVid-5M dataset for audio-visual dyadic interactive virtual human generation

[6] GeSTICS: A Multimodal Corpus for Studying Gesture Synthesis in Two-party Interactions with Contextualized Speech PDF

Cannot Refute

[7] INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations PDF

Cannot Refute

[24] Beyond talkingâgenerating holistic 3d human dyadic motion for communication PDF

Cannot Refute

[39] Interactive conversational head generation PDF

Cannot Refute

[40] Creating Multimodal Interactive Digital Twin Characters from Videos: A Dataset and Baseline PDF

Cannot Refute

[41] Slovak dialogue corpus with backchannel annotation PDF

Cannot Refute

[42] The USC CreativeIT database of multimodal dyadic interactions: From speech and full body motion capture to continuous emotional annotations PDF

Cannot Refute

[43] The ALICO corpus: Analysing the active listener PDF

Cannot Refute

[44] CNAMD Corpus: A Chinese Natural Audiovisual Multimodal Database of Conversations for Social Interactive Agents PDF

Cannot Refute

[45] MSP-AVATAR corpus: Motion capture recordings to study the role of discourse functions in the design of intelligent virtual agents PDF

Cannot Refute

Contribution

VidChatBench benchmark for evaluating dyadic interactive virtual human models

[31] Dyadformer: A multi-modal transformer for long-range modeling of dyadic interactions PDF

Cannot Refute

[32] The Importance of Multimodal Emotion Conditioning and Affect Consistency for Embodied Conversational Agents PDF

Cannot Refute

[33] The Interaction Behavior Dataset: A Dataset of Smiles and Laughs in Dyadic Interaction PDF

Cannot Refute

[34] Embodied Conversational Agents: Deep Learning Based Multimodal Sentiment Analysis PDF

Cannot Refute

[35] Socially-aware virtual agents: Automatically assessing dyadic rapport from temporal patterns of behavior PDF

Cannot Refute

[36] Evaluating Multimodal Interactive Agents PDF

Cannot Refute

[37] Virtual social environments as a tool for psychological assessment: Dynamics of interaction with a virtual spouse. PDF

Cannot Refute

[38] A Multimodal Approach to Improve Performance Evaluation of Call Center Agent. PDF

Cannot Refute

Contribution

Autoregressive baseline method for audio-visual dyadic human generation

[2] Let's face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings PDF

Cannot Refute

[7] INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations PDF

Cannot Refute

[11] MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation PDF

Cannot Refute

[24] Beyond talkingâgenerating holistic 3d human dyadic motion for communication PDF

Cannot Refute

[25] Dim: Dyadic interaction modeling for social behavior generation PDF

Cannot Refute

[26] Learning to listen: Modeling non-deterministic dyadic facial motion PDF

Cannot Refute

[27] REACT 2025: the Third Multiple Appropriate Facial Reaction Generation Challenge PDF

Cannot Refute

[28] Finite Scalar Quantization as Facial Tokenizer for Dyadic Reaction Generation PDF

Cannot Refute

[29] It Takes Two: Real-time Co-Speech Two-person's Interaction Generation via Reactive Auto-regressive Diffusion Model PDF

Cannot Refute

[30] Toward body language generation in dyadic interaction settings from interlocutor multimodal cues PDF

Cannot Refute

SpeakerVid-5M: A Large-Scale High-Quality Dataset for Audio-Visual Dyadic Interactive Human Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

SpeakerVid-5M dataset for audio-visual dyadic interactive virtual human generation

[6] GeSTICS: A Multimodal Corpus for Studying Gesture Synthesis in Two-party Interactions with Contextualized Speech PDF

[7] INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations PDF

[24] Beyond talkingâgenerating holistic 3d human dyadic motion for communication PDF

[39] Interactive conversational head generation PDF

[40] Creating Multimodal Interactive Digital Twin Characters from Videos: A Dataset and Baseline PDF

[41] Slovak dialogue corpus with backchannel annotation PDF

[42] The USC CreativeIT database of multimodal dyadic interactions: From speech and full body motion capture to continuous emotional annotations PDF

[43] The ALICO corpus: Analysing the active listener PDF

[44] CNAMD Corpus: A Chinese Natural Audiovisual Multimodal Database of Conversations for Social Interactive Agents PDF

[45] MSP-AVATAR corpus: Motion capture recordings to study the role of discourse functions in the design of intelligent virtual agents PDF

VidChatBench benchmark for evaluating dyadic interactive virtual human models

[31] Dyadformer: A multi-modal transformer for long-range modeling of dyadic interactions PDF

[32] The Importance of Multimodal Emotion Conditioning and Affect Consistency for Embodied Conversational Agents PDF

[33] The Interaction Behavior Dataset: A Dataset of Smiles and Laughs in Dyadic Interaction PDF

[34] Embodied Conversational Agents: Deep Learning Based Multimodal Sentiment Analysis PDF

[35] Socially-aware virtual agents: Automatically assessing dyadic rapport from temporal patterns of behavior PDF

[36] Evaluating Multimodal Interactive Agents PDF

[37] Virtual social environments as a tool for psychological assessment: Dynamics of interaction with a virtual spouse. PDF

[38] A Multimodal Approach to Improve Performance Evaluation of Call Center Agent. PDF

Autoregressive baseline method for audio-visual dyadic human generation

[2] Let's face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings PDF

[7] INFP: Audio-Driven Interactive Head Generation in Dyadic Conversations PDF

[11] MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation PDF

[24] Beyond talkingâgenerating holistic 3d human dyadic motion for communication PDF

[25] Dim: Dyadic interaction modeling for social behavior generation PDF

[26] Learning to listen: Modeling non-deterministic dyadic facial motion PDF

[27] REACT 2025: the Third Multiple Appropriate Facial Reaction Generation Challenge PDF

[28] Finite Scalar Quantization as Facial Tokenizer for Dyadic Reaction Generation PDF

[29] It Takes Two: Real-time Co-Speech Two-person's Interaction Generation via Reactive Auto-regressive Diffusion Model PDF

[30] Toward body language generation in dyadic interaction settings from interlocutor multimodal cues PDF

Table of Contents

[24] Beyond talkingâgenerating holistic 3d human dyadic motion for communication PDF

[24] Beyond talkingâgenerating holistic 3d human dyadic motion for communication PDF