VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.7 Download Report PDF

Text-to-Speech; Podcast Generation

Generating long-form, multi-speaker conversational audio like podcasts poses significant challenges for traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking. We present VibeVoice , a novel model designed to synthesize expressive, long-form speech with multiple speakers in a zero-shot manner. A core component of our approach is the continuous speech tokenizers operating at an ultra-low frame rate of 7.5. This tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. To facilitate training on authentic conversational dynamics, we have developed an annotation pipeline that generates pseudo transcriptions and turn-taking labels for extensive podcast data. Leveraging this data and our efficient tokenizer, VibeVoice employs the next-token diffusion framework. This enables VibeVoice to: (1) synthesize long-form speech (up to 30 minutes) with up to 4 speakers, surpassing the typical 1-2 speaker limits of many prior models; and (2) achieve a high degree of naturalness in turn-taking, pacing, and the rendition of subtle non-lexical cues (such as breaths and lip smacks), which are crucial for listener immersion and capturing the authentic vibe of expressive conversations.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

VibeVoice contributes a framework for generating expressive multi-speaker podcast-style conversations up to 30 minutes with up to four speakers, using ultra-low frame rate continuous speech tokenizers (7.5 fps) and next-token diffusion. The paper sits within the 'Podcast-Style Long-Form Dialogue Synthesis' leaf, which contains four papers total including VibeVoice itself. This represents a relatively sparse but emerging research direction, suggesting the problem of extended multi-speaker conversational synthesis remains under-explored compared to shorter-form or single-speaker tasks.

The taxonomy reveals that podcast-style synthesis is one specialized branch under 'Conversational Dialogue Speech Generation', which also includes interactive dual-speaker dialogue and agent-based multi-party generation. Neighboring branches address emotion/style transfer, context-aware long-form synthesis (audiobooks, storytelling), and conversational processing/understanding. VibeVoice's focus on podcast-length naturalness and turn-taking connects it to context-aware synthesis approaches, but its emphasis on zero-shot multi-speaker generation distinguishes it from audiobook methods that typically assume character-level control. The taxonomy's scope notes clarify that podcast-style work must handle extended duration and naturalness, separating it from shorter interactive dialogue systems.

Among 24 candidates examined, the VibeVoice framework contribution shows one refutable candidate out of four examined, while the ultra-low frame rate tokenizer contribution has four refutable candidates among ten examined. The annotation pipeline contribution appears more novel, with zero refutable candidates among ten examined. These statistics suggest that while the overall framework and tokenizer design have some overlapping prior work within the limited search scope, the specific approach to generating pseudo transcriptions and turn-taking labels for podcast data may represent a less-explored methodological contribution. The search scale (24 candidates) indicates this assessment is based on top-K semantic matches rather than exhaustive field coverage.

Given the limited search scope and the sparse taxonomy leaf (four papers), VibeVoice appears to address an emerging problem space where prior work is still accumulating. The framework and tokenizer contributions show moderate overlap with examined candidates, while the annotation pipeline shows less. The analysis covers top semantic matches and does not claim exhaustive coverage of all relevant speech synthesis literature, particularly work outside the podcast-style conversational domain.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: expressive long-form multi-speaker conversational speech synthesis. The field organizes around several complementary branches that address different facets of generating natural, emotionally rich dialogue. Conversational Dialogue Speech Generation focuses on producing multi-turn exchanges with appropriate turn-taking and prosodic cues, while Emotion and Style Transfer in Multi-Speaker TTS emphasizes controlling affective dimensions and speaker identity. Context-Aware Long-Form Speech Synthesis tackles the challenge of maintaining coherence and naturalness over extended durations, and Conversational Speech Processing and Understanding provides the analysis tools needed to model dialogue structure. Dialogue Response Generation with Emotion and Personalized Speech and Dialog Modeling round out the taxonomy by addressing content planning and speaker-specific adaptation. Representative works span from early listener vocalization studies to recent systems like FireRedTTS[2] and DialogueAgents[3] that integrate multiple expressive dimensions. A particularly active line of work centers on podcast-style long-form dialogue synthesis, where systems must balance naturalness, speaker consistency, and narrative flow over minutes rather than seconds. VibeVoice[0] sits squarely within this emerging cluster, alongside SoulX Podcast[5] and Podagent[6], all of which tackle the challenge of generating extended multi-speaker conversations with appropriate emotional arcs and turn-taking dynamics. Compared to SoulX Podcast[5], which emphasizes content-driven narrative structure, VibeVoice[0] appears to focus more directly on the expressive acoustic modeling required for sustained conversational realism. Meanwhile, works like Controllable Emotional Speech[4] and DialoSpeech[14] explore finer-grained control over prosody and emotion within shorter dialogue contexts, highlighting an ongoing tension between expressiveness at the utterance level versus coherence across long-form interactions. The field continues to grapple with how to scale emotional richness and speaker variability to podcast-length scenarios without sacrificing naturalness.

Claimed Contributions

VibeVoice framework for expressive multi-speaker podcast generation

Can Refute

4 retrieved papers

The authors introduce VibeVoice, a framework that synthesizes expressive, long-form conversational audio (up to 90 minutes) with up to 4 speakers in a zero-shot setting. It employs a next-token diffusion architecture integrated with an LLM to achieve natural turn-taking, pacing, and subtle non-lexical cues crucial for authentic conversational dynamics.

4 retrieved papers

Can Refute

Ultra-low frame rate continuous speech tokenizers

Can Refute

10 retrieved papers

The authors develop specialized acoustic and semantic tokenizers that both operate at an ultra-low frame rate of 7.5 Hz. The acoustic tokenizer uses a sigma-VAE to preserve audio fidelity while the semantic tokenizer extracts linguistic content, together forming a hybrid representation that significantly boosts computational efficiency for long sequences.

10 retrieved papers

Can Refute

Annotation pipeline for podcast data with conversational dynamics

10 retrieved papers

The authors propose a novel automatic annotation pipeline tailored for extended multi-speaker speech data. This pipeline generates pseudo transcriptions and speaker turn-taking labels for large-scale podcast datasets, enabling the model to learn realistic intonation, turn-taking, and subtle expressive cues from authentic conversational material.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[2] FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot PDF

Xie Kun, Shen Feiyu, Kun Xie, Li Junjie, Feiyu Shen, Xie, Fenglong, Junjie Li, Tang, Xu, Fenglong Xie, Hu Yao, Xu Tang, Yao Hu (2025) • arXiv.org

[5] SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity PDF

LIN Haopeng, Hanke Xie, Cao Wenxiao, Haopeng Lin, Tian Wen-jie, Dake Guo, Wu Jun, Wenjie Tian, Wen Hanlin, Jun Wu, Shang, Ruixuan, Hanlin Wen, Liu Hong-mei, R. Shang, Jiang Zhiqi, Hongmei Liu, Jiang Yuepeng, Zhiqi Jiang, Chen Wenxi, Yuepeng Jiang, Yan Rui-qi, Wenxi Chen, Qian Jiale, Ruiqi Yan, Yan, Yichao, Jiale Qian, Yichao Yan, Tao Ming, Shunshun Yin, Chen, Xie, Ming Tao, Xie Lei, Xie Chen, Wang Xinsheng, Lei Xie, Xinsheng Wang (2025) • arXiv.org

[6] Podagent: A comprehensive framework for podcast generation PDF

Xiao Yujia, He Lei, Yujia Xiao, Guo, Haohan, Lei He, Xie, Fenglong, Haohan Guo, Lee, Tan, Fenglong Xie, Tan Lee (2025)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VibeVoice framework for expressive multi-speaker podcast generation

[44] CoVoMix: Advancing zero-shot speech generation for human-like multi-talker conversations PDF

Can Refute

[45] CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching PDF

Cannot Refute

[47] SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training PDF

Cannot Refute

[48] Character-Driven Narrative Generation for Scene-Based Video Synthesis PDF

Cannot Refute

Contribution

Ultra-low frame rate continuous speech tokenizers

[35] Tadicodec: Text-aware diffusion speech tokenizer for speech language modeling PDF

Can Refute

[36] Kimi-audio technical report PDF

Can Refute

[37] U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation PDF

Can Refute

[38] GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot PDF

Can Refute

[34] Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding PDF

Cannot Refute

[39] LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models PDF

Cannot Refute

[40] SyllableLM: Learning Coarse Semantic Units for Speech Language Models PDF

Cannot Refute

[41] Lower Frame Rate Neural Network Acoustic Models. PDF

Cannot Refute

[42] TASLA: Text-Aligned Speech Tokens with Multiple Layer-Aggregation PDF

Cannot Refute

[43] LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization PDF

Cannot Refute

Contribution

Annotation pipeline for podcast data with conversational dynamics

[24] Towards multi-party conversation modeling PDF

Cannot Refute

[25] Spoken Language Processing: Conversational AI for Spontaneous Human Dialogues PDF

Cannot Refute

[26] Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models PDF

Cannot Refute

[27] DeepCon: An end-to-end multilingual toolkit for automatic minuting of multi-party dialogues PDF

Cannot Refute

[28] A multi-modal explainability approach for human-aware robots in multi-party conversation - Data PDF

Cannot Refute

[29] An End-to-End Multilingual System for Automatic Minuting of Multi-Party Dialogues PDF

Cannot Refute

[30] NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion PDF

Cannot Refute

[31] Identifying introductions in podcast episodes from automatically generated transcripts PDF

Cannot Refute

[32] Automatic segmentation of multiparty dialogue PDF

Cannot Refute

[33] Proactive Hearing Assistants that Isolate Egocentric Conversations PDF

Cannot Refute

VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[2] FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot PDF

[5] SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity PDF

[6] Podagent: A comprehensive framework for podcast generation PDF

Contribution Analysis

VibeVoice framework for expressive multi-speaker podcast generation

[44] CoVoMix: Advancing zero-shot speech generation for human-like multi-talker conversations PDF

[45] CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching PDF

[47] SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training PDF

[48] Character-Driven Narrative Generation for Scene-Based Video Synthesis PDF

Ultra-low frame rate continuous speech tokenizers

[35] Tadicodec: Text-aware diffusion speech tokenizer for speech language modeling PDF

[36] Kimi-audio technical report PDF

[37] U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation PDF

[38] GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot PDF

[34] Say More with Less: Variable-Frame-Rate Speech Tokenization via Adaptive Clustering and Implicit Duration Coding PDF

[39] LongCat-Audio-Codec: An Audio Tokenizer and Detokenizer Solution Designed for Speech Large Language Models PDF

[40] SyllableLM: Learning Coarse Semantic Units for Speech Language Models PDF

[41] Lower Frame Rate Neural Network Acoustic Models. PDF

[42] TASLA: Text-Aligned Speech Tokens with Multiple Layer-Aggregation PDF

[43] LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization PDF

Annotation pipeline for podcast data with conversational dynamics

[24] Towards multi-party conversation modeling PDF

[25] Spoken Language Processing: Conversational AI for Spontaneous Human Dialogues PDF

[26] Diarization-Aware Multi-Speaker Automatic Speech Recognition via Large Language Models PDF

[27] DeepCon: An end-to-end multilingual toolkit for automatic minuting of multi-party dialogues PDF

[28] A multi-modal explainability approach for human-aware robots in multi-party conversation - Data PDF

[29] An End-to-End Multilingual System for Automatic Minuting of Multi-Party Dialogues PDF

[30] NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion PDF

[31] Identifying introductions in podcast episodes from automatically generated transcripts PDF

[32] Automatic segmentation of multiparty dialogue PDF

[33] Proactive Hearing Assistants that Isolate Egocentric Conversations PDF

Table of Contents