VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

ICLR 2026 Conference SubmissionAnonymous Authors
Text-to-Speech; Podcast Generation
Abstract:

Generating long-form, multi-speaker conversational audio like podcasts poses significant challenges for traditional Text-to-Speech (TTS) systems, particularly in scalability, speaker consistency, and natural turn-taking. We present VibeVoice , a novel model designed to synthesize expressive, long-form speech with multiple speakers in a zero-shot manner. A core component of our approach is the continuous speech tokenizers operating at an ultra-low frame rate of 7.5. This tokenizer effectively preserves audio fidelity while significantly boosting computational efficiency for processing long sequences. To facilitate training on authentic conversational dynamics, we have developed an annotation pipeline that generates pseudo transcriptions and turn-taking labels for extensive podcast data. Leveraging this data and our efficient tokenizer, VibeVoice employs the next-token diffusion framework. This enables VibeVoice to: (1) synthesize long-form speech (up to 30 minutes) with up to 4 speakers, surpassing the typical 1-2 speaker limits of many prior models; and (2) achieve a high degree of naturalness in turn-taking, pacing, and the rendition of subtle non-lexical cues (such as breaths and lip smacks), which are crucial for listener immersion and capturing the authentic vibe of expressive conversations.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

VibeVoice contributes a framework for generating expressive multi-speaker podcast-style conversations up to 30 minutes with up to four speakers, using ultra-low frame rate continuous speech tokenizers (7.5 fps) and next-token diffusion. The paper sits within the 'Podcast-Style Long-Form Dialogue Synthesis' leaf, which contains four papers total including VibeVoice itself. This represents a relatively sparse but emerging research direction, suggesting the problem of extended multi-speaker conversational synthesis remains under-explored compared to shorter-form or single-speaker tasks.

The taxonomy reveals that podcast-style synthesis is one specialized branch under 'Conversational Dialogue Speech Generation', which also includes interactive dual-speaker dialogue and agent-based multi-party generation. Neighboring branches address emotion/style transfer, context-aware long-form synthesis (audiobooks, storytelling), and conversational processing/understanding. VibeVoice's focus on podcast-length naturalness and turn-taking connects it to context-aware synthesis approaches, but its emphasis on zero-shot multi-speaker generation distinguishes it from audiobook methods that typically assume character-level control. The taxonomy's scope notes clarify that podcast-style work must handle extended duration and naturalness, separating it from shorter interactive dialogue systems.

Among 24 candidates examined, the VibeVoice framework contribution shows one refutable candidate out of four examined, while the ultra-low frame rate tokenizer contribution has four refutable candidates among ten examined. The annotation pipeline contribution appears more novel, with zero refutable candidates among ten examined. These statistics suggest that while the overall framework and tokenizer design have some overlapping prior work within the limited search scope, the specific approach to generating pseudo transcriptions and turn-taking labels for podcast data may represent a less-explored methodological contribution. The search scale (24 candidates) indicates this assessment is based on top-K semantic matches rather than exhaustive field coverage.

Given the limited search scope and the sparse taxonomy leaf (four papers), VibeVoice appears to address an emerging problem space where prior work is still accumulating. The framework and tokenizer contributions show moderate overlap with examined candidates, while the annotation pipeline shows less. The analysis covers top semantic matches and does not claim exhaustive coverage of all relevant speech synthesis literature, particularly work outside the podcast-style conversational domain.

Taxonomy

Core-task Taxonomy Papers
23
3
Claimed Contributions
24
Contribution Candidate Papers Compared
5
Refutable Paper

Research Landscape Overview

Core task: expressive long-form multi-speaker conversational speech synthesis. The field organizes around several complementary branches that address different facets of generating natural, emotionally rich dialogue. Conversational Dialogue Speech Generation focuses on producing multi-turn exchanges with appropriate turn-taking and prosodic cues, while Emotion and Style Transfer in Multi-Speaker TTS emphasizes controlling affective dimensions and speaker identity. Context-Aware Long-Form Speech Synthesis tackles the challenge of maintaining coherence and naturalness over extended durations, and Conversational Speech Processing and Understanding provides the analysis tools needed to model dialogue structure. Dialogue Response Generation with Emotion and Personalized Speech and Dialog Modeling round out the taxonomy by addressing content planning and speaker-specific adaptation. Representative works span from early listener vocalization studies to recent systems like FireRedTTS[2] and DialogueAgents[3] that integrate multiple expressive dimensions. A particularly active line of work centers on podcast-style long-form dialogue synthesis, where systems must balance naturalness, speaker consistency, and narrative flow over minutes rather than seconds. VibeVoice[0] sits squarely within this emerging cluster, alongside SoulX Podcast[5] and Podagent[6], all of which tackle the challenge of generating extended multi-speaker conversations with appropriate emotional arcs and turn-taking dynamics. Compared to SoulX Podcast[5], which emphasizes content-driven narrative structure, VibeVoice[0] appears to focus more directly on the expressive acoustic modeling required for sustained conversational realism. Meanwhile, works like Controllable Emotional Speech[4] and DialoSpeech[14] explore finer-grained control over prosody and emotion within shorter dialogue contexts, highlighting an ongoing tension between expressiveness at the utterance level versus coherence across long-form interactions. The field continues to grapple with how to scale emotional richness and speaker variability to podcast-length scenarios without sacrificing naturalness.

Claimed Contributions

VibeVoice framework for expressive multi-speaker podcast generation

The authors introduce VibeVoice, a framework that synthesizes expressive, long-form conversational audio (up to 90 minutes) with up to 4 speakers in a zero-shot setting. It employs a next-token diffusion architecture integrated with an LLM to achieve natural turn-taking, pacing, and subtle non-lexical cues crucial for authentic conversational dynamics.

4 retrieved papers
Can Refute
Ultra-low frame rate continuous speech tokenizers

The authors develop specialized acoustic and semantic tokenizers that both operate at an ultra-low frame rate of 7.5 Hz. The acoustic tokenizer uses a sigma-VAE to preserve audio fidelity while the semantic tokenizer extracts linguistic content, together forming a hybrid representation that significantly boosts computational efficiency for long sequences.

10 retrieved papers
Can Refute
Annotation pipeline for podcast data with conversational dynamics

The authors propose a novel automatic annotation pipeline tailored for extended multi-speaker speech data. This pipeline generates pseudo transcriptions and speaker turn-taking labels for large-scale podcast datasets, enabling the model to learn realistic intonation, turn-taking, and subtle expressive cues from authentic conversational material.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

VibeVoice framework for expressive multi-speaker podcast generation

The authors introduce VibeVoice, a framework that synthesizes expressive, long-form conversational audio (up to 90 minutes) with up to 4 speakers in a zero-shot setting. It employs a next-token diffusion architecture integrated with an LLM to achieve natural turn-taking, pacing, and subtle non-lexical cues crucial for authentic conversational dynamics.

Contribution

Ultra-low frame rate continuous speech tokenizers

The authors develop specialized acoustic and semantic tokenizers that both operate at an ultra-low frame rate of 7.5 Hz. The acoustic tokenizer uses a sigma-VAE to preserve audio fidelity while the semantic tokenizer extracts linguistic content, together forming a hybrid representation that significantly boosts computational efficiency for long sequences.

Contribution

Annotation pipeline for podcast data with conversational dynamics

The authors propose a novel automatic annotation pipeline tailored for extended multi-speaker speech data. This pipeline generates pseudo transcriptions and speaker turn-taking labels for large-scale podcast datasets, enabling the model to learn realistic intonation, turn-taking, and subtle expressive cues from authentic conversational material.