YuE: Scaling Open Foundation Models for Long-Form Music Generation

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

lyrics2songsong generationlong-formfoundation modelmusic generation

We tackle the task of long-form music generation, particularly the challenging \textbf{lyrics-to-song} problem, by introducing \textbf{YuE (乐)}, a family of open-source music generation foundation models. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through \textbf{track-decoupled next-token prediction} to overcome dense mixture signals, and \textbf{structural progressive conditioning} for long-context lyrical alignment. In addition, we redesign the \textbf{in-context learning} technique for music generation, enabling bidirectional content creation, style cloning, and improving musicality. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility (as of 2025-01). We strongly encourage readers to \textbf{listen to our demo}\footnote{\url{https://yue-anonymous.github.io}}.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces YuE, a foundation model for long-form music generation from lyrics, contributing track-decoupled next-token prediction, structural progressive conditioning, and redesigned in-context learning for music. Within the taxonomy, YuE occupies the 'Long-Form Music Generation' leaf, which contains only this single paper among fifty total works surveyed. This isolation indicates a sparse research direction: while the broader field addresses text-to-speech prosody transfer and short-form audio generation extensively, extended musical composition with lyrical alignment remains underexplored in the current taxonomy structure.

The taxonomy reveals that neighboring branches—particularly 'Text-to-Speech Synthesis' and 'Text-to-Audio Generation'—contain dense clusters addressing reference-based prosody modeling, zero-shot voice cloning, and temporal audio synthesis. YuE's use of reference audio and target text aligns conceptually with prosody-transfer paradigms in TTS (e.g., Prosody Transfer Transformer, Exact Prosody Cloning), yet diverges by targeting musical structure and five-minute durations rather than speech fluency. The 'Temporal and Long-Form Audio Generation' leaf addresses extended synthesis but focuses on environmental sounds, not music with lyrical coherence, highlighting YuE's distinct positioning.

Among twenty candidates examined across three contributions, zero refutable prior work was identified. Track-Decoupled Next-Token Prediction examined ten candidates with no clear refutations; Structural Progressive Conditioning examined zero candidates; Redesigned In-Context Learning examined ten candidates, also yielding no refutations. This limited search scope—twenty papers from semantic retrieval—suggests the analysis captures immediate neighbors but may not reflect exhaustive coverage of music generation or long-context modeling literature. The absence of refutations within this scope indicates the contributions appear novel relative to the examined subset.

Given the sparse taxonomy leaf and limited search scale, YuE's contributions appear distinctive within the surveyed literature, particularly for long-form lyrical music generation. However, the analysis does not cover broader music generation systems outside the top-twenty semantic matches, nor does it exhaustively survey autoregressive music models or large-scale audio foundation work. The novelty assessment reflects what is visible in this constrained retrieval context, not a comprehensive field survey.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: Given a reference audio and text pair, generate new audio that matches a target text while preserving stylistic or prosodic characteristics from the reference. The field structure reflects diverse approaches to audio generation and understanding. Text-to-Speech Synthesis encompasses methods for converting text to natural speech, often emphasizing prosody transfer, style control, and zero-shot voice cloning, as seen in works like StyleTTS[2] and FireRedTTS[8]. Text-to-Audio Generation focuses on producing general sound effects or environmental audio from textual descriptions, with representative efforts such as EzAudio[4] and Make An Audio[16]. Multimodal Audio-Visual Generation bridges audio and visual modalities, exemplified by Ta2v[3] and DiffAVA[42], while Audio Understanding and Retrieval addresses tasks like audio captioning, retrieval, and source separation, including Separate Anything[5] and Egocentric Audio Retrieval[28]. Long-Form Music Generation targets extended musical compositions, a less densely populated but emerging branch. Across these branches, a central theme is the trade-off between controllability and naturalness: many studies explore how to steer prosody, emotion, or timbre without sacrificing fluency or realism. Within Text-to-Speech Synthesis, a substantial cluster investigates prosody and style transfer using reference audio, with works like Prosody Transfer Transformer[29] and Exact Prosody Cloning[34] emphasizing fine-grained control, while others such as MRMI TTS[6] and DMP TTS[22] pursue robust multi-reference or diffusion-based strategies. YuE[0] sits naturally within the Long-Form Music Generation branch, yet its reliance on reference audio and target text aligns it closely with prosody-transfer paradigms found in TTS research. Compared to shorter-form TTS methods like RiTTA[19] or LiveSpeech[21], YuE[0] likely emphasizes temporal coherence and structural consistency over extended durations, addressing challenges unique to music rather than speech. This positioning highlights an open question: how to adapt reference-driven generation techniques to maintain musical structure and expressiveness at scale.

Claimed Contributions

Track-Decoupled Next-Token Prediction (Dual-NTP)

10 retrieved papers

A dual-token strategy that separately models vocal and accompaniment tracks at each time step, overcoming the limitations of standard next-token prediction when encoding both vocals and accompaniment simultaneously. This approach maintains lyric intelligibility even in acoustically complex genres.

10 retrieved papers

Structural Progressive Conditioning (SPC)

0 retrieved papers

A conditioning strategy that leverages musical structural priors by segmenting songs into sections and interleaving text conditions with audio tokens. This enables the model to handle minutes-long contexts for full-song generation while maintaining lyrical alignment.

0 retrieved papers

Redesigned In-Context Learning for Music

10 retrieved papers

A novel in-context learning framework specifically designed for music generation that enables style transfer, voice cloning, and bidirectional content creation, going beyond the continuation-based approach used in speech synthesis.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Track-Decoupled Next-Token Prediction (Dual-NTP)

[51] SingSong: Generating musical accompaniments from singing PDF

Cannot Refute

[52] Drop the beat! freestyler for accompaniment conditioned rapping voice generation PDF

Cannot Refute

[53] Harmonizing the voices of AI: Exploring generative music models, voice cloning, and voice transfer for creative expression PDF

Cannot Refute

[54] Unisyn: an end-to-end unified model for text-to-speech and singing voice synthesis PDF

Cannot Refute

[55] MusicFace: Music-driven expressive singing face synthesis PDF

Cannot Refute

[56] AI-enabled text-to-music generation: A comprehensive review of methods, frameworks, and future directions PDF

Cannot Refute

[57] Simultaneous Music Separation and Generation Using Multi-Track Latent Diffusion Models PDF

Cannot Refute

[58] Multi-Source Diffusion Models for Simultaneous Music Generation and Separation PDF

Cannot Refute

[59] An overview of lead and accompaniment separation in music PDF

Cannot Refute

[60] VAT-SNet: A Convolutional Music-Separation Network Based on Vocal and Accompaniment Time-Domain Features PDF

Cannot Refute

Contribution

Structural Progressive Conditioning (SPC)

Contribution

Redesigned In-Context Learning for Music

[14] Retrieval augmented generation in prompt-based text-to-speech synthesis with context-aware contrastive language-audio pretraining PDF

Cannot Refute

[61] SF-Speech: Straightened Flow for Zero-Shot Voice Clone PDF

Cannot Refute

[62] Mega-tts 2: Boosting prompting mechanisms for zero-shot speech synthesis PDF

Cannot Refute

[63] Controlspeech: Towards simultaneous and independent zero-shot speaker cloning and zero-shot language style control PDF

Cannot Refute

[64] P-flow: A fast and data-efficient zero-shot TTS through speech prompting PDF

Cannot Refute

[65] Voxinstruct: Expressive human instruction-to-speech generation with unified multilingual codec language modelling PDF

Cannot Refute

[66] The multi-speaker multi-style voice cloning challenge 2021 PDF

Cannot Refute

[67] Context-aware style learning and content recovery networks for neural style transfer PDF

Cannot Refute

[68] Unet-tts: Improving unseen speaker and style transfer in one-shot voice cloning PDF

Cannot Refute

[69] Meta-Voice: Fast few-shot style transfer for expressive voice cloning using meta learning PDF

Cannot Refute

YuE: Scaling Open Foundation Models for Long-Form Music Generation

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

Contribution Analysis

Track-Decoupled Next-Token Prediction (Dual-NTP)

[51] SingSong: Generating musical accompaniments from singing PDF

[52] Drop the beat! freestyler for accompaniment conditioned rapping voice generation PDF

[53] Harmonizing the voices of AI: Exploring generative music models, voice cloning, and voice transfer for creative expression PDF

[54] Unisyn: an end-to-end unified model for text-to-speech and singing voice synthesis PDF

[55] MusicFace: Music-driven expressive singing face synthesis PDF

[56] AI-enabled text-to-music generation: A comprehensive review of methods, frameworks, and future directions PDF

[57] Simultaneous Music Separation and Generation Using Multi-Track Latent Diffusion Models PDF

[58] Multi-Source Diffusion Models for Simultaneous Music Generation and Separation PDF

[59] An overview of lead and accompaniment separation in music PDF

[60] VAT-SNet: A Convolutional Music-Separation Network Based on Vocal and Accompaniment Time-Domain Features PDF

Structural Progressive Conditioning (SPC)

Redesigned In-Context Learning for Music

[14] Retrieval augmented generation in prompt-based text-to-speech synthesis with context-aware contrastive language-audio pretraining PDF

[61] SF-Speech: Straightened Flow for Zero-Shot Voice Clone PDF

[62] Mega-tts 2: Boosting prompting mechanisms for zero-shot speech synthesis PDF

[63] Controlspeech: Towards simultaneous and independent zero-shot speaker cloning and zero-shot language style control PDF

[64] P-flow: A fast and data-efficient zero-shot TTS through speech prompting PDF

[65] Voxinstruct: Expressive human instruction-to-speech generation with unified multilingual codec language modelling PDF

[66] The multi-speaker multi-style voice cloning challenge 2021 PDF

[67] Context-aware style learning and content recovery networks for neural style transfer PDF

[68] Unet-tts: Improving unseen speaker and style transfer in one-shot voice cloning PDF

[69] Meta-Voice: Fast few-shot style transfer for expressive voice cloning using meta learning PDF

Table of Contents