YuE: Scaling Open Foundation Models for Long-Form Music Generation

ICLR 2026 Conference SubmissionAnonymous Authors
lyrics2songsong generationlong-formfoundation modelmusic generation
Abstract:

We tackle the task of long-form music generation, particularly the challenging \textbf{lyrics-to-song} problem, by introducing \textbf{YuE (乐)}, a family of open-source music generation foundation models. Specifically, YuE scales to trillions of tokens and generates up to five minutes of music while maintaining lyrical alignment, coherent musical structure, and engaging vocal melodies with appropriate accompaniment. It achieves this through \textbf{track-decoupled next-token prediction} to overcome dense mixture signals, and \textbf{structural progressive conditioning} for long-context lyrical alignment. In addition, we redesign the \textbf{in-context learning} technique for music generation, enabling bidirectional content creation, style cloning, and improving musicality. Through extensive evaluation, we demonstrate that YuE matches or even surpasses some of the proprietary systems in musicality and vocal agility (as of 2025-01). We strongly encourage readers to \textbf{listen to our demo}\footnote{\url{https://yue-anonymous.github.io}}.

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper introduces YuE, a foundation model for long-form music generation from lyrics, contributing track-decoupled next-token prediction, structural progressive conditioning, and redesigned in-context learning for music. Within the taxonomy, YuE occupies the 'Long-Form Music Generation' leaf, which contains only this single paper among fifty total works surveyed. This isolation indicates a sparse research direction: while the broader field addresses text-to-speech prosody transfer and short-form audio generation extensively, extended musical composition with lyrical alignment remains underexplored in the current taxonomy structure.

The taxonomy reveals that neighboring branches—particularly 'Text-to-Speech Synthesis' and 'Text-to-Audio Generation'—contain dense clusters addressing reference-based prosody modeling, zero-shot voice cloning, and temporal audio synthesis. YuE's use of reference audio and target text aligns conceptually with prosody-transfer paradigms in TTS (e.g., Prosody Transfer Transformer, Exact Prosody Cloning), yet diverges by targeting musical structure and five-minute durations rather than speech fluency. The 'Temporal and Long-Form Audio Generation' leaf addresses extended synthesis but focuses on environmental sounds, not music with lyrical coherence, highlighting YuE's distinct positioning.

Among twenty candidates examined across three contributions, zero refutable prior work was identified. Track-Decoupled Next-Token Prediction examined ten candidates with no clear refutations; Structural Progressive Conditioning examined zero candidates; Redesigned In-Context Learning examined ten candidates, also yielding no refutations. This limited search scope—twenty papers from semantic retrieval—suggests the analysis captures immediate neighbors but may not reflect exhaustive coverage of music generation or long-context modeling literature. The absence of refutations within this scope indicates the contributions appear novel relative to the examined subset.

Given the sparse taxonomy leaf and limited search scale, YuE's contributions appear distinctive within the surveyed literature, particularly for long-form lyrical music generation. However, the analysis does not cover broader music generation systems outside the top-twenty semantic matches, nor does it exhaustively survey autoregressive music models or large-scale audio foundation work. The novelty assessment reflects what is visible in this constrained retrieval context, not a comprehensive field survey.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
20
Contribution Candidate Papers Compared
0
Refutable Paper

Research Landscape Overview

Core task: Given a reference audio and text pair, generate new audio that matches a target text while preserving stylistic or prosodic characteristics from the reference. The field structure reflects diverse approaches to audio generation and understanding. Text-to-Speech Synthesis encompasses methods for converting text to natural speech, often emphasizing prosody transfer, style control, and zero-shot voice cloning, as seen in works like StyleTTS[2] and FireRedTTS[8]. Text-to-Audio Generation focuses on producing general sound effects or environmental audio from textual descriptions, with representative efforts such as EzAudio[4] and Make An Audio[16]. Multimodal Audio-Visual Generation bridges audio and visual modalities, exemplified by Ta2v[3] and DiffAVA[42], while Audio Understanding and Retrieval addresses tasks like audio captioning, retrieval, and source separation, including Separate Anything[5] and Egocentric Audio Retrieval[28]. Long-Form Music Generation targets extended musical compositions, a less densely populated but emerging branch. Across these branches, a central theme is the trade-off between controllability and naturalness: many studies explore how to steer prosody, emotion, or timbre without sacrificing fluency or realism. Within Text-to-Speech Synthesis, a substantial cluster investigates prosody and style transfer using reference audio, with works like Prosody Transfer Transformer[29] and Exact Prosody Cloning[34] emphasizing fine-grained control, while others such as MRMI TTS[6] and DMP TTS[22] pursue robust multi-reference or diffusion-based strategies. YuE[0] sits naturally within the Long-Form Music Generation branch, yet its reliance on reference audio and target text aligns it closely with prosody-transfer paradigms found in TTS research. Compared to shorter-form TTS methods like RiTTA[19] or LiveSpeech[21], YuE[0] likely emphasizes temporal coherence and structural consistency over extended durations, addressing challenges unique to music rather than speech. This positioning highlights an open question: how to adapt reference-driven generation techniques to maintain musical structure and expressiveness at scale.

Claimed Contributions

Track-Decoupled Next-Token Prediction (Dual-NTP)

A dual-token strategy that separately models vocal and accompaniment tracks at each time step, overcoming the limitations of standard next-token prediction when encoding both vocals and accompaniment simultaneously. This approach maintains lyric intelligibility even in acoustically complex genres.

10 retrieved papers
Structural Progressive Conditioning (SPC)

A conditioning strategy that leverages musical structural priors by segmenting songs into sections and interleaving text conditions with audio tokens. This enables the model to handle minutes-long contexts for full-song generation while maintaining lyrical alignment.

0 retrieved papers
Redesigned In-Context Learning for Music

A novel in-context learning framework specifically designed for music generation that enables style transfer, voice cloning, and bidirectional content creation, going beyond the continuation-based approach used in speech synthesis.

10 retrieved papers

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Within the taxonomy built over the current TopK core-task papers, the original paper is assigned to a leaf with no direct siblings and no cousin branches under the same grandparent topic. In this retrieved landscape, it appears structurally isolated, which is one partial signal of novelty, but still constrained by search coverage and taxonomy granularity.

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Track-Decoupled Next-Token Prediction (Dual-NTP)

A dual-token strategy that separately models vocal and accompaniment tracks at each time step, overcoming the limitations of standard next-token prediction when encoding both vocals and accompaniment simultaneously. This approach maintains lyric intelligibility even in acoustically complex genres.

Contribution

Structural Progressive Conditioning (SPC)

A conditioning strategy that leverages musical structural priors by segmenting songs into sections and interleaving text conditions with audio tokens. This enables the model to handle minutes-long contexts for full-song generation while maintaining lyrical alignment.

Contribution

Redesigned In-Context Learning for Music

A novel in-context learning framework specifically designed for music generation that enables style transfer, voice cloning, and bidirectional content creation, going beyond the continuation-based approach used in speech synthesis.