Music Flamingo: Scaling Music Understanding in Audio Language Models

ICLR 2026 Conference SubmissionAnonymous Authors
musicaudiomulti-modallanguage model
Abstract:

We introduce Music Flamingo, a novel large audio–language model, designed to advance music (including song) understanding in foundational audio models. While audio–language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question–answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model's reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio–language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as richly and meaningfully as humans do. Demo: https://musicflamingo.github.io

Disclaimer
This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.
NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.
If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Music Flamingo proposes a large audio-language model specialized for music understanding, addressing harmony, structure, timbre, lyrics, and cultural context through curated datasets and a post-training recipe involving chain-of-thought reasoning. The paper resides in the 'Music-Centric Audio-Language Models' leaf, which contains four papers total, including Mumu-llama, GAMA, and Musilingo. This leaf represents a focused research direction within the broader 'Music-Specific Understanding and Reasoning' branch, indicating a moderately populated area where specialized music models are actively being developed alongside general audio-language systems.

The taxonomy reveals neighboring work in 'Multi-Modal Audio Encoders and Fusion' (seven papers) and 'Multi-Domain Audio-Language Models' (six papers), which explore general audio understanding without music-specific specialization. The 'Music Perception and Knowledge Evaluation' leaf (three papers) addresses music theory tasks like chord recognition, while 'Lyrics and Multi-Modal Music Analysis' (four papers) integrates textual and audio modalities. Music Flamingo bridges these areas by combining multi-modal fusion techniques with music-centric training, distinguishing itself from general audio models like Audio Flamingo and from purely symbolic or theory-focused approaches.

Among 29 candidates examined, the model architecture contribution shows one refutable candidate out of ten examined, suggesting some overlap with prior audio-language model designs. The dataset contributions (MF-Skills and MF-Think) encountered no refutable candidates across ten examined papers, indicating potential novelty in curating music-specific annotations and chain-of-thought data. The post-training recipe contribution identified one refutable candidate among nine examined, reflecting partial overlap with existing reinforcement learning or reasoning enhancement methods. These statistics reflect a limited semantic search scope, not an exhaustive literature review.

Given the search scope of 29 candidates, the work appears to offer incremental architectural refinement alongside potentially novel dataset curation for music understanding. The taxonomy context shows Music Flamingo occupies a moderately active research niche, with sibling papers pursuing similar music-centric goals. The analysis does not cover exhaustive prior work in music information retrieval or symbolic music processing, which may contain additional relevant comparisons beyond the top-K semantic matches examined here.

Taxonomy

Core-task Taxonomy Papers
50
3
Claimed Contributions
29
Contribution Candidate Papers Compared
2
Refutable Paper

Research Landscape Overview

Core task: music understanding in audio language models. The field has evolved around several complementary branches. Audio-Language Model Architectures and Training explores foundational designs that bridge audio encoders with large language models, exemplified by systems like Qwen-Audio[2] and SALMONN[7]. Music-Specific Understanding and Reasoning focuses on models tailored to musical content, addressing tasks such as genre classification, mood analysis, and music-centric question answering. Music Generation and Editing encompasses text-to-music synthesis (e.g., MusicLM[10], MusicLDM[14]) and instruction-based editing workflows. General Audio Understanding and Multi-Domain Systems develop versatile architectures that handle speech, environmental sounds, and music within unified frameworks, while Evaluation and Benchmarking provides datasets and metrics (e.g., AIR-Bench[9], MMAU[11]) to assess model capabilities across diverse audio tasks. Recent work reveals a tension between general-purpose audio-language models and music-centric specialization. General systems like Audio Flamingo[8] and its successors (Audio Flamingo 2[15], Audio Flamingo 3[5]) aim for broad audio understanding, whereas music-focused models such as Mumu-llama[3], GAMA[12], and Musilingo[16] prioritize deep musical reasoning and domain-specific knowledge. Music Flamingo[0] sits within this music-centric cluster, emphasizing structured music understanding alongside these specialized approaches. Compared to Mumu-llama[3], which targets comprehensive music question answering, and Musilingo[16], which integrates symbolic music representations, Music Flamingo[0] explores how to adapt flamingo-style architectures specifically for musical content. This specialization trend highlights an open question: whether future progress will favor unified multi-domain models or continue to benefit from domain-tailored designs that capture music's unique structural and perceptual properties.

Claimed Contributions

Music Flamingo model for advanced music understanding

The authors present Music Flamingo, a large audio-language model that moves beyond surface-level music recognition toward layered, human-like perception of songs. The model is designed to handle the dynamic, layered, and information-dense nature of music through enhanced training strategies and reasoning capabilities.

10 retrieved papers
Can Refute
MF-Skills and MF-Think datasets for music understanding

The authors introduce two large-scale datasets: MF-Skills contains over 4 million samples with detailed multi-aspect captions and question-answer pairs covering full-length, multi-cultural songs; MF-Think provides 300,000 chain-of-thought examples grounded in music theory to enable deliberate reasoning about music.

10 retrieved papers
Post-training recipe with chain-of-thought and reinforcement learning

The authors propose a novel post-training approach that combines supervised fine-tuning on chain-of-thought examples from MF-Think with GRPO-based reinforcement learning using custom-designed rewards. This enables the model to perform explicit step-by-step musical reasoning rather than simple pattern matching.

9 retrieved papers
Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Music Flamingo model for advanced music understanding

The authors present Music Flamingo, a large audio-language model that moves beyond surface-level music recognition toward layered, human-like perception of songs. The model is designed to handle the dynamic, layered, and information-dense nature of music through enhanced training strategies and reasoning capabilities.

Contribution

MF-Skills and MF-Think datasets for music understanding

The authors introduce two large-scale datasets: MF-Skills contains over 4 million samples with detailed multi-aspect captions and question-answer pairs covering full-length, multi-cultural songs; MF-Think provides 300,000 chain-of-thought examples grounded in music theory to enable deliberate reasoning about music.

Contribution

Post-training recipe with chain-of-thought and reinforcement learning

The authors propose a novel post-training approach that combines supervised fine-tuning on chain-of-thought examples from MF-Think with GRPO-based reinforcement learning using custom-designed rewards. This enables the model to perform explicit step-by-step musical reasoning rather than simple pattern matching.