Music Flamingo: Scaling Music Understanding in Audio Language Models

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.5 Download Report PDF

musicaudiomulti-modallanguage model

We introduce Music Flamingo, a novel large audio–language model, designed to advance music (including song) understanding in foundational audio models. While audio–language research has progressed rapidly, music remains challenging due to its dynamic, layered, and information-dense nature. Progress has been further limited by the difficulty of scaling open audio understanding models, primarily because of the scarcity of high-quality music data and annotations. As a result, prior models are restricted to producing short, high-level captions, answering only surface-level questions, and showing limited generalization across diverse musical cultures. To address these challenges, we curate MF-Skills, a large-scale dataset labeled through a multi-stage pipeline that yields rich captions and question–answer pairs covering harmony, structure, timbre, lyrics, and cultural context. We fine-tune an enhanced Audio Flamingo 3 backbone on MF-Skills and further strengthen multiple skills relevant to music understanding. To improve the model's reasoning abilities, we introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards. Music Flamingo achieves state-of-the-art results across 10+ benchmarks for music understanding and reasoning, establishing itself as a generalist and musically intelligent audio–language model. Beyond strong empirical results, Music Flamingo sets a new standard for advanced music understanding by demonstrating how models can move from surface-level recognition toward layered, human-like perception of songs. We believe this work provides both a benchmark and a foundation for the community to build the next generation of models that engage with music as richly and meaningfully as humans do. Demo: https://musicflamingo.github.io

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

Music Flamingo proposes a large audio-language model specialized for music understanding, addressing harmony, structure, timbre, lyrics, and cultural context through curated datasets and a post-training recipe involving chain-of-thought reasoning. The paper resides in the 'Music-Centric Audio-Language Models' leaf, which contains four papers total, including Mumu-llama, GAMA, and Musilingo. This leaf represents a focused research direction within the broader 'Music-Specific Understanding and Reasoning' branch, indicating a moderately populated area where specialized music models are actively being developed alongside general audio-language systems.

The taxonomy reveals neighboring work in 'Multi-Modal Audio Encoders and Fusion' (seven papers) and 'Multi-Domain Audio-Language Models' (six papers), which explore general audio understanding without music-specific specialization. The 'Music Perception and Knowledge Evaluation' leaf (three papers) addresses music theory tasks like chord recognition, while 'Lyrics and Multi-Modal Music Analysis' (four papers) integrates textual and audio modalities. Music Flamingo bridges these areas by combining multi-modal fusion techniques with music-centric training, distinguishing itself from general audio models like Audio Flamingo and from purely symbolic or theory-focused approaches.

Among 29 candidates examined, the model architecture contribution shows one refutable candidate out of ten examined, suggesting some overlap with prior audio-language model designs. The dataset contributions (MF-Skills and MF-Think) encountered no refutable candidates across ten examined papers, indicating potential novelty in curating music-specific annotations and chain-of-thought data. The post-training recipe contribution identified one refutable candidate among nine examined, reflecting partial overlap with existing reinforcement learning or reasoning enhancement methods. These statistics reflect a limited semantic search scope, not an exhaustive literature review.

Given the search scope of 29 candidates, the work appears to offer incremental architectural refinement alongside potentially novel dataset curation for music understanding. The taxonomy context shows Music Flamingo occupies a moderately active research niche, with sibling papers pursuing similar music-centric goals. The analysis does not cover exhaustive prior work in music information retrieval or symbolic music processing, which may contain additional relevant comparisons beyond the top-K semantic matches examined here.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: music understanding in audio language models. The field has evolved around several complementary branches. Audio-Language Model Architectures and Training explores foundational designs that bridge audio encoders with large language models, exemplified by systems like Qwen-Audio[2] and SALMONN[7]. Music-Specific Understanding and Reasoning focuses on models tailored to musical content, addressing tasks such as genre classification, mood analysis, and music-centric question answering. Music Generation and Editing encompasses text-to-music synthesis (e.g., MusicLM[10], MusicLDM[14]) and instruction-based editing workflows. General Audio Understanding and Multi-Domain Systems develop versatile architectures that handle speech, environmental sounds, and music within unified frameworks, while Evaluation and Benchmarking provides datasets and metrics (e.g., AIR-Bench[9], MMAU[11]) to assess model capabilities across diverse audio tasks. Recent work reveals a tension between general-purpose audio-language models and music-centric specialization. General systems like Audio Flamingo[8] and its successors (Audio Flamingo 2[15], Audio Flamingo 3[5]) aim for broad audio understanding, whereas music-focused models such as Mumu-llama[3], GAMA[12], and Musilingo[16] prioritize deep musical reasoning and domain-specific knowledge. Music Flamingo[0] sits within this music-centric cluster, emphasizing structured music understanding alongside these specialized approaches. Compared to Mumu-llama[3], which targets comprehensive music question answering, and Musilingo[16], which integrates symbolic music representations, Music Flamingo[0] explores how to adapt flamingo-style architectures specifically for musical content. This specialization trend highlights an open question: whether future progress will favor unified multi-domain models or continue to benefit from domain-tailored designs that capture music's unique structural and perceptual properties.

Claimed Contributions

Music Flamingo model for advanced music understanding

Can Refute

10 retrieved papers

The authors present Music Flamingo, a large audio-language model that moves beyond surface-level music recognition toward layered, human-like perception of songs. The model is designed to handle the dynamic, layered, and information-dense nature of music through enhanced training strategies and reasoning capabilities.

10 retrieved papers

Can Refute

MF-Skills and MF-Think datasets for music understanding

10 retrieved papers

The authors introduce two large-scale datasets: MF-Skills contains over 4 million samples with detailed multi-aspect captions and question-answer pairs covering full-length, multi-cultural songs; MF-Think provides 300,000 chain-of-thought examples grounded in music theory to enable deliberate reasoning about music.

10 retrieved papers

Post-training recipe with chain-of-thought and reinforcement learning

Can Refute

9 retrieved papers

The authors propose a novel post-training approach that combines supervised fine-tuning on chain-of-thought examples from MF-Think with GRPO-based reinforcement learning using custom-designed rewards. This enables the model to perform explicit step-by-step musical reasoning rather than simple pattern matching.

9 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[3] Mumu-llama: Multi-modal music understanding and generation via large language models PDF

Liu, Shansong, Shansong Liu, Wu Qilong, Atin Sakkeer Hussain, Sun, Chenshuo, Qilong Wu, Shan, Ying, Chenshuo Sun, Ying Shan (2024)

[12] Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities PDF

Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chandra Kiran Reddy Evuru, Utkarsh Tyagi, S Sakshi, Oriol Nieto, Ramani Duraiswami, Dinesh Manocha (2024)

[16] Musilingo: Bridging music and text with pre-trained language models for music captioning and query response PDF

Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, Emmanouil Benetos (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Music Flamingo model for advanced music understanding

[75] Llark: A multimodal foundation model for music PDF

Can Refute

[4] Continuous Audio Language Models PDF

Cannot Refute

[7] SALMONN: Towards Generic Hearing Abilities for Large Language Models PDF

Cannot Refute

[12] Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities PDF

Cannot Refute

[15] Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities PDF

Cannot Refute

[71] Pam: Prompting audio-language models for audio quality assessment PDF

Cannot Refute

[72] A survey of foundation models for music understanding PDF

Cannot Refute

[73] CLAP Learning Audio Concepts from Natural Language Supervision PDF

Cannot Refute

[74] Sparks of large audio models: A survey and outlook PDF

Cannot Refute

[76] Can synthetic audio from generative foundation models assist audio recognition and speech modeling? PDF

Cannot Refute

Contribution

MF-Skills and MF-Think datasets for music understanding

[51] MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models PDF

Cannot Refute

[52] Improving BERT for symbolic music understanding using token denoising and pianoroll prediction PDF

Cannot Refute

[53] Foundation models for music: A survey PDF

Cannot Refute

[54] Are we there yet? a brief survey of music emotion prediction datasets, models and outstanding challenges PDF

Cannot Refute

[55] MGPHot: A Dataset of Musicological Annotations for Popular Music (1958â2022) PDF

Cannot Refute

[56] Multimodal music datasets? Challenges and future goals in music processing PDF

Cannot Refute

[57] Songcreator: Lyrics-based universal song generation PDF

Cannot Refute

[58] The interconnections of music structure, harmony, melody, rhythm, and predictivity PDF

Cannot Refute

[59] Towards Unified Music Emotion Recognition across Dimensional and Categorical Models PDF

Cannot Refute

[60] ERLD-HC: Entropy-Regularized Latent Diffusion for Harmony-Constrained Symbolic Music Generation PDF

Cannot Refute

Contribution

Post-training recipe with chain-of-thought and reinforcement learning

[69] Audio-thinker: Guiding audio language model when and how to think via reinforcement learning PDF

Can Refute

[61] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation PDF

Cannot Refute

[62] SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models PDF

Cannot Refute

[63] Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models PDF

Cannot Refute

[64] Audio-cot: Exploring chain-of-thought reasoning in large audio language model PDF

Cannot Refute

[66] Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning PDF

Cannot Refute

[67] Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards PDF

Cannot Refute

[68] Robust chain of thoughts preference optimization PDF

Cannot Refute

[70] Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition PDF

Cannot Refute

Music Flamingo: Scaling Music Understanding in Audio Language Models

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[3] Mumu-llama: Multi-modal music understanding and generation via large language models PDF

[12] Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities PDF

[16] Musilingo: Bridging music and text with pre-trained language models for music captioning and query response PDF

Contribution Analysis

Music Flamingo model for advanced music understanding

[75] Llark: A multimodal foundation model for music PDF

[4] Continuous Audio Language Models PDF

[7] SALMONN: Towards Generic Hearing Abilities for Large Language Models PDF

[12] Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities PDF

[15] Audio flamingo 2: An audio-language model with long-audio understanding and expert reasoning abilities PDF

[71] Pam: Prompting audio-language models for audio quality assessment PDF

[72] A survey of foundation models for music understanding PDF

[73] CLAP Learning Audio Concepts from Natural Language Supervision PDF

[74] Sparks of large audio models: A survey and outlook PDF

[76] Can synthetic audio from generative foundation models assist audio recognition and speech modeling? PDF

MF-Skills and MF-Think datasets for music understanding

[51] MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models PDF

[52] Improving BERT for symbolic music understanding using token denoising and pianoroll prediction PDF

[53] Foundation models for music: A survey PDF

[54] Are we there yet? a brief survey of music emotion prediction datasets, models and outstanding challenges PDF

[55] MGPHot: A Dataset of Musicological Annotations for Popular Music (1958â2022) PDF

[56] Multimodal music datasets? Challenges and future goals in music processing PDF

[57] Songcreator: Lyrics-based universal song generation PDF

[58] The interconnections of music structure, harmony, melody, rhythm, and predictivity PDF

[59] Towards Unified Music Emotion Recognition across Dimensional and Categorical Models PDF

[60] ERLD-HC: Entropy-Regularized Latent Diffusion for Harmony-Constrained Symbolic Music Generation PDF

Post-training recipe with chain-of-thought and reinforcement learning

[69] Audio-thinker: Guiding audio language model when and how to think via reinforcement learning PDF

[61] Motion-R1: Chain-of-Thought Reasoning and Reinforcement Learning for Human Motion Generation PDF

[62] SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models PDF

[63] Adaptive Divergence Regularized Policy Optimization for Fine-tuning Generative Models PDF

[64] Audio-cot: Exploring chain-of-thought reasoning in large audio language model PDF

[66] Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning PDF

[67] Incentivizing Consistent, Effective and Scalable Reasoning Capability in Audio LLMs via Reasoning Process Rewards PDF

[68] Robust chain of thoughts preference optimization PDF

[70] Thinking in cocktail party: Chain-of-Thought and reinforcement learning for target speaker automatic speech recognition PDF

Table of Contents

[55] MGPHot: A Dataset of Musicological Annotations for Popular Music (1958â2022) PDF