Learning Music Style For Piano Arrangement Through Cross-Modal Bootstrapping

ICLR 2026 Conference SubmissionAnonymous Authors

OpenReview Score: 6.0 Download Report PDF

music generationaudio-to-symbolic alignementpiano cover generationstyle transferQ-Former

What is music style? Though often described using text labels such as "swing," "classical," or "emotional," the real style remains implicit and hidden in concrete music examples. In this paper, we introduce a cross-modal framework that learns implicit music styles from raw audio and applies the styles to symbolic music generation. Inspired by BLIP-2, our model leverages a Querying Transformer (Q-Former) to extract style representations from a large, pre-trained audio language model (LM), and further applies them to condition a symbolic LM for generating piano arrangements. We adopt a two-stage training strategy: contrastive learning to align auditory style with symbolic expression, followed by generative modelling to perform music arrangement. Our model generates piano performances jointly conditioned on a lead sheet (content) and a reference audio example (style), enabling controllable and stylistically faithful arrangement. Experiments demonstrate the effectiveness of our approach in piano cover generation, style transfer, and audio-to-MIDI retrieval, achieving substantial improvements in style-aware alignment and music quality.

Abstract:

Disclaimer

This report is AI-GENERATED using Large Language Models and WisPaper (A scholar search engine). It analyzes academic papers' tasks and contributions against retrieved prior work. While this system identifies POTENTIAL overlaps and novel directions, ITS COVERAGE IS NOT EXHAUSTIVE AND JUDGMENTS ARE APPROXIMATE. These results are intended to assist human reviewers and SHOULD NOT be relied upon as a definitive verdict on novelty.

NOTE that some papers exist in multiple, slightly different versions (e.g., with different titles or URLs). The system may retrieve several versions of the same underlying work. The current automated pipeline does not reliably align or distinguish these cases, so human reviewers will need to disambiguate them manually.

If you have any questions, please contact: mingzhang23@m.fudan.edu.cn

Overview

Overall Novelty Assessment

The paper proposes a cross-modal framework using a Querying Transformer (Q-Former) to extract implicit style representations from audio and condition symbolic piano generation. It resides in the 'Audio-to-Symbolic Piano Arrangement' leaf, which contains four papers total, including the original work. This leaf sits within the broader 'Cross-Modal Style Transfer and Arrangement' branch, indicating a moderately active research direction focused on bridging audio and symbolic modalities for piano arrangement tasks.

The taxonomy reveals neighboring work in 'Multi-Modal Music Style Transfer' (two papers on unsupervised polyphonic transfer) and 'Phrase-Based Arrangement with Style Transfer' (one paper combining phrase selection with neural style transfer). The 'Conditioned Symbolic Music Generation' branch explores genre-conditioned generation without cross-modal audio input, while 'Music Representation Learning and Classification' focuses on style embeddings for classification rather than generation. The original paper's emphasis on implicit style learning from audio distinguishes it from purely symbolic conditioning approaches.

Among sixteen candidates examined, Contribution A (Q-Former for implicit style learning) showed no clear refutation across ten candidates, suggesting relative novelty in this specific architectural choice. Contribution B (two-stage training strategy) was refuted by one of three candidates examined, indicating some overlap with existing audio-symbolic alignment methods. Contribution C (style disentanglement methodology) also encountered one refutable candidate among three examined, pointing to prior work in separating style from content in music language models.

Based on the limited search scope of sixteen candidates, the framework appears to occupy a moderately explored niche within audio-to-symbolic piano arrangement. The Q-Former architecture for style extraction shows promise as a distinguishing element, while the training strategy and disentanglement methodology have more substantial prior work. The analysis does not cover exhaustive literature beyond top-K semantic matches and citation expansion.

Taxonomy

Core-task Taxonomy Papers

Claimed Contributions

Contribution Candidate Papers Compared

Refutable Paper

Research Landscape Overview

Core task: cross-modal music style learning for piano arrangement. This field addresses the challenge of transforming audio recordings into symbolic piano scores while preserving or adapting musical style. The taxonomy reveals four main branches. Cross-Modal Style Transfer and Arrangement focuses on audio-to-symbolic conversion methods, including direct piano arrangement systems like Pop2piano[5] and Audio-To-Symbolic Arrangement Via Cross-Modal[6]. Music Representation Learning and Classification explores how to encode and distinguish musical characteristics, with works such as Composer Classification With Cross-Modal[2] examining style embeddings. Conditioned Symbolic Music Generation encompasses approaches that produce piano scores guided by various control signals, exemplified by systems like GENPIA[4] and Accomontage[7]. Finally, Theoretical Foundations and Survey Studies provide conceptual frameworks, including efforts to define musical intelligence and style, as seen in Towards Human-Like Music Intelligence[10] and Towards a Definition of[11]. A particularly active line of work centers on audio-to-symbolic piano arrangement, where researchers grapple with the trade-off between faithfulness to the original audio and stylistic coherence in the generated score. Learning Music Style For[0] sits squarely within this branch, emphasizing cross-modal style learning to bridge the gap between audio input and symbolic piano output. Compared to Pop2piano[5], which also tackles pop-to-piano conversion, the original paper places stronger emphasis on learning and transferring style representations across modalities. Meanwhile, Audio-To-Symbolic Arrangement Via Cross-Modal[6] shares a similar cross-modal focus but may differ in how style constraints are encoded or applied. Across the broader landscape, open questions remain about how to best represent musical style in a way that generalizes across genres, how to balance arrangement creativity with source fidelity, and whether unified models can handle both classification and generation tasks effectively.

Claimed Contributions

Cross-modal framework using Q-Former for implicit music style learning

10 retrieved papers

The authors propose a novel framework that leverages a Querying Transformer (Q-Former) to extract implicit music style representations from audio using a pre-trained audio language model and apply them to condition a symbolic language model for piano arrangement generation. This extends the Q-Former's role beyond content alignment in vision-language tasks to align audio and symbolic modalities through implicit music style.

10 retrieved papers

Two-stage training strategy for audio-symbolic alignment

Can Refute

3 retrieved papers

The authors present a two-stage training methodology where Stage-I uses contrastive learning, matching, and generative objectives to learn cross-modal style representations, and Stage-II performs generative modeling for piano arrangement. This approach enables bootstrapping audio-to-symbolic arrangement without re-training either language model backbone.

3 retrieved papers

Can Refute

Style disentanglement methodology for music language models

Can Refute

3 retrieved papers

The authors develop a methodology that disentangles music style from content by treating the Q-Former as a bottleneck to transfer only style-related information (such as grooving patterns and dynamics) from audio while preserving content from lead sheets. This provides a scalable approach compared to traditional latent-variable disentanglement methods.

3 retrieved papers

Can Refute

Core Task Comparisons

Comparisons with papers in the same taxonomy category

[5] Pop2piano: Pop audio-based piano cover generation PDF

J Choi, K Lee (2023)

[6] Audio-To-Symbolic Arrangement Via Cross-Modal Music Representation Learning PDF

Ziyu Wang, Gus Xia, Dejing Xu, Ying Shan (2022)

[9] PiCoGen2: Piano cover generation with transfer learning approach and weakly aligned data PDF

Chih-Pin Tan, Hsin Ai, Yi-Hsin Chang, Shuen-Huei Guan, Yi-Hsuan Yang (2024)

Contribution Analysis

Detailed comparisons for each claimed contribution

Contribution

Cross-modal framework using Q-Former for implicit music style learning

[18] VampNet: Music Generation via Masked Acoustic Token Modeling PDF

Cannot Refute

[19] X-dancer: Expressive music to human dance video generation PDF

Cannot Refute

[20] Let's Dance Together! AI Dancers Can Dance to Your Favorite Music and Style PDF

Cannot Refute

[21] Music style classification by jointly using CNN and Transformer PDF

Cannot Refute

[22] MuseMorphose: Full-Song and Fine-Grained Piano Music Style Transfer With One Transformer VAE PDF

Cannot Refute

[23] Joint Learning of Emotion and Singing Style for Enhanced Music Style Understanding PDF

Cannot Refute

[24] Cross-Modal Transformer with Dynamic Attention Fusion for Emotion Recognition in Music via Audio-Lyrics Alignment PDF

Cannot Refute

[25] Human Body Synthesis PDF

Cannot Refute

[26] CNN-Transformer architecture for piano performance style recognition and AI-based real-time music accompaniment PDF

Cannot Refute

[27] CyberTune -Dynamic Remixing and Hack your Playlist to match Beat Alchemy to transform your Sound for Human-Centric AI PDF

Cannot Refute

Contribution

Two-stage training strategy for audio-symbolic alignment

[14] BOSSA: Learning Music Style Through Cross-Modal Bootstrapping PDF

Can Refute

[28] Contrastive Audio-Language Learning for Music PDF

Cannot Refute

[29] A Contribution to Music Theory Enhanced and Emotion Aware; Deep Learning Based Symbolic Music Generation PDF

Cannot Refute

Contribution

Style disentanglement methodology for music language models

[14] BOSSA: Learning Music Style Through Cross-Modal Bootstrapping PDF

Can Refute

[16] Vevo: Controllable zero-shot voice imitation with self-supervised disentanglement PDF

Cannot Refute

[17] GenerTTS: Pronunciation disentanglement for timbre and style generalization in cross-lingual text-to-speech PDF

Cannot Refute

Learning Music Style For Piano Arrangement Through Cross-Modal Bootstrapping

Overview

Overall Novelty Assessment

Taxonomy

Research Landscape Overview

Claimed Contributions

Core Task Comparisons

[5] Pop2piano: Pop audio-based piano cover generation PDF

[6] Audio-To-Symbolic Arrangement Via Cross-Modal Music Representation Learning PDF

[9] PiCoGen2: Piano cover generation with transfer learning approach and weakly aligned data PDF

Contribution Analysis

Cross-modal framework using Q-Former for implicit music style learning

[18] VampNet: Music Generation via Masked Acoustic Token Modeling PDF

[19] X-dancer: Expressive music to human dance video generation PDF

[20] Let's Dance Together! AI Dancers Can Dance to Your Favorite Music and Style PDF

[21] Music style classification by jointly using CNN and Transformer PDF

[22] MuseMorphose: Full-Song and Fine-Grained Piano Music Style Transfer With One Transformer VAE PDF

[23] Joint Learning of Emotion and Singing Style for Enhanced Music Style Understanding PDF

[24] Cross-Modal Transformer with Dynamic Attention Fusion for Emotion Recognition in Music via Audio-Lyrics Alignment PDF

[25] Human Body Synthesis PDF

[26] CNN-Transformer architecture for piano performance style recognition and AI-based real-time music accompaniment PDF

[27] CyberTune -Dynamic Remixing and Hack your Playlist to match Beat Alchemy to transform your Sound for Human-Centric AI PDF

Two-stage training strategy for audio-symbolic alignment

[14] BOSSA: Learning Music Style Through Cross-Modal Bootstrapping PDF

[28] Contrastive Audio-Language Learning for Music PDF

[29] A Contribution to Music Theory Enhanced and Emotion Aware; Deep Learning Based Symbolic Music Generation PDF

Style disentanglement methodology for music language models

[14] BOSSA: Learning Music Style Through Cross-Modal Bootstrapping PDF

[16] Vevo: Controllable zero-shot voice imitation with self-supervised disentanglement PDF

[17] GenerTTS: Pronunciation disentanglement for timbre and style generalization in cross-lingual text-to-speech PDF

Table of Contents